Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next] r8169: make use of xmit_more
From: Holger Hoffstätte @ 2019-08-08 19:52 UTC (permalink / raw)
  To: Heiner Kallweit, Realtek linux nic maintainers, David Miller
  Cc: netdev@vger.kernel.org, Sander Eikelenboom, Eric Dumazet
In-Reply-To: <a19bb3de-a866-d342-7cca-020fef219d03@gmail.com>

On 8/8/19 8:17 PM, Heiner Kallweit wrote:
> On 08.08.2019 17:53, Holger Hoffstätte wrote:
>> On 8/8/19 4:37 PM, Holger Hoffstätte wrote:
>>>
>>> Hello Heiner -
>>>
>>> On 7/28/19 11:25 AM, Heiner Kallweit wrote:
>>>> There was a previous attempt to use xmit_more, but the change had to be
>>>> reverted because under load sometimes a transmit timeout occurred [0].
>>>> Maybe this was caused by a missing memory barrier, the new attempt
>>>> keeps the memory barrier before the call to netif_stop_queue like it
>>>> is used by the driver as of today. The new attempt also changes the
>>>> order of some calls as suggested by Eric.
>>>>
>>>> [0] https://lkml.org/lkml/2019/2/10/39
>>>>
>>>> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
>>>
>>> I decided to take one for the team and merged this into my 5.2.x tree (just
>>> fixing up the path) and it has been working fine for the last 2 weeks in two
>>> machines..until today, when for the first time in forever some random NFS traffic
>>> made this old friend come out from under the couch:
>>>
>>> [Aug 8 14:13] ------------[ cut here ]------------
>>> [  +0.000006] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
>>> [  +0.000021] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:442 dev_watchdog+0x21f/0x230
>>> [  +0.000001] Modules linked in: lz4 lz4_compress lz4_decompress nfsd auth_rpcgss oid_registry lockd grace sunrpc sch_fq_codel btrfs xor zstd_compress raid6_pq zstd_decompress bfq jitterentropy_rng nct6775 hwmon_vid coretemp hwmon x86_pkg_temp_thermal aesni_intel aes_x86_64 i915 glue_helper crypto_simd cryptd i2c_i801 intel_gtt i2c_algo_bit iosf_mbi drm_kms_helper syscopyarea usbhid sysfillrect r8169 sysimgblt fb_sys_fops realtek drm libphy drm_panel_orientation_quirks i2c_core video backlight mq_deadline
>>> [  +0.000026] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.2.7 #1
>>> [  +0.000001] Hardware name: System manufacturer System Product Name/P8Z68-V LX, BIOS 4105 07/01/2013
>>> [  +0.000004] RIP: 0010:dev_watchdog+0x21f/0x230
>>> [  +0.000002] Code: 3b 00 75 ea eb ad 4c 89 ef c6 05 1c 45 bd 00 01 e8 66 35 fc ff 44 89 e1 4c 89 ee 48 c7 c7 e8 5e fc 81 48 89 c2 e8 90 df 92 ff <0f> 0b eb 8e 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 66 66 66 66 90
>>> [  +0.000002] RSP: 0018:ffffc90000118e68 EFLAGS: 00010286
>>> [  +0.000002] RAX: 0000000000000000 RBX: ffff8887f7837600 RCX: 0000000000000303
>>> [  +0.000001] RDX: 0000000000000001 RSI: 0000000000000092 RDI: ffffffff827a488c
>>> [  +0.000001] RBP: ffff8887f9fbc440 R08: 0000000000000303 R09: 0000000000000003
>>> [  +0.000001] R10: 000000000001004c R11: 0000000000000001 R12: 0000000000000000
>>> [  +0.000009] R13: ffff8887f9fbc000 R14: ffffffff8173aa20 R15: dead000000000200
>>> [  +0.000001] FS:  0000000000000000(0000) GS:ffff8887ff580000(0000) knlGS:0000000000000000
>>> [  +0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  +0.000001] CR2: 00007f8d1c04d000 CR3: 0000000002209001 CR4: 00000000000606e0
>>> [  +0.000000] Call Trace:
>>> [  +0.000002]  <IRQ>
>>> [  +0.000005]  call_timer_fn+0x2b/0x120
>>> [  +0.000002]  expire_timers+0xa4/0x100
>>> [  +0.000001]  run_timer_softirq+0x8c/0x170
>>> [  +0.000002]  ? __hrtimer_run_queues+0x13a/0x290
>>> [  +0.000003]  ? sched_clock_cpu+0xe/0x130
>>> [  +0.000003]  __do_softirq+0xeb/0x2de
>>> [  +0.000003]  irq_exit+0x9d/0xe0
>>> [  +0.000002]  smp_apic_timer_interrupt+0x60/0x110
>>> [  +0.000003]  apic_timer_interrupt+0xf/0x20
>>> [  +0.000001]  </IRQ>
>>> [  +0.000003] RIP: 0010:cpuidle_enter_state+0xad/0x930
>>> [  +0.000001] Code: c5 66 66 66 66 90 31 ff e8 90 99 9e ff 80 7c 24 0b 00 74 12 9c 58 f6 c4 02 0f 85 39 08 00 00 31 ff e8 e7 26 a2 ff fb 45 85 e4 <0f> 88 34 02 00 00 49 63 cc 4c 2b 2c 24 48 8d 04 49 48 c1 e0 05 8b
>>> [  +0.000000] RSP: 0018:ffffc9000008be50 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
>>> [  +0.000001] RAX: ffff8887ff5a9180 RBX: ffffffff822b6c40 RCX: 000000000000001f
>>> [  +0.000001] RDX: 0000000000000000 RSI: 0000000033087154 RDI: 0000000000000000
>>> [  +0.000001] RBP: ffff8887ff5b1310 R08: 000030d021fae397 R09: ffff8887ff59c8c0
>>> [  +0.000000] R10: ffff8887ff59c8c0 R11: 0000000000000006 R12: 0000000000000004
>>> [  +0.000001] R13: 000030d021fae397 R14: 0000000000000004 R15: ffff8887fc281600
>>> [  +0.000001]  cpuidle_enter+0x29/0x40
>>> [  +0.000002]  do_idle+0x1e5/0x280
>>> [  +0.000001]  cpu_startup_entry+0x19/0x20
>>> [  +0.000002]  start_secondary+0x186/0x1c0
>>> [  +0.000001]  secondary_startup_64+0xa4/0xb0
>>> [  +0.000001] ---[ end trace 99493c768580f4fd ]---
>>>
>>> The device is:
>>>
>>> Aug  7 23:19:09 tux kernel: libphy: r8169: probed
>>> Aug  7 23:19:09 tux kernel: r8169 0000:04:00.0 eth0: RTL8168evl/8111evl, c8:60:00:68:33:cc, XID 2c9, IRQ 36
>>> Aug  7 23:19:09 tux kernel: r8169 0000:04:00.0 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko]
>>> Aug  7 23:19:12 tux kernel: RTL8211E Gigabit Ethernet r8169-400:00: attached PHY driver [RTL8211E Gigabit Ethernet] (mii_bus:phy_addr=r8169-400:00, irq=IGNORE)
>>> Aug  7 23:19:13 tux kernel: r8169 0000:04:00.0 eth0: No native access to PCI extended config space, falling back to CSI
>>>
>>> and using fq_codel, of course.
>>>
>>> This cpuidle hiccup used to be completely gone without xmit_more and this was
>>> the first (and so far only) time since merging it (regardless of load).
>>> Also, while I'm using BMQ as CPU scheduler, that hasn't made a difference for
>>> this particular problem in the past (with MuQSS/PDS) either; way back when I had
>>> Eric's previous attempt(s) it also hiccupped with CFS.
>>>
>>> Revert or wait for more reports when -next is merged in 5.4?
>>
>> Another question/data point: I've had the whole basket of offloads activated:
>>
>>    ethtool --offload eth0 rx on tx on gro on gso on sg on tso on
>>
>> and this caused zero problems without the xmit_more patch. However I just saw
>> that net-next has a patch where TSO is disabled due to a known HW defect in
>> RTL8168evl, which is of course what I have. Could this be the reason for the
>> stall/hiccup when xmit_more has its fingers in the pie? I kind of know what
>> xmit_more does, just not how it could interact with a possibly broken TSO that
>> nevertheless seems to work fine otherwise..
>>
> 
> I was about to ask exactly that, whether you have TSO enabled. I don't know what
> can trigger the HW issue, it was just confirmed by Realtek that this chip version
> has a problem with TSO. So the logical conclusion is: test w/o TSO, ideally the
> linux-next version.

So disabling TSO alone didn't work - it leads to reduced throughout (~70 MB/s in iperf).
Instead I decided to backport 93681cd7d94f ("r8169: enable HW csum and TSO"), which
wasn't easy due to cleanups/renamings of dependencies, but I managed to backport
it and .. got the same problem of reduced throughout. wat?!

After lots of trial & error I started disabling all offloads and finally found
that sg (Scatter-Gather) enabled alone - without TSO - will lead to the throughput
drop. So the culprit seems 93681cd7d94f, which disabled TSO on my NIC, but left
sg on by default. This weas repeatable - switch on sg, throughput drop; turn it
off - smooth sailing, now with reduced buffers.

I modified the relevant bits to disable tso & sg like this:

	/* RTL8168e-vl has a HW issue with TSO */
	if (tp->mac_version == RTL_GIGA_MAC_VER_34) {
+		dev->vlan_features &= ~(NETIF_F_ALL_TSO|NETIF_F_SG);
+		dev->hw_features &= ~(NETIF_F_ALL_TSO|NETIF_F_SG);
+		dev->features &= ~(NETIF_F_ALL_TSO|NETIF_F_SG);
	}

This seems to work since it restores performance without sg/tso by default
and without any additional offloads, yet with xmit_more in the mix.
We'll see whether that is stable over the next few days, but I strongly
suspect it will be good and that the hiccups were due to xmit_more/TSO
interaction.

thanks,
Holger

^ permalink raw reply

* Re: [PATCH net-next 3/3] net: phy: realtek: add support for the 2.5Gbps PHY in RTL8125
From: Heiner Kallweit @ 2019-08-08 19:46 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Florian Fainelli, David Miller, netdev@vger.kernel.org
In-Reply-To: <20190808193743.GL27917@lunn.ch>

On 08.08.2019 21:37, Andrew Lunn wrote:
> On Thu, Aug 08, 2019 at 09:05:54PM +0200, Heiner Kallweit wrote:
>> This adds support for the integrated 2.5Gbps PHY in Realtek RTL8125.
>> Advertisement of 2.5Gbps mode is done via a vendor-specific register.
>> Same applies to reading NBase-T link partner advertisement.
>> Unfortunately this 2.5Gbps PHY shares the PHY ID with the integrated
>> 1Gbps PHY's in other Realtek network chips and so far no method is
>> known to differentiate them.
> 
> That is not nice.
> 
Indeed.

> Do you have any contacts in Realtek who can provide us with
> information? Maybe there is another undocumented vendor specific
> register?
> 
I have a contact in Realtek who provided the information about
the vendor-specific registers used in the patch. I also asked for
a method to auto-detect 2.5Gbps support but have no feedback so far.
What may contribute to the problem is that also the integrated 1Gbps
PHY's (all with the same PHY ID) differ significantly from each other,
depending on the network chip version.

> 	Andrew
> 
Heiner

^ permalink raw reply

* Re: [bpf-next v3 PATCH 2/3] samples/bpf: make xdp_fwd more practically usable via devmap lookup
From: Y Song @ 2019-08-08 19:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, Daniel Borkmann, Alexei Starovoitov, a.s.protopopov,
	David Ahern, Toke Høiland-Jørgensen
In-Reply-To: <156528106270.22124.2563148023961869582.stgit@firesoul>

On Thu, Aug 8, 2019 at 9:17 AM Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>
> This address the TODO in samples/bpf/xdp_fwd_kern.c, which points out
> that the chosen egress index should be checked for existence in the
> devmap. This can now be done via taking advantage of Toke's work in
> commit 0cdbb4b09a06 ("devmap: Allow map lookups from eBPF").
>
> This change makes xdp_fwd more practically usable, as this allows for
> a mixed environment, where IP-forwarding fallback to network stack, if
> the egress device isn't configured to use XDP.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Reviewed-by: David Ahern <dsahern@gmail.com>

Acked-by: Yonghong Song <yhs@fb.com>

> ---
>  samples/bpf/xdp_fwd_kern.c |   17 +++++++++++------
>  samples/bpf/xdp_fwd_user.c |   33 ++++++++++++++++++++++-----------
>  2 files changed, 33 insertions(+), 17 deletions(-)
>
> diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c
> index e6ffc4ea06f4..a43d6953c054 100644
> --- a/samples/bpf/xdp_fwd_kern.c
> +++ b/samples/bpf/xdp_fwd_kern.c
> @@ -104,13 +104,18 @@ static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags)
>
>         rc = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params), flags);
>
> -       /* verify egress index has xdp support
> -        * TO-DO bpf_map_lookup_elem(&tx_port, &key) fails with
> -        *       cannot pass map_type 14 into func bpf_map_lookup_elem#1:
> -        * NOTE: without verification that egress index supports XDP
> -        *       forwarding packets are dropped.
> -        */
>         if (rc == 0) {
> +               /* Verify egress index has been configured as TX-port.
> +                * (Note: User can still have inserted an egress ifindex that
> +                * doesn't support XDP xmit, which will result in packet drops).
> +                *
> +                * Note: lookup in devmap supported since 0cdbb4b09a0.
> +                * If not supported will fail with:
> +                *  cannot pass map_type 14 into func bpf_map_lookup_elem#1:
> +                */
> +               if (!bpf_map_lookup_elem(&xdp_tx_ports, &fib_params.ifindex))
> +                       return XDP_PASS;
> +
>                 if (h_proto == htons(ETH_P_IP))
>                         ip_decrease_ttl(iph);
>                 else if (h_proto == htons(ETH_P_IPV6))
> diff --git a/samples/bpf/xdp_fwd_user.c b/samples/bpf/xdp_fwd_user.c
> index ba012d9f93dd..97ff1dad7669 100644
> --- a/samples/bpf/xdp_fwd_user.c
> +++ b/samples/bpf/xdp_fwd_user.c
> @@ -27,14 +27,20 @@
>  #include "libbpf.h"
>  #include <bpf/bpf.h>
>
> -
> -static int do_attach(int idx, int fd, const char *name)
> +static int do_attach(int idx, int prog_fd, int map_fd, const char *name)
>  {
>         int err;
>
> -       err = bpf_set_link_xdp_fd(idx, fd, 0);
> -       if (err < 0)
> +       err = bpf_set_link_xdp_fd(idx, prog_fd, 0);
> +       if (err < 0) {
>                 printf("ERROR: failed to attach program to %s\n", name);
> +               return err;
> +       }
> +
> +       /* Adding ifindex as a possible egress TX port */
> +       err = bpf_map_update_elem(map_fd, &idx, &idx, 0);
> +       if (err)
> +               printf("ERROR: failed using device %s as TX-port\n", name);
>
>         return err;
>  }
> @@ -47,6 +53,9 @@ static int do_detach(int idx, const char *name)
>         if (err < 0)
>                 printf("ERROR: failed to detach program from %s\n", name);
>
> +       /* TODO: Remember to cleanup map, when adding use of shared map
> +        *  bpf_map_delete_elem((map_fd, &idx);
> +        */
>         return err;
>  }
>
> @@ -67,10 +76,10 @@ int main(int argc, char **argv)
>         };
>         const char *prog_name = "xdp_fwd";
>         struct bpf_program *prog;
> +       int prog_fd, map_fd = -1;
>         char filename[PATH_MAX];
>         struct bpf_object *obj;
>         int opt, i, idx, err;
> -       int prog_fd, map_fd;
>         int attach = 1;
>         int ret = 0;
>
> @@ -103,8 +112,14 @@ int main(int argc, char **argv)
>                         return 1;
>                 }
>
> -               if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
> +               err = bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd);
> +               if (err) {
> +                       printf("Does kernel support devmap lookup?\n");
> +                       /* If not, the error message will be:
> +                        *  "cannot pass map_type 14 into func bpf_map_lookup_elem#1"
> +                        */
>                         return 1;
> +               }
>
>                 prog = bpf_object__find_program_by_title(obj, prog_name);
>                 prog_fd = bpf_program__fd(prog);
> @@ -119,10 +134,6 @@ int main(int argc, char **argv)
>                         return 1;
>                 }
>         }
> -       if (attach) {
> -               for (i = 1; i < 64; ++i)
> -                       bpf_map_update_elem(map_fd, &i, &i, 0);
> -       }
>
>         for (i = optind; i < argc; ++i) {
>                 idx = if_nametoindex(argv[i]);
> @@ -138,7 +149,7 @@ int main(int argc, char **argv)
>                         if (err)
>                                 ret = err;
>                 } else {
> -                       err = do_attach(idx, prog_fd, argv[i]);
> +                       err = do_attach(idx, prog_fd, map_fd, argv[i]);
>                         if (err)
>                                 ret = err;
>                 }
>

^ permalink raw reply

* Re: [PATCH v5 bpf-next] BPF: helpers: New helper to obtain namespace data from current task
From: Y Song @ 2019-08-08 19:44 UTC (permalink / raw)
  To: Carlos Antonio Neira Bustos
  Cc: Yonghong Song, netdev@vger.kernel.org, ebiederm@xmission.com,
	brouer@redhat.com, quentin.monnet@netronome.com
In-Reply-To: <20190808174848.poybtaagg5ctle7t@dev00>

On Thu, Aug 8, 2019 at 10:52 AM Carlos Antonio Neira Bustos
<cneirabustos@gmail.com> wrote:
>
> Yonghong,
>
> I have modified the patch following your feedback.
> Let me know if I'm missing something.

Yes, I have some other requests about formating.
https://lore.kernel.org/netdev/20190808174848.poybtaagg5ctle7t@dev00/T/#t
Could you address it as well?

>
> Bests
>
> From 70f8d5584700c9cfc82c006901d8ee9595c53f15 Mon Sep 17 00:00:00 2001
> From: Carlos <cneirabustos@gmail.com>
> Date: Wed, 7 Aug 2019 20:04:30 -0400
> Subject: [PATCH] [PATCH v6 bpf-next] BPF: New helper to obtain namespace data
>  from current task
>
> This helper obtains the active namespace from current and returns pid, tgid,
> device and namespace id as seen from that namespace, allowing to instrument
> a process inside a container.
> Device is read from /proc/self/ns/pid, as in the future it's possible that
> different pid_ns files may belong to different devices, according
> to the discussion between Eric Biederman and Yonghong in 2017 linux plumbers
> conference.
> Currently bpf_get_current_pid_tgid(), is used to do pid filtering in bcc's
> scripts but this helper returns the pid as seen by the root namespace which is
> fine when a bcc script is not executed inside a container.
> When the process of interest is inside a container, pid filtering will not work
> if bpf_get_current_pid_tgid() is used. This helper addresses this limitation
> returning the pid as it's seen by the current namespace where the script is
> executing.
>
> This helper has the same use cases as bpf_get_current_pid_tgid() as it can be
> used to do pid filtering even inside a container.
>
> For example a bcc script using bpf_get_current_pid_tgid() (tools/funccount.py):
>
>         u32 pid = bpf_get_current_pid_tgid() >> 32;
>         if (pid != <pid_arg_passed_in>)
>                 return 0;
> Could be modified to use bpf_get_current_pidns_info() as follows:
>
>         struct bpf_pidns pidns;
>         bpf_get_current_pidns_info(&pidns, sizeof(struct bpf_pidns));
>         u32 pid = pidns.tgid;
>         u32 nsid = pidns.nsid;
>         if ((pid != <pid_arg_passed_in>) && (nsid != <nsid_arg_passed_in>))
>                 return 0;
>
> To find out the name PID namespace id of a process, you could use this command:
>
> $ ps -h -o pidns -p <pid_of_interest>
>
> Or this other command:
>
> $ ls -Li /proc/<pid_of_interest>/ns/pid
>
> Signed-off-by: Carlos Neira <cneirabustos@gmail.com>
> ---
>  fs/internal.h                                      |   2 -
>  fs/namei.c                                         |   1 -
>  include/linux/bpf.h                                |   1 +
>  include/linux/namei.h                              |   4 +
>  include/uapi/linux/bpf.h                           |  27 +++-
>  kernel/bpf/core.c                                  |   1 +
>  kernel/bpf/helpers.c                               |  64 ++++++++++
>  kernel/trace/bpf_trace.c                           |   2 +
>  samples/bpf/Makefile                               |   3 +
>  samples/bpf/trace_ns_info_user.c                   |  35 ++++++
>  samples/bpf/trace_ns_info_user_kern.c              |  44 +++++++
>  tools/include/uapi/linux/bpf.h                     |  27 +++-
>  tools/testing/selftests/bpf/Makefile               |   2 +-
>  tools/testing/selftests/bpf/bpf_helpers.h          |   3 +
>  .../testing/selftests/bpf/progs/test_pidns_kern.c  |  51 ++++++++
>  tools/testing/selftests/bpf/test_pidns.c           | 138 +++++++++++++++++++++
>  16 files changed, 399 insertions(+), 6 deletions(-)
>  create mode 100644 samples/bpf/trace_ns_info_user.c
>  create mode 100644 samples/bpf/trace_ns_info_user_kern.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_pidns_kern.c
>  create mode 100644 tools/testing/selftests/bpf/test_pidns.c
>
> diff --git a/fs/internal.h b/fs/internal.h
> index 315fcd8d237c..6647e15dd419 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -59,8 +59,6 @@ extern int finish_clean_context(struct fs_context *fc);
>  /*
>   * namei.c
>   */
> -extern int filename_lookup(int dfd, struct filename *name, unsigned flags,
> -                          struct path *path, struct path *root);
>  extern int user_path_mountpoint_at(int, const char __user *, unsigned int, struct path *);
>  extern int vfs_path_lookup(struct dentry *, struct vfsmount *,
>                            const char *, unsigned int, struct path *);
> diff --git a/fs/namei.c b/fs/namei.c
> index 209c51a5226c..a89fc72a4a10 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -19,7 +19,6 @@
>  #include <linux/export.h>
>  #include <linux/kernel.h>
>  #include <linux/slab.h>
> -#include <linux/fs.h>
>  #include <linux/namei.h>
>  #include <linux/pagemap.h>
>  #include <linux/fsnotify.h>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index f9a506147c8a..e4adf5e05afd 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1050,6 +1050,7 @@ extern const struct bpf_func_proto bpf_get_local_storage_proto;
>  extern const struct bpf_func_proto bpf_strtol_proto;
>  extern const struct bpf_func_proto bpf_strtoul_proto;
>  extern const struct bpf_func_proto bpf_tcp_sock_proto;
> +extern const struct bpf_func_proto bpf_get_current_pidns_info_proto;
>
>  /* Shared helpers among cBPF and eBPF. */
>  void bpf_user_rnd_init_once(void);
> diff --git a/include/linux/namei.h b/include/linux/namei.h
> index 9138b4471dbf..b45c8b6f7cb4 100644
> --- a/include/linux/namei.h
> +++ b/include/linux/namei.h
> @@ -6,6 +6,7 @@
>  #include <linux/path.h>
>  #include <linux/fcntl.h>
>  #include <linux/errno.h>
> +#include <linux/fs.h>
>
>  enum { MAX_NESTED_LINKS = 8 };
>
> @@ -97,6 +98,9 @@ extern void unlock_rename(struct dentry *, struct dentry *);
>
>  extern void nd_jump_link(struct path *path);
>
> +extern int filename_lookup(int dfd, struct filename *name, unsigned flags,
> +                          struct path *path, struct path *root);
> +
>  static inline void nd_terminate_link(void *name, size_t len, size_t maxlen)
>  {
>         ((char *) name)[min(len, maxlen)] = '\0';
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 4393bd4b2419..b0d4869fb860 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -2741,6 +2741,24 @@ union bpf_attr {
>   *             **-EOPNOTSUPP** kernel configuration does not enable SYN cookies
>   *
>   *             **-EPROTONOSUPPORT** IP packet version is not 4 or 6
> + *
> + * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 size_of_pidns)
> + *     Description
> + *             Copies into *pidns* pid, namespace id and tgid as seen by the
> + *             current namespace and also device from /proc/self/ns/pid.
> + *             *size_of_pidns* must be the size of *pidns*
> + *
> + *             This helper is used when pid filtering is needed inside a
> + *             container as bpf_get_current_tgid() helper returns always the
> + *             pid id as seen by the root namespace.
> + *     Return
> + *             0 on success
> + *
> + *             **-EINVAL** if *size_of_pidns* is not valid or unable to get ns, pid
> + *             or tgid of the current task.
> + *
> + *             **-ENOMEM**  if allocation fails.
> + *
>   */
>  #define __BPF_FUNC_MAPPER(FN)          \
>         FN(unspec),                     \
> @@ -2853,7 +2871,8 @@ union bpf_attr {
>         FN(sk_storage_get),             \
>         FN(sk_storage_delete),          \
>         FN(send_signal),                \
> -       FN(tcp_gen_syncookie),
> +       FN(tcp_gen_syncookie),          \
> +       FN(get_current_pidns_info),
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> @@ -3604,4 +3623,10 @@ struct bpf_sockopt {
>         __s32   retval;
>  };
>
> +struct bpf_pidns_info {
> +       __u32 dev;
> +       __u32 nsid;
> +       __u32 tgid;
> +       __u32 pid;
> +};
>  #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 8191a7db2777..3159f2a0188c 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -2038,6 +2038,7 @@ const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
>  const struct bpf_func_proto bpf_get_current_comm_proto __weak;
>  const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak;
>  const struct bpf_func_proto bpf_get_local_storage_proto __weak;
> +const struct bpf_func_proto bpf_get_current_pidns_info __weak;
>
>  const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
>  {
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 5e28718928ca..41fbf1f28a48 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -11,6 +11,12 @@
>  #include <linux/uidgid.h>
>  #include <linux/filter.h>
>  #include <linux/ctype.h>
> +#include <linux/pid_namespace.h>
> +#include <linux/major.h>
> +#include <linux/stat.h>
> +#include <linux/namei.h>
> +#include <linux/version.h>
> +
>
>  #include "../../lib/kstrtox.h"
>
> @@ -312,6 +318,64 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
>         preempt_enable();
>  }
>
> +BPF_CALL_2(bpf_get_current_pidns_info, struct bpf_pidns_info *, pidns_info, u32,
> +        size)
> +{
> +       const char *pidns_path = "/proc/self/ns/pid";
> +       struct pid_namespace *pidns = NULL;
> +       struct filename *tmp = NULL;
> +       struct inode *inode;
> +       struct path kp;
> +       pid_t tgid = 0;
> +       pid_t pid = 0;
> +       int ret;
> +       int len;
> +
> +       if (unlikely(size != sizeof(struct bpf_pidns_info)))
> +               return -EINVAL;
> +       pidns = task_active_pid_ns(current);
> +       if (unlikely(!pidns))
> +               goto clear;
> +       pidns_info->nsid =  pidns->ns.inum;
> +       pid = task_pid_nr_ns(current, pidns);
> +       if (unlikely(!pid))
> +               goto clear;
> +       tgid = task_tgid_nr_ns(current, pidns);
> +       if (unlikely(!tgid))
> +               goto clear;
> +       pidns_info->tgid = (u32) tgid;
> +       pidns_info->pid = (u32) pid;
> +       tmp = kmem_cache_alloc(names_cachep, GFP_ATOMIC);
> +       if (unlikely(!tmp)) {
> +               memset((void *)pidns_info, 0, (size_t) size);
> +               return -ENOMEM;
> +       }
> +       len = strlen(pidns_path) + 1;
> +       memcpy((char *)tmp->name, pidns_path, len);
> +       tmp->uptr = NULL;
> +       tmp->aname = NULL;
> +       tmp->refcnt = 1;
> +       ret = filename_lookup(AT_FDCWD, tmp, 0, &kp, NULL);
> +       if (ret) {
> +               memset((void *)pidns_info, 0, (size_t) size);
> +               return ret;
> +       }
> +       inode = d_backing_inode(kp.dentry);
> +       pidns_info->dev = inode->i_sb->s_dev;
> +       return 0;
> +clear:
> +       memset((void *)pidns_info, 0, (size_t) size);
> +       return -EINVAL;
> +}
> +
> +const struct bpf_func_proto bpf_get_current_pidns_info_proto = {
> +       .func           = bpf_get_current_pidns_info,
> +       .gpl_only       = false,
> +       .ret_type       = RET_INTEGER,
> +       .arg1_type      = ARG_PTR_TO_UNINIT_MEM,
> +       .arg2_type      = ARG_CONST_SIZE,
> +};
> +
>  #ifdef CONFIG_CGROUPS
>  BPF_CALL_0(bpf_get_current_cgroup_id)
>  {
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index ca1255d14576..5e1dc22765a5 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -709,6 +709,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>  #endif
>         case BPF_FUNC_send_signal:
>                 return &bpf_send_signal_proto;
> +       case BPF_FUNC_get_current_pidns_info:
> +               return &bpf_get_current_pidns_info_proto;
>         default:
>                 return NULL;
>         }
> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> index 1d9be26b4edd..238453ff27d2 100644
> --- a/samples/bpf/Makefile
> +++ b/samples/bpf/Makefile
> @@ -53,6 +53,7 @@ hostprogs-y += task_fd_query
>  hostprogs-y += xdp_sample_pkts
>  hostprogs-y += ibumad
>  hostprogs-y += hbm
> +hostprogs-y += trace_ns_info
>
>  # Libbpf dependencies
>  LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
> @@ -109,6 +110,7 @@ task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS)
>  xdp_sample_pkts-objs := xdp_sample_pkts_user.o $(TRACE_HELPERS)
>  ibumad-objs := bpf_load.o ibumad_user.o $(TRACE_HELPERS)
>  hbm-objs := bpf_load.o hbm.o $(CGROUP_HELPERS)
> +trace_ns_info-objs := bpf_load.o trace_ns_info_user.o
>
>  # Tell kbuild to always build the programs
>  always := $(hostprogs-y)
> @@ -170,6 +172,7 @@ always += xdp_sample_pkts_kern.o
>  always += ibumad_kern.o
>  always += hbm_out_kern.o
>  always += hbm_edt_kern.o
> +always += trace_ns_info_user_kern.o
>
>  KBUILD_HOSTCFLAGS += -I$(objtree)/usr/include
>  KBUILD_HOSTCFLAGS += -I$(srctree)/tools/lib/bpf/
> diff --git a/samples/bpf/trace_ns_info_user.c b/samples/bpf/trace_ns_info_user.c
> new file mode 100644
> index 000000000000..e06d08db6f30
> --- /dev/null
> +++ b/samples/bpf/trace_ns_info_user.c
> @@ -0,0 +1,35 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + */
> +
> +#include <stdio.h>
> +#include <linux/bpf.h>
> +#include <unistd.h>
> +#include "bpf/libbpf.h"
> +#include "bpf_load.h"
> +
> +/* This code was taken verbatim from tracex1_user.c, it's used
> + * to exercize bpf_get_current_pidns_info() helper call.
> + */
> +int main(int ac, char **argv)
> +{
> +       FILE *f;
> +       char filename[256];
> +
> +       snprintf(filename, sizeof(filename), "%s_user_kern.o", argv[0]);
> +       printf("loading %s\n", filename);
> +
> +       if (load_bpf_file(filename)) {
> +               printf("%s", bpf_log_buf);
> +               return 1;
> +       }
> +
> +       f = popen("taskset 1 ping  localhost", "r");
> +       (void) f;
> +       read_trace_pipe();
> +       return 0;
> +}
> diff --git a/samples/bpf/trace_ns_info_user_kern.c b/samples/bpf/trace_ns_info_user_kern.c
> new file mode 100644
> index 000000000000..96675e02b707
> --- /dev/null
> +++ b/samples/bpf/trace_ns_info_user_kern.c
> @@ -0,0 +1,44 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + */
> +#include <linux/skbuff.h>
> +#include <linux/netdevice.h>
> +#include <linux/version.h>
> +#include <uapi/linux/bpf.h>
> +#include "bpf_helpers.h"
> +
> +typedef __u64 u64;
> +typedef __u32 u32;
> +
> +
> +/* kprobe is NOT a stable ABI
> + * kernel functions can be removed, renamed or completely change semantics.
> + * Number of arguments and their positions can change, etc.
> + * In such case this bpf+kprobe example will no longer be meaningful
> + */
> +
> +/* This will call bpf_get_current_pidns_info() to display pid and ns values
> + * as seen by the current namespace, on the far left you will see the pid as
> + * seen as by the root namespace.
> + */
> +
> +SEC("kprobe/__netif_receive_skb_core")
> +int bpf_prog1(struct pt_regs *ctx)
> +{
> +       char fmt[] = "nsid:%u, dev: %u,  pid:%u\n";
> +       struct bpf_pidns_info nsinfo;
> +       int ok = 0;
> +
> +       ok = bpf_get_current_pidns_info(&nsinfo, sizeof(nsinfo));
> +       if (ok == 0)
> +               bpf_trace_printk(fmt, sizeof(fmt), (u32)nsinfo.nsid,
> +                                (u32) nsinfo.dev, (u32)nsinfo.pid);
> +
> +       return 0;
> +}
> +char _license[] SEC("license") = "GPL";
> +u32 _version SEC("version") = LINUX_VERSION_CODE;
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 4393bd4b2419..b0d4869fb860 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -2741,6 +2741,24 @@ union bpf_attr {
>   *             **-EOPNOTSUPP** kernel configuration does not enable SYN cookies
>   *
>   *             **-EPROTONOSUPPORT** IP packet version is not 4 or 6
> + *
> + * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 size_of_pidns)
> + *     Description
> + *             Copies into *pidns* pid, namespace id and tgid as seen by the
> + *             current namespace and also device from /proc/self/ns/pid.
> + *             *size_of_pidns* must be the size of *pidns*
> + *
> + *             This helper is used when pid filtering is needed inside a
> + *             container as bpf_get_current_tgid() helper returns always the
> + *             pid id as seen by the root namespace.
> + *     Return
> + *             0 on success
> + *
> + *             **-EINVAL** if *size_of_pidns* is not valid or unable to get ns, pid
> + *             or tgid of the current task.
> + *
> + *             **-ENOMEM**  if allocation fails.
> + *
>   */
>  #define __BPF_FUNC_MAPPER(FN)          \
>         FN(unspec),                     \
> @@ -2853,7 +2871,8 @@ union bpf_attr {
>         FN(sk_storage_get),             \
>         FN(sk_storage_delete),          \
>         FN(send_signal),                \
> -       FN(tcp_gen_syncookie),
> +       FN(tcp_gen_syncookie),          \
> +       FN(get_current_pidns_info),
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> @@ -3604,4 +3623,10 @@ struct bpf_sockopt {
>         __s32   retval;
>  };
>
> +struct bpf_pidns_info {
> +       __u32 dev;
> +       __u32 nsid;
> +       __u32 tgid;
> +       __u32 pid;
> +};
>  #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> index 3bd0f4a0336a..1f97b571b581 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -29,7 +29,7 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test
>         test_cgroup_storage test_select_reuseport test_section_names \
>         test_netcnt test_tcpnotify_user test_sock_fields test_sysctl test_hashmap \
>         test_btf_dump test_cgroup_attach xdping test_sockopt test_sockopt_sk \
> -       test_sockopt_multi test_tcp_rtt
> +       test_sockopt_multi test_tcp_rtt test_pidns
>
>  BPF_OBJ_FILES = $(patsubst %.c,%.o, $(notdir $(wildcard progs/*.c)))
>  TEST_GEN_FILES = $(BPF_OBJ_FILES)
> diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
> index 120aa86c58d3..c96795a9d983 100644
> --- a/tools/testing/selftests/bpf/bpf_helpers.h
> +++ b/tools/testing/selftests/bpf/bpf_helpers.h
> @@ -231,6 +231,9 @@ static int (*bpf_send_signal)(unsigned sig) = (void *)BPF_FUNC_send_signal;
>  static long long (*bpf_tcp_gen_syncookie)(struct bpf_sock *sk, void *ip,
>                                           int ip_len, void *tcp, int tcp_len) =
>         (void *) BPF_FUNC_tcp_gen_syncookie;
> +static int (*bpf_get_current_pidns_info)(struct bpf_pidns_info *buf,
> +                                        unsigned int buf_size) =
> +       (void *) BPF_FUNC_get_current_pidns_info;
>
>  /* llvm builtin functions that eBPF C program may use to
>   * emit BPF_LD_ABS and BPF_LD_IND instructions
> diff --git a/tools/testing/selftests/bpf/progs/test_pidns_kern.c b/tools/testing/selftests/bpf/progs/test_pidns_kern.c
> new file mode 100644
> index 000000000000..e1d2facfa762
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/test_pidns_kern.c
> @@ -0,0 +1,51 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + */
> +
> +#include <linux/bpf.h>
> +#include <errno.h>
> +#include "bpf_helpers.h"
> +
> +struct bpf_map_def SEC("maps") nsidmap = {
> +       .type = BPF_MAP_TYPE_ARRAY,
> +       .key_size = sizeof(__u32),
> +       .value_size = sizeof(__u32),
> +       .max_entries = 1,
> +};
> +
> +struct bpf_map_def SEC("maps") pidmap = {
> +       .type = BPF_MAP_TYPE_ARRAY,
> +       .key_size = sizeof(__u32),
> +       .value_size = sizeof(__u32),
> +       .max_entries = 1,
> +};
> +
> +SEC("tracepoint/syscalls/sys_enter_nanosleep")
> +int trace(void *ctx)
> +{
> +       struct bpf_pidns_info nsinfo;
> +       __u32 key = 0, *expected_pid, *val;
> +       char fmt[] = "ERROR nspid:%d\n";
> +
> +       if (bpf_get_current_pidns_info(&nsinfo, sizeof(nsinfo)))
> +               return -EINVAL;
> +
> +       expected_pid = bpf_map_lookup_elem(&pidmap, &key);
> +
> +
> +       if (!expected_pid || *expected_pid != nsinfo.pid)
> +               return 0;
> +
> +       val = bpf_map_lookup_elem(&nsidmap, &key);
> +       if (val)
> +               *val = nsinfo.nsid;
> +
> +       return 0;
> +}
> +
> +char _license[] SEC("license") = "GPL";
> +__u32 _version SEC("version") = 1;
> diff --git a/tools/testing/selftests/bpf/test_pidns.c b/tools/testing/selftests/bpf/test_pidns.c
> new file mode 100644
> index 000000000000..a7254055f294
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/test_pidns.c
> @@ -0,0 +1,138 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + */
> +
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <syscall.h>
> +#include <unistd.h>
> +#include <linux/perf_event.h>
> +#include <sys/ioctl.h>
> +#include <sys/time.h>
> +#include <sys/types.h>
> +#include <sys/stat.h>
> +
> +#include <linux/bpf.h>
> +#include <bpf/bpf.h>
> +#include <bpf/libbpf.h>
> +
> +#include "cgroup_helpers.h"
> +#include "bpf_rlimit.h"
> +
> +#define CHECK(condition, tag, format...) ({            \
> +       int __ret = !!(condition);                      \
> +       if (__ret) {                                    \
> +               printf("%s:FAIL:%s ", __func__, tag);   \
> +               printf(format);                         \
> +       } else {                                        \
> +               printf("%s:PASS:%s\n", __func__, tag);  \
> +       }                                               \
> +       __ret;                                          \
> +})
> +
> +static int bpf_find_map(const char *test, struct bpf_object *obj,
> +                       const char *name)
> +{
> +       struct bpf_map *map;
> +
> +       map = bpf_object__find_map_by_name(obj, name);
> +       if (!map)
> +               return -1;
> +       return bpf_map__fd(map);
> +}
> +
> +
> +int main(int argc, char **argv)
> +{
> +       const char *probe_name = "syscalls/sys_enter_nanosleep";
> +       const char *file = "test_pidns_kern.o";
> +       int err, bytes, efd, prog_fd, pmu_fd;
> +       int pidmap_fd, nsidmap_fd;
> +       struct perf_event_attr attr = {};
> +       struct bpf_object *obj;
> +       __u32 knsid = 0;
> +       __u32 key = 0, pid;
> +       int exit_code = 1;
> +       struct stat st;
> +       char buf[256];
> +
> +       err = bpf_prog_load(file, BPF_PROG_TYPE_TRACEPOINT, &obj, &prog_fd);
> +       if (CHECK(err, "bpf_prog_load", "err %d errno %d\n", err, errno))
> +               goto cleanup_cgroup_env;
> +
> +       nsidmap_fd = bpf_find_map(__func__, obj, "nsidmap");
> +       if (CHECK(nsidmap_fd < 0, "bpf_find_map", "err %d errno %d\n",
> +                 nsidmap_fd, errno))
> +               goto close_prog;
> +
> +       pidmap_fd = bpf_find_map(__func__, obj, "pidmap");
> +       if (CHECK(pidmap_fd < 0, "bpf_find_map", "err %d errno %d\n",
> +                 pidmap_fd, errno))
> +               goto close_prog;
> +
> +       pid = getpid();
> +       bpf_map_update_elem(pidmap_fd, &key, &pid, 0);
> +
> +       snprintf(buf, sizeof(buf),
> +                "/sys/kernel/debug/tracing/events/%s/id", probe_name);
> +       efd = open(buf, O_RDONLY, 0);
> +       if (CHECK(efd < 0, "open", "err %d errno %d\n", efd, errno))
> +               goto close_prog;
> +       bytes = read(efd, buf, sizeof(buf));
> +       close(efd);
> +       if (CHECK(bytes <= 0 || bytes >= sizeof(buf), "read",
> +                 "bytes %d errno %d\n", bytes, errno))
> +               goto close_prog;
> +
> +       attr.config = strtol(buf, NULL, 0);
> +       attr.type = PERF_TYPE_TRACEPOINT;
> +       attr.sample_type = PERF_SAMPLE_RAW;
> +       attr.sample_period = 1;
> +       attr.wakeup_events = 1;
> +
> +       pmu_fd = syscall(__NR_perf_event_open, &attr, getpid(), -1, -1, 0);
> +       if (CHECK(pmu_fd < 0, "perf_event_open", "err %d errno %d\n", pmu_fd,
> +                 errno))
> +               goto close_prog;
> +
> +       err = ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
> +       if (CHECK(err, "perf_event_ioc_enable", "err %d errno %d\n", err,
> +                 errno))
> +               goto close_pmu;
> +
> +       err = ioctl(pmu_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
> +       if (CHECK(err, "perf_event_ioc_set_bpf", "err %d errno %d\n", err,
> +                 errno))
> +               goto close_pmu;
> +
> +       /* trigger some syscalls */
> +       sleep(1);
> +
> +       err = bpf_map_lookup_elem(nsidmap_fd, &key, &knsid);
> +       if (CHECK(err, "bpf_map_lookup_elem", "err %d errno %d\n", err, errno))
> +               goto close_pmu;
> +
> +       if (stat("/proc/self/ns/pid", &st))
> +               goto close_pmu;
> +
> +       if (CHECK(knsid != (__u32) st.st_ino, "compare_namespace_id",
> +                 "kern knsid %u user unsid %u\n", knsid, (__u32) st.st_ino))
> +               goto close_pmu;
> +
> +       exit_code = 0;
> +       printf("%s:PASS\n", argv[0]);
> +
> +close_pmu:
> +       close(pmu_fd);
> +close_prog:
> +       bpf_object__close(obj);
> +cleanup_cgroup_env:
> +       return exit_code;
> +}
> --
> 2.11.0
>
>
>
>
>
>
> On Thu, Aug 08, 2019 at 05:09:51AM +0000, Yonghong Song wrote:
> >
> >
> > On 8/7/19 6:22 PM, Carlos Antonio Neira Bustos wrote:
> > > The code has been modified to avoid syscalls that could sleep.
> > > Please let me know if any other modification is needed.
> > >
> > >  From be0384c0fa209a78c1567936e8db4e35b9a7c0f8 Mon Sep 17 00:00:00 2001
> > > From: Carlos <cneirabustos@gmail.com>
> > > Date: Wed, 7 Aug 2019 20:04:30 -0400
> > > Subject: [PATCH] [PATCH v5 bpf-next] BPF: New helper to obtain namespace data
> > >   from current task
> > >
> > > This helper obtains the active namespace from current and returns pid, tgid,
> > > device and namespace id as seen from that namespace, allowing to instrument
> > > a process inside a container.
> > > Device is read from /proc/self/ns/pid, as in the future it's possible that
> > > different pid_ns files may belong to different devices, according
> > > to the discussion between Eric Biederman and Yonghong in 2017 linux plumbers
> > > conference.
> > > Currently bpf_get_current_pid_tgid(), is used to do pid filtering in bcc's
> > > scripts but this helper returns the pid as seen by the root namespace which is
> > > fine when a bcc script is not executed inside a container.
> > > When the process of interest is inside a container, pid filtering will not work
> > > if bpf_get_current_pid_tgid() is used. This helper addresses this limitation
> > > returning the pid as it's seen by the current namespace where the script is
> > > executing.
> > >
> > > This helper has the same use cases as bpf_get_current_pid_tgid() as it can be
> > > used to do pid filtering even inside a container.
> > >
> > > For example a bcc script using bpf_get_current_pid_tgid() (tools/funccount.py):
> > >
> > >          u32 pid = bpf_get_current_pid_tgid() >> 32;
> > >          if (pid != <pid_arg_passed_in>)
> > >                  return 0;
> > > Could be modified to use bpf_get_current_pidns_info() as follows:
> > >
> > >          struct bpf_pidns pidns;
> > >          bpf_get_current_pidns_info(&pidns, sizeof(struct bpf_pidns));
> > >          u32 pid = pidns.tgid;
> > >          u32 nsid = pidns.nsid;
> > >          if ((pid != <pid_arg_passed_in>) && (nsid != <nsid_arg_passed_in>))
> > >                  return 0;
> > >
> > > To find out the name PID namespace id of a process, you could use this command:
> > >
> > > $ ps -h -o pidns -p <pid_of_interest>
> > >
> > > Or this other command:
> > >
> > > $ ls -Li /proc/<pid_of_interest>/ns/pid
> > >
> > > Signed-off-by: Carlos Neira <cneirabustos@gmail.com>
> > > ---
> > >   fs/namei.c                                         |   2 +-
> > >   include/linux/bpf.h                                |   1 +
> > >   include/linux/namei.h                              |   4 +
> > >   include/uapi/linux/bpf.h                           |  29 ++++-
> > >   kernel/bpf/core.c                                  |   1 +
> > >   kernel/bpf/helpers.c                               |  78 ++++++++++++
> > >   kernel/trace/bpf_trace.c                           |   2 +
> > >   samples/bpf/Makefile                               |   3 +
> > >   samples/bpf/trace_ns_info_user.c                   |  35 ++++++
> > >   samples/bpf/trace_ns_info_user_kern.c              |  44 +++++++
> > >   tools/include/uapi/linux/bpf.h                     |  29 ++++-
> > >   tools/testing/selftests/bpf/Makefile               |   2 +-
> > >   tools/testing/selftests/bpf/bpf_helpers.h          |   3 +
> > >   .../testing/selftests/bpf/progs/test_pidns_kern.c  |  51 ++++++++
> > >   tools/testing/selftests/bpf/test_pidns.c           | 138 +++++++++++++++++++++
> > >   15 files changed, 418 insertions(+), 4 deletions(-)
> > >   create mode 100644 samples/bpf/trace_ns_info_user.c
> > >   create mode 100644 samples/bpf/trace_ns_info_user_kern.c
> > >   create mode 100644 tools/testing/selftests/bpf/progs/test_pidns_kern.c
> > >   create mode 100644 tools/testing/selftests/bpf/test_pidns.c
> > >
> > > diff --git a/fs/namei.c b/fs/namei.c
> > > index 209c51a5226c..d1eca36972d2 100644
> > > --- a/fs/namei.c
> > > +++ b/fs/namei.c
> > > @@ -19,7 +19,6 @@
> > >   #include <linux/export.h>
> > >   #include <linux/kernel.h>
> > >   #include <linux/slab.h>
> > > -#include <linux/fs.h>
> > >   #include <linux/namei.h>
> > >   #include <linux/pagemap.h>
> > >   #include <linux/fsnotify.h>
> > > @@ -2355,6 +2354,7 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags,
> > >     putname(name);
> > >     return retval;
> > >   }
> > > +EXPORT_SYMBOL(filename_lookup);
> >
> > No need to export symbols. bpf uses it and bpf is in the core, not in
> > modules.
> >
> > >
> > >   /* Returns 0 and nd will be valid on success; Retuns error, otherwise. */
> > >   static int path_parentat(struct nameidata *nd, unsigned flags,
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index f9a506147c8a..e4adf5e05afd 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -1050,6 +1050,7 @@ extern const struct bpf_func_proto bpf_get_local_storage_proto;
> > >   extern const struct bpf_func_proto bpf_strtol_proto;
> > >   extern const struct bpf_func_proto bpf_strtoul_proto;
> > >   extern const struct bpf_func_proto bpf_tcp_sock_proto;
> > > +extern const struct bpf_func_proto bpf_get_current_pidns_info_proto;
> > >
> > >   /* Shared helpers among cBPF and eBPF. */
> > >   void bpf_user_rnd_init_once(void);
> > > diff --git a/include/linux/namei.h b/include/linux/namei.h
> > > index 9138b4471dbf..2c24e8c71d46 100644
> > > --- a/include/linux/namei.h
> > > +++ b/include/linux/namei.h
> > > @@ -6,6 +6,7 @@
> > >   #include <linux/path.h>
> > >   #include <linux/fcntl.h>
> > >   #include <linux/errno.h>
> > > +#include <linux/fs.h>
> > >
> > >   enum { MAX_NESTED_LINKS = 8 };
> > >
> > > @@ -97,6 +98,9 @@ extern void unlock_rename(struct dentry *, struct dentry *);
> > >
> > >   extern void nd_jump_link(struct path *path);
> > >
> > > +extern int filename_lookup(int dfd, struct filename *name, unsigned int flags,
> > > +               struct path *path, struct path *root);
> >
> > The previous definition in fs/internal.h should be removed.
> >
> > > +
> > >   static inline void nd_terminate_link(void *name, size_t len, size_t maxlen)
> > >   {
> > >     ((char *) name)[min(len, maxlen)] = '\0';
> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > index 4393bd4b2419..6f601f7106e2 100644
> > > --- a/include/uapi/linux/bpf.h
> > > +++ b/include/uapi/linux/bpf.h
> > > @@ -2741,6 +2741,26 @@ union bpf_attr {
> > >    *                **-EOPNOTSUPP** kernel configuration does not enable SYN cookies
> > >    *
> > >    *                **-EPROTONOSUPPORT** IP packet version is not 4 or 6
> > > + *
> > > + * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 size_of_pidns)
> > > + * Description
> > > + *         Copies into *pidns* pid, namespace id and tgid as seen by the
> > > + *         current namespace and also device from /proc/self/ns/pid.
> > > + *         *size_of_pidns* must be the size of *pidns*
> > > + *
> > > + *         This helper is used when pid filtering is needed inside a
> > > + *         container as bpf_get_current_tgid() helper returns always the
> > > + *         pid id as seen by the root namespace.
> > > + * Return
> > > + *         0 on success
> > > + *
> > > + *         **-EINVAL**  if unable to get ns, pid or tgid of current task.
> > > + *         Or if size_of_pidns is not valid.
> >
> > Maybe reword by following the code sequence.
> >     if *size_of_pidns* is not valid or unable to get ns, pid or tgid of
> >     the current task.
> >
> > > + *
> > > + *         **-ENOMEM**  if allocation fails.
> >
> > Maybe some other error codes in filename_lookup() function?
> >
> > > + *
> > > + *         If unable to get the inode from /proc/self/ns/pid an error code
> > > + *         will be returned.
> >
> > You do not need this. The description of error code cases should cover this.
> >
> > >    */
> > >   #define __BPF_FUNC_MAPPER(FN)             \
> > >     FN(unspec),                     \
> > > @@ -2853,7 +2873,8 @@ union bpf_attr {
> > >     FN(sk_storage_get),             \
> > >     FN(sk_storage_delete),          \
> > >     FN(send_signal),                \
> > > -   FN(tcp_gen_syncookie),
> > > +   FN(tcp_gen_syncookie),          \
> > > +   FN(get_current_pidns_info),
> > >
> > >   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> > >    * function eBPF program intends to call
> > > @@ -3604,4 +3625,10 @@ struct bpf_sockopt {
> > >     __s32   retval;
> > >   };
> > >
> > > +struct bpf_pidns_info {
> > > +   __u32 dev;
> > > +   __u32 nsid;
> > > +   __u32 tgid;
> > > +   __u32 pid;
> > > +};
> > >   #endif /* _UAPI__LINUX_BPF_H__ */
> > > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > > index 8191a7db2777..3159f2a0188c 100644
> > > --- a/kernel/bpf/core.c
> > > +++ b/kernel/bpf/core.c
> > > @@ -2038,6 +2038,7 @@ const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
> > >   const struct bpf_func_proto bpf_get_current_comm_proto __weak;
> > >   const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak;
> > >   const struct bpf_func_proto bpf_get_local_storage_proto __weak;
> > > +const struct bpf_func_proto bpf_get_current_pidns_info __weak;
> > >
> > >   const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
> > >   {
> > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > > index 5e28718928ca..571f24077db2 100644
> > > --- a/kernel/bpf/helpers.c
> > > +++ b/kernel/bpf/helpers.c
> > > @@ -11,6 +11,12 @@
> > >   #include <linux/uidgid.h>
> > >   #include <linux/filter.h>
> > >   #include <linux/ctype.h>
> > > +#include <linux/pid_namespace.h>
> > > +#include <linux/major.h>
> > > +#include <linux/stat.h>
> > > +#include <linux/namei.h>
> > > +#include <linux/version.h>
> > > +
> > >
> > >   #include "../../lib/kstrtox.h"
> > >
> > > @@ -312,6 +318,78 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
> > >     preempt_enable();
> > >   }
> > >
> > > +BPF_CALL_2(bpf_get_current_pidns_info, struct bpf_pidns_info *, pidns_info, u32,
> > > +    size)
> > > +{
> > > +   const char *name = "/proc/self/ns/pid";
> >
> > maybe rename this variable to pidns_path?
> >
> > > +   struct pid_namespace *pidns = NULL;
> > > +   struct filename *tmp = NULL;
> >
> > Maybe rename this variable to name?
> >
> > > +   int len = strlen(name) + 1;
> >
> > We can delay this assignment later until it is needed.
> >
> > > +   struct inode *inode;
> > > +   struct path kp;
> > > +   pid_t tgid = 0;
> > > +   pid_t pid = 0;
> > > +   int ret;
> > > +
> > > +   if (unlikely(size != sizeof(struct bpf_pidns_info)))
> > > +           return -EINVAL;
> > > +
> > > +   pidns = task_active_pid_ns(current);
> > > +
> >
> > we can save an empty line here.
> >
> > > +   if (unlikely(!pidns))
> > > +           goto clear;
> > > +
> > > +   pidns_info->nsid =  pidns->ns.inum;
> > > +   pid = task_pid_nr_ns(current, pidns);
> > > +
> >
> > We can save an empty line here.
> >
> > > +   if (unlikely(!pid))
> > > +           goto clear;
> > > +
> > > +   tgid = task_tgid_nr_ns(current, pidns);
> > > +
> > ditto. save an empty line.
> > > +   if (unlikely(!tgid))
> > > +           goto clear;
> > > +
> > > +   pidns_info->tgid = (u32) tgid;
> > > +   pidns_info->pid = (u32) pid;
> > > +
> > > +   tmp = kmem_cache_alloc(names_cachep, GFP_ATOMIC);
> > > +   if (unlikely(!tmp)) {
> > > +           memset((void *)pidns_info, 0, (size_t) size);
> > > +           return -ENOMEM;
> > > +   }
> > > +
> > > +   memcpy((char *)tmp->name, name, len);
> > > +   tmp->uptr = NULL;
> > > +   tmp->aname = NULL;
> > > +   tmp->refcnt = 1;
> > > +
> > ditto. save an empty line.
> > > +   ret = filename_lookup(AT_FDCWD, tmp, 0, &kp, NULL);
> > > +
> > ditto. save an empty line.
> > > +   if (ret) {
> > > +           memset((void *)pidns_info, 0, (size_t) size);
> > > +           return ret;
> > > +   }
> > > +
> > > +   inode = d_backing_inode(kp.dentry);
> > > +   pidns_info->dev = inode->i_sb->s_dev;
> > > +
> > > +   return 0;
> > > +
> > > +clear:
> > > +   memset((void *)pidns_info, 0, (size_t) size);
> > > +
> > save an empty line.
> > > +   return -EINVAL;
> > > +}
> > > +
> > > +const struct bpf_func_proto bpf_get_current_pidns_info_proto = {
> > > +   .func   = bpf_get_current_pidns_info,
> > make the "= " aligned with others?
> > > +   .gpl_only       = false,
> > > +   .ret_type       = RET_INTEGER,
> > > +   .arg1_type      = ARG_PTR_TO_UNINIT_MEM,
> > > +   .arg2_type      = ARG_CONST_SIZE,
> > > +};
> > > +
> > >   #ifdef CONFIG_CGROUPS
> > >   BPF_CALL_0(bpf_get_current_cgroup_id)
> > >   {
> > > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> > > index ca1255d14576..5e1dc22765a5 100644
> > > --- a/kernel/trace/bpf_trace.c
> > > +++ b/kernel/trace/bpf_trace.c
> > > @@ -709,6 +709,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> > >   #endif
> > >     case BPF_FUNC_send_signal:
> > >             return &bpf_send_signal_proto;
> > > +   case BPF_FUNC_get_current_pidns_info:
> > > +           return &bpf_get_current_pidns_info_proto;
> > >     default:
> > >             return NULL;
> > >     }
> > > diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> > > index 1d9be26b4edd..238453ff27d2 100644
> > > --- a/samples/bpf/Makefile
> > > +++ b/samples/bpf/Makefile
> > > @@ -53,6 +53,7 @@ hostprogs-y += task_fd_query
> > >   hostprogs-y += xdp_sample_pkts
> > >   hostprogs-y += ibumad
> > >   hostprogs-y += hbm
> > > +hostprogs-y += trace_ns_info
> > [...]

^ permalink raw reply

* Re: [PATCH net] net: phy: rtl8211f: do a double read to get real time link status
From: Andrew Lunn @ 2019-08-08 19:40 UTC (permalink / raw)
  To: Heiner Kallweit
  Cc: Yonglong Liu, davem, netdev, linux-kernel, linuxarm, salil.mehta,
	yisen.zhuang, shiju.jose
In-Reply-To: <b1140603-f05b-2373-445f-c1d7a43ff012@gmail.com>

> @@ -568,6 +568,11 @@ int phy_start_aneg(struct phy_device *phydev)
>  	if (err < 0)
>  		goto out_unlock;
>  
> +	/* The PHY may not yet have cleared aneg-completed and link-up bit
> +	 * w/o this delay when the following read is done.
> +	 */
> +	usleep_range(1000, 2000);
> +

Hi Heiner

Does 802.3 C22 say anything about this?

If this PHY is broken with respect to the standard, i would prefer the
workaround is in the PHY specific driver code, not generic core code.

	   Andrew

^ permalink raw reply

* Re: [PATCH v2 02/15] net: phy: adin: hook genphy_read_abilities() to get_features
From: Heiner Kallweit @ 2019-08-08 19:32 UTC (permalink / raw)
  To: Andrew Lunn, Alexandru Ardelean
  Cc: netdev, devicetree, linux-kernel, davem, robh+dt, mark.rutland,
	f.fainelli
In-Reply-To: <20190808152403.GB27917@lunn.ch>

On 08.08.2019 17:24, Andrew Lunn wrote:
> On Thu, Aug 08, 2019 at 03:30:13PM +0300, Alexandru Ardelean wrote:
>> The ADIN PHYs can operate with Clause 45, however they are not typical for
>> how phylib considers Clause 45 PHYs.
>>
>> If the `features` field & the `get_features` hook are unspecified, and the
>> device wants to operate via Clause 45, it would also try to read features
>> via the `genphy_c45_pma_read_abilities()`, which will try to read PMA regs
>> that are unsupported.
>>
>> Hooking the `genphy_read_abilities()` function to the `get_features` hook
>> will ensure that this does not happen and the PHY features are read
>> correctly regardless of Clause 22 or Clause 45 operation.
> 
> I think we need to stop and think about a PHY which supports both C22
> and C45.
> 
> How does bus enumeration work? Is it discovered twice?  I've always
> considered phydev->is_c45 means everything is c45, not that some
> registers can be accessed via c45. But the driver is mixing c22 and
> c45. Does the driver actually require c45? Are some features which are
> only accessibly via C45? What does C45 actually bring us for this
> device?
> 
genphy_c45_pma_read_abilities() is only called if phydev->is_c45 is set.
And this flag means that the PHY complies with Clause 45 incl. all the
standard devices like PMA. In the case here only some vendor-specific
registers can be accessed via Clause 45 and therefore is_c45 shouldn't
bet set. As a consequence this patch isn't needed.

>      Andrew
> 
Heiner


^ permalink raw reply

* Re: [PATCH v2 13/15] net: phy: adin: configure downshift on config_init
From: Heiner Kallweit @ 2019-08-08 19:38 UTC (permalink / raw)
  To: Alexandru Ardelean, netdev, devicetree, linux-kernel
  Cc: davem, robh+dt, mark.rutland, f.fainelli, andrew
In-Reply-To: <20190808123026.17382-14-alexandru.ardelean@analog.com>

On 08.08.2019 14:30, Alexandru Ardelean wrote:
> Down-speed auto-negotiation may not always be enabled, in which case the
> PHY won't down-shift to 100 or 10 during auto-negotiation.
> 
> This change enables downshift and configures the number of retries to
> default 8 (maximum supported value).
> 
> The change has been adapted from the Marvell PHY driver.
> 
Instead of a fixed downshift setting (like in the Marvell driver) you
may consider to implement the ethtool phy-tunable ETHTOOL_PHY_DOWNSHIFT.
See the Aquantia PHY driver for an example.
Then the user can configure whether he wants downshift and if yes after
how many retries.

> Signed-off-by: Alexandru Ardelean <alexandru.ardelean@analog.com>
> ---
>  drivers/net/phy/adin.c | 39 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 39 insertions(+)
[...]

Heiner

^ permalink raw reply

* Re: [PATCH net-next 3/3] net: phy: realtek: add support for the 2.5Gbps PHY in RTL8125
From: Andrew Lunn @ 2019-08-08 19:37 UTC (permalink / raw)
  To: Heiner Kallweit; +Cc: Florian Fainelli, David Miller, netdev@vger.kernel.org
In-Reply-To: <64769c3d-42b6-8eb8-26e4-722869408986@gmail.com>

On Thu, Aug 08, 2019 at 09:05:54PM +0200, Heiner Kallweit wrote:
> This adds support for the integrated 2.5Gbps PHY in Realtek RTL8125.
> Advertisement of 2.5Gbps mode is done via a vendor-specific register.
> Same applies to reading NBase-T link partner advertisement.
> Unfortunately this 2.5Gbps PHY shares the PHY ID with the integrated
> 1Gbps PHY's in other Realtek network chips and so far no method is
> known to differentiate them.

That is not nice.

Do you have any contacts in Realtek who can provide us with
information? Maybe there is another undocumented vendor specific
register?

	Andrew

^ permalink raw reply

* Re: [PATCH net-next 1/3] net: phy: prepare phylib to deal with PHY's extending Clause 22
From: Andrew Lunn @ 2019-08-08 19:32 UTC (permalink / raw)
  To: Heiner Kallweit; +Cc: Florian Fainelli, David Miller, netdev@vger.kernel.org
In-Reply-To: <214bedc0-4ae0-b15f-e03c-173f17527417@gmail.com>

On Thu, Aug 08, 2019 at 09:03:42PM +0200, Heiner Kallweit wrote:
> The integrated PHY in 2.5Gbps chip RTL8125 is the first (known to me)
> PHY that uses standard Clause 22 for all modes up to 1Gbps and adds
> 2.5Gbps control using vendor-specific registers. To use phylib for
> the standard part little extensions are needed:
> - Move most of genphy_config_aneg to a new function
>   __genphy_config_aneg that takes a parameter whether restarting
>   auto-negotiation is needed (depending on whether content of
>   vendor-specific advertisement register changed).
> - Don't clear phydev->lp_advertising in genphy_read_status so that
>   we can set non-C22 mode flags before.
> 
> Basically both changes mimic the behavior of the equivalent Clause 45
> functions.
> 
> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
> ---
>  drivers/net/phy/phy_device.c | 35 +++++++++++++++--------------------
>  include/linux/phy.h          |  8 +++++++-
>  2 files changed, 22 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> index 7ddd91df9..bd7e7db8c 100644
> --- a/drivers/net/phy/phy_device.c
> +++ b/drivers/net/phy/phy_device.c
> @@ -1571,11 +1571,9 @@ static int genphy_config_advert(struct phy_device *phydev)
>  	/* Only allow advertising what this PHY supports */
>  	linkmode_and(phydev->advertising, phydev->advertising,
>  		     phydev->supported);
> -	if (!ethtool_convert_link_mode_to_legacy_u32(&advertise,
> -						     phydev->advertising))
> -		phydev_warn(phydev, "PHY advertising (%*pb) more modes than genphy supports, some modes not advertised.\n",
> -			    __ETHTOOL_LINK_MODE_MASK_NBITS,
> -			    phydev->advertising);
> +
> +	ethtool_convert_link_mode_to_legacy_u32(&advertise,
> +						phydev->advertising);

Hi Heiner

linkmode_adv_to_mii_adv_t() would remove the need to use
ethtool_convert_link_mode_to_legacy_u32(), and this warning would also
go away. 

   Andrew

^ permalink raw reply

* Re: [PATCH bpf-next v5 0/6] xdp: Add devmap_hash map type
From: Toke Høiland-Jørgensen @ 2019-08-08 19:30 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, Alexei Starovoitov, Network Development,
	David Miller, Jesper Dangaard Brouer, Jakub Kicinski,
	Björn Töpel, Yonghong Song
In-Reply-To: <CAADnVQJpYeQ68V5BE2r3BhbraBh7G8dSd8zknFUJxtW4GwNkuA@mail.gmail.com>

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Fri, Jul 26, 2019 at 9:06 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> This series adds a new map type, devmap_hash, that works like the existing
>> devmap type, but using a hash-based indexing scheme. This is useful for the use
>> case where a devmap is indexed by ifindex (for instance for use with the routing
>> table lookup helper). For this use case, the regular devmap needs to be sized
>> after the maximum ifindex number, not the number of devices in it. A hash-based
>> indexing scheme makes it possible to size the map after the number of devices it
>> should contain instead.
>>
>> This was previously part of my patch series that also turned the regular
>> bpf_redirect() helper into a map-based one; for this series I just pulled out
>> the patches that introduced the new map type.
>>
>> Changelog:
>>
>> v5:
>>
>> - Dynamically set the number of hash buckets by rounding up max_entries to the
>>   nearest power of two (mirroring the regular hashmap), as suggested by Jesper.
>
> fyi I'm waiting for Jesper to review this new version.

Ping Jesper? :)

-Toke

^ permalink raw reply

* Re: [v3,3/4] tools: bpftool: add bash-completion for net attach/detach
From: Daniel T. Lee @ 2019-08-08 19:28 UTC (permalink / raw)
  To: Quentin Monnet; +Cc: Daniel Borkmann, Alexei Starovoitov, netdev
In-Reply-To: <5cd88036-8682-8d26-b795-caf96e1e883f@netronome.com>

On Fri, Aug 9, 2019 at 1:48 AM Quentin Monnet
<quentin.monnet@netronome.com> wrote:
>
> 2019-08-07 11:25 UTC+0900 ~ Daniel T. Lee <danieltimlee@gmail.com>
> > This commit adds bash-completion for new "net attach/detach"
> > subcommand for attaching XDP program on interface.
> >
> > Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
> > ---
> >  tools/bpf/bpftool/bash-completion/bpftool | 64 +++++++++++++++++++----
> >  1 file changed, 55 insertions(+), 9 deletions(-)
> >
> > diff --git a/tools/bpf/bpftool/bash-completion/bpftool b/tools/bpf/bpftool/bash-completion/bpftool
> > index c8f42e1fcbc9..1d81cb09d478 100644
> > --- a/tools/bpf/bpftool/bash-completion/bpftool
> > +++ b/tools/bpf/bpftool/bash-completion/bpftool
> > @@ -201,6 +201,10 @@ _bpftool()
> >              _bpftool_get_prog_tags
> >              return 0
> >              ;;
> > +        dev)
> > +            _sysfs_get_netdevs
> > +            return 0
> > +            ;;
>
> Makes sense to have this for "dev", thanks! But it seems you missed one
> place where it was used, for "bpftool feature probe" (We have "[[ $prev
> == "dev" ]] && _sysfs_get_netdevs && return 0"). Could you also remove
> that one please?
>
> Other than this looks good, thanks:
>
> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>

My bad. Thanks for letting me know.
I'll update it with the next version of patch.

Thank you for your review.
I really appreciate it.

^ permalink raw reply

* Re: [PATCH net] net: phy: rtl8211f: do a double read to get real time link status
From: Heiner Kallweit @ 2019-08-08 19:26 UTC (permalink / raw)
  To: Yonglong Liu, davem, andrew
  Cc: netdev, linux-kernel, linuxarm, salil.mehta, yisen.zhuang,
	shiju.jose
In-Reply-To: <c15f820b-cc80-9a93-4c48-1b60bc14f73a@huawei.com>

On 08.08.2019 08:21, Yonglong Liu wrote:
> 
> 
> On 2019/8/8 14:11, Heiner Kallweit wrote:
>> On 08.08.2019 03:15, Yonglong Liu wrote:
>>>
>>>
>>> On 2019/8/8 0:47, Heiner Kallweit wrote:
>>>> On 07.08.2019 15:16, Yonglong Liu wrote:
>>>>> [   27.232781] hns3 0000:bd:00.3 eth7: net open
>>>>> [   27.237303] 8021q: adding VLAN 0 to HW filter on device eth7
>>>>> [   27.242972] IPv6: ADDRCONF(NETDEV_CHANGE): eth7: link becomes ready
>>>>> [   27.244449] hns3 0000:bd:00.3: invalid speed (-1)
>>>>> [   27.253904] hns3 0000:bd:00.3 eth7: failed to adjust link.
>>>>> [   27.259379] RTL8211F Gigabit Ethernet mii-0000:bd:00.3:07: PHY state change UP -> RUNNING
>>>>> [   27.924903] hns3 0000:bd:00.3 eth7: link up
>>>>> [   28.280479] RTL8211F Gigabit Ethernet mii-0000:bd:00.3:07: PHY state change RUNNING -> NOLINK
>>>>> [   29.208452] hns3 0000:bd:00.3 eth7: link down
>>>>> [   32.376745] RTL8211F Gigabit Ethernet mii-0000:bd:00.3:07: PHY state change NOLINK -> RUNNING
>>>>> [   33.208448] hns3 0000:bd:00.3 eth7: link up
>>>>> [   35.253821] hns3 0000:bd:00.3 eth7: net stop
>>>>> [   35.258270] hns3 0000:bd:00.3 eth7: link down
>>>>>
>>>>> When using rtl8211f in polling mode, may get a invalid speed,
>>>>> because of reading a fake link up and autoneg complete status
>>>>> immediately after starting autoneg:
>>>>>
>>>>>         ifconfig-1176  [007] ....    27.232763: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x00 val:0x1040
>>>>>   kworker/u257:1-670   [015] ....    27.232805: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x04 val:0x01e1
>>>>>   kworker/u257:1-670   [015] ....    27.232815: mdio_access: mii-0000:bd:00.3 write phy:0x07 reg:0x04 val:0x05e1
>>>>>   kworker/u257:1-670   [015] ....    27.232869: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x79ad
>>>>>   kworker/u257:1-670   [015] ....    27.232904: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x09 val:0x0200
>>>>>   kworker/u257:1-670   [015] ....    27.232940: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x00 val:0x1040
>>>>>   kworker/u257:1-670   [015] ....    27.232949: mdio_access: mii-0000:bd:00.3 write phy:0x07 reg:0x00 val:0x1240
>>>>>   kworker/u257:1-670   [015] ....    27.233003: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x79ad
>>>>>   kworker/u257:1-670   [015] ....    27.233039: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x0a val:0x3002
>>>>>   kworker/u257:1-670   [015] ....    27.233074: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x09 val:0x0200
>>>>>   kworker/u257:1-670   [015] ....    27.233110: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x05 val:0x0000
>>>>>   kworker/u257:1-670   [000] ....    28.280475: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x7989
>>>>>   kworker/u257:1-670   [000] ....    29.304471: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x7989
>>>>>
>>>>> According to the datasheet of rtl8211f, to get the real time
>>>>> link status, need to read MII_BMSR twice.
>>>>>
>>>>> This patch add a read_status hook for rtl8211f, and do a fake
>>>>> phy_read before genphy_read_status(), so that can get real link
>>>>> status in genphy_read_status().
>>>>>
>>>>> Signed-off-by: Yonglong Liu <liuyonglong@huawei.com>
>>>>> ---
>>>>>  drivers/net/phy/realtek.c | 13 +++++++++++++
>>>>>  1 file changed, 13 insertions(+)
>>>>>
>>>> Is this an accidental resubmit? Because we discussed this in
>>>> https://marc.info/?t=156413509900003&r=1&w=2 and a fix has
>>>> been applied already.
>>>>
>>>> Heiner
>>>>
>>>> .
>>>>
>>>
>>> In https://marc.info/?t=156413509900003&r=1&w=2 , the invalid speed
>>> recurrence rate is almost 100%, and I had test the solution about
>>> 5 times and it works. But yesterday it happen again suddenly, and than
>>> I fount that the recurrence rate reduce to 10%. This time we get 0x79ad
>>> after autoneg started which is not 0x798d from last discussion.
>>>
>>>
>>>
>> OK, I'll have a look.
>> However the approach is wrong. The double read is related to the latching
>> of link-down events. This is done by all PHY's and not specific to RT8211F.
>> Also it's not related to the problem. I assume any sufficient delay would
>> do instead of the read.
>>
>> .
>>
> 
> So you will send a new patch to fix this problem? I am waiting for it,
> and can do a full test this time.
> 
> 
Can you try the following? This delay should give thy PHY enough time
to clear both bits before the following read is done.

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index ef7aa738e..32f327a44 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -568,6 +568,11 @@ int phy_start_aneg(struct phy_device *phydev)
 	if (err < 0)
 		goto out_unlock;
 
+	/* The PHY may not yet have cleared aneg-completed and link-up bit
+	 * w/o this delay when the following read is done.
+	 */
+	usleep_range(1000, 2000);
+
 	if (phy_is_started(phydev))
 		err = phy_check_link_status(phydev);
 out_unlock:
-- 
2.22.0



^ permalink raw reply related

* Re: [v3,1/4] tools: bpftool: add net attach command to attach XDP on interface
From: Daniel T. Lee @ 2019-08-08 19:22 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Daniel Borkmann, Alexei Starovoitov, netdev
In-Reply-To: <20190808104948.7e72dea0@cakuba.netronome.com>

On Fri, Aug 9, 2019 at 2:50 AM Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
>
> On Thu, 8 Aug 2019 07:15:22 +0900, Daniel T. Lee wrote:
> > > > +             return -EINVAL;
> > > > +     }
> > > > +
> > > > +     NEXT_ARG();
> > >
> > > nit: the new line should be before NEXT_ARG(), IOV NEXT_ARG() belongs
> > > to the code which consumed the argument
> > >
> >
> > I'm not sure I'm following.
> > Are you saying that, at here the newline shouldn't be necessary?
>
> I mean this is better:
>
>         if (!is_prefix(*argv, "bla-bla"))
>                 return -EINVAL;
>         NEXT_ARG();
>
>         if (!is_prefix(*argv, "bla-bla"))
>                 return -EINVAL;
>         NEXT_ARG();
>
> Than this:
>
>         if (!is_prefix(*argv, "bla-bla"))
>                 return -EINVAL;
>
>         NEXT_ARG();
>         if (!is_prefix(*argv, "bla-bla"))
>                 return -EINVAL;
>
>         NEXT_ARG();
>
> Because the NEXT_ARG() "belongs" to the code that "consumed" the option.
>
> So instead of this:
>
>      attach_type = parse_attach_type(*argv);
>      if (attach_type == max_net_attach_type) {
>              p_err("invalid net attach/detach type");
>              return -EINVAL;
>      }
>
>      NEXT_ARG();
>      progfd = prog_parse_fd(&argc, &argv);
>      if (progfd < 0)
>              return -EINVAL;
>
> This seems more logical to me:
>
>      attach_type = parse_attach_type(*argv);
>      if (attach_type == max_net_attach_type) {
>              p_err("invalid net attach/detach type");
>              return -EINVAL;
>      }
>      NEXT_ARG();
>
>      progfd = prog_parse_fd(&argc, &argv);
>      if (progfd < 0)
>              return -EINVAL;

Oh. I see.
I'll update NEXT_ARG line stick to the code which "consumes" the option.

Thanks for the review! :)

^ permalink raw reply

* Re: [PATCH net v3] net/tls: prevent skb_orphan() from leaking TLS plain text with offload
From: Boris Pismenny @ 2019-08-08 19:18 UTC (permalink / raw)
  To: Jakub Kicinski, davem@davemloft.net
  Cc: netdev@vger.kernel.org, davejwatson@fb.com, Aviad Yehezkel,
	john.fastabend@gmail.com, daniel@iogearbox.net,
	willemb@google.com, edumazet@google.com,
	alexei.starovoitov@gmail.com, oss-drivers@netronome.com
In-Reply-To: <20190808000359.20785-1-jakub.kicinski@netronome.com>



On 08/08/2019 3:03, Jakub Kicinski wrote:
> sk_validate_xmit_skb() and drivers depend on the sk member of
> struct sk_buff to identify segments requiring encryption.
> Any operation which removes or does not preserve the original TLS
> socket such as skb_orphan() or skb_clone() will cause clear text
> leaks.
> 
> Make the TCP socket underlying an offloaded TLS connection
> mark all skbs as decrypted, if TLS TX is in offload mode.
> Then in sk_validate_xmit_skb() catch skbs which have no socket
> (or a socket with no validation) and decrypted flag set.
> 
> Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and
> sk->sk_validate_xmit_skb are slightly interchangeable right now,
> they all imply TLS offload. The new checks are guarded by
> CONFIG_TLS_DEVICE because that's the option guarding the
> sk_buff->decrypted member.
> 
> Second, smaller issue with orphaning is that it breaks
> the guarantee that packets will be delivered to device
> queues in-order. All TLS offload drivers depend on that
> scheduling property. This means skb_orphan_partial()'s
> trick of preserving partial socket references will cause
> issues in the drivers. We need a full orphan, and as a
> result netem delay/throttling will cause all TLS offload
> skbs to be dropped.
> 
> Reusing the sk_buff->decrypted flag also protects from
> leaking clear text when incoming, decrypted skb is redirected
> (e.g. by TC).
> 
> See commit 0608c69c9a80 ("bpf: sk_msg, sock{map|hash} redirect
> through ULP") for justification why the internal flag is safe.
> The only location which could leak the flag in is tcp_bpf_sendmsg(),
> which is taken care of by clearing the previously unused bit.
> 
> v2:
>  - remove superfluous decrypted mark copy (Willem);
>  - remove the stale doc entry (Boris);
>  - rely entirely on EOR marking to prevent coalescing (Boris);
>  - use an internal sendpages flag instead of marking the socket
>    (Boris).
> v3 (Willem):
>  - reorganize the can_skb_orphan_partial() condition;
>  - fix the flag leak-in through tcp_bpf_sendmsg.
> 
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>

LGTM

Reviewed-by: Boris Pismenny <borisp@mellanox.com>

^ permalink raw reply

* Re: [bpf-next v3 PATCH 3/3] samples/bpf: xdp_fwd explain bpf_fib_lookup return codes
From: Toke Høiland-Jørgensen @ 2019-08-08 19:17 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov
  Cc: a.s.protopopov, dsahern, ys114321, Jesper Dangaard Brouer
In-Reply-To: <156528106777.22124.12162740342925045912.stgit@firesoul>

Jesper Dangaard Brouer <brouer@redhat.com> writes:

> Make it clear that this XDP program depend on the network
> stack to do the ARP resolution.  This is connected with the
> BPF_FIB_LKUP_RET_NO_NEIGH return code from bpf_fib_lookup().
>
> Another common mistake (seen via XDP-tutorial) is that users
> don't realize that sysctl net.ipv{4,6}.conf.all.forwarding
> setting is honored by bpf_fib_lookup.
>
> Reported-by: Anton Protopopov <a.s.protopopov@gmail.com>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Reviewed-by: David Ahern <dsahern@gmail.com>
> Acked-by: Yonghong Song <yhs@fb.com>

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>

^ permalink raw reply

* Re: [bonding][patch] Regarding a bonding lacp issue
From: Jay Vosburgh @ 2019-08-08 19:17 UTC (permalink / raw)
  To: Felix; +Cc: vfalico, andy, netdev, linux-kernel
In-Reply-To: <8799b243-36da-4baf-8c67-aeb5f978c34f.fei.feng@linux.alibaba.com>

Felix <fei.feng@linux.alibaba.com> wrote:

>Dear Mainteners,
>
>Recently I hit a packet drop issue in bonding driver on Linux 4.9. Please
>see details below. Please take a look to see if my understanding is
>correct. Many thanks.
>
>What is the problem?
>The bonding driver starts to send packets even if the Partner(Switch)'s
>Collecting bit is not enabled yet. Partner would drop all packets until
>its Collecting bit is enabled.
>
>What is the root cuase?
>According to LACP spec, the Actor need to check Partner's Sync and
>Collecting bits before enable its Distributing bit and Distributing
>function. Please see the PIC below.

	The diagram you reference is found in 802.1AX-2014 figure 6-21,
which shows the state diagram for an independent control implementation,
i.e., collecting and distributing are managed independently.

	However, Linux bonding implements coupled control, which is
shown in figure 6-22.  Here, there is no Partner.Collecting requirement
on the state transition from ATTACHED to COLLECTING_DISTRIBUTING.

	To quote 802.1AX-2014 6.4.15:

	As independent control is not possible, the coupled control
	state machine does not wait for the Partner to signal that
	collection has started before enabling both collection and
	distribution.

	Now, that said, I agree that what you're seeing is likely
explained by this behavior, and your fix should resolve the immediate
problem (that bonding sends packets before the peer has enabled
COLLECTING).

	However, your fix does put bonding out of compliance with the
standard, as it does not really implement COLLECTING and DISTRIBUTING as
discrete states.  In particular, if the peer in your case were to later
clear Partner.Collecting, bonding will not react to this as a figure
6-21 independent control implementation would (which isn't a change from
current behavior, but currently this isn't expected).

	So, in my opinion a patch like this should have a comment
attached noting that we are deliberately not in compliance with the
standard in this specific situation.  The proper fix is to implement
figure 6-21 separate state.

	Lastly, are you able to test and generate a patch against
current upstream, instead of 4.9?

	-J

>How to fix?
>Please see the diff as following. And the patch is attached.
>
>--- ../origin/linux-4.9.188/drivers/net/bonding/bond_3ad.c 2019-08-07
>00:29:42.000000000 +0800
>+++ drivers/net/bonding/bond_3ad.c 2019-08-08 23:13:29.015640197 +0800
>@@ -937,6 +937,7 @@
>     */
>    if ((port->sm_vars & AD_PORT_SELECTED) &&
>        (port->partner_oper.port_state & AD_STATE_SYNCHRONIZATION) &&
>+       (port->partner_oper.port_state & AD_STATE_COLLECTING) &&
>        !__check_agg_selection_timer(port)) {
>     if (port->aggregator->is_active)
>      port->sm_mux_state =
>
>------
>Thanks,
>Felix

---
	-Jay Vosburgh, jay.vosburgh@canonical.com

^ permalink raw reply

* Re: [bpf-next v3 PATCH 2/3] samples/bpf: make xdp_fwd more practically usable via devmap lookup
From: Toke Høiland-Jørgensen @ 2019-08-08 19:17 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov
  Cc: a.s.protopopov, dsahern, ys114321, Jesper Dangaard Brouer
In-Reply-To: <156528106270.22124.2563148023961869582.stgit@firesoul>

Jesper Dangaard Brouer <brouer@redhat.com> writes:

> This address the TODO in samples/bpf/xdp_fwd_kern.c, which points out
> that the chosen egress index should be checked for existence in the
> devmap. This can now be done via taking advantage of Toke's work in
> commit 0cdbb4b09a06 ("devmap: Allow map lookups from eBPF").
>
> This change makes xdp_fwd more practically usable, as this allows for
> a mixed environment, where IP-forwarding fallback to network stack, if
> the egress device isn't configured to use XDP.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Reviewed-by: David Ahern <dsahern@gmail.com>

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>

^ permalink raw reply

* Re: [bpf-next v3 PATCH 1/3] samples/bpf: xdp_fwd rename devmap name to be xdp_tx_ports
From: Toke Høiland-Jørgensen @ 2019-08-08 19:16 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov
  Cc: a.s.protopopov, dsahern, ys114321, Jesper Dangaard Brouer
In-Reply-To: <156528105763.22124.16079929264188823516.stgit@firesoul>

Jesper Dangaard Brouer <brouer@redhat.com> writes:

> The devmap name 'tx_port' came from a copy-paste from xdp_redirect_map
> which only have a single TX port. Change name to xdp_tx_ports
> to make it more descriptive.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Reviewed-by: David Ahern <dsahern@gmail.com>
> Acked-by: Yonghong Song <yhs@fb.com>


Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>

^ permalink raw reply

* [PATCH net-next 2/3] net: phy: add phy_modify_paged_changed
From: Heiner Kallweit @ 2019-08-08 19:04 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli, David Miller; +Cc: netdev@vger.kernel.org
In-Reply-To: <ddbf28b9-f32e-7399-10a6-27b79ca0aaf9@gmail.com>

Add helper function phy_modify_paged_changed, behavior is the same
as for phy_modify_changed.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
---
 drivers/net/phy/phy-core.c | 29 ++++++++++++++++++++++++-----
 include/linux/phy.h        |  2 ++
 2 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/drivers/net/phy/phy-core.c b/drivers/net/phy/phy-core.c
index 16667fbac..9ae3abb2d 100644
--- a/drivers/net/phy/phy-core.c
+++ b/drivers/net/phy/phy-core.c
@@ -783,24 +783,43 @@ int phy_write_paged(struct phy_device *phydev, int page, u32 regnum, u16 val)
 EXPORT_SYMBOL(phy_write_paged);
 
 /**
- * phy_modify_paged() - Convenience function for modifying a paged register
+ * phy_modify_paged_changed() - Function for modifying a paged register
  * @phydev: a pointer to a &struct phy_device
  * @page: the page for the phy
  * @regnum: register number
  * @mask: bit mask of bits to clear
  * @set: bit mask of bits to set
  *
- * Same rules as for phy_read() and phy_write().
+ * Returns negative errno, 0 if there was no change, and 1 in case of change
  */
-int phy_modify_paged(struct phy_device *phydev, int page, u32 regnum,
-		     u16 mask, u16 set)
+int phy_modify_paged_changed(struct phy_device *phydev, int page, u32 regnum,
+			     u16 mask, u16 set)
 {
 	int ret = 0, oldpage;
 
 	oldpage = phy_select_page(phydev, page);
 	if (oldpage >= 0)
-		ret = __phy_modify(phydev, regnum, mask, set);
+		ret = __phy_modify_changed(phydev, regnum, mask, set);
 
 	return phy_restore_page(phydev, oldpage, ret);
 }
+EXPORT_SYMBOL(phy_modify_paged_changed);
+
+/**
+ * phy_modify_paged() - Convenience function for modifying a paged register
+ * @phydev: a pointer to a &struct phy_device
+ * @page: the page for the phy
+ * @regnum: register number
+ * @mask: bit mask of bits to clear
+ * @set: bit mask of bits to set
+ *
+ * Same rules as for phy_read() and phy_write().
+ */
+int phy_modify_paged(struct phy_device *phydev, int page, u32 regnum,
+		     u16 mask, u16 set)
+{
+	int ret = phy_modify_paged_changed(phydev, page, regnum, mask, set);
+
+	return ret < 0 ? ret : 0;
+}
 EXPORT_SYMBOL(phy_modify_paged);
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 7117825ee..781f4810c 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -984,6 +984,8 @@ int phy_select_page(struct phy_device *phydev, int page);
 int phy_restore_page(struct phy_device *phydev, int oldpage, int ret);
 int phy_read_paged(struct phy_device *phydev, int page, u32 regnum);
 int phy_write_paged(struct phy_device *phydev, int page, u32 regnum, u16 val);
+int phy_modify_paged_changed(struct phy_device *phydev, int page, u32 regnum,
+			     u16 mask, u16 set);
 int phy_modify_paged(struct phy_device *phydev, int page, u32 regnum,
 		     u16 mask, u16 set);
 
-- 
2.22.0



^ permalink raw reply related

* [PATCH net-next 1/3] net: phy: prepare phylib to deal with PHY's extending Clause 22
From: Heiner Kallweit @ 2019-08-08 19:03 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli, David Miller; +Cc: netdev@vger.kernel.org
In-Reply-To: <ddbf28b9-f32e-7399-10a6-27b79ca0aaf9@gmail.com>

The integrated PHY in 2.5Gbps chip RTL8125 is the first (known to me)
PHY that uses standard Clause 22 for all modes up to 1Gbps and adds
2.5Gbps control using vendor-specific registers. To use phylib for
the standard part little extensions are needed:
- Move most of genphy_config_aneg to a new function
  __genphy_config_aneg that takes a parameter whether restarting
  auto-negotiation is needed (depending on whether content of
  vendor-specific advertisement register changed).
- Don't clear phydev->lp_advertising in genphy_read_status so that
  we can set non-C22 mode flags before.

Basically both changes mimic the behavior of the equivalent Clause 45
functions.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
---
 drivers/net/phy/phy_device.c | 35 +++++++++++++++--------------------
 include/linux/phy.h          |  8 +++++++-
 2 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 7ddd91df9..bd7e7db8c 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -1571,11 +1571,9 @@ static int genphy_config_advert(struct phy_device *phydev)
 	/* Only allow advertising what this PHY supports */
 	linkmode_and(phydev->advertising, phydev->advertising,
 		     phydev->supported);
-	if (!ethtool_convert_link_mode_to_legacy_u32(&advertise,
-						     phydev->advertising))
-		phydev_warn(phydev, "PHY advertising (%*pb) more modes than genphy supports, some modes not advertised.\n",
-			    __ETHTOOL_LINK_MODE_MASK_NBITS,
-			    phydev->advertising);
+
+	ethtool_convert_link_mode_to_legacy_u32(&advertise,
+						phydev->advertising);
 
 	/* Setup standard advertisement */
 	err = phy_modify_changed(phydev, MII_ADVERTISE,
@@ -1681,18 +1679,20 @@ int genphy_restart_aneg(struct phy_device *phydev)
 EXPORT_SYMBOL(genphy_restart_aneg);
 
 /**
- * genphy_config_aneg - restart auto-negotiation or write BMCR
+ * __genphy_config_aneg - restart auto-negotiation or write BMCR
  * @phydev: target phy_device struct
+ * @changed: whether autoneg is requested
  *
  * Description: If auto-negotiation is enabled, we configure the
  *   advertising, and then restart auto-negotiation.  If it is not
  *   enabled, then we write the BMCR.
  */
-int genphy_config_aneg(struct phy_device *phydev)
+int __genphy_config_aneg(struct phy_device *phydev, bool changed)
 {
-	int err, changed;
+	int err;
 
-	changed = genphy_config_eee_advert(phydev);
+	if (genphy_config_eee_advert(phydev))
+		changed = true;
 
 	if (AUTONEG_ENABLE != phydev->autoneg)
 		return genphy_setup_forced(phydev);
@@ -1700,10 +1700,10 @@ int genphy_config_aneg(struct phy_device *phydev)
 	err = genphy_config_advert(phydev);
 	if (err < 0) /* error */
 		return err;
+	else if (err)
+		changed = true;
 
-	changed |= err;
-
-	if (changed == 0) {
+	if (!changed) {
 		/* Advertisement hasn't changed, but maybe aneg was never on to
 		 * begin with?  Or maybe phy was isolated?
 		 */
@@ -1713,18 +1713,15 @@ int genphy_config_aneg(struct phy_device *phydev)
 			return ctl;
 
 		if (!(ctl & BMCR_ANENABLE) || (ctl & BMCR_ISOLATE))
-			changed = 1; /* do restart aneg */
+			changed = true; /* do restart aneg */
 	}
 
 	/* Only restart aneg if we are advertising something different
 	 * than we were before.
 	 */
-	if (changed > 0)
-		return genphy_restart_aneg(phydev);
-
-	return 0;
+	return changed ? genphy_restart_aneg(phydev) : 0;
 }
-EXPORT_SYMBOL(genphy_config_aneg);
+EXPORT_SYMBOL(__genphy_config_aneg);
 
 /**
  * genphy_aneg_done - return auto-negotiation status
@@ -1811,8 +1808,6 @@ int genphy_read_status(struct phy_device *phydev)
 	phydev->pause = 0;
 	phydev->asym_pause = 0;
 
-	linkmode_zero(phydev->lp_advertising);
-
 	if (phydev->autoneg == AUTONEG_ENABLE && phydev->autoneg_complete) {
 		if (phydev->is_gigabit_capable) {
 			lpagb = phy_read(phydev, MII_STAT1000);
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 462b90b73..7117825ee 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -1069,7 +1069,7 @@ int genphy_read_abilities(struct phy_device *phydev);
 int genphy_setup_forced(struct phy_device *phydev);
 int genphy_restart_aneg(struct phy_device *phydev);
 int genphy_config_eee_advert(struct phy_device *phydev);
-int genphy_config_aneg(struct phy_device *phydev);
+int __genphy_config_aneg(struct phy_device *phydev, bool changed);
 int genphy_aneg_done(struct phy_device *phydev);
 int genphy_update_link(struct phy_device *phydev);
 int genphy_read_status(struct phy_device *phydev);
@@ -1077,6 +1077,12 @@ int genphy_suspend(struct phy_device *phydev);
 int genphy_resume(struct phy_device *phydev);
 int genphy_loopback(struct phy_device *phydev, bool enable);
 int genphy_soft_reset(struct phy_device *phydev);
+
+static inline int genphy_config_aneg(struct phy_device *phydev)
+{
+	return __genphy_config_aneg(phydev, false);
+}
+
 static inline int genphy_no_soft_reset(struct phy_device *phydev)
 {
 	return 0;
-- 
2.22.0



^ permalink raw reply related

* [PATCH net-next 3/3] net: phy: realtek: add support for the 2.5Gbps PHY in RTL8125
From: Heiner Kallweit @ 2019-08-08 19:05 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli, David Miller; +Cc: netdev@vger.kernel.org
In-Reply-To: <ddbf28b9-f32e-7399-10a6-27b79ca0aaf9@gmail.com>

This adds support for the integrated 2.5Gbps PHY in Realtek RTL8125.
Advertisement of 2.5Gbps mode is done via a vendor-specific register.
Same applies to reading NBase-T link partner advertisement.
Unfortunately this 2.5Gbps PHY shares the PHY ID with the integrated
1Gbps PHY's in other Realtek network chips and so far no method is
known to differentiate them. This means we can't autodetect 2.5Gbps
support and the network driver has to add 2.5Gbps to the supported
and advertised modes.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
---
 drivers/net/phy/realtek.c | 48 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/drivers/net/phy/realtek.c b/drivers/net/phy/realtek.c
index a669945eb..35c981476 100644
--- a/drivers/net/phy/realtek.c
+++ b/drivers/net/phy/realtek.c
@@ -39,6 +39,11 @@
 #define RTL8366RB_POWER_SAVE			0x15
 #define RTL8366RB_POWER_SAVE_ON			BIT(12)
 
+#define RTL_ADV_2500FULL			BIT(7)
+#define RTL_LPADV_10000FULL			BIT(11)
+#define RTL_LPADV_5000FULL			BIT(6)
+#define RTL_LPADV_2500FULL			BIT(5)
+
 MODULE_DESCRIPTION("Realtek PHY driver");
 MODULE_AUTHOR("Johnson Leung");
 MODULE_LICENSE("GPL");
@@ -256,6 +261,47 @@ static int rtl8366rb_config_init(struct phy_device *phydev)
 	return ret;
 }
 
+static int rtlgen_config_aneg(struct phy_device *phydev)
+{
+	int ret = 0;
+
+	if (linkmode_test_bit(ETHTOOL_LINK_MODE_2500baseT_Full_BIT,
+	    phydev->supported) && phydev->autoneg == AUTONEG_ENABLE) {
+		u16 adv2500 = 0;
+
+		if (linkmode_test_bit(ETHTOOL_LINK_MODE_2500baseT_Full_BIT,
+				      phydev->advertising))
+			adv2500 = RTL_ADV_2500FULL;
+
+		ret = phy_modify_paged_changed(phydev, 0xa5d, 0x12,
+					       RTL_ADV_2500FULL, adv2500);
+		if (ret < 0)
+			return ret;
+	}
+
+	return __genphy_config_aneg(phydev, ret);
+}
+
+static int rtlgen_read_status(struct phy_device *phydev)
+{
+	if (linkmode_test_bit(ETHTOOL_LINK_MODE_2500baseT_Full_BIT,
+	    phydev->supported) && phydev->autoneg == AUTONEG_ENABLE) {
+		int lpadv = phy_read_paged(phydev, 0xa5d, 0x13);
+
+		if (lpadv < 0)
+			return lpadv;
+
+		linkmode_mod_bit(ETHTOOL_LINK_MODE_10000baseT_Full_BIT,
+			phydev->lp_advertising, lpadv & RTL_LPADV_10000FULL);
+		linkmode_mod_bit(ETHTOOL_LINK_MODE_5000baseT_Full_BIT,
+			phydev->lp_advertising, lpadv & RTL_LPADV_5000FULL);
+		linkmode_mod_bit(ETHTOOL_LINK_MODE_2500baseT_Full_BIT,
+			phydev->lp_advertising, lpadv & RTL_LPADV_2500FULL);
+	}
+
+	return genphy_read_status(phydev);
+}
+
 static struct phy_driver realtek_drvs[] = {
 	{
 		PHY_ID_MATCH_EXACT(0x00008201),
@@ -328,6 +374,8 @@ static struct phy_driver realtek_drvs[] = {
 	}, {
 		PHY_ID_MATCH_EXACT(0x001cc800),
 		.name		= "Generic Realtek PHY",
+		.config_aneg	= rtlgen_config_aneg,
+		.read_status	= rtlgen_read_status,
 		.suspend	= genphy_suspend,
 		.resume		= genphy_resume,
 		.read_page	= rtl821x_read_page,
-- 
2.22.0



^ permalink raw reply related

* Re: [PATCH net-next] net/ncsi: allow to customize BMC MAC Address offset
From: Tao Ren @ 2019-08-08 19:02 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Jakub Kicinski, Samuel Mendoza-Jonas, David S . Miller,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	openbmc@lists.ozlabs.org, William Kennington, Joel Stanley
In-Reply-To: <20190808133209.GB32706@lunn.ch>

Hi Andrew,

On 8/8/19 6:32 AM, Andrew Lunn wrote:
>> Let me prepare patch v2 using device tree. I'm not sure if standard
>> "mac-address" fits this situation because all we need is an offset
>> (integer) and BMC MAC is calculated by adding the offset to NIC's
>> MAC address. Anyways, let me work out v2 patch we can discuss more
>> then.
> 
> Hi Tao
> 
> I don't know BMC terminology. By NICs MAC address, you are referring
> to the hosts MAC address? The MAC address the big CPU is using for its
> interface?  Where does this NIC get its MAC address from? If the BMCs
> bootloader has access to it, it can set the mac-address property in
> the device tree.

Sorry for the confusion and let me clarify more:

The NIC here refers to the Network controller which provide network connectivity for both BMC (via NC-SI) and Host (for example, via PCIe).

On Facebook Yamp BMC, BMC sends NCSI_OEM_GET_MAC command (as an ethernet packet) to the Network Controller while bringing up eth0, and the (Broadcom) Network Controller replies with the Base MAC Address reserved for the platform. As for Yamp, Base-MAC and Base-MAC+1 are used by Host (big CPU) and Base-MAC+2 are assigned to BMC. In my opinion, Base MAC and MAC address assignments are controlled by Network Controller, which is transparent to both BMC and Host.

I'm not sure if I understand your suggestion correctly: do you mean we should move the logic (GET_MAC from Network Controller, adding offset and configuring BMC MAC) from kernel to boot loader?

Sam posted several ncsi patches for u-boot recently. Sam, do we support the work (implemented in this patch) in uboot? Or are we planning to do so?


Thanks,

Tao

^ permalink raw reply

* [PATCH net-next 0/3] net: phy: realtek: add support for integrated 2.5Gbps PHY in RTL8125
From: Heiner Kallweit @ 2019-08-08 19:02 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli, David Miller; +Cc: netdev@vger.kernel.org

This series adds support for the integrated 2.5Gbps PHY in RTL8125.
First two patches add necessary functionality to phylib.

Heiner Kallweit (3):
  net: phy: prepare phylib to deal with PHY's extending Clause 22
  net: phy: add phy_modify_paged_changed
  net: phy: realtek: add support for the 2.5Gbps PHY in RTL8125

 drivers/net/phy/phy-core.c   | 29 ++++++++++++++++++----
 drivers/net/phy/phy_device.c | 35 +++++++++++---------------
 drivers/net/phy/realtek.c    | 48 ++++++++++++++++++++++++++++++++++++
 include/linux/phy.h          | 10 +++++++-
 4 files changed, 96 insertions(+), 26 deletions(-)

-- 
2.22.0


^ permalink raw reply

* Re: [PATCH net-next v1 1/8] netfilter: inlined four headers files into another one.
From: Jozsef Kadlecsik @ 2019-08-08 18:49 UTC (permalink / raw)
  To: Jeremy Sowden
  Cc: Pablo Neira Ayuso, Netfilter Devel, Net Dev, Masahiro Yamada
In-Reply-To: <20190807141705.4864-2-jeremy@azazel.net>

Hi Jeremy,

On Wed, 7 Aug 2019, Jeremy Sowden wrote:

> linux/netfilter/ipset/ip_set.h included four other header files:
> 
>   include/linux/netfilter/ipset/ip_set_comment.h
>   include/linux/netfilter/ipset/ip_set_counter.h
>   include/linux/netfilter/ipset/ip_set_skbinfo.h
>   include/linux/netfilter/ipset/ip_set_timeout.h
> 
> Of these the first three were not included anywhere else.  The last,
> ip_set_timeout.h, was included in a couple of other places, but defined
> inline functions which call other inline functions defined in ip_set.h,
> so ip_set.h had to be included before it.
> 
> Inlined all four into ip_set.h, and updated the other files that
> included ip_set_timeout.h.
> 
> Signed-off-by: Jeremy Sowden <jeremy@azazel.net>

Acked-by: Jozsef Kadlecsik <kadlec@netfilter.org>

Also, for the ipset part of patch 2/8, thank!

Best regards,
Jozsef

> ---
>  include/linux/netfilter/ipset/ip_set.h        | 238 +++++++++++++++++-
>  .../linux/netfilter/ipset/ip_set_comment.h    |  73 ------
>  .../linux/netfilter/ipset/ip_set_counter.h    |  84 -------
>  .../linux/netfilter/ipset/ip_set_skbinfo.h    |  42 ----
>  .../linux/netfilter/ipset/ip_set_timeout.h    |  77 ------
>  net/netfilter/ipset/ip_set_hash_gen.h         |   2 +-
>  net/netfilter/xt_set.c                        |   1 -
>  7 files changed, 235 insertions(+), 282 deletions(-)
>  delete mode 100644 include/linux/netfilter/ipset/ip_set_comment.h
>  delete mode 100644 include/linux/netfilter/ipset/ip_set_counter.h
>  delete mode 100644 include/linux/netfilter/ipset/ip_set_skbinfo.h
>  delete mode 100644 include/linux/netfilter/ipset/ip_set_timeout.h
> 
> diff --git a/include/linux/netfilter/ipset/ip_set.h b/include/linux/netfilter/ipset/ip_set.h
> index 12ad9b1853b4..9bc255a8461b 100644
> --- a/include/linux/netfilter/ipset/ip_set.h
> +++ b/include/linux/netfilter/ipset/ip_set.h
> @@ -452,10 +452,240 @@ bitmap_bytes(u32 a, u32 b)
>  	return 4 * ((((b - a + 8) / 8) + 3) / 4);
>  }
>  
> -#include <linux/netfilter/ipset/ip_set_timeout.h>
> -#include <linux/netfilter/ipset/ip_set_comment.h>
> -#include <linux/netfilter/ipset/ip_set_counter.h>
> -#include <linux/netfilter/ipset/ip_set_skbinfo.h>
> +/* How often should the gc be run by default */
> +#define IPSET_GC_TIME			(3 * 60)
> +
> +/* Timeout period depending on the timeout value of the given set */
> +#define IPSET_GC_PERIOD(timeout) \
> +	((timeout/3) ? min_t(u32, (timeout)/3, IPSET_GC_TIME) : 1)
> +
> +/* Entry is set with no timeout value */
> +#define IPSET_ELEM_PERMANENT	0
> +
> +/* Set is defined with timeout support: timeout value may be 0 */
> +#define IPSET_NO_TIMEOUT	UINT_MAX
> +
> +/* Max timeout value, see msecs_to_jiffies() in jiffies.h */
> +#define IPSET_MAX_TIMEOUT	(UINT_MAX >> 1)/MSEC_PER_SEC
> +
> +#define ip_set_adt_opt_timeout(opt, set)	\
> +((opt)->ext.timeout != IPSET_NO_TIMEOUT ? (opt)->ext.timeout : (set)->timeout)
> +
> +static inline unsigned int
> +ip_set_timeout_uget(struct nlattr *tb)
> +{
> +	unsigned int timeout = ip_set_get_h32(tb);
> +
> +	/* Normalize to fit into jiffies */
> +	if (timeout > IPSET_MAX_TIMEOUT)
> +		timeout = IPSET_MAX_TIMEOUT;
> +
> +	return timeout;
> +}
> +
> +static inline bool
> +ip_set_timeout_expired(const unsigned long *t)
> +{
> +	return *t != IPSET_ELEM_PERMANENT && time_is_before_jiffies(*t);
> +}
> +
> +static inline void
> +ip_set_timeout_set(unsigned long *timeout, u32 value)
> +{
> +	unsigned long t;
> +
> +	if (!value) {
> +		*timeout = IPSET_ELEM_PERMANENT;
> +		return;
> +	}
> +
> +	t = msecs_to_jiffies(value * MSEC_PER_SEC) + jiffies;
> +	if (t == IPSET_ELEM_PERMANENT)
> +		/* Bingo! :-) */
> +		t--;
> +	*timeout = t;
> +}
> +
> +static inline u32
> +ip_set_timeout_get(const unsigned long *timeout)
> +{
> +	u32 t;
> +
> +	if (*timeout == IPSET_ELEM_PERMANENT)
> +		return 0;
> +
> +	t = jiffies_to_msecs(*timeout - jiffies)/MSEC_PER_SEC;
> +	/* Zero value in userspace means no timeout */
> +	return t == 0 ? 1 : t;
> +}
> +
> +static inline char*
> +ip_set_comment_uget(struct nlattr *tb)
> +{
> +	return nla_data(tb);
> +}
> +
> +/* Called from uadd only, protected by the set spinlock.
> + * The kadt functions don't use the comment extensions in any way.
> + */
> +static inline void
> +ip_set_init_comment(struct ip_set *set, struct ip_set_comment *comment,
> +		    const struct ip_set_ext *ext)
> +{
> +	struct ip_set_comment_rcu *c = rcu_dereference_protected(comment->c, 1);
> +	size_t len = ext->comment ? strlen(ext->comment) : 0;
> +
> +	if (unlikely(c)) {
> +		set->ext_size -= sizeof(*c) + strlen(c->str) + 1;
> +		kfree_rcu(c, rcu);
> +		rcu_assign_pointer(comment->c, NULL);
> +	}
> +	if (!len)
> +		return;
> +	if (unlikely(len > IPSET_MAX_COMMENT_SIZE))
> +		len = IPSET_MAX_COMMENT_SIZE;
> +	c = kmalloc(sizeof(*c) + len + 1, GFP_ATOMIC);
> +	if (unlikely(!c))
> +		return;
> +	strlcpy(c->str, ext->comment, len + 1);
> +	set->ext_size += sizeof(*c) + strlen(c->str) + 1;
> +	rcu_assign_pointer(comment->c, c);
> +}
> +
> +/* Used only when dumping a set, protected by rcu_read_lock() */
> +static inline int
> +ip_set_put_comment(struct sk_buff *skb, const struct ip_set_comment *comment)
> +{
> +	struct ip_set_comment_rcu *c = rcu_dereference(comment->c);
> +
> +	if (!c)
> +		return 0;
> +	return nla_put_string(skb, IPSET_ATTR_COMMENT, c->str);
> +}
> +
> +/* Called from uadd/udel, flush or the garbage collectors protected
> + * by the set spinlock.
> + * Called when the set is destroyed and when there can't be any user
> + * of the set data anymore.
> + */
> +static inline void
> +ip_set_comment_free(struct ip_set *set, struct ip_set_comment *comment)
> +{
> +	struct ip_set_comment_rcu *c;
> +
> +	c = rcu_dereference_protected(comment->c, 1);
> +	if (unlikely(!c))
> +		return;
> +	set->ext_size -= sizeof(*c) + strlen(c->str) + 1;
> +	kfree_rcu(c, rcu);
> +	rcu_assign_pointer(comment->c, NULL);
> +}
> +
> +static inline void
> +ip_set_add_bytes(u64 bytes, struct ip_set_counter *counter)
> +{
> +	atomic64_add((long long)bytes, &(counter)->bytes);
> +}
> +
> +static inline void
> +ip_set_add_packets(u64 packets, struct ip_set_counter *counter)
> +{
> +	atomic64_add((long long)packets, &(counter)->packets);
> +}
> +
> +static inline u64
> +ip_set_get_bytes(const struct ip_set_counter *counter)
> +{
> +	return (u64)atomic64_read(&(counter)->bytes);
> +}
> +
> +static inline u64
> +ip_set_get_packets(const struct ip_set_counter *counter)
> +{
> +	return (u64)atomic64_read(&(counter)->packets);
> +}
> +
> +static inline bool
> +ip_set_match_counter(u64 counter, u64 match, u8 op)
> +{
> +	switch (op) {
> +	case IPSET_COUNTER_NONE:
> +		return true;
> +	case IPSET_COUNTER_EQ:
> +		return counter == match;
> +	case IPSET_COUNTER_NE:
> +		return counter != match;
> +	case IPSET_COUNTER_LT:
> +		return counter < match;
> +	case IPSET_COUNTER_GT:
> +		return counter > match;
> +	}
> +	return false;
> +}
> +
> +static inline void
> +ip_set_update_counter(struct ip_set_counter *counter,
> +		      const struct ip_set_ext *ext, u32 flags)
> +{
> +	if (ext->packets != ULLONG_MAX &&
> +	    !(flags & IPSET_FLAG_SKIP_COUNTER_UPDATE)) {
> +		ip_set_add_bytes(ext->bytes, counter);
> +		ip_set_add_packets(ext->packets, counter);
> +	}
> +}
> +
> +static inline bool
> +ip_set_put_counter(struct sk_buff *skb, const struct ip_set_counter *counter)
> +{
> +	return nla_put_net64(skb, IPSET_ATTR_BYTES,
> +			     cpu_to_be64(ip_set_get_bytes(counter)),
> +			     IPSET_ATTR_PAD) ||
> +	       nla_put_net64(skb, IPSET_ATTR_PACKETS,
> +			     cpu_to_be64(ip_set_get_packets(counter)),
> +			     IPSET_ATTR_PAD);
> +}
> +
> +static inline void
> +ip_set_init_counter(struct ip_set_counter *counter,
> +		    const struct ip_set_ext *ext)
> +{
> +	if (ext->bytes != ULLONG_MAX)
> +		atomic64_set(&(counter)->bytes, (long long)(ext->bytes));
> +	if (ext->packets != ULLONG_MAX)
> +		atomic64_set(&(counter)->packets, (long long)(ext->packets));
> +}
> +
> +static inline void
> +ip_set_get_skbinfo(struct ip_set_skbinfo *skbinfo,
> +		   const struct ip_set_ext *ext,
> +		   struct ip_set_ext *mext, u32 flags)
> +{
> +	mext->skbinfo = *skbinfo;
> +}
> +
> +static inline bool
> +ip_set_put_skbinfo(struct sk_buff *skb, const struct ip_set_skbinfo *skbinfo)
> +{
> +	/* Send nonzero parameters only */
> +	return ((skbinfo->skbmark || skbinfo->skbmarkmask) &&
> +		nla_put_net64(skb, IPSET_ATTR_SKBMARK,
> +			      cpu_to_be64((u64)skbinfo->skbmark << 32 |
> +					  skbinfo->skbmarkmask),
> +			      IPSET_ATTR_PAD)) ||
> +	       (skbinfo->skbprio &&
> +		nla_put_net32(skb, IPSET_ATTR_SKBPRIO,
> +			      cpu_to_be32(skbinfo->skbprio))) ||
> +	       (skbinfo->skbqueue &&
> +		nla_put_net16(skb, IPSET_ATTR_SKBQUEUE,
> +			      cpu_to_be16(skbinfo->skbqueue)));
> +}
> +
> +static inline void
> +ip_set_init_skbinfo(struct ip_set_skbinfo *skbinfo,
> +		    const struct ip_set_ext *ext)
> +{
> +	*skbinfo = ext->skbinfo;
> +}
>  
>  #define IP_SET_INIT_KEXT(skb, opt, set)			\
>  	{ .bytes = (skb)->len, .packets = 1,		\
> diff --git a/include/linux/netfilter/ipset/ip_set_comment.h b/include/linux/netfilter/ipset/ip_set_comment.h
> deleted file mode 100644
> index 0b894d81bbf2..000000000000
> --- a/include/linux/netfilter/ipset/ip_set_comment.h
> +++ /dev/null
> @@ -1,73 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0-only */
> -#ifndef _IP_SET_COMMENT_H
> -#define _IP_SET_COMMENT_H
> -
> -/* Copyright (C) 2013 Oliver Smith <oliver@8.c.9.b.0.7.4.0.1.0.0.2.ip6.arpa>
> - */
> -
> -#ifdef __KERNEL__
> -
> -static inline char*
> -ip_set_comment_uget(struct nlattr *tb)
> -{
> -	return nla_data(tb);
> -}
> -
> -/* Called from uadd only, protected by the set spinlock.
> - * The kadt functions don't use the comment extensions in any way.
> - */
> -static inline void
> -ip_set_init_comment(struct ip_set *set, struct ip_set_comment *comment,
> -		    const struct ip_set_ext *ext)
> -{
> -	struct ip_set_comment_rcu *c = rcu_dereference_protected(comment->c, 1);
> -	size_t len = ext->comment ? strlen(ext->comment) : 0;
> -
> -	if (unlikely(c)) {
> -		set->ext_size -= sizeof(*c) + strlen(c->str) + 1;
> -		kfree_rcu(c, rcu);
> -		rcu_assign_pointer(comment->c, NULL);
> -	}
> -	if (!len)
> -		return;
> -	if (unlikely(len > IPSET_MAX_COMMENT_SIZE))
> -		len = IPSET_MAX_COMMENT_SIZE;
> -	c = kmalloc(sizeof(*c) + len + 1, GFP_ATOMIC);
> -	if (unlikely(!c))
> -		return;
> -	strlcpy(c->str, ext->comment, len + 1);
> -	set->ext_size += sizeof(*c) + strlen(c->str) + 1;
> -	rcu_assign_pointer(comment->c, c);
> -}
> -
> -/* Used only when dumping a set, protected by rcu_read_lock() */
> -static inline int
> -ip_set_put_comment(struct sk_buff *skb, const struct ip_set_comment *comment)
> -{
> -	struct ip_set_comment_rcu *c = rcu_dereference(comment->c);
> -
> -	if (!c)
> -		return 0;
> -	return nla_put_string(skb, IPSET_ATTR_COMMENT, c->str);
> -}
> -
> -/* Called from uadd/udel, flush or the garbage collectors protected
> - * by the set spinlock.
> - * Called when the set is destroyed and when there can't be any user
> - * of the set data anymore.
> - */
> -static inline void
> -ip_set_comment_free(struct ip_set *set, struct ip_set_comment *comment)
> -{
> -	struct ip_set_comment_rcu *c;
> -
> -	c = rcu_dereference_protected(comment->c, 1);
> -	if (unlikely(!c))
> -		return;
> -	set->ext_size -= sizeof(*c) + strlen(c->str) + 1;
> -	kfree_rcu(c, rcu);
> -	rcu_assign_pointer(comment->c, NULL);
> -}
> -
> -#endif
> -#endif
> diff --git a/include/linux/netfilter/ipset/ip_set_counter.h b/include/linux/netfilter/ipset/ip_set_counter.h
> deleted file mode 100644
> index 3400958c07be..000000000000
> --- a/include/linux/netfilter/ipset/ip_set_counter.h
> +++ /dev/null
> @@ -1,84 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0-only */
> -#ifndef _IP_SET_COUNTER_H
> -#define _IP_SET_COUNTER_H
> -
> -/* Copyright (C) 2015 Jozsef Kadlecsik <kadlec@netfilter.org> */
> -
> -#ifdef __KERNEL__
> -
> -static inline void
> -ip_set_add_bytes(u64 bytes, struct ip_set_counter *counter)
> -{
> -	atomic64_add((long long)bytes, &(counter)->bytes);
> -}
> -
> -static inline void
> -ip_set_add_packets(u64 packets, struct ip_set_counter *counter)
> -{
> -	atomic64_add((long long)packets, &(counter)->packets);
> -}
> -
> -static inline u64
> -ip_set_get_bytes(const struct ip_set_counter *counter)
> -{
> -	return (u64)atomic64_read(&(counter)->bytes);
> -}
> -
> -static inline u64
> -ip_set_get_packets(const struct ip_set_counter *counter)
> -{
> -	return (u64)atomic64_read(&(counter)->packets);
> -}
> -
> -static inline bool
> -ip_set_match_counter(u64 counter, u64 match, u8 op)
> -{
> -	switch (op) {
> -	case IPSET_COUNTER_NONE:
> -		return true;
> -	case IPSET_COUNTER_EQ:
> -		return counter == match;
> -	case IPSET_COUNTER_NE:
> -		return counter != match;
> -	case IPSET_COUNTER_LT:
> -		return counter < match;
> -	case IPSET_COUNTER_GT:
> -		return counter > match;
> -	}
> -	return false;
> -}
> -
> -static inline void
> -ip_set_update_counter(struct ip_set_counter *counter,
> -		      const struct ip_set_ext *ext, u32 flags)
> -{
> -	if (ext->packets != ULLONG_MAX &&
> -	    !(flags & IPSET_FLAG_SKIP_COUNTER_UPDATE)) {
> -		ip_set_add_bytes(ext->bytes, counter);
> -		ip_set_add_packets(ext->packets, counter);
> -	}
> -}
> -
> -static inline bool
> -ip_set_put_counter(struct sk_buff *skb, const struct ip_set_counter *counter)
> -{
> -	return nla_put_net64(skb, IPSET_ATTR_BYTES,
> -			     cpu_to_be64(ip_set_get_bytes(counter)),
> -			     IPSET_ATTR_PAD) ||
> -	       nla_put_net64(skb, IPSET_ATTR_PACKETS,
> -			     cpu_to_be64(ip_set_get_packets(counter)),
> -			     IPSET_ATTR_PAD);
> -}
> -
> -static inline void
> -ip_set_init_counter(struct ip_set_counter *counter,
> -		    const struct ip_set_ext *ext)
> -{
> -	if (ext->bytes != ULLONG_MAX)
> -		atomic64_set(&(counter)->bytes, (long long)(ext->bytes));
> -	if (ext->packets != ULLONG_MAX)
> -		atomic64_set(&(counter)->packets, (long long)(ext->packets));
> -}
> -
> -#endif /* __KERNEL__ */
> -#endif /* _IP_SET_COUNTER_H */
> diff --git a/include/linux/netfilter/ipset/ip_set_skbinfo.h b/include/linux/netfilter/ipset/ip_set_skbinfo.h
> deleted file mode 100644
> index 3a2df02dbd55..000000000000
> --- a/include/linux/netfilter/ipset/ip_set_skbinfo.h
> +++ /dev/null
> @@ -1,42 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0-only */
> -#ifndef _IP_SET_SKBINFO_H
> -#define _IP_SET_SKBINFO_H
> -
> -/* Copyright (C) 2015 Jozsef Kadlecsik <kadlec@netfilter.org> */
> -
> -#ifdef __KERNEL__
> -
> -static inline void
> -ip_set_get_skbinfo(struct ip_set_skbinfo *skbinfo,
> -		   const struct ip_set_ext *ext,
> -		   struct ip_set_ext *mext, u32 flags)
> -{
> -	mext->skbinfo = *skbinfo;
> -}
> -
> -static inline bool
> -ip_set_put_skbinfo(struct sk_buff *skb, const struct ip_set_skbinfo *skbinfo)
> -{
> -	/* Send nonzero parameters only */
> -	return ((skbinfo->skbmark || skbinfo->skbmarkmask) &&
> -		nla_put_net64(skb, IPSET_ATTR_SKBMARK,
> -			      cpu_to_be64((u64)skbinfo->skbmark << 32 |
> -					  skbinfo->skbmarkmask),
> -			      IPSET_ATTR_PAD)) ||
> -	       (skbinfo->skbprio &&
> -		nla_put_net32(skb, IPSET_ATTR_SKBPRIO,
> -			      cpu_to_be32(skbinfo->skbprio))) ||
> -	       (skbinfo->skbqueue &&
> -		nla_put_net16(skb, IPSET_ATTR_SKBQUEUE,
> -			      cpu_to_be16(skbinfo->skbqueue)));
> -}
> -
> -static inline void
> -ip_set_init_skbinfo(struct ip_set_skbinfo *skbinfo,
> -		    const struct ip_set_ext *ext)
> -{
> -	*skbinfo = ext->skbinfo;
> -}
> -
> -#endif /* __KERNEL__ */
> -#endif /* _IP_SET_SKBINFO_H */
> diff --git a/include/linux/netfilter/ipset/ip_set_timeout.h b/include/linux/netfilter/ipset/ip_set_timeout.h
> deleted file mode 100644
> index 2be60e379ecf..000000000000
> --- a/include/linux/netfilter/ipset/ip_set_timeout.h
> +++ /dev/null
> @@ -1,77 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0-only */
> -#ifndef _IP_SET_TIMEOUT_H
> -#define _IP_SET_TIMEOUT_H
> -
> -/* Copyright (C) 2003-2013 Jozsef Kadlecsik <kadlec@netfilter.org> */
> -
> -#ifdef __KERNEL__
> -
> -/* How often should the gc be run by default */
> -#define IPSET_GC_TIME			(3 * 60)
> -
> -/* Timeout period depending on the timeout value of the given set */
> -#define IPSET_GC_PERIOD(timeout) \
> -	((timeout/3) ? min_t(u32, (timeout)/3, IPSET_GC_TIME) : 1)
> -
> -/* Entry is set with no timeout value */
> -#define IPSET_ELEM_PERMANENT	0
> -
> -/* Set is defined with timeout support: timeout value may be 0 */
> -#define IPSET_NO_TIMEOUT	UINT_MAX
> -
> -/* Max timeout value, see msecs_to_jiffies() in jiffies.h */
> -#define IPSET_MAX_TIMEOUT	(UINT_MAX >> 1)/MSEC_PER_SEC
> -
> -#define ip_set_adt_opt_timeout(opt, set)	\
> -((opt)->ext.timeout != IPSET_NO_TIMEOUT ? (opt)->ext.timeout : (set)->timeout)
> -
> -static inline unsigned int
> -ip_set_timeout_uget(struct nlattr *tb)
> -{
> -	unsigned int timeout = ip_set_get_h32(tb);
> -
> -	/* Normalize to fit into jiffies */
> -	if (timeout > IPSET_MAX_TIMEOUT)
> -		timeout = IPSET_MAX_TIMEOUT;
> -
> -	return timeout;
> -}
> -
> -static inline bool
> -ip_set_timeout_expired(const unsigned long *t)
> -{
> -	return *t != IPSET_ELEM_PERMANENT && time_is_before_jiffies(*t);
> -}
> -
> -static inline void
> -ip_set_timeout_set(unsigned long *timeout, u32 value)
> -{
> -	unsigned long t;
> -
> -	if (!value) {
> -		*timeout = IPSET_ELEM_PERMANENT;
> -		return;
> -	}
> -
> -	t = msecs_to_jiffies(value * MSEC_PER_SEC) + jiffies;
> -	if (t == IPSET_ELEM_PERMANENT)
> -		/* Bingo! :-) */
> -		t--;
> -	*timeout = t;
> -}
> -
> -static inline u32
> -ip_set_timeout_get(const unsigned long *timeout)
> -{
> -	u32 t;
> -
> -	if (*timeout == IPSET_ELEM_PERMANENT)
> -		return 0;
> -
> -	t = jiffies_to_msecs(*timeout - jiffies)/MSEC_PER_SEC;
> -	/* Zero value in userspace means no timeout */
> -	return t == 0 ? 1 : t;
> -}
> -
> -#endif	/* __KERNEL__ */
> -#endif /* _IP_SET_TIMEOUT_H */
> diff --git a/net/netfilter/ipset/ip_set_hash_gen.h b/net/netfilter/ipset/ip_set_hash_gen.h
> index 0feb77fa9edc..2e541cb3b37d 100644
> --- a/net/netfilter/ipset/ip_set_hash_gen.h
> +++ b/net/netfilter/ipset/ip_set_hash_gen.h
> @@ -7,7 +7,7 @@
>  #include <linux/rcupdate.h>
>  #include <linux/jhash.h>
>  #include <linux/types.h>
> -#include <linux/netfilter/ipset/ip_set_timeout.h>
> +#include <linux/netfilter/ipset/ip_set.h>
>  
>  #define __ipset_dereference_protected(p, c)	rcu_dereference_protected(p, c)
>  #define ipset_dereference_protected(p, set) \
> diff --git a/net/netfilter/xt_set.c b/net/netfilter/xt_set.c
> index ecbfa291fb70..731bc2cafae4 100644
> --- a/net/netfilter/xt_set.c
> +++ b/net/netfilter/xt_set.c
> @@ -14,7 +14,6 @@
>  
>  #include <linux/netfilter/x_tables.h>
>  #include <linux/netfilter/ipset/ip_set.h>
> -#include <linux/netfilter/ipset/ip_set_timeout.h>
>  #include <uapi/linux/netfilter/xt_set.h>
>  
>  MODULE_LICENSE("GPL");
> -- 
> 2.20.1
> 
> 

-
E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.mta.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply

* Re: [PATCH bpf 2/2] tools: bpftool: add error message on pin failure
From: Andrii Nakryiko @ 2019-08-08 18:54 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexei Starovoitov, Daniel Borkmann, Networking, bpf, oss-drivers,
	Andy Lutomirski, Quentin Monnet
In-Reply-To: <20190807001923.19483-3-jakub.kicinski@netronome.com>

On Tue, Aug 6, 2019 at 5:21 PM Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
>
> No error message is currently printed if the pin syscall
> itself fails. It got lost in the loadall refactoring.
>
> Fixes: 77380998d91d ("bpftool: add loadall command")
> Reported-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
> ---

Acked-by: Andrii Nakryiko <andriin@fb.com>

> CC: luto@kernel.org, sdf@google.com
>
>  tools/bpf/bpftool/common.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
> index c52a6ffb8949..6a71324be628 100644
> --- a/tools/bpf/bpftool/common.c
> +++ b/tools/bpf/bpftool/common.c
> @@ -204,7 +204,11 @@ int do_pin_fd(int fd, const char *name)
>         if (err)
>                 return err;
>
> -       return bpf_obj_pin(fd, name);
> +       err = bpf_obj_pin(fd, name);
> +       if (err)
> +               p_err("can't pin the object (%s): %s", name, strerror(errno));
> +
> +       return err;
>  }
>
>  int do_pin_any(int argc, char **argv, int (*get_fd_by_id)(__u32))
> --
> 2.21.0
>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox