Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next 3/3] net: phy: realtek: add support for the 2.5Gbps PHY in RTL8125
From: Andrew Lunn @ 2019-08-08 20:20 UTC (permalink / raw)
  To: Heiner Kallweit; +Cc: Florian Fainelli, David Miller, netdev@vger.kernel.org
In-Reply-To: <f34d1117-510f-861f-59f0-51e0e87ead1e@gmail.com>

> I have a contact in Realtek who provided the information about
> the vendor-specific registers used in the patch. I also asked for
> a method to auto-detect 2.5Gbps support but have no feedback so far.
> What may contribute to the problem is that also the integrated 1Gbps
> PHY's (all with the same PHY ID) differ significantly from each other,
> depending on the network chip version.

Hi Heiner

Some of the PHYs embedded in Marvell switches have an OUI, but no
product ID. We work around this brokenness by trapping the reads to
the ID registers in the MDIO bus controller driver and inserting the
switch product ID. The Marvell PHY driver then recognises these IDs
and does the right thing.

Maybe you can do something similar here?

      Andrew

^ permalink raw reply

* Re: [PATCH v2 bpf-next] btf: expose BTF info through sysfs
From: Yonghong Song @ 2019-08-08 20:21 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf@vger.kernel.org, netdev@vger.kernel.org,
	Alexei Starovoitov, daniel@iogearbox.net, Kernel Team,
	Masahiro Yamada, Arnaldo Carvalho de Melo, Jiri Olsa,
	Sam Ravnborg
In-Reply-To: <CAEf4BzYAZ7x+PY0t90ty9RVSm1FSmc9XqY216DtJCA-giK3fUg@mail.gmail.com>



On 8/8/19 10:47 AM, Andrii Nakryiko wrote:
> On Wed, Aug 7, 2019 at 9:24 PM Yonghong Song <yhs@fb.com> wrote:
>>
>>
>>
>> On 8/7/19 5:32 PM, Andrii Nakryiko wrote:
>>> Make .BTF section allocated and expose its contents through sysfs.
>>>
>>> /sys/kernel/btf directory is created to contain all the BTFs present
>>> inside kernel. Currently there is only kernel's main BTF, represented as
>>> /sys/kernel/btf/kernel file. Once kernel modules' BTFs are supported,
>>> each module will expose its BTF as /sys/kernel/btf/<module-name> file.
>>>
>>> Current approach relies on a few pieces coming together:
>>> 1. pahole is used to take almost final vmlinux image (modulo .BTF and
>>>      kallsyms) and generate .BTF section by converting DWARF info into
>>>      BTF. This section is not allocated and not mapped to any segment,
>>>      though, so is not yet accessible from inside kernel at runtime.
>>> 2. objcopy dumps .BTF contents into binary file and subsequently
>>>      convert binary file into linkable object file with automatically
>>>      generated symbols _binary__btf_kernel_bin_start and
>>>      _binary__btf_kernel_bin_end, pointing to start and end, respectively,
>>>      of BTF raw data.
>>> 3. final vmlinux image is generated by linking this object file (and
>>>      kallsyms, if necessary). sysfs_btf.c then creates
>>>      /sys/kernel/btf/kernel file and exposes embedded BTF contents through
>>>      it. This allows, e.g., libbpf and bpftool access BTF info at
>>>      well-known location, without resorting to searching for vmlinux image
>>>      on disk (location of which is not standardized and vmlinux image
>>>      might not be even available in some scenarios, e.g., inside qemu
>>>      during testing).
>>>
>>> Alternative approach using .incbin assembler directive to embed BTF
>>> contents directly was attempted but didn't work, because sysfs_proc.o is
>>> not re-compiled during link-vmlinux.sh stage. This is required, though,
>>> to update embedded BTF data (initially empty data is embedded, then
>>> pahole generates BTF info and we need to regenerate sysfs_btf.o with
>>> updated contents, but it's too late at that point).
>>>
>>> If BTF couldn't be generated due to missing or too old pahole,
>>> sysfs_btf.c handles that gracefully by detecting that
>>> _binary__btf_kernel_bin_start (weak symbol) is 0 and not creating
>>> /sys/kernel/btf at all.
>>>
>>> v1->v2:
>>> - allow kallsyms stage to re-use vmlinux generated by gen_btf();
>>>
>>> Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
>>> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
>>> Cc: Jiri Olsa <jolsa@kernel.org>
>>> Cc: Sam Ravnborg <sam@ravnborg.org>
>>> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
>>> ---
> 
> [...]
> 
>>> +
>>> +     # dump .BTF section into raw binary file to link with final vmlinux
>>> +     bin_arch=$(${OBJDUMP} -f ${1} | grep architecture | \
>>> +             cut -d, -f1 | cut -d' ' -f2)
>>> +     ${OBJCOPY} --dump-section .BTF=.btf.kernel.bin ${1} 2>/dev/null
>>> +     ${OBJCOPY} -I binary -O ${CONFIG_OUTPUT_FORMAT} -B ${bin_arch} \
>>> +             --rename-section .data=.BTF .btf.kernel.bin ${2}
>>
>> Currently, the binary size on my config is about 2.6MB. Do you think
>> we could or need to compress it to make it smaller? I tried gzip
>> and the compressed size is 0.9MB.
> 
> I'd really prefer to keep it uncompressed for two main reasons:
> - by having this in uncompressed form, kernel itself can use this BTF
> data from inside with almost no additional memory (except maybe for
> index from type ID to actual location of type info), which opens up a
> lot of new and interesting opportunities, like kernel returning its
> own BTF and BTF type ID for various types (think about driver metdata,
> all those special maps, etc).
> - if we are doing compression, now we need to decide on best
> compression format, teach it libbpf (which will make libbpf also
> bigger and depending on extra libraries), etc.
> 
> So basically, in exchange of 1-1.5MB extra memory we get a bunch of
> new problems we normally don't have to deal with.

Yes, I am aware of this tradeoff. Just to make sure this has been 
discussed. I am totally fine with leaving it uncompressed.

> 
>>
>>>    }
>>>
>>>    # Create ${2} .o file with all symbols from the ${1} object file
>>> @@ -153,6 +164,7 @@ sortextable()
>>>    # Delete output files in case of error
>>>    cleanup()
>>>    {
> 
> [...]
> 

^ permalink raw reply

* [pull request][net 00/12] Mellanox, mlx5 fixes 2019-08-08
From: Saeed Mahameed @ 2019-08-08 20:21 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev@vger.kernel.org, Saeed Mahameed

Hi Dave,

This series introduces some fixes to mlx5 driver.

Highlights:
1) From Tariq, Critical mlx5 kTLS fixes to better align with hw specs.
2) From Aya, Fixes to mlx5 tx devlink health reporter.
3) From Maxim, aRFs parsing to use flow dissector to avoid relying on
invalid skb fields.

Please pull and let me know if there is any problem.

For -stable v4.3
 ('net/mlx5e: Only support tx/rx pause setting for port owner')
For -stable v4.9
 ('net/mlx5e: Use flow keys dissector to parse packets for ARFS')
For -stable v5.1
 ('net/mlx5e: Fix false negative indication on tx reporter CQE recovery')
 ('net/mlx5e: Remove redundant check in CQE recovery flow of tx reporter')
 ('net/mlx5e: ethtool, Avoid setting speed to 56GBASE when autoneg off')

Note: when merged with net-next this minor conflict will pop up:
++<<<<<<< (net-next)
 +      if (is_eswitch_flow) {
 +              flow->esw_attr->match_level = match_level;
 +              flow->esw_attr->tunnel_match_level = tunnel_match_level;
++=======
+       if (flow->flags & MLX5E_TC_FLOW_ESWITCH) {
+               flow->esw_attr->inner_match_level = inner_match_level;
+               flow->esw_attr->outer_match_level = outer_match_level;
++>>>>>>> (net)

To resolve, use hunks from net (2nd) and replace:
if (flow->flags & MLX5E_TC_FLOW_ESWITCH) 
with
if (is_eswitch_flow)

Thanks,
Saeed.

---
The following changes since commit f6649feb264ed10ce425455df48242c0e704cba2:

  Merge tag 'batadv-net-for-davem-20190808' of git://git.open-mesh.org/linux-merge (2019-08-08 11:25:39 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5-fixes-2019-08-08

for you to fetch changes up to a4e508cab623951dc4754f346e5673714f3bbade:

  net/mlx5e: Remove redundant check in CQE recovery flow of tx reporter (2019-08-08 13:01:20 -0700)

----------------------------------------------------------------
mlx5-fixes-2019-08-08

----------------------------------------------------------------
Aya Levin (3):
      net/mlx5e: Fix false negative indication on tx reporter CQE recovery
      net/mlx5e: Fix error flow of CQE recovery on tx reporter
      net/mlx5e: Remove redundant check in CQE recovery flow of tx reporter

Huy Nguyen (2):
      net/mlx5: Support inner header match criteria for non decap flow action
      net/mlx5e: Only support tx/rx pause setting for port owner

Maxim Mikityanskiy (1):
      net/mlx5e: Use flow keys dissector to parse packets for ARFS

Mohamad Heib (1):
      net/mlx5e: ethtool, Avoid setting speed to 56GBASE when autoneg off

Tariq Toukan (5):
      net/mlx5: crypto, Fix wrong offset in encryption key command
      net/mlx5: kTLS, Fix wrong TIS opmod constants
      net/mlx5e: kTLS, Fix progress params context WQE layout
      net/mlx5e: kTLS, Fix tisn field name
      net/mlx5e: kTLS, Fix tisn field placement

 drivers/net/ethernet/mellanox/mlx5/core/en.h       |  9 +-
 .../ethernet/mellanox/mlx5/core/en/reporter_tx.c   | 19 ++---
 .../ethernet/mellanox/mlx5/core/en_accel/ktls.h    |  6 +-
 .../ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c | 10 +--
 drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c  | 97 ++++++++--------------
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   | 11 +++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  1 -
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c    | 31 ++++---
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.h  |  4 +-
 .../ethernet/mellanox/mlx5/core/eswitch_offloads.c | 12 +--
 .../net/ethernet/mellanox/mlx5/core/lib/crypto.c   |  1 +
 include/linux/mlx5/device.h                        |  4 +-
 include/linux/mlx5/mlx5_ifc.h                      |  5 +-
 13 files changed, 101 insertions(+), 109 deletions(-)

^ permalink raw reply

* [net 01/12] net/mlx5e: Use flow keys dissector to parse packets for ARFS
From: Saeed Mahameed @ 2019-08-08 20:22 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev@vger.kernel.org, Maxim Mikityanskiy, Tariq Toukan,
	Saeed Mahameed
In-Reply-To: <20190808202025.11303-1-saeedm@mellanox.com>

From: Maxim Mikityanskiy <maximmi@mellanox.com>

The current ARFS code relies on certain fields to be set in the SKB
(e.g. transport_header) and extracts IP addresses and ports by custom
code that parses the packet. The necessary SKB fields, however, are not
always set at that point, which leads to an out-of-bounds access. Use
skb_flow_dissect_flow_keys() to get the necessary information reliably,
fix the out-of-bounds access and reuse the code.

Fixes: 18c908e477dc ("net/mlx5e: Add accelerated RFS support")
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 .../net/ethernet/mellanox/mlx5/core/en_arfs.c | 97 +++++++------------
 1 file changed, 34 insertions(+), 63 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c b/drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c
index 8657e0f26995..2c75b2752f58 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c
@@ -437,12 +437,6 @@ arfs_hash_bucket(struct arfs_table *arfs_t, __be16 src_port,
 	return &arfs_t->rules_hash[bucket_idx];
 }
 
-static u8 arfs_get_ip_proto(const struct sk_buff *skb)
-{
-	return (skb->protocol == htons(ETH_P_IP)) ?
-		ip_hdr(skb)->protocol : ipv6_hdr(skb)->nexthdr;
-}
-
 static struct arfs_table *arfs_get_table(struct mlx5e_arfs_tables *arfs,
 					 u8 ip_proto, __be16 etype)
 {
@@ -602,31 +596,9 @@ static void arfs_handle_work(struct work_struct *work)
 	arfs_may_expire_flow(priv);
 }
 
-/* return L4 destination port from ip4/6 packets */
-static __be16 arfs_get_dst_port(const struct sk_buff *skb)
-{
-	char *transport_header;
-
-	transport_header = skb_transport_header(skb);
-	if (arfs_get_ip_proto(skb) == IPPROTO_TCP)
-		return ((struct tcphdr *)transport_header)->dest;
-	return ((struct udphdr *)transport_header)->dest;
-}
-
-/* return L4 source port from ip4/6 packets */
-static __be16 arfs_get_src_port(const struct sk_buff *skb)
-{
-	char *transport_header;
-
-	transport_header = skb_transport_header(skb);
-	if (arfs_get_ip_proto(skb) == IPPROTO_TCP)
-		return ((struct tcphdr *)transport_header)->source;
-	return ((struct udphdr *)transport_header)->source;
-}
-
 static struct arfs_rule *arfs_alloc_rule(struct mlx5e_priv *priv,
 					 struct arfs_table *arfs_t,
-					 const struct sk_buff *skb,
+					 const struct flow_keys *fk,
 					 u16 rxq, u32 flow_id)
 {
 	struct arfs_rule *rule;
@@ -641,19 +613,19 @@ static struct arfs_rule *arfs_alloc_rule(struct mlx5e_priv *priv,
 	INIT_WORK(&rule->arfs_work, arfs_handle_work);
 
 	tuple = &rule->tuple;
-	tuple->etype = skb->protocol;
+	tuple->etype = fk->basic.n_proto;
+	tuple->ip_proto = fk->basic.ip_proto;
 	if (tuple->etype == htons(ETH_P_IP)) {
-		tuple->src_ipv4 = ip_hdr(skb)->saddr;
-		tuple->dst_ipv4 = ip_hdr(skb)->daddr;
+		tuple->src_ipv4 = fk->addrs.v4addrs.src;
+		tuple->dst_ipv4 = fk->addrs.v4addrs.dst;
 	} else {
-		memcpy(&tuple->src_ipv6, &ipv6_hdr(skb)->saddr,
+		memcpy(&tuple->src_ipv6, &fk->addrs.v6addrs.src,
 		       sizeof(struct in6_addr));
-		memcpy(&tuple->dst_ipv6, &ipv6_hdr(skb)->daddr,
+		memcpy(&tuple->dst_ipv6, &fk->addrs.v6addrs.dst,
 		       sizeof(struct in6_addr));
 	}
-	tuple->ip_proto = arfs_get_ip_proto(skb);
-	tuple->src_port = arfs_get_src_port(skb);
-	tuple->dst_port = arfs_get_dst_port(skb);
+	tuple->src_port = fk->ports.src;
+	tuple->dst_port = fk->ports.dst;
 
 	rule->flow_id = flow_id;
 	rule->filter_id = priv->fs.arfs.last_filter_id++ % RPS_NO_FILTER;
@@ -664,37 +636,33 @@ static struct arfs_rule *arfs_alloc_rule(struct mlx5e_priv *priv,
 	return rule;
 }
 
-static bool arfs_cmp_ips(struct arfs_tuple *tuple,
-			 const struct sk_buff *skb)
+static bool arfs_cmp(const struct arfs_tuple *tuple, const struct flow_keys *fk)
 {
-	if (tuple->etype == htons(ETH_P_IP) &&
-	    tuple->src_ipv4 == ip_hdr(skb)->saddr &&
-	    tuple->dst_ipv4 == ip_hdr(skb)->daddr)
-		return true;
-	if (tuple->etype == htons(ETH_P_IPV6) &&
-	    (!memcmp(&tuple->src_ipv6, &ipv6_hdr(skb)->saddr,
-		     sizeof(struct in6_addr))) &&
-	    (!memcmp(&tuple->dst_ipv6, &ipv6_hdr(skb)->daddr,
-		     sizeof(struct in6_addr))))
-		return true;
+	if (tuple->src_port != fk->ports.src || tuple->dst_port != fk->ports.dst)
+		return false;
+	if (tuple->etype != fk->basic.n_proto)
+		return false;
+	if (tuple->etype == htons(ETH_P_IP))
+		return tuple->src_ipv4 == fk->addrs.v4addrs.src &&
+		       tuple->dst_ipv4 == fk->addrs.v4addrs.dst;
+	if (tuple->etype == htons(ETH_P_IPV6))
+		return !memcmp(&tuple->src_ipv6, &fk->addrs.v6addrs.src,
+			       sizeof(struct in6_addr)) &&
+		       !memcmp(&tuple->dst_ipv6, &fk->addrs.v6addrs.dst,
+			       sizeof(struct in6_addr));
 	return false;
 }
 
 static struct arfs_rule *arfs_find_rule(struct arfs_table *arfs_t,
-					const struct sk_buff *skb)
+					const struct flow_keys *fk)
 {
 	struct arfs_rule *arfs_rule;
 	struct hlist_head *head;
-	__be16 src_port = arfs_get_src_port(skb);
-	__be16 dst_port = arfs_get_dst_port(skb);
 
-	head = arfs_hash_bucket(arfs_t, src_port, dst_port);
+	head = arfs_hash_bucket(arfs_t, fk->ports.src, fk->ports.dst);
 	hlist_for_each_entry(arfs_rule, head, hlist) {
-		if (arfs_rule->tuple.src_port == src_port &&
-		    arfs_rule->tuple.dst_port == dst_port &&
-		    arfs_cmp_ips(&arfs_rule->tuple, skb)) {
+		if (arfs_cmp(&arfs_rule->tuple, fk))
 			return arfs_rule;
-		}
 	}
 
 	return NULL;
@@ -707,20 +675,24 @@ int mlx5e_rx_flow_steer(struct net_device *dev, const struct sk_buff *skb,
 	struct mlx5e_arfs_tables *arfs = &priv->fs.arfs;
 	struct arfs_table *arfs_t;
 	struct arfs_rule *arfs_rule;
+	struct flow_keys fk;
+
+	if (!skb_flow_dissect_flow_keys(skb, &fk, 0))
+		return -EPROTONOSUPPORT;
 
-	if (skb->protocol != htons(ETH_P_IP) &&
-	    skb->protocol != htons(ETH_P_IPV6))
+	if (fk.basic.n_proto != htons(ETH_P_IP) &&
+	    fk.basic.n_proto != htons(ETH_P_IPV6))
 		return -EPROTONOSUPPORT;
 
 	if (skb->encapsulation)
 		return -EPROTONOSUPPORT;
 
-	arfs_t = arfs_get_table(arfs, arfs_get_ip_proto(skb), skb->protocol);
+	arfs_t = arfs_get_table(arfs, fk.basic.ip_proto, fk.basic.n_proto);
 	if (!arfs_t)
 		return -EPROTONOSUPPORT;
 
 	spin_lock_bh(&arfs->arfs_lock);
-	arfs_rule = arfs_find_rule(arfs_t, skb);
+	arfs_rule = arfs_find_rule(arfs_t, &fk);
 	if (arfs_rule) {
 		if (arfs_rule->rxq == rxq_index) {
 			spin_unlock_bh(&arfs->arfs_lock);
@@ -728,8 +700,7 @@ int mlx5e_rx_flow_steer(struct net_device *dev, const struct sk_buff *skb,
 		}
 		arfs_rule->rxq = rxq_index;
 	} else {
-		arfs_rule = arfs_alloc_rule(priv, arfs_t, skb,
-					    rxq_index, flow_id);
+		arfs_rule = arfs_alloc_rule(priv, arfs_t, &fk, rxq_index, flow_id);
 		if (!arfs_rule) {
 			spin_unlock_bh(&arfs->arfs_lock);
 			return -ENOMEM;
-- 
2.21.0


^ permalink raw reply related

* [net 02/12] net/mlx5: Support inner header match criteria for non decap flow action
From: Saeed Mahameed @ 2019-08-08 20:22 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev@vger.kernel.org, Huy Nguyen, Roi Dayan, Saeed Mahameed
In-Reply-To: <20190808202025.11303-1-saeedm@mellanox.com>

From: Huy Nguyen <huyn@mellanox.com>

We have an issue that OVS application creates an offloaded drop rule
that drops VXLAN traffic with both inner and outer header match
criteria. mlx5_core driver detects correctly the inner and outer
header match criteria but does not enable the inner header match criteria
due to an incorrect assumption in mlx5_eswitch_add_offloaded_rule that
only decap rule needs inner header criteria.

Solution:
Remove mlx5_esw_flow_attr's match_level and tunnel_match_level and add
two new members: inner_match_level and outer_match_level.
inner/outer_match_level is set to NONE if the inner/outer match criteria
is not specified in the tc rule creation request. The decap assumption is
removed and the code just needs to check for inner/outer_match_level to
enable the corresponding bit in firmware's match_criteria_enable value.

Fixes: 6363651d6dd7 ("net/mlx5e: Properly set steering match levels for offloaded TC decap rules")
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   | 31 ++++++++++++-------
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  4 +--
 .../mellanox/mlx5/core/eswitch_offloads.c     | 12 +++----
 3 files changed, 26 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 7ecfc53cf5f6..deeb65da99f3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -1480,7 +1480,7 @@ static int __parse_cls_flower(struct mlx5e_priv *priv,
 			      struct mlx5_flow_spec *spec,
 			      struct flow_cls_offload *f,
 			      struct net_device *filter_dev,
-			      u8 *match_level, u8 *tunnel_match_level)
+			      u8 *inner_match_level, u8 *outer_match_level)
 {
 	struct netlink_ext_ack *extack = f->common.extack;
 	void *headers_c = MLX5_ADDR_OF(fte_match_param, spec->match_criteria,
@@ -1495,8 +1495,9 @@ static int __parse_cls_flower(struct mlx5e_priv *priv,
 	struct flow_dissector *dissector = rule->match.dissector;
 	u16 addr_type = 0;
 	u8 ip_proto = 0;
+	u8 *match_level;
 
-	*match_level = MLX5_MATCH_NONE;
+	match_level = outer_match_level;
 
 	if (dissector->used_keys &
 	    ~(BIT(FLOW_DISSECTOR_KEY_META) |
@@ -1524,12 +1525,14 @@ static int __parse_cls_flower(struct mlx5e_priv *priv,
 	}
 
 	if (mlx5e_get_tc_tun(filter_dev)) {
-		if (parse_tunnel_attr(priv, spec, f, filter_dev, tunnel_match_level))
+		if (parse_tunnel_attr(priv, spec, f, filter_dev,
+				      outer_match_level))
 			return -EOPNOTSUPP;
 
-		/* In decap flow, header pointers should point to the inner
+		/* At this point, header pointers should point to the inner
 		 * headers, outer header were already set by parse_tunnel_attr
 		 */
+		match_level = inner_match_level;
 		headers_c = get_match_headers_criteria(MLX5_FLOW_CONTEXT_ACTION_DECAP,
 						       spec);
 		headers_v = get_match_headers_value(MLX5_FLOW_CONTEXT_ACTION_DECAP,
@@ -1831,35 +1834,41 @@ static int parse_cls_flower(struct mlx5e_priv *priv,
 			    struct flow_cls_offload *f,
 			    struct net_device *filter_dev)
 {
+	u8 inner_match_level, outer_match_level, non_tunnel_match_level;
 	struct netlink_ext_ack *extack = f->common.extack;
 	struct mlx5_core_dev *dev = priv->mdev;
 	struct mlx5_eswitch *esw = dev->priv.eswitch;
 	struct mlx5e_rep_priv *rpriv = priv->ppriv;
-	u8 match_level, tunnel_match_level = MLX5_MATCH_NONE;
 	struct mlx5_eswitch_rep *rep;
 	int err;
 
-	err = __parse_cls_flower(priv, spec, f, filter_dev, &match_level, &tunnel_match_level);
+	inner_match_level = MLX5_MATCH_NONE;
+	outer_match_level = MLX5_MATCH_NONE;
+
+	err = __parse_cls_flower(priv, spec, f, filter_dev, &inner_match_level,
+				 &outer_match_level);
+	non_tunnel_match_level = (inner_match_level == MLX5_MATCH_NONE) ?
+				 outer_match_level : inner_match_level;
 
 	if (!err && (flow->flags & MLX5E_TC_FLOW_ESWITCH)) {
 		rep = rpriv->rep;
 		if (rep->vport != MLX5_VPORT_UPLINK &&
 		    (esw->offloads.inline_mode != MLX5_INLINE_MODE_NONE &&
-		    esw->offloads.inline_mode < match_level)) {
+		    esw->offloads.inline_mode < non_tunnel_match_level)) {
 			NL_SET_ERR_MSG_MOD(extack,
 					   "Flow is not offloaded due to min inline setting");
 			netdev_warn(priv->netdev,
 				    "Flow is not offloaded due to min inline setting, required %d actual %d\n",
-				    match_level, esw->offloads.inline_mode);
+				    non_tunnel_match_level, esw->offloads.inline_mode);
 			return -EOPNOTSUPP;
 		}
 	}
 
 	if (flow->flags & MLX5E_TC_FLOW_ESWITCH) {
-		flow->esw_attr->match_level = match_level;
-		flow->esw_attr->tunnel_match_level = tunnel_match_level;
+		flow->esw_attr->inner_match_level = inner_match_level;
+		flow->esw_attr->outer_match_level = outer_match_level;
 	} else {
-		flow->nic_attr->match_level = match_level;
+		flow->nic_attr->match_level = non_tunnel_match_level;
 	}
 
 	return err;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index a38e8a3c7c9a..04685dbb280c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -377,8 +377,8 @@ struct mlx5_esw_flow_attr {
 		struct mlx5_termtbl_handle *termtbl;
 	} dests[MLX5_MAX_FLOW_FWD_VPORTS];
 	u32	mod_hdr_id;
-	u8	match_level;
-	u8	tunnel_match_level;
+	u8	inner_match_level;
+	u8	outer_match_level;
 	struct mlx5_fc *counter;
 	u32	chain;
 	u16	prio;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 089ae4d48a82..0323fd078271 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -207,14 +207,10 @@ mlx5_eswitch_add_offloaded_rule(struct mlx5_eswitch *esw,
 
 	mlx5_eswitch_set_rule_source_port(esw, spec, attr);
 
-	if (flow_act.action & MLX5_FLOW_CONTEXT_ACTION_DECAP) {
-		if (attr->tunnel_match_level != MLX5_MATCH_NONE)
-			spec->match_criteria_enable |= MLX5_MATCH_OUTER_HEADERS;
-		if (attr->match_level != MLX5_MATCH_NONE)
-			spec->match_criteria_enable |= MLX5_MATCH_INNER_HEADERS;
-	} else if (attr->match_level != MLX5_MATCH_NONE) {
+	if (attr->outer_match_level != MLX5_MATCH_NONE)
 		spec->match_criteria_enable |= MLX5_MATCH_OUTER_HEADERS;
-	}
+	if (attr->inner_match_level != MLX5_MATCH_NONE)
+		spec->match_criteria_enable |= MLX5_MATCH_INNER_HEADERS;
 
 	if (flow_act.action & MLX5_FLOW_CONTEXT_ACTION_MOD_HDR)
 		flow_act.modify_id = attr->mod_hdr_id;
@@ -290,7 +286,7 @@ mlx5_eswitch_add_fwd_rule(struct mlx5_eswitch *esw,
 	mlx5_eswitch_set_rule_source_port(esw, spec, attr);
 
 	spec->match_criteria_enable |= MLX5_MATCH_MISC_PARAMETERS;
-	if (attr->match_level != MLX5_MATCH_NONE)
+	if (attr->outer_match_level != MLX5_MATCH_NONE)
 		spec->match_criteria_enable |= MLX5_MATCH_OUTER_HEADERS;
 
 	rule = mlx5_add_flow_rules(fast_fdb, spec, &flow_act, dest, i);
-- 
2.21.0


^ permalink raw reply related

* [net 03/12] net/mlx5e: Only support tx/rx pause setting for port owner
From: Saeed Mahameed @ 2019-08-08 20:22 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev@vger.kernel.org, Huy Nguyen, Parav Pandit, Saeed Mahameed
In-Reply-To: <20190808202025.11303-1-saeedm@mellanox.com>

From: Huy Nguyen <huyn@mellanox.com>

Only support changing tx/rx pause frame setting if the net device
is the vport group manager.

Fixes: 3c2d18ef22df ("net/mlx5e: Support ethtool get/set_pauseparam")
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 03bed714bac3..ee9fa0c2c8b9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -1338,6 +1338,9 @@ int mlx5e_ethtool_set_pauseparam(struct mlx5e_priv *priv,
 	struct mlx5_core_dev *mdev = priv->mdev;
 	int err;
 
+	if (!MLX5_CAP_GEN(mdev, vport_group_manager))
+		return -EOPNOTSUPP;
+
 	if (pauseparam->autoneg)
 		return -EINVAL;
 
-- 
2.21.0


^ permalink raw reply related

* [net 04/12] net/mlx5e: ethtool, Avoid setting speed to 56GBASE when autoneg off
From: Saeed Mahameed @ 2019-08-08 20:22 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev@vger.kernel.org, Mohamad Heib, Saeed Mahameed
In-Reply-To: <20190808202025.11303-1-saeedm@mellanox.com>

From: Mohamad Heib <mohamadh@mellanox.com>

Setting speed to 56GBASE is allowed only with auto-negotiation enabled.

This patch prevent setting speed to 56GBASE when auto-negotiation disabled.

Fixes: f62b8bb8f2d3 ("net/mlx5: Extend mlx5_core to support ConnectX-4 Ethernet functionality")
Signed-off-by: Mohamad Heib <mohamadh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index ee9fa0c2c8b9..e89dba790a2d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -1081,6 +1081,14 @@ int mlx5e_ethtool_set_link_ksettings(struct mlx5e_priv *priv,
 	link_modes = autoneg == AUTONEG_ENABLE ? ethtool2ptys_adver_func(adver) :
 		mlx5e_port_speed2linkmodes(mdev, speed, !ext);
 
+	if ((link_modes & MLX5E_PROT_MASK(MLX5E_56GBASE_R4)) &&
+	    autoneg != AUTONEG_ENABLE) {
+		netdev_err(priv->netdev, "%s: 56G link speed requires autoneg enabled\n",
+			   __func__);
+		err = -EINVAL;
+		goto out;
+	}
+
 	link_modes = link_modes & eproto.cap;
 	if (!link_modes) {
 		netdev_err(priv->netdev, "%s: Not supported link mode(s) requested",
-- 
2.21.0


^ permalink raw reply related

* [net 05/12] net/mlx5: crypto, Fix wrong offset in encryption key command
From: Saeed Mahameed @ 2019-08-08 20:22 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev@vger.kernel.org, Tariq Toukan, Saeed Mahameed
In-Reply-To: <20190808202025.11303-1-saeedm@mellanox.com>

From: Tariq Toukan <tariqt@mellanox.com>

Fix the 128b key offset in key encryption key creation command,
per the HW specification.

Fixes: 45d3b55dc665 ("net/mlx5: Add crypto library to support create/destroy encryption key")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/lib/crypto.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/crypto.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/crypto.c
index ea9ee88491e5..ea1d4d26ece0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/crypto.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/crypto.c
@@ -27,6 +27,7 @@ int mlx5_create_encryption_key(struct mlx5_core_dev *mdev,
 	case 128:
 		general_obj_key_size =
 			MLX5_GENERAL_OBJECT_TYPE_ENCRYPTION_KEY_KEY_SIZE_128;
+		key_p += sz_bytes;
 		break;
 	case 256:
 		general_obj_key_size =
-- 
2.21.0


^ permalink raw reply related

* [net 06/12] net/mlx5: kTLS, Fix wrong TIS opmod constants
From: Saeed Mahameed @ 2019-08-08 20:22 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev@vger.kernel.org, Tariq Toukan, Saeed Mahameed
In-Reply-To: <20190808202025.11303-1-saeedm@mellanox.com>

From: Tariq Toukan <tariqt@mellanox.com>

Fix the used constants for TLS TIS opmods, per the HW specification.

Fixes: a12ff35e0fb7 ("net/mlx5: Introduce TLS TX offload hardware bits and structures")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 include/linux/mlx5/device.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h
index ce9839c8bc1a..c2f056b5766d 100644
--- a/include/linux/mlx5/device.h
+++ b/include/linux/mlx5/device.h
@@ -446,11 +446,11 @@ enum {
 };
 
 enum {
-	MLX5_OPC_MOD_TLS_TIS_STATIC_PARAMS = 0x20,
+	MLX5_OPC_MOD_TLS_TIS_STATIC_PARAMS = 0x1,
 };
 
 enum {
-	MLX5_OPC_MOD_TLS_TIS_PROGRESS_PARAMS = 0x20,
+	MLX5_OPC_MOD_TLS_TIS_PROGRESS_PARAMS = 0x1,
 };
 
 enum {
-- 
2.21.0


^ permalink raw reply related

* [net 07/12] net/mlx5e: kTLS, Fix progress params context WQE layout
From: Saeed Mahameed @ 2019-08-08 20:22 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev@vger.kernel.org, Tariq Toukan, Saeed Mahameed
In-Reply-To: <20190808202025.11303-1-saeedm@mellanox.com>

From: Tariq Toukan <tariqt@mellanox.com>

The TLS progress params context WQE should not include an
Eth segment, drop it.
In addition, align the tls_progress_params layout with the
HW specification document:
- fix the tisn field name.
- remove the valid bit.

Fixes: a12ff35e0fb7 ("net/mlx5: Introduce TLS TX offload hardware bits and structures")
Fixes: d2ead1f360e8 ("net/mlx5e: Add kTLS TX HW offload support")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h             | 9 +++++++--
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls.h  | 6 ++++--
 .../net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c   | 4 ++--
 include/linux/mlx5/mlx5_ifc.h                            | 5 ++---
 4 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index ce1be2a84231..f6b64a03cd06 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -184,8 +184,13 @@ static inline int mlx5e_get_max_num_channels(struct mlx5_core_dev *mdev)
 
 struct mlx5e_tx_wqe {
 	struct mlx5_wqe_ctrl_seg ctrl;
-	struct mlx5_wqe_eth_seg  eth;
-	struct mlx5_wqe_data_seg data[0];
+	union {
+		struct {
+			struct mlx5_wqe_eth_seg  eth;
+			struct mlx5_wqe_data_seg data[0];
+		};
+		u8 tls_progress_params_ctx[0];
+	};
 };
 
 struct mlx5e_rx_wqe_ll {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls.h
index 407da83474ef..b7298f9ee3d3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls.h
@@ -11,12 +11,14 @@
 #include "accel/tls.h"
 
 #define MLX5E_KTLS_STATIC_UMR_WQE_SZ \
-	(sizeof(struct mlx5e_umr_wqe) + MLX5_ST_SZ_BYTES(tls_static_params))
+	(offsetof(struct mlx5e_umr_wqe, tls_static_params_ctx) + \
+	 MLX5_ST_SZ_BYTES(tls_static_params))
 #define MLX5E_KTLS_STATIC_WQEBBS \
 	(DIV_ROUND_UP(MLX5E_KTLS_STATIC_UMR_WQE_SZ, MLX5_SEND_WQE_BB))
 
 #define MLX5E_KTLS_PROGRESS_WQE_SZ \
-	(sizeof(struct mlx5e_tx_wqe) + MLX5_ST_SZ_BYTES(tls_progress_params))
+	(offsetof(struct mlx5e_tx_wqe, tls_progress_params_ctx) + \
+	 MLX5_ST_SZ_BYTES(tls_progress_params))
 #define MLX5E_KTLS_PROGRESS_WQEBBS \
 	(DIV_ROUND_UP(MLX5E_KTLS_PROGRESS_WQE_SZ, MLX5_SEND_WQE_BB))
 #define MLX5E_KTLS_MAX_DUMP_WQEBBS 2
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
index 3766545ce259..9f67bfb559f1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
@@ -80,7 +80,7 @@ build_static_params(struct mlx5e_umr_wqe *wqe, u16 pc, u32 sqn,
 static void
 fill_progress_params_ctx(void *ctx, struct mlx5e_ktls_offload_context_tx *priv_tx)
 {
-	MLX5_SET(tls_progress_params, ctx, pd, priv_tx->tisn);
+	MLX5_SET(tls_progress_params, ctx, tisn, priv_tx->tisn);
 	MLX5_SET(tls_progress_params, ctx, record_tracker_state,
 		 MLX5E_TLS_PROGRESS_PARAMS_RECORD_TRACKER_STATE_START);
 	MLX5_SET(tls_progress_params, ctx, auth_state,
@@ -104,7 +104,7 @@ build_progress_params(struct mlx5e_tx_wqe *wqe, u16 pc, u32 sqn,
 					     PROGRESS_PARAMS_DS_CNT);
 	cseg->fm_ce_se         = fence ? MLX5_FENCE_MODE_INITIATOR_SMALL : 0;
 
-	fill_progress_params_ctx(wqe->data, priv_tx);
+	fill_progress_params_ctx(wqe->tls_progress_params_ctx, priv_tx);
 }
 
 static void tx_fill_wi(struct mlx5e_txqsq *sq,
diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index ec571fd7fcf8..b8b570c30b5e 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -10054,9 +10054,8 @@ struct mlx5_ifc_tls_static_params_bits {
 };
 
 struct mlx5_ifc_tls_progress_params_bits {
-	u8         valid[0x1];
-	u8         reserved_at_1[0x7];
-	u8         pd[0x18];
+	u8         reserved_at_0[0x8];
+	u8         tisn[0x18];
 
 	u8         next_record_tcp_sn[0x20];
 
-- 
2.21.0


^ permalink raw reply related

* [net 08/12] net/mlx5e: kTLS, Fix tisn field name
From: Saeed Mahameed @ 2019-08-08 20:22 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev@vger.kernel.org, Tariq Toukan, Saeed Mahameed
In-Reply-To: <20190808202025.11303-1-saeedm@mellanox.com>

From: Tariq Toukan <tariqt@mellanox.com>

Use the proper tisn field name from the union in struct mlx5_wqe_ctrl_seg.

Fixes: d2ead1f360e8 ("net/mlx5e: Add kTLS TX HW offload support")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
index 9f67bfb559f1..cfc9e7d457e3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
@@ -69,7 +69,7 @@ build_static_params(struct mlx5e_umr_wqe *wqe, u16 pc, u32 sqn,
 	cseg->qpn_ds           = cpu_to_be32((sqn << MLX5_WQE_CTRL_QPN_SHIFT) |
 					     STATIC_PARAMS_DS_CNT);
 	cseg->fm_ce_se         = fence ? MLX5_FENCE_MODE_INITIATOR_SMALL : 0;
-	cseg->imm              = cpu_to_be32(priv_tx->tisn);
+	cseg->tisn             = cpu_to_be32(priv_tx->tisn);
 
 	ucseg->flags = MLX5_UMR_INLINE;
 	ucseg->bsf_octowords = cpu_to_be16(MLX5_ST_SZ_BYTES(tls_static_params) / 16);
@@ -278,7 +278,7 @@ tx_post_resync_dump(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 
 	cseg->opmod_idx_opcode = cpu_to_be32((sq->pc << 8)  | MLX5_OPCODE_DUMP);
 	cseg->qpn_ds           = cpu_to_be32((sq->sqn << 8) | ds_cnt);
-	cseg->imm              = cpu_to_be32(tisn);
+	cseg->tisn             = cpu_to_be32(tisn);
 	cseg->fm_ce_se         = first ? MLX5_FENCE_MODE_INITIATOR_SMALL : 0;
 
 	eseg->inline_hdr.sz = cpu_to_be16(ihs);
@@ -434,7 +434,7 @@ struct sk_buff *mlx5e_ktls_handle_tx_skb(struct net_device *netdev,
 	priv_tx->expected_seq = seq + datalen;
 
 	cseg = &(*wqe)->ctrl;
-	cseg->imm = cpu_to_be32(priv_tx->tisn);
+	cseg->tisn = cpu_to_be32(priv_tx->tisn);
 
 	stats->tls_encrypted_packets += skb_is_gso(skb) ? skb_shinfo(skb)->gso_segs : 1;
 	stats->tls_encrypted_bytes   += datalen;
-- 
2.21.0


^ permalink raw reply related

* [net 09/12] net/mlx5e: kTLS, Fix tisn field placement
From: Saeed Mahameed @ 2019-08-08 20:22 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev@vger.kernel.org, Tariq Toukan, Saeed Mahameed
In-Reply-To: <20190808202025.11303-1-saeedm@mellanox.com>

From: Tariq Toukan <tariqt@mellanox.com>

Shift the tisn field in the WQE control segment, per the
HW specification.

Fixes: d2ead1f360e8 ("net/mlx5e: Add kTLS TX HW offload support")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
index cfc9e7d457e3..8b93101e1a09 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
@@ -69,7 +69,7 @@ build_static_params(struct mlx5e_umr_wqe *wqe, u16 pc, u32 sqn,
 	cseg->qpn_ds           = cpu_to_be32((sqn << MLX5_WQE_CTRL_QPN_SHIFT) |
 					     STATIC_PARAMS_DS_CNT);
 	cseg->fm_ce_se         = fence ? MLX5_FENCE_MODE_INITIATOR_SMALL : 0;
-	cseg->tisn             = cpu_to_be32(priv_tx->tisn);
+	cseg->tisn             = cpu_to_be32(priv_tx->tisn << 8);
 
 	ucseg->flags = MLX5_UMR_INLINE;
 	ucseg->bsf_octowords = cpu_to_be16(MLX5_ST_SZ_BYTES(tls_static_params) / 16);
@@ -278,7 +278,7 @@ tx_post_resync_dump(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 
 	cseg->opmod_idx_opcode = cpu_to_be32((sq->pc << 8)  | MLX5_OPCODE_DUMP);
 	cseg->qpn_ds           = cpu_to_be32((sq->sqn << 8) | ds_cnt);
-	cseg->tisn             = cpu_to_be32(tisn);
+	cseg->tisn             = cpu_to_be32(tisn << 8);
 	cseg->fm_ce_se         = first ? MLX5_FENCE_MODE_INITIATOR_SMALL : 0;
 
 	eseg->inline_hdr.sz = cpu_to_be16(ihs);
@@ -434,7 +434,7 @@ struct sk_buff *mlx5e_ktls_handle_tx_skb(struct net_device *netdev,
 	priv_tx->expected_seq = seq + datalen;
 
 	cseg = &(*wqe)->ctrl;
-	cseg->tisn = cpu_to_be32(priv_tx->tisn);
+	cseg->tisn = cpu_to_be32(priv_tx->tisn << 8);
 
 	stats->tls_encrypted_packets += skb_is_gso(skb) ? skb_shinfo(skb)->gso_segs : 1;
 	stats->tls_encrypted_bytes   += datalen;
-- 
2.21.0


^ permalink raw reply related

* [net 10/12] net/mlx5e: Fix false negative indication on tx reporter CQE recovery
From: Saeed Mahameed @ 2019-08-08 20:22 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev@vger.kernel.org, Aya Levin, Tariq Toukan, Saeed Mahameed
In-Reply-To: <20190808202025.11303-1-saeedm@mellanox.com>

From: Aya Levin <ayal@mellanox.com>

Remove wrong error return value when SQ is not in error state.
CQE recovery on TX reporter queries the sq state. If the sq is not in
error state, the sq is either in ready or reset state. Ready state is
good state which doesn't require recovery and reset state is a temporal
state which ends in ready state. With this patch, CQE recovery in this
scenario is successful.

Fixes: de8650a82071 ("net/mlx5e: Add tx reporter support")
Signed-off-by: Aya Levin <ayal@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
index f3d98748b211..b307234b4e05 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
@@ -86,10 +86,8 @@ static int mlx5e_tx_reporter_err_cqe_recover(struct mlx5e_txqsq *sq)
 		return err;
 	}
 
-	if (state != MLX5_SQC_STATE_ERR) {
-		netdev_err(dev, "SQ 0x%x not in ERROR state\n", sq->sqn);
-		return -EINVAL;
-	}
+	if (state != MLX5_SQC_STATE_ERR)
+		return 0;
 
 	mlx5e_tx_disable_queue(sq->txq);
 
-- 
2.21.0


^ permalink raw reply related

* [net 11/12] net/mlx5e: Fix error flow of CQE recovery on tx reporter
From: Saeed Mahameed @ 2019-08-08 20:22 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev@vger.kernel.org, Aya Levin, Tariq Toukan, Saeed Mahameed
In-Reply-To: <20190808202025.11303-1-saeedm@mellanox.com>

From: Aya Levin <ayal@mellanox.com>

CQE recovery function begins with test and set of recovery bit. Add an
error flow which ensures clearing of this bit when leaving the recovery
function, to allow further recoveries to take place. This allows removal
of clearing recovery bit on sq activate.

Fixes: de8650a82071 ("net/mlx5e: Add tx reporter support")
Signed-off-by: Aya Levin <ayal@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 .../net/ethernet/mellanox/mlx5/core/en/reporter_tx.c | 12 ++++++++----
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c    |  1 -
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
index b307234b4e05..b91814ecfbc9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
@@ -83,17 +83,17 @@ static int mlx5e_tx_reporter_err_cqe_recover(struct mlx5e_txqsq *sq)
 	if (err) {
 		netdev_err(dev, "Failed to query SQ 0x%x state. err = %d\n",
 			   sq->sqn, err);
-		return err;
+		goto out;
 	}
 
 	if (state != MLX5_SQC_STATE_ERR)
-		return 0;
+		goto out;
 
 	mlx5e_tx_disable_queue(sq->txq);
 
 	err = mlx5e_wait_for_sq_flush(sq);
 	if (err)
-		return err;
+		goto out;
 
 	/* At this point, no new packets will arrive from the stack as TXQ is
 	 * marked with QUEUE_STATE_DRV_XOFF. In addition, NAPI cleared all
@@ -102,13 +102,17 @@ static int mlx5e_tx_reporter_err_cqe_recover(struct mlx5e_txqsq *sq)
 
 	err = mlx5e_sq_to_ready(sq, state);
 	if (err)
-		return err;
+		goto out;
 
 	mlx5e_reset_txqsq_cc_pc(sq);
 	sq->stats->recover++;
+	clear_bit(MLX5E_SQ_STATE_RECOVERING, &sq->state);
 	mlx5e_activate_txqsq(sq);
 
 	return 0;
+out:
+	clear_bit(MLX5E_SQ_STATE_RECOVERING, &sq->state);
+	return err;
 }
 
 static int mlx5_tx_health_report(struct devlink_health_reporter *tx_reporter,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 6c712c5be4d8..9d5f6e56188f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -1321,7 +1321,6 @@ static int mlx5e_open_txqsq(struct mlx5e_channel *c,
 void mlx5e_activate_txqsq(struct mlx5e_txqsq *sq)
 {
 	sq->txq = netdev_get_tx_queue(sq->channel->netdev, sq->txq_ix);
-	clear_bit(MLX5E_SQ_STATE_RECOVERING, &sq->state);
 	set_bit(MLX5E_SQ_STATE_ENABLED, &sq->state);
 	netdev_tx_reset_queue(sq->txq);
 	netif_tx_start_queue(sq->txq);
-- 
2.21.0


^ permalink raw reply related

* [net 12/12] net/mlx5e: Remove redundant check in CQE recovery flow of tx reporter
From: Saeed Mahameed @ 2019-08-08 20:22 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev@vger.kernel.org, Aya Levin, Saeed Mahameed
In-Reply-To: <20190808202025.11303-1-saeedm@mellanox.com>

From: Aya Levin <ayal@mellanox.com>

Remove check of recovery bit, in the beginning of the CQE recovery
function. This test is already performed right before the reporter
is invoked, when CQE error is detected.

Fixes: de8650a82071 ("net/mlx5e: Add tx reporter support")
Signed-off-by: Aya Levin <ayal@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
index b91814ecfbc9..c7f86453c638 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
@@ -76,9 +76,6 @@ static int mlx5e_tx_reporter_err_cqe_recover(struct mlx5e_txqsq *sq)
 	u8 state;
 	int err;
 
-	if (!test_bit(MLX5E_SQ_STATE_RECOVERING, &sq->state))
-		return 0;
-
 	err = mlx5_core_query_sq_state(mdev, sq->sqn, &state);
 	if (err) {
 		netdev_err(dev, "Failed to query SQ 0x%x state. err = %d\n",
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH net-next 3/3] net: phy: realtek: add support for the 2.5Gbps PHY in RTL8125
From: Heiner Kallweit @ 2019-08-08 20:24 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Florian Fainelli, David Miller, netdev@vger.kernel.org
In-Reply-To: <20190808202029.GN27917@lunn.ch>

On 08.08.2019 22:20, Andrew Lunn wrote:
>> I have a contact in Realtek who provided the information about
>> the vendor-specific registers used in the patch. I also asked for
>> a method to auto-detect 2.5Gbps support but have no feedback so far.
>> What may contribute to the problem is that also the integrated 1Gbps
>> PHY's (all with the same PHY ID) differ significantly from each other,
>> depending on the network chip version.
> 
> Hi Heiner
> 
> Some of the PHYs embedded in Marvell switches have an OUI, but no
> product ID. We work around this brokenness by trapping the reads to
> the ID registers in the MDIO bus controller driver and inserting the
> switch product ID. The Marvell PHY driver then recognises these IDs
> and does the right thing.
> 
> Maybe you can do something similar here?
> 
Yes, this would be an idea. Let me check.

>       Andrew
> 
Thanks, Heiner


^ permalink raw reply

* Re: [v3,4/4] tools: bpftool: add documentation for net attach/detach
From: Daniel T. Lee @ 2019-08-08 20:28 UTC (permalink / raw)
  To: Quentin Monnet; +Cc: Daniel Borkmann, Alexei Starovoitov, netdev
In-Reply-To: <1cc16243-ad5a-87f3-7727-31a58599bf04@netronome.com>

On Fri, Aug 9, 2019 at 1:48 AM Quentin Monnet
<quentin.monnet@netronome.com> wrote:
>
> 2019-08-07 11:25 UTC+0900 ~ Daniel T. Lee <danieltimlee@gmail.com>
> > Since, new sub-command 'net attach/detach' has been added for
> > attaching XDP program on interface,
> > this commit documents usage and sample output of `net attach/detach`.
> >
> > Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
> > ---
> >  .../bpf/bpftool/Documentation/bpftool-net.rst | 51 +++++++++++++++++--
> >  1 file changed, 48 insertions(+), 3 deletions(-)
> >
> > diff --git a/tools/bpf/bpftool/Documentation/bpftool-net.rst b/tools/bpf/bpftool/Documentation/bpftool-net.rst
> > index d8e5237a2085..4ad1a380e186 100644
> > --- a/tools/bpf/bpftool/Documentation/bpftool-net.rst
> > +++ b/tools/bpf/bpftool/Documentation/bpftool-net.rst
> > @@ -15,17 +15,22 @@ SYNOPSIS
> >       *OPTIONS* := { [{ **-j** | **--json** }] [{ **-p** | **--pretty** }] }
> >
> >       *COMMANDS* :=
> > -     { **show** | **list** } [ **dev** name ] | **help**
> > +     { **show** | **list** | **attach** | **detach** | **help** }
> >
> >  NET COMMANDS
> >  ============
> >
> > -|    **bpftool** **net { show | list } [ dev name ]**
> > +|    **bpftool** **net { show | list }** [ **dev** *name* ]
> > +|    **bpftool** **net attach** *ATTACH_TYPE* *PROG* **dev** *name* [ **overwrite** ]
> > +|    **bpftool** **net detach** *ATTACH_TYPE* **dev** *name*
>
> Nit: Could we have "name" in capital letters (everywhere in the file),
> to make this file consistent with the formatting used for
> bpftool-prog.rst and bpftool-map.rst?
>

I'll update all "name" with capital "NAME" at next version of patch.

> >  |    **bpftool** **net help**
> > +|
> > +|    *PROG* := { **id** *PROG_ID* | **pinned** *FILE* | **tag** *PROG_TAG* }
> > +|    *ATTACH_TYPE* := { **xdp** | **xdpgeneric** | **xdpdrv** | **xdpoffload** }
> >
> >  DESCRIPTION
> >  ===========
> > -     **bpftool net { show | list } [ dev name ]**
> > +     **bpftool net { show | list }** [ **dev** *name* ]
> >                    List bpf program attachments in the kernel networking subsystem.
> >
> >                    Currently, only device driver xdp attachments and tc filter
> > @@ -47,6 +52,18 @@ DESCRIPTION
> >                    all bpf programs attached to non clsact qdiscs, and finally all
> >                    bpf programs attached to root and clsact qdisc.
> >
> > +     **bpftool** **net attach** *ATTACH_TYPE* *PROG* **dev** *name* [ **overwrite** ]
> > +                  Attach bpf program *PROG* to network interface *name* with
> > +                  type specified by *ATTACH_TYPE*. Previously attached bpf program
> > +                  can be replaced by the command used with **overwrite** option.
> > +                  Currently, *ATTACH_TYPE* only contains XDP programs.
>
> Other nit: "ATTACH_TYPE only contains XDP programs" sounds odd to me.
> Could we maybe phrase this something like: "Currently, only XDP-related
> modes are supported for ATTACH_TYPE"?
>
> Also, could you please provide a brief description of the different
> attach types? In particular, explaining what "xdp" alone stands for
> might be useful.
>

I'll replace the phrase and add brief description about the attach types.

> Thanks,
> Quentin
>
> > +
> > +     **bpftool** **net detach** *ATTACH_TYPE* **dev** *name*
> > +                  Detach bpf program attached to network interface *name* with
> > +                  type specified by *ATTACH_TYPE*. To detach bpf program, same
> > +                  *ATTACH_TYPE* previously used for attach must be specified.
> > +                  Currently, *ATTACH_TYPE* only contains XDP programs.

Thank you for taking your time for the review.

^ permalink raw reply

* Re: [PATCH v5 bpf-next] BPF: helpers: New helper to obtain namespace data from current task
From: Carlos Antonio Neira Bustos @ 2019-08-08 20:32 UTC (permalink / raw)
  To: Y Song
  Cc: Yonghong Song, netdev@vger.kernel.org, ebiederm@xmission.com,
	brouer@redhat.com, quentin.monnet@netronome.com
In-Reply-To: <CAH3MdRUiQJ4e4rRAE4WrbzG8LWvnuDC4J-UYQc1wRA7AEN=7+g@mail.gmail.com>

Hi Yonghong,

I'm sorry just to be sure, I'm just missing the error codes from filename_lookup()?.
I'll work on that.

Bests

> > > Maybe reword by following the code sequence.
> > >     if *size_of_pidns* is not valid or unable to get ns, pid or tgid of
> > >     the current task.
> > >
> > > > + *
> > > > + *         **-ENOMEM**  if allocation fails.
> > >
> > > Maybe some other error codes in filename_lookup() function?
> > >
> > > > + *
> > > > + *         If unable to get the inode from /proc/self/ns/pid an error code
> > > > + *         will be returned.
> > >
> > > You do not need this. The description of error code cases should cover this.
>

On Thu, Aug 08, 2019 at 12:44:22PM -0700, Y Song wrote:
> On Thu, Aug 8, 2019 at 10:52 AM Carlos Antonio Neira Bustos
> <cneirabustos@gmail.com> wrote:
> >
> > Yonghong,
> >
> > I have modified the patch following your feedback.
> > Let me know if I'm missing something.
> 
> Yes, I have some other requests about formating.
> https://lore.kernel.org/netdev/20190808174848.poybtaagg5ctle7t@dev00/T/#t
> Could you address it as well?
> 
> >
> > Bests
> >
> > From 70f8d5584700c9cfc82c006901d8ee9595c53f15 Mon Sep 17 00:00:00 2001
> > From: Carlos <cneirabustos@gmail.com>
> > Date: Wed, 7 Aug 2019 20:04:30 -0400
> > Subject: [PATCH] [PATCH v6 bpf-next] BPF: New helper to obtain namespace data
> >  from current task
> >
> > This helper obtains the active namespace from current and returns pid, tgid,
> > device and namespace id as seen from that namespace, allowing to instrument
> > a process inside a container.
> > Device is read from /proc/self/ns/pid, as in the future it's possible that
> > different pid_ns files may belong to different devices, according
> > to the discussion between Eric Biederman and Yonghong in 2017 linux plumbers
> > conference.
> > Currently bpf_get_current_pid_tgid(), is used to do pid filtering in bcc's
> > scripts but this helper returns the pid as seen by the root namespace which is
> > fine when a bcc script is not executed inside a container.
> > When the process of interest is inside a container, pid filtering will not work
> > if bpf_get_current_pid_tgid() is used. This helper addresses this limitation
> > returning the pid as it's seen by the current namespace where the script is
> > executing.
> >
> > This helper has the same use cases as bpf_get_current_pid_tgid() as it can be
> > used to do pid filtering even inside a container.
> >
> > For example a bcc script using bpf_get_current_pid_tgid() (tools/funccount.py):
> >
> >         u32 pid = bpf_get_current_pid_tgid() >> 32;
> >         if (pid != <pid_arg_passed_in>)
> >                 return 0;
> > Could be modified to use bpf_get_current_pidns_info() as follows:
> >
> >         struct bpf_pidns pidns;
> >         bpf_get_current_pidns_info(&pidns, sizeof(struct bpf_pidns));
> >         u32 pid = pidns.tgid;
> >         u32 nsid = pidns.nsid;
> >         if ((pid != <pid_arg_passed_in>) && (nsid != <nsid_arg_passed_in>))
> >                 return 0;
> >
> > To find out the name PID namespace id of a process, you could use this command:
> >
> > $ ps -h -o pidns -p <pid_of_interest>
> >
> > Or this other command:
> >
> > $ ls -Li /proc/<pid_of_interest>/ns/pid
> >
> > Signed-off-by: Carlos Neira <cneirabustos@gmail.com>
> > ---
> >  fs/internal.h                                      |   2 -
> >  fs/namei.c                                         |   1 -
> >  include/linux/bpf.h                                |   1 +
> >  include/linux/namei.h                              |   4 +
> >  include/uapi/linux/bpf.h                           |  27 +++-
> >  kernel/bpf/core.c                                  |   1 +
> >  kernel/bpf/helpers.c                               |  64 ++++++++++
> >  kernel/trace/bpf_trace.c                           |   2 +
> >  samples/bpf/Makefile                               |   3 +
> >  samples/bpf/trace_ns_info_user.c                   |  35 ++++++
> >  samples/bpf/trace_ns_info_user_kern.c              |  44 +++++++
> >  tools/include/uapi/linux/bpf.h                     |  27 +++-
> >  tools/testing/selftests/bpf/Makefile               |   2 +-
> >  tools/testing/selftests/bpf/bpf_helpers.h          |   3 +
> >  .../testing/selftests/bpf/progs/test_pidns_kern.c  |  51 ++++++++
> >  tools/testing/selftests/bpf/test_pidns.c           | 138 +++++++++++++++++++++
> >  16 files changed, 399 insertions(+), 6 deletions(-)
> >  create mode 100644 samples/bpf/trace_ns_info_user.c
> >  create mode 100644 samples/bpf/trace_ns_info_user_kern.c
> >  create mode 100644 tools/testing/selftests/bpf/progs/test_pidns_kern.c
> >  create mode 100644 tools/testing/selftests/bpf/test_pidns.c
> >
> > diff --git a/fs/internal.h b/fs/internal.h
> > index 315fcd8d237c..6647e15dd419 100644
> > --- a/fs/internal.h
> > +++ b/fs/internal.h
> > @@ -59,8 +59,6 @@ extern int finish_clean_context(struct fs_context *fc);
> >  /*
> >   * namei.c
> >   */
> > -extern int filename_lookup(int dfd, struct filename *name, unsigned flags,
> > -                          struct path *path, struct path *root);
> >  extern int user_path_mountpoint_at(int, const char __user *, unsigned int, struct path *);
> >  extern int vfs_path_lookup(struct dentry *, struct vfsmount *,
> >                            const char *, unsigned int, struct path *);
> > diff --git a/fs/namei.c b/fs/namei.c
> > index 209c51a5226c..a89fc72a4a10 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -19,7 +19,6 @@
> >  #include <linux/export.h>
> >  #include <linux/kernel.h>
> >  #include <linux/slab.h>
> > -#include <linux/fs.h>
> >  #include <linux/namei.h>
> >  #include <linux/pagemap.h>
> >  #include <linux/fsnotify.h>
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index f9a506147c8a..e4adf5e05afd 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -1050,6 +1050,7 @@ extern const struct bpf_func_proto bpf_get_local_storage_proto;
> >  extern const struct bpf_func_proto bpf_strtol_proto;
> >  extern const struct bpf_func_proto bpf_strtoul_proto;
> >  extern const struct bpf_func_proto bpf_tcp_sock_proto;
> > +extern const struct bpf_func_proto bpf_get_current_pidns_info_proto;
> >
> >  /* Shared helpers among cBPF and eBPF. */
> >  void bpf_user_rnd_init_once(void);
> > diff --git a/include/linux/namei.h b/include/linux/namei.h
> > index 9138b4471dbf..b45c8b6f7cb4 100644
> > --- a/include/linux/namei.h
> > +++ b/include/linux/namei.h
> > @@ -6,6 +6,7 @@
> >  #include <linux/path.h>
> >  #include <linux/fcntl.h>
> >  #include <linux/errno.h>
> > +#include <linux/fs.h>
> >
> >  enum { MAX_NESTED_LINKS = 8 };
> >
> > @@ -97,6 +98,9 @@ extern void unlock_rename(struct dentry *, struct dentry *);
> >
> >  extern void nd_jump_link(struct path *path);
> >
> > +extern int filename_lookup(int dfd, struct filename *name, unsigned flags,
> > +                          struct path *path, struct path *root);
> > +
> >  static inline void nd_terminate_link(void *name, size_t len, size_t maxlen)
> >  {
> >         ((char *) name)[min(len, maxlen)] = '\0';
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 4393bd4b2419..b0d4869fb860 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -2741,6 +2741,24 @@ union bpf_attr {
> >   *             **-EOPNOTSUPP** kernel configuration does not enable SYN cookies
> >   *
> >   *             **-EPROTONOSUPPORT** IP packet version is not 4 or 6
> > + *
> > + * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 size_of_pidns)
> > + *     Description
> > + *             Copies into *pidns* pid, namespace id and tgid as seen by the
> > + *             current namespace and also device from /proc/self/ns/pid.
> > + *             *size_of_pidns* must be the size of *pidns*
> > + *
> > + *             This helper is used when pid filtering is needed inside a
> > + *             container as bpf_get_current_tgid() helper returns always the
> > + *             pid id as seen by the root namespace.
> > + *     Return
> > + *             0 on success
> > + *
> > + *             **-EINVAL** if *size_of_pidns* is not valid or unable to get ns, pid
> > + *             or tgid of the current task.
> > + *
> > + *             **-ENOMEM**  if allocation fails.
> > + *
> >   */
> >  #define __BPF_FUNC_MAPPER(FN)          \
> >         FN(unspec),                     \
> > @@ -2853,7 +2871,8 @@ union bpf_attr {
> >         FN(sk_storage_get),             \
> >         FN(sk_storage_delete),          \
> >         FN(send_signal),                \
> > -       FN(tcp_gen_syncookie),
> > +       FN(tcp_gen_syncookie),          \
> > +       FN(get_current_pidns_info),
> >
> >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> >   * function eBPF program intends to call
> > @@ -3604,4 +3623,10 @@ struct bpf_sockopt {
> >         __s32   retval;
> >  };
> >
> > +struct bpf_pidns_info {
> > +       __u32 dev;
> > +       __u32 nsid;
> > +       __u32 tgid;
> > +       __u32 pid;
> > +};
> >  #endif /* _UAPI__LINUX_BPF_H__ */
> > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > index 8191a7db2777..3159f2a0188c 100644
> > --- a/kernel/bpf/core.c
> > +++ b/kernel/bpf/core.c
> > @@ -2038,6 +2038,7 @@ const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
> >  const struct bpf_func_proto bpf_get_current_comm_proto __weak;
> >  const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak;
> >  const struct bpf_func_proto bpf_get_local_storage_proto __weak;
> > +const struct bpf_func_proto bpf_get_current_pidns_info __weak;
> >
> >  const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
> >  {
> > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > index 5e28718928ca..41fbf1f28a48 100644
> > --- a/kernel/bpf/helpers.c
> > +++ b/kernel/bpf/helpers.c
> > @@ -11,6 +11,12 @@
> >  #include <linux/uidgid.h>
> >  #include <linux/filter.h>
> >  #include <linux/ctype.h>
> > +#include <linux/pid_namespace.h>
> > +#include <linux/major.h>
> > +#include <linux/stat.h>
> > +#include <linux/namei.h>
> > +#include <linux/version.h>
> > +
> >
> >  #include "../../lib/kstrtox.h"
> >
> > @@ -312,6 +318,64 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
> >         preempt_enable();
> >  }
> >
> > +BPF_CALL_2(bpf_get_current_pidns_info, struct bpf_pidns_info *, pidns_info, u32,
> > +        size)
> > +{
> > +       const char *pidns_path = "/proc/self/ns/pid";
> > +       struct pid_namespace *pidns = NULL;
> > +       struct filename *tmp = NULL;
> > +       struct inode *inode;
> > +       struct path kp;
> > +       pid_t tgid = 0;
> > +       pid_t pid = 0;
> > +       int ret;
> > +       int len;
> > +
> > +       if (unlikely(size != sizeof(struct bpf_pidns_info)))
> > +               return -EINVAL;
> > +       pidns = task_active_pid_ns(current);
> > +       if (unlikely(!pidns))
> > +               goto clear;
> > +       pidns_info->nsid =  pidns->ns.inum;
> > +       pid = task_pid_nr_ns(current, pidns);
> > +       if (unlikely(!pid))
> > +               goto clear;
> > +       tgid = task_tgid_nr_ns(current, pidns);
> > +       if (unlikely(!tgid))
> > +               goto clear;
> > +       pidns_info->tgid = (u32) tgid;
> > +       pidns_info->pid = (u32) pid;
> > +       tmp = kmem_cache_alloc(names_cachep, GFP_ATOMIC);
> > +       if (unlikely(!tmp)) {
> > +               memset((void *)pidns_info, 0, (size_t) size);
> > +               return -ENOMEM;
> > +       }
> > +       len = strlen(pidns_path) + 1;
> > +       memcpy((char *)tmp->name, pidns_path, len);
> > +       tmp->uptr = NULL;
> > +       tmp->aname = NULL;
> > +       tmp->refcnt = 1;
> > +       ret = filename_lookup(AT_FDCWD, tmp, 0, &kp, NULL);
> > +       if (ret) {
> > +               memset((void *)pidns_info, 0, (size_t) size);
> > +               return ret;
> > +       }
> > +       inode = d_backing_inode(kp.dentry);
> > +       pidns_info->dev = inode->i_sb->s_dev;
> > +       return 0;
> > +clear:
> > +       memset((void *)pidns_info, 0, (size_t) size);
> > +       return -EINVAL;
> > +}
> > +
> > +const struct bpf_func_proto bpf_get_current_pidns_info_proto = {
> > +       .func           = bpf_get_current_pidns_info,
> > +       .gpl_only       = false,
> > +       .ret_type       = RET_INTEGER,
> > +       .arg1_type      = ARG_PTR_TO_UNINIT_MEM,
> > +       .arg2_type      = ARG_CONST_SIZE,
> > +};
> > +
> >  #ifdef CONFIG_CGROUPS
> >  BPF_CALL_0(bpf_get_current_cgroup_id)
> >  {
> > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> > index ca1255d14576..5e1dc22765a5 100644
> > --- a/kernel/trace/bpf_trace.c
> > +++ b/kernel/trace/bpf_trace.c
> > @@ -709,6 +709,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> >  #endif
> >         case BPF_FUNC_send_signal:
> >                 return &bpf_send_signal_proto;
> > +       case BPF_FUNC_get_current_pidns_info:
> > +               return &bpf_get_current_pidns_info_proto;
> >         default:
> >                 return NULL;
> >         }
> > diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> > index 1d9be26b4edd..238453ff27d2 100644
> > --- a/samples/bpf/Makefile
> > +++ b/samples/bpf/Makefile
> > @@ -53,6 +53,7 @@ hostprogs-y += task_fd_query
> >  hostprogs-y += xdp_sample_pkts
> >  hostprogs-y += ibumad
> >  hostprogs-y += hbm
> > +hostprogs-y += trace_ns_info
> >
> >  # Libbpf dependencies
> >  LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
> > @@ -109,6 +110,7 @@ task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS)
> >  xdp_sample_pkts-objs := xdp_sample_pkts_user.o $(TRACE_HELPERS)
> >  ibumad-objs := bpf_load.o ibumad_user.o $(TRACE_HELPERS)
> >  hbm-objs := bpf_load.o hbm.o $(CGROUP_HELPERS)
> > +trace_ns_info-objs := bpf_load.o trace_ns_info_user.o
> >
> >  # Tell kbuild to always build the programs
> >  always := $(hostprogs-y)
> > @@ -170,6 +172,7 @@ always += xdp_sample_pkts_kern.o
> >  always += ibumad_kern.o
> >  always += hbm_out_kern.o
> >  always += hbm_edt_kern.o
> > +always += trace_ns_info_user_kern.o
> >
> >  KBUILD_HOSTCFLAGS += -I$(objtree)/usr/include
> >  KBUILD_HOSTCFLAGS += -I$(srctree)/tools/lib/bpf/
> > diff --git a/samples/bpf/trace_ns_info_user.c b/samples/bpf/trace_ns_info_user.c
> > new file mode 100644
> > index 000000000000..e06d08db6f30
> > --- /dev/null
> > +++ b/samples/bpf/trace_ns_info_user.c
> > @@ -0,0 +1,35 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of version 2 of the GNU General Public
> > + * License as published by the Free Software Foundation.
> > + */
> > +
> > +#include <stdio.h>
> > +#include <linux/bpf.h>
> > +#include <unistd.h>
> > +#include "bpf/libbpf.h"
> > +#include "bpf_load.h"
> > +
> > +/* This code was taken verbatim from tracex1_user.c, it's used
> > + * to exercize bpf_get_current_pidns_info() helper call.
> > + */
> > +int main(int ac, char **argv)
> > +{
> > +       FILE *f;
> > +       char filename[256];
> > +
> > +       snprintf(filename, sizeof(filename), "%s_user_kern.o", argv[0]);
> > +       printf("loading %s\n", filename);
> > +
> > +       if (load_bpf_file(filename)) {
> > +               printf("%s", bpf_log_buf);
> > +               return 1;
> > +       }
> > +
> > +       f = popen("taskset 1 ping  localhost", "r");
> > +       (void) f;
> > +       read_trace_pipe();
> > +       return 0;
> > +}
> > diff --git a/samples/bpf/trace_ns_info_user_kern.c b/samples/bpf/trace_ns_info_user_kern.c
> > new file mode 100644
> > index 000000000000..96675e02b707
> > --- /dev/null
> > +++ b/samples/bpf/trace_ns_info_user_kern.c
> > @@ -0,0 +1,44 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of version 2 of the GNU General Public
> > + * License as published by the Free Software Foundation.
> > + */
> > +#include <linux/skbuff.h>
> > +#include <linux/netdevice.h>
> > +#include <linux/version.h>
> > +#include <uapi/linux/bpf.h>
> > +#include "bpf_helpers.h"
> > +
> > +typedef __u64 u64;
> > +typedef __u32 u32;
> > +
> > +
> > +/* kprobe is NOT a stable ABI
> > + * kernel functions can be removed, renamed or completely change semantics.
> > + * Number of arguments and their positions can change, etc.
> > + * In such case this bpf+kprobe example will no longer be meaningful
> > + */
> > +
> > +/* This will call bpf_get_current_pidns_info() to display pid and ns values
> > + * as seen by the current namespace, on the far left you will see the pid as
> > + * seen as by the root namespace.
> > + */
> > +
> > +SEC("kprobe/__netif_receive_skb_core")
> > +int bpf_prog1(struct pt_regs *ctx)
> > +{
> > +       char fmt[] = "nsid:%u, dev: %u,  pid:%u\n";
> > +       struct bpf_pidns_info nsinfo;
> > +       int ok = 0;
> > +
> > +       ok = bpf_get_current_pidns_info(&nsinfo, sizeof(nsinfo));
> > +       if (ok == 0)
> > +               bpf_trace_printk(fmt, sizeof(fmt), (u32)nsinfo.nsid,
> > +                                (u32) nsinfo.dev, (u32)nsinfo.pid);
> > +
> > +       return 0;
> > +}
> > +char _license[] SEC("license") = "GPL";
> > +u32 _version SEC("version") = LINUX_VERSION_CODE;
> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > index 4393bd4b2419..b0d4869fb860 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> > @@ -2741,6 +2741,24 @@ union bpf_attr {
> >   *             **-EOPNOTSUPP** kernel configuration does not enable SYN cookies
> >   *
> >   *             **-EPROTONOSUPPORT** IP packet version is not 4 or 6
> > + *
> > + * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 size_of_pidns)
> > + *     Description
> > + *             Copies into *pidns* pid, namespace id and tgid as seen by the
> > + *             current namespace and also device from /proc/self/ns/pid.
> > + *             *size_of_pidns* must be the size of *pidns*
> > + *
> > + *             This helper is used when pid filtering is needed inside a
> > + *             container as bpf_get_current_tgid() helper returns always the
> > + *             pid id as seen by the root namespace.
> > + *     Return
> > + *             0 on success
> > + *
> > + *             **-EINVAL** if *size_of_pidns* is not valid or unable to get ns, pid
> > + *             or tgid of the current task.
> > + *
> > + *             **-ENOMEM**  if allocation fails.
> > + *
> >   */
> >  #define __BPF_FUNC_MAPPER(FN)          \
> >         FN(unspec),                     \
> > @@ -2853,7 +2871,8 @@ union bpf_attr {
> >         FN(sk_storage_get),             \
> >         FN(sk_storage_delete),          \
> >         FN(send_signal),                \
> > -       FN(tcp_gen_syncookie),
> > +       FN(tcp_gen_syncookie),          \
> > +       FN(get_current_pidns_info),
> >
> >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> >   * function eBPF program intends to call
> > @@ -3604,4 +3623,10 @@ struct bpf_sockopt {
> >         __s32   retval;
> >  };
> >
> > +struct bpf_pidns_info {
> > +       __u32 dev;
> > +       __u32 nsid;
> > +       __u32 tgid;
> > +       __u32 pid;
> > +};
> >  #endif /* _UAPI__LINUX_BPF_H__ */
> > diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> > index 3bd0f4a0336a..1f97b571b581 100644
> > --- a/tools/testing/selftests/bpf/Makefile
> > +++ b/tools/testing/selftests/bpf/Makefile
> > @@ -29,7 +29,7 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test
> >         test_cgroup_storage test_select_reuseport test_section_names \
> >         test_netcnt test_tcpnotify_user test_sock_fields test_sysctl test_hashmap \
> >         test_btf_dump test_cgroup_attach xdping test_sockopt test_sockopt_sk \
> > -       test_sockopt_multi test_tcp_rtt
> > +       test_sockopt_multi test_tcp_rtt test_pidns
> >
> >  BPF_OBJ_FILES = $(patsubst %.c,%.o, $(notdir $(wildcard progs/*.c)))
> >  TEST_GEN_FILES = $(BPF_OBJ_FILES)
> > diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
> > index 120aa86c58d3..c96795a9d983 100644
> > --- a/tools/testing/selftests/bpf/bpf_helpers.h
> > +++ b/tools/testing/selftests/bpf/bpf_helpers.h
> > @@ -231,6 +231,9 @@ static int (*bpf_send_signal)(unsigned sig) = (void *)BPF_FUNC_send_signal;
> >  static long long (*bpf_tcp_gen_syncookie)(struct bpf_sock *sk, void *ip,
> >                                           int ip_len, void *tcp, int tcp_len) =
> >         (void *) BPF_FUNC_tcp_gen_syncookie;
> > +static int (*bpf_get_current_pidns_info)(struct bpf_pidns_info *buf,
> > +                                        unsigned int buf_size) =
> > +       (void *) BPF_FUNC_get_current_pidns_info;
> >
> >  /* llvm builtin functions that eBPF C program may use to
> >   * emit BPF_LD_ABS and BPF_LD_IND instructions
> > diff --git a/tools/testing/selftests/bpf/progs/test_pidns_kern.c b/tools/testing/selftests/bpf/progs/test_pidns_kern.c
> > new file mode 100644
> > index 000000000000..e1d2facfa762
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/test_pidns_kern.c
> > @@ -0,0 +1,51 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of version 2 of the GNU General Public
> > + * License as published by the Free Software Foundation.
> > + */
> > +
> > +#include <linux/bpf.h>
> > +#include <errno.h>
> > +#include "bpf_helpers.h"
> > +
> > +struct bpf_map_def SEC("maps") nsidmap = {
> > +       .type = BPF_MAP_TYPE_ARRAY,
> > +       .key_size = sizeof(__u32),
> > +       .value_size = sizeof(__u32),
> > +       .max_entries = 1,
> > +};
> > +
> > +struct bpf_map_def SEC("maps") pidmap = {
> > +       .type = BPF_MAP_TYPE_ARRAY,
> > +       .key_size = sizeof(__u32),
> > +       .value_size = sizeof(__u32),
> > +       .max_entries = 1,
> > +};
> > +
> > +SEC("tracepoint/syscalls/sys_enter_nanosleep")
> > +int trace(void *ctx)
> > +{
> > +       struct bpf_pidns_info nsinfo;
> > +       __u32 key = 0, *expected_pid, *val;
> > +       char fmt[] = "ERROR nspid:%d\n";
> > +
> > +       if (bpf_get_current_pidns_info(&nsinfo, sizeof(nsinfo)))
> > +               return -EINVAL;
> > +
> > +       expected_pid = bpf_map_lookup_elem(&pidmap, &key);
> > +
> > +
> > +       if (!expected_pid || *expected_pid != nsinfo.pid)
> > +               return 0;
> > +
> > +       val = bpf_map_lookup_elem(&nsidmap, &key);
> > +       if (val)
> > +               *val = nsinfo.nsid;
> > +
> > +       return 0;
> > +}
> > +
> > +char _license[] SEC("license") = "GPL";
> > +__u32 _version SEC("version") = 1;
> > diff --git a/tools/testing/selftests/bpf/test_pidns.c b/tools/testing/selftests/bpf/test_pidns.c
> > new file mode 100644
> > index 000000000000..a7254055f294
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/test_pidns.c
> > @@ -0,0 +1,138 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of version 2 of the GNU General Public
> > + * License as published by the Free Software Foundation.
> > + */
> > +
> > +#include <stdio.h>
> > +#include <stdlib.h>
> > +#include <string.h>
> > +#include <errno.h>
> > +#include <fcntl.h>
> > +#include <syscall.h>
> > +#include <unistd.h>
> > +#include <linux/perf_event.h>
> > +#include <sys/ioctl.h>
> > +#include <sys/time.h>
> > +#include <sys/types.h>
> > +#include <sys/stat.h>
> > +
> > +#include <linux/bpf.h>
> > +#include <bpf/bpf.h>
> > +#include <bpf/libbpf.h>
> > +
> > +#include "cgroup_helpers.h"
> > +#include "bpf_rlimit.h"
> > +
> > +#define CHECK(condition, tag, format...) ({            \
> > +       int __ret = !!(condition);                      \
> > +       if (__ret) {                                    \
> > +               printf("%s:FAIL:%s ", __func__, tag);   \
> > +               printf(format);                         \
> > +       } else {                                        \
> > +               printf("%s:PASS:%s\n", __func__, tag);  \
> > +       }                                               \
> > +       __ret;                                          \
> > +})
> > +
> > +static int bpf_find_map(const char *test, struct bpf_object *obj,
> > +                       const char *name)
> > +{
> > +       struct bpf_map *map;
> > +
> > +       map = bpf_object__find_map_by_name(obj, name);
> > +       if (!map)
> > +               return -1;
> > +       return bpf_map__fd(map);
> > +}
> > +
> > +
> > +int main(int argc, char **argv)
> > +{
> > +       const char *probe_name = "syscalls/sys_enter_nanosleep";
> > +       const char *file = "test_pidns_kern.o";
> > +       int err, bytes, efd, prog_fd, pmu_fd;
> > +       int pidmap_fd, nsidmap_fd;
> > +       struct perf_event_attr attr = {};
> > +       struct bpf_object *obj;
> > +       __u32 knsid = 0;
> > +       __u32 key = 0, pid;
> > +       int exit_code = 1;
> > +       struct stat st;
> > +       char buf[256];
> > +
> > +       err = bpf_prog_load(file, BPF_PROG_TYPE_TRACEPOINT, &obj, &prog_fd);
> > +       if (CHECK(err, "bpf_prog_load", "err %d errno %d\n", err, errno))
> > +               goto cleanup_cgroup_env;
> > +
> > +       nsidmap_fd = bpf_find_map(__func__, obj, "nsidmap");
> > +       if (CHECK(nsidmap_fd < 0, "bpf_find_map", "err %d errno %d\n",
> > +                 nsidmap_fd, errno))
> > +               goto close_prog;
> > +
> > +       pidmap_fd = bpf_find_map(__func__, obj, "pidmap");
> > +       if (CHECK(pidmap_fd < 0, "bpf_find_map", "err %d errno %d\n",
> > +                 pidmap_fd, errno))
> > +               goto close_prog;
> > +
> > +       pid = getpid();
> > +       bpf_map_update_elem(pidmap_fd, &key, &pid, 0);
> > +
> > +       snprintf(buf, sizeof(buf),
> > +                "/sys/kernel/debug/tracing/events/%s/id", probe_name);
> > +       efd = open(buf, O_RDONLY, 0);
> > +       if (CHECK(efd < 0, "open", "err %d errno %d\n", efd, errno))
> > +               goto close_prog;
> > +       bytes = read(efd, buf, sizeof(buf));
> > +       close(efd);
> > +       if (CHECK(bytes <= 0 || bytes >= sizeof(buf), "read",
> > +                 "bytes %d errno %d\n", bytes, errno))
> > +               goto close_prog;
> > +
> > +       attr.config = strtol(buf, NULL, 0);
> > +       attr.type = PERF_TYPE_TRACEPOINT;
> > +       attr.sample_type = PERF_SAMPLE_RAW;
> > +       attr.sample_period = 1;
> > +       attr.wakeup_events = 1;
> > +
> > +       pmu_fd = syscall(__NR_perf_event_open, &attr, getpid(), -1, -1, 0);
> > +       if (CHECK(pmu_fd < 0, "perf_event_open", "err %d errno %d\n", pmu_fd,
> > +                 errno))
> > +               goto close_prog;
> > +
> > +       err = ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
> > +       if (CHECK(err, "perf_event_ioc_enable", "err %d errno %d\n", err,
> > +                 errno))
> > +               goto close_pmu;
> > +
> > +       err = ioctl(pmu_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
> > +       if (CHECK(err, "perf_event_ioc_set_bpf", "err %d errno %d\n", err,
> > +                 errno))
> > +               goto close_pmu;
> > +
> > +       /* trigger some syscalls */
> > +       sleep(1);
> > +
> > +       err = bpf_map_lookup_elem(nsidmap_fd, &key, &knsid);
> > +       if (CHECK(err, "bpf_map_lookup_elem", "err %d errno %d\n", err, errno))
> > +               goto close_pmu;
> > +
> > +       if (stat("/proc/self/ns/pid", &st))
> > +               goto close_pmu;
> > +
> > +       if (CHECK(knsid != (__u32) st.st_ino, "compare_namespace_id",
> > +                 "kern knsid %u user unsid %u\n", knsid, (__u32) st.st_ino))
> > +               goto close_pmu;
> > +
> > +       exit_code = 0;
> > +       printf("%s:PASS\n", argv[0]);
> > +
> > +close_pmu:
> > +       close(pmu_fd);
> > +close_prog:
> > +       bpf_object__close(obj);
> > +cleanup_cgroup_env:
> > +       return exit_code;
> > +}
> > --
> > 2.11.0
> >
> >
> >
> >
> >
> >
> > On Thu, Aug 08, 2019 at 05:09:51AM +0000, Yonghong Song wrote:
> > >
> > >
> > > On 8/7/19 6:22 PM, Carlos Antonio Neira Bustos wrote:
> > > > The code has been modified to avoid syscalls that could sleep.
> > > > Please let me know if any other modification is needed.
> > > >
> > > >  From be0384c0fa209a78c1567936e8db4e35b9a7c0f8 Mon Sep 17 00:00:00 2001
> > > > From: Carlos <cneirabustos@gmail.com>
> > > > Date: Wed, 7 Aug 2019 20:04:30 -0400
> > > > Subject: [PATCH] [PATCH v5 bpf-next] BPF: New helper to obtain namespace data
> > > >   from current task
> > > >
> > > > This helper obtains the active namespace from current and returns pid, tgid,
> > > > device and namespace id as seen from that namespace, allowing to instrument
> > > > a process inside a container.
> > > > Device is read from /proc/self/ns/pid, as in the future it's possible that
> > > > different pid_ns files may belong to different devices, according
> > > > to the discussion between Eric Biederman and Yonghong in 2017 linux plumbers
> > > > conference.
> > > > Currently bpf_get_current_pid_tgid(), is used to do pid filtering in bcc's
> > > > scripts but this helper returns the pid as seen by the root namespace which is
> > > > fine when a bcc script is not executed inside a container.
> > > > When the process of interest is inside a container, pid filtering will not work
> > > > if bpf_get_current_pid_tgid() is used. This helper addresses this limitation
> > > > returning the pid as it's seen by the current namespace where the script is
> > > > executing.
> > > >
> > > > This helper has the same use cases as bpf_get_current_pid_tgid() as it can be
> > > > used to do pid filtering even inside a container.
> > > >
> > > > For example a bcc script using bpf_get_current_pid_tgid() (tools/funccount.py):
> > > >
> > > >          u32 pid = bpf_get_current_pid_tgid() >> 32;
> > > >          if (pid != <pid_arg_passed_in>)
> > > >                  return 0;
> > > > Could be modified to use bpf_get_current_pidns_info() as follows:
> > > >
> > > >          struct bpf_pidns pidns;
> > > >          bpf_get_current_pidns_info(&pidns, sizeof(struct bpf_pidns));
> > > >          u32 pid = pidns.tgid;
> > > >          u32 nsid = pidns.nsid;
> > > >          if ((pid != <pid_arg_passed_in>) && (nsid != <nsid_arg_passed_in>))
> > > >                  return 0;
> > > >
> > > > To find out the name PID namespace id of a process, you could use this command:
> > > >
> > > > $ ps -h -o pidns -p <pid_of_interest>
> > > >
> > > > Or this other command:
> > > >
> > > > $ ls -Li /proc/<pid_of_interest>/ns/pid
> > > >
> > > > Signed-off-by: Carlos Neira <cneirabustos@gmail.com>
> > > > ---
> > > >   fs/namei.c                                         |   2 +-
> > > >   include/linux/bpf.h                                |   1 +
> > > >   include/linux/namei.h                              |   4 +
> > > >   include/uapi/linux/bpf.h                           |  29 ++++-
> > > >   kernel/bpf/core.c                                  |   1 +
> > > >   kernel/bpf/helpers.c                               |  78 ++++++++++++
> > > >   kernel/trace/bpf_trace.c                           |   2 +
> > > >   samples/bpf/Makefile                               |   3 +
> > > >   samples/bpf/trace_ns_info_user.c                   |  35 ++++++
> > > >   samples/bpf/trace_ns_info_user_kern.c              |  44 +++++++
> > > >   tools/include/uapi/linux/bpf.h                     |  29 ++++-
> > > >   tools/testing/selftests/bpf/Makefile               |   2 +-
> > > >   tools/testing/selftests/bpf/bpf_helpers.h          |   3 +
> > > >   .../testing/selftests/bpf/progs/test_pidns_kern.c  |  51 ++++++++
> > > >   tools/testing/selftests/bpf/test_pidns.c           | 138 +++++++++++++++++++++
> > > >   15 files changed, 418 insertions(+), 4 deletions(-)
> > > >   create mode 100644 samples/bpf/trace_ns_info_user.c
> > > >   create mode 100644 samples/bpf/trace_ns_info_user_kern.c
> > > >   create mode 100644 tools/testing/selftests/bpf/progs/test_pidns_kern.c
> > > >   create mode 100644 tools/testing/selftests/bpf/test_pidns.c
> > > >
> > > > diff --git a/fs/namei.c b/fs/namei.c
> > > > index 209c51a5226c..d1eca36972d2 100644
> > > > --- a/fs/namei.c
> > > > +++ b/fs/namei.c
> > > > @@ -19,7 +19,6 @@
> > > >   #include <linux/export.h>
> > > >   #include <linux/kernel.h>
> > > >   #include <linux/slab.h>
> > > > -#include <linux/fs.h>
> > > >   #include <linux/namei.h>
> > > >   #include <linux/pagemap.h>
> > > >   #include <linux/fsnotify.h>
> > > > @@ -2355,6 +2354,7 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags,
> > > >     putname(name);
> > > >     return retval;
> > > >   }
> > > > +EXPORT_SYMBOL(filename_lookup);
> > >
> > > No need to export symbols. bpf uses it and bpf is in the core, not in
> > > modules.
> > >
> > > >
> > > >   /* Returns 0 and nd will be valid on success; Retuns error, otherwise. */
> > > >   static int path_parentat(struct nameidata *nd, unsigned flags,
> > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > > index f9a506147c8a..e4adf5e05afd 100644
> > > > --- a/include/linux/bpf.h
> > > > +++ b/include/linux/bpf.h
> > > > @@ -1050,6 +1050,7 @@ extern const struct bpf_func_proto bpf_get_local_storage_proto;
> > > >   extern const struct bpf_func_proto bpf_strtol_proto;
> > > >   extern const struct bpf_func_proto bpf_strtoul_proto;
> > > >   extern const struct bpf_func_proto bpf_tcp_sock_proto;
> > > > +extern const struct bpf_func_proto bpf_get_current_pidns_info_proto;
> > > >
> > > >   /* Shared helpers among cBPF and eBPF. */
> > > >   void bpf_user_rnd_init_once(void);
> > > > diff --git a/include/linux/namei.h b/include/linux/namei.h
> > > > index 9138b4471dbf..2c24e8c71d46 100644
> > > > --- a/include/linux/namei.h
> > > > +++ b/include/linux/namei.h
> > > > @@ -6,6 +6,7 @@
> > > >   #include <linux/path.h>
> > > >   #include <linux/fcntl.h>
> > > >   #include <linux/errno.h>
> > > > +#include <linux/fs.h>
> > > >
> > > >   enum { MAX_NESTED_LINKS = 8 };
> > > >
> > > > @@ -97,6 +98,9 @@ extern void unlock_rename(struct dentry *, struct dentry *);
> > > >
> > > >   extern void nd_jump_link(struct path *path);
> > > >
> > > > +extern int filename_lookup(int dfd, struct filename *name, unsigned int flags,
> > > > +               struct path *path, struct path *root);
> > >
> > > The previous definition in fs/internal.h should be removed.
> > >
> > > > +
> > > >   static inline void nd_terminate_link(void *name, size_t len, size_t maxlen)
> > > >   {
> > > >     ((char *) name)[min(len, maxlen)] = '\0';
> > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > > index 4393bd4b2419..6f601f7106e2 100644
> > > > --- a/include/uapi/linux/bpf.h
> > > > +++ b/include/uapi/linux/bpf.h
> > > > @@ -2741,6 +2741,26 @@ union bpf_attr {
> > > >    *                **-EOPNOTSUPP** kernel configuration does not enable SYN cookies
> > > >    *
> > > >    *                **-EPROTONOSUPPORT** IP packet version is not 4 or 6
> > > > + *
> > > > + * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 size_of_pidns)
> > > > + * Description
> > > > + *         Copies into *pidns* pid, namespace id and tgid as seen by the
> > > > + *         current namespace and also device from /proc/self/ns/pid.
> > > > + *         *size_of_pidns* must be the size of *pidns*
> > > > + *
> > > > + *         This helper is used when pid filtering is needed inside a
> > > > + *         container as bpf_get_current_tgid() helper returns always the
> > > > + *         pid id as seen by the root namespace.
> > > > + * Return
> > > > + *         0 on success
> > > > + *
> > > > + *         **-EINVAL**  if unable to get ns, pid or tgid of current task.
> > > > + *         Or if size_of_pidns is not valid.
> > >
> > > Maybe reword by following the code sequence.
> > >     if *size_of_pidns* is not valid or unable to get ns, pid or tgid of
> > >     the current task.
> > >
> > > > + *
> > > > + *         **-ENOMEM**  if allocation fails.
> > >
> > > Maybe some other error codes in filename_lookup() function?
> > >
> > > > + *
> > > > + *         If unable to get the inode from /proc/self/ns/pid an error code
> > > > + *         will be returned.
> > >
> > > You do not need this. The description of error code cases should cover this.
> > >
> > > >    */
> > > >   #define __BPF_FUNC_MAPPER(FN)             \
> > > >     FN(unspec),                     \
> > > > @@ -2853,7 +2873,8 @@ union bpf_attr {
> > > >     FN(sk_storage_get),             \
> > > >     FN(sk_storage_delete),          \
> > > >     FN(send_signal),                \
> > > > -   FN(tcp_gen_syncookie),
> > > > +   FN(tcp_gen_syncookie),          \
> > > > +   FN(get_current_pidns_info),
> > > >
> > > >   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> > > >    * function eBPF program intends to call
> > > > @@ -3604,4 +3625,10 @@ struct bpf_sockopt {
> > > >     __s32   retval;
> > > >   };
> > > >
> > > > +struct bpf_pidns_info {
> > > > +   __u32 dev;
> > > > +   __u32 nsid;
> > > > +   __u32 tgid;
> > > > +   __u32 pid;
> > > > +};
> > > >   #endif /* _UAPI__LINUX_BPF_H__ */
> > > > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > > > index 8191a7db2777..3159f2a0188c 100644
> > > > --- a/kernel/bpf/core.c
> > > > +++ b/kernel/bpf/core.c
> > > > @@ -2038,6 +2038,7 @@ const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
> > > >   const struct bpf_func_proto bpf_get_current_comm_proto __weak;
> > > >   const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak;
> > > >   const struct bpf_func_proto bpf_get_local_storage_proto __weak;
> > > > +const struct bpf_func_proto bpf_get_current_pidns_info __weak;
> > > >
> > > >   const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
> > > >   {
> > > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > > > index 5e28718928ca..571f24077db2 100644
> > > > --- a/kernel/bpf/helpers.c
> > > > +++ b/kernel/bpf/helpers.c
> > > > @@ -11,6 +11,12 @@
> > > >   #include <linux/uidgid.h>
> > > >   #include <linux/filter.h>
> > > >   #include <linux/ctype.h>
> > > > +#include <linux/pid_namespace.h>
> > > > +#include <linux/major.h>
> > > > +#include <linux/stat.h>
> > > > +#include <linux/namei.h>
> > > > +#include <linux/version.h>
> > > > +
> > > >
> > > >   #include "../../lib/kstrtox.h"
> > > >
> > > > @@ -312,6 +318,78 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
> > > >     preempt_enable();
> > > >   }
> > > >
> > > > +BPF_CALL_2(bpf_get_current_pidns_info, struct bpf_pidns_info *, pidns_info, u32,
> > > > +    size)
> > > > +{
> > > > +   const char *name = "/proc/self/ns/pid";
> > >
> > > maybe rename this variable to pidns_path?
> > >
> > > > +   struct pid_namespace *pidns = NULL;
> > > > +   struct filename *tmp = NULL;
> > >
> > > Maybe rename this variable to name?
> > >
> > > > +   int len = strlen(name) + 1;
> > >
> > > We can delay this assignment later until it is needed.
> > >
> > > > +   struct inode *inode;
> > > > +   struct path kp;
> > > > +   pid_t tgid = 0;
> > > > +   pid_t pid = 0;
> > > > +   int ret;
> > > > +
> > > > +   if (unlikely(size != sizeof(struct bpf_pidns_info)))
> > > > +           return -EINVAL;
> > > > +
> > > > +   pidns = task_active_pid_ns(current);
> > > > +
> > >
> > > we can save an empty line here.
> > >
> > > > +   if (unlikely(!pidns))
> > > > +           goto clear;
> > > > +
> > > > +   pidns_info->nsid =  pidns->ns.inum;
> > > > +   pid = task_pid_nr_ns(current, pidns);
> > > > +
> > >
> > > We can save an empty line here.
> > >
> > > > +   if (unlikely(!pid))
> > > > +           goto clear;
> > > > +
> > > > +   tgid = task_tgid_nr_ns(current, pidns);
> > > > +
> > > ditto. save an empty line.
> > > > +   if (unlikely(!tgid))
> > > > +           goto clear;
> > > > +
> > > > +   pidns_info->tgid = (u32) tgid;
> > > > +   pidns_info->pid = (u32) pid;
> > > > +
> > > > +   tmp = kmem_cache_alloc(names_cachep, GFP_ATOMIC);
> > > > +   if (unlikely(!tmp)) {
> > > > +           memset((void *)pidns_info, 0, (size_t) size);
> > > > +           return -ENOMEM;
> > > > +   }
> > > > +
> > > > +   memcpy((char *)tmp->name, name, len);
> > > > +   tmp->uptr = NULL;
> > > > +   tmp->aname = NULL;
> > > > +   tmp->refcnt = 1;
> > > > +
> > > ditto. save an empty line.
> > > > +   ret = filename_lookup(AT_FDCWD, tmp, 0, &kp, NULL);
> > > > +
> > > ditto. save an empty line.
> > > > +   if (ret) {
> > > > +           memset((void *)pidns_info, 0, (size_t) size);
> > > > +           return ret;
> > > > +   }
> > > > +
> > > > +   inode = d_backing_inode(kp.dentry);
> > > > +   pidns_info->dev = inode->i_sb->s_dev;
> > > > +
> > > > +   return 0;
> > > > +
> > > > +clear:
> > > > +   memset((void *)pidns_info, 0, (size_t) size);
> > > > +
> > > save an empty line.
> > > > +   return -EINVAL;
> > > > +}
> > > > +
> > > > +const struct bpf_func_proto bpf_get_current_pidns_info_proto = {
> > > > +   .func   = bpf_get_current_pidns_info,
> > > make the "= " aligned with others?
> > > > +   .gpl_only       = false,
> > > > +   .ret_type       = RET_INTEGER,
> > > > +   .arg1_type      = ARG_PTR_TO_UNINIT_MEM,
> > > > +   .arg2_type      = ARG_CONST_SIZE,
> > > > +};
> > > > +
> > > >   #ifdef CONFIG_CGROUPS
> > > >   BPF_CALL_0(bpf_get_current_cgroup_id)
> > > >   {
> > > > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> > > > index ca1255d14576..5e1dc22765a5 100644
> > > > --- a/kernel/trace/bpf_trace.c
> > > > +++ b/kernel/trace/bpf_trace.c
> > > > @@ -709,6 +709,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> > > >   #endif
> > > >     case BPF_FUNC_send_signal:
> > > >             return &bpf_send_signal_proto;
> > > > +   case BPF_FUNC_get_current_pidns_info:
> > > > +           return &bpf_get_current_pidns_info_proto;
> > > >     default:
> > > >             return NULL;
> > > >     }
> > > > diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> > > > index 1d9be26b4edd..238453ff27d2 100644
> > > > --- a/samples/bpf/Makefile
> > > > +++ b/samples/bpf/Makefile
> > > > @@ -53,6 +53,7 @@ hostprogs-y += task_fd_query
> > > >   hostprogs-y += xdp_sample_pkts
> > > >   hostprogs-y += ibumad
> > > >   hostprogs-y += hbm
> > > > +hostprogs-y += trace_ns_info
> > > [...]

^ permalink raw reply

* Re: [PATCH net] net: phy: rtl8211f: do a double read to get real time link status
From: Andrew Lunn @ 2019-08-08 20:34 UTC (permalink / raw)
  To: Heiner Kallweit
  Cc: Yonglong Liu, davem, netdev, linux-kernel, linuxarm, salil.mehta,
	yisen.zhuang, shiju.jose
In-Reply-To: <26e2c5c9-915c-858b-d091-e5bfa7ab6a5b@gmail.com>

On Thu, Aug 08, 2019 at 10:01:39PM +0200, Heiner Kallweit wrote:
> On 08.08.2019 21:40, Andrew Lunn wrote:
> >> @@ -568,6 +568,11 @@ int phy_start_aneg(struct phy_device *phydev)
> >>  	if (err < 0)
> >>  		goto out_unlock;
> >>  
> >> +	/* The PHY may not yet have cleared aneg-completed and link-up bit
> >> +	 * w/o this delay when the following read is done.
> >> +	 */
> >> +	usleep_range(1000, 2000);
> >> +
> > 
> > Hi Heiner
> > 
> > Does 802.3 C22 say anything about this?
> > 
> C22 says:
> "The Auto-Negotiation process shall be restarted by setting bit 0.9 to a logic one. This bit is self-
> clearing, and a PHY shall return a value of one in bit 0.9 until the Auto-Negotiation process has been
> initiated."
> 
> Maybe we should read bit 0.9 in genphy_update_link() after having read BMSR and report
> aneg-complete and link-up as false (no matter of their current value) if 0.9 is set.

Yes. That sounds sensible.

     Andrew

^ permalink raw reply

* Re: [PATCH v2 13/15] net: phy: adin: configure downshift on config_init
From: Andrew Lunn @ 2019-08-08 20:39 UTC (permalink / raw)
  To: Heiner Kallweit
  Cc: Alexandru Ardelean, netdev, devicetree, linux-kernel, davem,
	robh+dt, mark.rutland, f.fainelli
In-Reply-To: <420c8e15-3361-a722-4ad1-3c448b1d3bc1@gmail.com>

On Thu, Aug 08, 2019 at 09:38:40PM +0200, Heiner Kallweit wrote:
> On 08.08.2019 14:30, Alexandru Ardelean wrote:
> > Down-speed auto-negotiation may not always be enabled, in which case the
> > PHY won't down-shift to 100 or 10 during auto-negotiation.
> > 
> > This change enables downshift and configures the number of retries to
> > default 8 (maximum supported value).
> > 
> > The change has been adapted from the Marvell PHY driver.
> > 
> Instead of a fixed downshift setting (like in the Marvell driver) you
> may consider to implement the ethtool phy-tunable ETHTOOL_PHY_DOWNSHIFT.

Hi Alexandru

Upps, sorry, my bad.

I looked at marvell_set_downshift(), and assumed it was connected to
the phy-tunable. I have patches somewhere which does that. But they
have not made it into mainline yet.

> See the Aquantia PHY driver for an example.

Yes, that does have all the tunable stuff.

     Andrew

^ permalink raw reply

* Re: [PATCH net-next] taprio: remove unused variable 'entry_list_policy'
From: Vinicius Costa Gomes @ 2019-08-08 20:42 UTC (permalink / raw)
  To: David Miller, yuehaibing; +Cc: jhs, xiyou.wangcong, jiri, linux-kernel, netdev
In-Reply-To: <20190808.113813.478689798535715440.davem@davemloft.net>

Hi,

David Miller <davem@davemloft.net> writes:

> From: YueHaibing <yuehaibing@huawei.com>
> Date: Thu, 8 Aug 2019 22:26:23 +0800
>
>> net/sched/sch_taprio.c:680:32: warning:
>>  entry_list_policy defined but not used [-Wunused-const-variable=]
>> 
>> It is not used since commit a3d43c0d56f1 ("taprio: Add
>> support adding an admin schedule")
>> 
>> Reported-by: Hulk Robot <hulkci@huawei.com>
>> Signed-off-by: YueHaibing <yuehaibing@huawei.com>
>
> This is probably unintentional and a bug, we should be using that
> policy value to validate that the sched list is indeed a nested
> attribute.

Removing this policy should be fine.

One of the points of commit (as explained in the commit message)
a3d43c0d56f1 ("taprio: Add support adding an admin schedule") is that it
removes support (it now returns "not supported") for schedules using the
TCA_TAPRIO_ATTR_SCHED_SINGLE_ENTRY attribute (which were never used),
the parsing of those types of schedules was the only user of this
policy.

>
> I'm not applying this without at least a better and clear commit
> message explaining why we shouldn't be using this policy any more.

YueHaibing may use the text above in the commit message of a new spin of
this patch if you think it's clear enough.


Cheers,
--
Vinicius

^ permalink raw reply

* Re: [PATCH v5 bpf-next] BPF: helpers: New helper to obtain namespacedata from current task
From: Yonghong Song @ 2019-08-08 20:47 UTC (permalink / raw)
  To: carlos antonio neira bustos, Y Song
  Cc: netdev@vger.kernel.org, ebiederm@xmission.com, brouer@redhat.com,
	quentin.monnet@netronome.com
In-Reply-To: <5d4c856b.1c69fb81.2aa4f.32dd@mx.google.com>



On 8/8/19 1:26 PM, carlos antonio neira bustos wrote:
> Hi Yonghong,
> 
> I’m sorry, just to be sure, I’m just missing the error codes from 
> filename_lookup() right ?.

 From kernel functionality point of view. Yes, I am talking about
error codes returned by filename_lookup().
For example, if CONFIG_PID_NS or CONFIG_NAMESPACES is not
defined in the config, the path "/proc/self/ns/pid" will not exist,
the error code will return. It may be -ENOTDIR
if CONFIG_NAMESPACES not defined or -ECHILD if CONFIG_PID_NS
is not defined. Please double check.

Please do follow the advice in
 > https://lore.kernel.org/netdev/20190808174848.poybtaagg5ctle7t@dev00/T/#t
to break the single patch to multiple patches.

I only reviewed the kernel code. Will review tools/ code
in the next properly-formatted (broken-up) commits.

Also, please also cc commits to bpf mailing list at
bpf@vger.kernel.org

> 
> Bests
> 
> Maybe some other error codes in filename_lookup() function?
> 
>  > + *
> 
>  > + *                      If unable to get the inode from 
> /proc/self/ns/pid an error code
> 
>  > + *                      will be returned.
> 
> *From: *Y Song <mailto:ys114321@gmail.com>
> *Sent: *08 August 2019 15:44
> *To: *Carlos Antonio Neira Bustos <mailto:cneirabustos@gmail.com>
> *Cc: *Yonghong Song <mailto:yhs@fb.com>; netdev@vger.kernel.org 
> <mailto:netdev@vger.kernel.org>; ebiederm@xmission.com 
> <mailto:ebiederm@xmission.com>; brouer@redhat.com 
> <mailto:brouer@redhat.com>; quentin.monnet@netronome.com 
> <mailto:quentin.monnet@netronome.com>
> *Subject: *Re: [PATCH v5 bpf-next] BPF: helpers: New helper to obtain 
> namespacedata from current task
> 
> On Thu, Aug 8, 2019 at 10:52 AM Carlos Antonio Neira Bustos
> 
> <cneirabustos@gmail.com> wrote:
> 
>  >
> 
>  > Yonghong,
> 
>  >
> 
>  > I have modified the patch following your feedback.
> 
>  > Let me know if I'm missing something.
> 
> Yes, I have some other requests about formating.
> 
> https://lore.kernel.org/netdev/20190808174848.poybtaagg5ctle7t@dev00/T/#t
> 
> Could you address it as well?
> 
>  >
> 
>  > Bests
> 
>  >
> 
>  > From 70f8d5584700c9cfc82c006901d8ee9595c53f15 Mon Sep 17 00:00:00 2001
> 
>  > From: Carlos <cneirabustos@gmail.com>
> 
>  > Date: Wed, 7 Aug 2019 20:04:30 -0400
> 
>  > Subject: [PATCH] [PATCH v6 bpf-next] BPF: New helper to obtain 
> namespace data
> 
>  >  from current task
> 
>  >
> 
>  > This helper obtains the active namespace from current and returns 
> pid, tgid,
> 
>  > device and namespace id as seen from that namespace, allowing to 
> instrument
> 
>  > a process inside a container.
> 
>  > Device is read from /proc/self/ns/pid, as in the future it's possible 
> that
> 
>  > different pid_ns files may belong to different devices, according
> 
>  > to the discussion between Eric Biederman and Yonghong in 2017 linux 
> plumbers
> 
>  > conference.
> 
>  > Currently bpf_get_current_pid_tgid(), is used to do pid filtering in 
> bcc's
> 
>  > scripts but this helper returns the pid as seen by the root namespace 
> which is
> 
>  > fine when a bcc script is not executed inside a container.
> 
>  > When the process of interest is inside a container, pid filtering 
> will not work
> 
>  > if bpf_get_current_pid_tgid() is used. This helper addresses this 
> limitation
> 
>  > returning the pid as it's seen by the current namespace where the 
> script is
> 
>  > executing.
> 
>  >
> 
>  > This helper has the same use cases as bpf_get_current_pid_tgid() as 
> it can be
> 
>  > used to do pid filtering even inside a container.
> 
>  >
> 
>  > For example a bcc script using bpf_get_current_pid_tgid() 
> (tools/funccount.py):
> 
>  >
> 
>  >         u32 pid = bpf_get_current_pid_tgid() >> 32;
> 
>  >         if (pid != <pid_arg_passed_in>)
> 
>  >                 return 0;
> 
>  > Could be modified to use bpf_get_current_pidns_info() as follows:
> 
>  >
> 
>  >         struct bpf_pidns pidns;
> 
>  >         bpf_get_current_pidns_info(&pidns, sizeof(struct bpf_pidns));
> 
>  >         u32 pid = pidns.tgid;
> 
>  >         u32 nsid = pidns.nsid;
> 
>  >         if ((pid != <pid_arg_passed_in>) && (nsid != 
> <nsid_arg_passed_in>))
> 
>  >                 return 0;
> 
>  >
> 
>  > To find out the name PID namespace id of a process, you could use 
> this command:
> 
>  >
> 
>  > $ ps -h -o pidns -p <pid_of_interest>
> 
>  >
> 
>  > Or this other command:
> 
>  >
> 
>  > $ ls -Li /proc/<pid_of_interest>/ns/pid
> 
>  >
> 
>  > Signed-off-by: Carlos Neira <cneirabustos@gmail.com>
> 
>  > ---
> 
>  >  fs/internal.h                                      |   2 -
> 
>  >  fs/namei.c                                         |   1 -
> 
>  >  include/linux/bpf.h                                |   1 +
> 
>  >  include/linux/namei.h                              |   4 +
> 
>  >  include/uapi/linux/bpf.h                           |  27 +++-
> 
>  >  kernel/bpf/core.c                                  |   1 +
> 
>  >  kernel/bpf/helpers.c                               |  64 ++++++++++
> 
>  >  kernel/trace/bpf_trace.c                           |   2 +
> 
>  >  samples/bpf/Makefile                               |   3 +
> 
>  >  samples/bpf/trace_ns_info_user.c                   |  35 ++++++
> 
>  >  samples/bpf/trace_ns_info_user_kern.c              |  44 +++++++
> 
>  >  tools/include/uapi/linux/bpf.h                     |  27 +++-
> 
>  >  tools/testing/selftests/bpf/Makefile               |   2 +-
> 
>  >  tools/testing/selftests/bpf/bpf_helpers.h          |   3 +
> 
>  >  .../testing/selftests/bpf/progs/test_pidns_kern.c  |  51 ++++++++
> 
>  >  tools/testing/selftests/bpf/test_pidns.c           | 138 
> +++++++++++++++++++++
> 
>  >  16 files changed, 399 insertions(+), 6 deletions(-)
> 
>  >  create mode 100644 samples/bpf/trace_ns_info_user.c
> 
>  >  create mode 100644 samples/bpf/trace_ns_info_user_kern.c
> 
>  >  create mode 100644 tools/testing/selftests/bpf/progs/test_pidns_kern.c
> 
>  >  create mode 100644 tools/testing/selftests/bpf/test_pidns.c
> 
>  >
> 
>  > diff --git a/fs/internal.h b/fs/internal.h
> 
>  > index 315fcd8d237c..6647e15dd419 100644
> 
>  > --- a/fs/internal.h
> 
>  > +++ b/fs/internal.h
> 
>  > @@ -59,8 +59,6 @@ extern int finish_clean_context(struct fs_context *fc);
> 
>  >  /*
> 
>  >   * namei.c
> 
>  >   */
> 
>  > -extern int filename_lookup(int dfd, struct filename *name, unsigned 
> flags,
> 
>  > -                          struct path *path, struct path *root);
> 
>  >  extern int user_path_mountpoint_at(int, const char __user *, 
> unsigned int, struct path *);
> 
>  >  extern int vfs_path_lookup(struct dentry *, struct vfsmount *,
> 
>  >                            const char *, unsigned int, struct path *);
> 
>  > diff --git a/fs/namei.c b/fs/namei.c
> 
>  > index 209c51a5226c..a89fc72a4a10 100644
> 
>  > --- a/fs/namei.c
> 
>  > +++ b/fs/namei.c
> 
>  > @@ -19,7 +19,6 @@
> 
>  >  #include <linux/export.h>
> 
>  >  #include <linux/kernel.h>
> 
>  >  #include <linux/slab.h>
> 
>  > -#include <linux/fs.h>
> 
>  >  #include <linux/namei.h>
> 
>  >  #include <linux/pagemap.h>
> 
>  >  #include <linux/fsnotify.h>
> 
>  > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> 
>  > index f9a506147c8a..e4adf5e05afd 100644
> 
>  > --- a/include/linux/bpf.h
> 
>  > +++ b/include/linux/bpf.h
> 
>  > @@ -1050,6 +1050,7 @@ extern const struct bpf_func_proto 
> bpf_get_local_storage_proto;
> 
>  >  extern const struct bpf_func_proto bpf_strtol_proto;
> 
>  >  extern const struct bpf_func_proto bpf_strtoul_proto;
> 
>  >  extern const struct bpf_func_proto bpf_tcp_sock_proto;
> 
>  > +extern const struct bpf_func_proto bpf_get_current_pidns_info_proto;
> 
>  >
> 
>  >  /* Shared helpers among cBPF and eBPF. */
> 
>  >  void bpf_user_rnd_init_once(void);
> 
>  > diff --git a/include/linux/namei.h b/include/linux/namei.h
> 
>  > index 9138b4471dbf..b45c8b6f7cb4 100644
> 
>  > --- a/include/linux/namei.h
> 
>  > +++ b/include/linux/namei.h
> 
>  > @@ -6,6 +6,7 @@
> 
>  >  #include <linux/path.h>
> 
>  >  #include <linux/fcntl.h>
> 
>  >  #include <linux/errno.h>
> 
>  > +#include <linux/fs.h>
> 
>  >
> 
>  >  enum { MAX_NESTED_LINKS = 8 };
> 
>  >
> 
>  > @@ -97,6 +98,9 @@ extern void unlock_rename(struct dentry *, struct 
> dentry *);
> 
>  >
> 
>  >  extern void nd_jump_link(struct path *path);
> 
>  >
> 
>  > +extern int filename_lookup(int dfd, struct filename *name, unsigned 
> flags,
> 
>  > +                          struct path *path, struct path *root);
> 
>  > +
> 
>  >  static inline void nd_terminate_link(void *name, size_t len, size_t 
> maxlen)
> 
>  >  {
> 
>  >         ((char *) name)[min(len, maxlen)] = '\0';
> 
>  > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> 
>  > index 4393bd4b2419..b0d4869fb860 100644
> 
>  > --- a/include/uapi/linux/bpf.h
> 
>  > +++ b/include/uapi/linux/bpf.h
> 
>  > @@ -2741,6 +2741,24 @@ union bpf_attr {
> 
>  >   *             **-EOPNOTSUPP** kernel configuration does not enable 
> SYN cookies
> 
>  >   *
> 
>  >   *             **-EPROTONOSUPPORT** IP packet version is not 4 or 6
> 
>  > + *
> 
>  > + * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 
> size_of_pidns)
> 
>  > + *     Description
> 
>  > + *             Copies into *pidns* pid, namespace id and tgid as 
> seen by the
> 
>  > + *             current namespace and also device from /proc/self/ns/pid.
> 
>  > + *             *size_of_pidns* must be the size of *pidns*
> 
>  > + *
> 
>  > + *             This helper is used when pid filtering is needed inside a
> 
>  > + *             container as bpf_get_current_tgid() helper returns 
> always the
> 
>  > + *             pid id as seen by the root namespace.
> 
>  > + *     Return
> 
>  > + *             0 on success
> 
>  > + *
> 
>  > + *             **-EINVAL** if *size_of_pidns* is not valid or unable 
> to get ns, pid
> 
>  > + *             or tgid of the current task.
> 
>  > + *
> 
>  > + *             **-ENOMEM**  if allocation fails.
> 
>  > + *
> 
>  >   */
> 
>  >  #define __BPF_FUNC_MAPPER(FN)          \
> 
>  >         FN(unspec),                     \
> 
>  > @@ -2853,7 +2871,8 @@ union bpf_attr {
> 
>  >         FN(sk_storage_get),             \
> 
>  >         FN(sk_storage_delete),          \
> 
>  >         FN(send_signal),                \
> 
>  > -       FN(tcp_gen_syncookie),
> 
>  > +       FN(tcp_gen_syncookie),          \
> 
>  > +       FN(get_current_pidns_info),
> 
>  >
> 
>  >  /* integer value in 'imm' field of BPF_CALL instruction selects 
> which helper
> 
>  >   * function eBPF program intends to call
> 
>  > @@ -3604,4 +3623,10 @@ struct bpf_sockopt {
> 
>  >         __s32   retval;
> 
>  >  };
> 
>  >
> 
>  > +struct bpf_pidns_info {
> 
>  > +       __u32 dev;
> 
>  > +       __u32 nsid;
> 
>  > +       __u32 tgid;
> 
>  > +       __u32 pid;
> 
>  > +};
> 
>  >  #endif /* _UAPI__LINUX_BPF_H__ */
> 
>  > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> 
>  > index 8191a7db2777..3159f2a0188c 100644
> 
>  > --- a/kernel/bpf/core.c
> 
>  > +++ b/kernel/bpf/core.c
> 
>  > @@ -2038,6 +2038,7 @@ const struct bpf_func_proto 
> bpf_get_current_uid_gid_proto __weak;
> 
>  >  const struct bpf_func_proto bpf_get_current_comm_proto __weak;
> 
>  >  const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak;
> 
>  >  const struct bpf_func_proto bpf_get_local_storage_proto __weak;
> 
>  > +const struct bpf_func_proto bpf_get_current_pidns_info __weak;
> 
>  >
> 
>  >  const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
> 
>  >  {
> 
>  > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> 
>  > index 5e28718928ca..41fbf1f28a48 100644
> 
>  > --- a/kernel/bpf/helpers.c
> 
>  > +++ b/kernel/bpf/helpers.c
> 
>  > @@ -11,6 +11,12 @@
> 
>  >  #include <linux/uidgid.h>
> 
>  >  #include <linux/filter.h>
> 
>  >  #include <linux/ctype.h>
> 
>  > +#include <linux/pid_namespace.h>
> 
>  > +#include <linux/major.h>
> 
>  > +#include <linux/stat.h>
> 
>  > +#include <linux/namei.h>
> 
>  > +#include <linux/version.h>
> 
>  > +
> 
>  >
> 
>  >  #include "../../lib/kstrtox.h"
> 
>  >
> 
>  > @@ -312,6 +318,64 @@ void copy_map_value_locked(struct bpf_map *map, 
> void *dst, void *src,
> 
>  >         preempt_enable();
> 
>  >  }
> 
>  >
> 
>  > +BPF_CALL_2(bpf_get_current_pidns_info, struct bpf_pidns_info *, 
> pidns_info, u32,
> 
>  > +        size)
> 
>  > +{
> 
>  > +       const char *pidns_path = "/proc/self/ns/pid";
> 
>  > +       struct pid_namespace *pidns = NULL;
> 
>  > +       struct filename *tmp = NULL;
> 
>  > +       struct inode *inode;
> 
>  > +       struct path kp;
> 
>  > +       pid_t tgid = 0;
> 
>  > +       pid_t pid = 0;
> 
>  > +       int ret;
> 
>  > +       int len;
> 
>  > +
> 
>  > +       if (unlikely(size != sizeof(struct bpf_pidns_info)))
> 
>  > +               return -EINVAL;
> 
>  > +       pidns = task_active_pid_ns(current);
> 
>  > +       if (unlikely(!pidns))
> 
>  > +               goto clear;
> 
>  > +       pidns_info->nsid =  pidns->ns.inum;
> 
>  > +       pid = task_pid_nr_ns(current, pidns);
> 
>  > +       if (unlikely(!pid))
> 
>  > +               goto clear;
> 
>  > +       tgid = task_tgid_nr_ns(current, pidns);
> 
>  > +       if (unlikely(!tgid))
> 
>  > +               goto clear;
> 
>  > +       pidns_info->tgid = (u32) tgid;
> 
>  > +       pidns_info->pid = (u32) pid;
> 
>  > +       tmp = kmem_cache_alloc(names_cachep, GFP_ATOMIC);
> 
>  > +       if (unlikely(!tmp)) {
> 
>  > +               memset((void *)pidns_info, 0, (size_t) size);
> 
>  > +               return -ENOMEM;
> 
>  > +       }
> 
>  > +       len = strlen(pidns_path) + 1;
> 
>  > +       memcpy((char *)tmp->name, pidns_path, len);
> 
>  > +       tmp->uptr = NULL;
> 
>  > +       tmp->aname = NULL;
> 
>  > +       tmp->refcnt = 1;
> 
>  > +       ret = filename_lookup(AT_FDCWD, tmp, 0, &kp, NULL);
> 
>  > +       if (ret) {
> 
>  > +               memset((void *)pidns_info, 0, (size_t) size);
> 
>  > +               return ret;
> 
>  > +       }
> 
>  > +       inode = d_backing_inode(kp.dentry);
> 
>  > +       pidns_info->dev = inode->i_sb->s_dev;
> 
>  > +       return 0;
> 
>  > +clear:
> 
>  > +       memset((void *)pidns_info, 0, (size_t) size);
> 
>  > +       return -EINVAL;
> 
>  > +}
> 
>  > +
> 
>  > +const struct bpf_func_proto bpf_get_current_pidns_info_proto = {
> 
>  > +       .func           = bpf_get_current_pidns_info,
> 
>  > +       .gpl_only       = false,
> 
>  > +       .ret_type       = RET_INTEGER,
> 
>  > +       .arg1_type      = ARG_PTR_TO_UNINIT_MEM,
> 
>  > +       .arg2_type      = ARG_CONST_SIZE,
> 
>  > +};
> 
>  > +
> 
>  >  #ifdef CONFIG_CGROUPS
> 
>  >  BPF_CALL_0(bpf_get_current_cgroup_id)
> 
>  >  {
> 
>  > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> 
>  > index ca1255d14576..5e1dc22765a5 100644
> 
>  > --- a/kernel/trace/bpf_trace.c
> 
>  > +++ b/kernel/trace/bpf_trace.c
> 
>  > @@ -709,6 +709,8 @@ tracing_func_proto(enum bpf_func_id func_id, 
> const struct bpf_prog *prog)
> 
>  >  #endif
> 
>  >         case BPF_FUNC_send_signal:
> 
>  >                 return &bpf_send_signal_proto;
> 
>  > +       case BPF_FUNC_get_current_pidns_info:
> 
>  > +               return &bpf_get_current_pidns_info_proto;
> 
>  >         default:
> 
>  >                 return NULL;
> 
>  >         }
> 
>  > diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> 
>  > index 1d9be26b4edd..238453ff27d2 100644
> 
>  > --- a/samples/bpf/Makefile
> 
>  > +++ b/samples/bpf/Makefile
> 
>  > @@ -53,6 +53,7 @@ hostprogs-y += task_fd_query
> 
>  >  hostprogs-y += xdp_sample_pkts
> 
>  >  hostprogs-y += ibumad
> 
>  >  hostprogs-y += hbm
> 
>  > +hostprogs-y += trace_ns_info
> 
>  >
> 
>  >  # Libbpf dependencies
> 
>  >  LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
> 
>  > @@ -109,6 +110,7 @@ task_fd_query-objs := bpf_load.o 
> task_fd_query_user.o $(TRACE_HELPERS)
> 
>  >  xdp_sample_pkts-objs := xdp_sample_pkts_user.o $(TRACE_HELPERS)
> 
>  >  ibumad-objs := bpf_load.o ibumad_user.o $(TRACE_HELPERS)
> 
>  >  hbm-objs := bpf_load.o hbm.o $(CGROUP_HELPERS)
> 
>  > +trace_ns_info-objs := bpf_load.o trace_ns_info_user.o
> 
>  >
> 
>  >  # Tell kbuild to always build the programs
> 
>  >  always := $(hostprogs-y)
> 
>  > @@ -170,6 +172,7 @@ always += xdp_sample_pkts_kern.o
> 
>  >  always += ibumad_kern.o
> 
>  >  always += hbm_out_kern.o
> 
>  >  always += hbm_edt_kern.o
> 
>  > +always += trace_ns_info_user_kern.o
> 
>  >
> 
>  >  KBUILD_HOSTCFLAGS += -I$(objtree)/usr/include
> 
>  >  KBUILD_HOSTCFLAGS += -I$(srctree)/tools/lib/bpf/
> 
>  > diff --git a/samples/bpf/trace_ns_info_user.c 
> b/samples/bpf/trace_ns_info_user.c
> 
>  > new file mode 100644
> 
>  > index 000000000000..e06d08db6f30
> 
>  > --- /dev/null
> 
>  > +++ b/samples/bpf/trace_ns_info_user.c
> 
>  > @@ -0,0 +1,35 @@
> 
>  > +// SPDX-License-Identifier: GPL-2.0
> 
>  > +/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
> 
>  > + *
> 
>  > + * This program is free software; you can redistribute it and/or
> 
>  > + * modify it under the terms of version 2 of the GNU General Public
> 
>  > + * License as published by the Free Software Foundation.
> 
>  > + */
> 
>  > +
> 
>  > +#include <stdio.h>
> 
>  > +#include <linux/bpf.h>
> 
>  > +#include <unistd.h>
> 
>  > +#include "bpf/libbpf.h"
> 
>  > +#include "bpf_load.h"
> 
>  > +
> 
>  > +/* This code was taken verbatim from tracex1_user.c, it's used
> 
>  > + * to exercize bpf_get_current_pidns_info() helper call.
> 
>  > + */
> 
>  > +int main(int ac, char **argv)
> 
>  > +{
> 
>  > +       FILE *f;
> 
>  > +       char filename[256];
> 
>  > +
> 
>  > +       snprintf(filename, sizeof(filename), "%s_user_kern.o", argv[0]);
> 
>  > +       printf("loading %s\n", filename);
> 
>  > +
> 
>  > +       if (load_bpf_file(filename)) {
> 
>  > +               printf("%s", bpf_log_buf);
> 
>  > +               return 1;
> 
>  > +       }
> 
>  > +
> 
>  > +       f = popen("taskset 1 ping  localhost", "r");
> 
>  > +       (void) f;
> 
>  > +       read_trace_pipe();
> 
>  > +       return 0;
> 
>  > +}
> 
>  > diff --git a/samples/bpf/trace_ns_info_user_kern.c 
> b/samples/bpf/trace_ns_info_user_kern.c
> 
>  > new file mode 100644
> 
>  > index 000000000000..96675e02b707
> 
>  > --- /dev/null
> 
>  > +++ b/samples/bpf/trace_ns_info_user_kern.c
> 
>  > @@ -0,0 +1,44 @@
> 
>  > +// SPDX-License-Identifier: GPL-2.0
> 
>  > +/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
> 
>  > + *
> 
>  > + * This program is free software; you can redistribute it and/or
> 
>  > + * modify it under the terms of version 2 of the GNU General Public
> 
>  > + * License as published by the Free Software Foundation.
> 
>  > + */
> 
>  > +#include <linux/skbuff.h>
> 
>  > +#include <linux/netdevice.h>
> 
>  > +#include <linux/version.h>
> 
>  > +#include <uapi/linux/bpf.h>
> 
>  > +#include "bpf_helpers.h"
> 
>  > +
> 
>  > +typedef __u64 u64;
> 
>  > +typedef __u32 u32;
> 
>  > +
> 
>  > +
> 
>  > +/* kprobe is NOT a stable ABI
> 
>  > + * kernel functions can be removed, renamed or completely change 
> semantics.
> 
>  > + * Number of arguments and their positions can change, etc.
> 
>  > + * In such case this bpf+kprobe example will no longer be meaningful
> 
>  > + */
> 
>  > +
> 
>  > +/* This will call bpf_get_current_pidns_info() to display pid and ns 
> values
> 
>  > + * as seen by the current namespace, on the far left you will see 
> the pid as
> 
>  > + * seen as by the root namespace.
> 
>  > + */
> 
>  > +
> 
>  > +SEC("kprobe/__netif_receive_skb_core")
> 
>  > +int bpf_prog1(struct pt_regs *ctx)
> 
>  > +{
> 
>  > +       char fmt[] = "nsid:%u, dev: %u,  pid:%u\n";
> 
>  > +       struct bpf_pidns_info nsinfo;
> 
>  > +       int ok = 0;
> 
>  > +
> 
>  > +       ok = bpf_get_current_pidns_info(&nsinfo, sizeof(nsinfo));
> 
>  > +       if (ok == 0)
> 
>  > +               bpf_trace_printk(fmt, sizeof(fmt), (u32)nsinfo.nsid,
> 
>  > +                                (u32) nsinfo.dev, (u32)nsinfo.pid);
> 
>  > +
> 
>  > +       return 0;
> 
>  > +}
> 
>  > +char _license[] SEC("license") = "GPL";
> 
>  > +u32 _version SEC("version") = LINUX_VERSION_CODE;
> 
>  > diff --git a/tools/include/uapi/linux/bpf.h 
> b/tools/include/uapi/linux/bpf.h
> 
>  > index 4393bd4b2419..b0d4869fb860 100644
> 
>  > --- a/tools/include/uapi/linux/bpf.h
> 
>  > +++ b/tools/include/uapi/linux/bpf.h
> 
>  > @@ -2741,6 +2741,24 @@ union bpf_attr {
> 
>  >   *             **-EOPNOTSUPP** kernel configuration does not enable 
> SYN cookies
> 
>  >   *
> 
>  >   *             **-EPROTONOSUPPORT** IP packet version is not 4 or 6
> 
>  > + *
> 
>  > + * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 
> size_of_pidns)
> 
>  > + *     Description
> 
>  > + *             Copies into *pidns* pid, namespace id and tgid as 
> seen by the
> 
>  > + *             current namespace and also device from /proc/self/ns/pid.
> 
>  > + *             *size_of_pidns* must be the size of *pidns*
> 
>  > + *
> 
>  > + *             This helper is used when pid filtering is needed inside a
> 
>  > + *             container as bpf_get_current_tgid() helper returns 
> always the
> 
>  > + *             pid id as seen by the root namespace.
> 
>  > + *     Return
> 
>  > + *             0 on success
> 
>  > + *
> 
>  > + *             **-EINVAL** if *size_of_pidns* is not valid or unable 
> to get ns, pid
> 
>  > + *             or tgid of the current task.
> 
>  > + *
> 
>  > + *             **-ENOMEM**  if allocation fails.
> 
>  > + *
> 
>  >   */
> 
>  >  #define __BPF_FUNC_MAPPER(FN)          \
> 
>  >         FN(unspec),                     \
> 
>  > @@ -2853,7 +2871,8 @@ union bpf_attr {
> 
>  >         FN(sk_storage_get),             \
> 
>  >         FN(sk_storage_delete),          \
> 
>  >         FN(send_signal),                \
> 
>  > -       FN(tcp_gen_syncookie),
> 
>  > +       FN(tcp_gen_syncookie),          \
> 
>  > +       FN(get_current_pidns_info),
> 
>  >
> 
>  >  /* integer value in 'imm' field of BPF_CALL instruction selects 
> which helper
> 
>  >   * function eBPF program intends to call
> 
>  > @@ -3604,4 +3623,10 @@ struct bpf_sockopt {
> 
>  >         __s32   retval;
> 
>  >  };
> 
>  >
> 
>  > +struct bpf_pidns_info {
> 
>  > +       __u32 dev;
> 
>  > +       __u32 nsid;
> 
>  > +       __u32 tgid;
> 
>  > +       __u32 pid;
> 
>  > +};
> 
>  >  #endif /* _UAPI__LINUX_BPF_H__ */
> 
>  > diff --git a/tools/testing/selftests/bpf/Makefile 
> b/tools/testing/selftests/bpf/Makefile
> 
>  > index 3bd0f4a0336a..1f97b571b581 100644
> 
>  > --- a/tools/testing/selftests/bpf/Makefile
> 
>  > +++ b/tools/testing/selftests/bpf/Makefile
> 
>  > @@ -29,7 +29,7 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps 
> test_lru_map test_lpm_map test
> 
>  >         test_cgroup_storage test_select_reuseport test_section_names \
> 
>  >         test_netcnt test_tcpnotify_user test_sock_fields test_sysctl 
> test_hashmap \
> 
>  >         test_btf_dump test_cgroup_attach xdping test_sockopt 
> test_sockopt_sk \
> 
>  > -       test_sockopt_multi test_tcp_rtt
> 
>  > +       test_sockopt_multi test_tcp_rtt test_pidns
> 
>  >
> 
>  >  BPF_OBJ_FILES = $(patsubst %.c,%.o, $(notdir $(wildcard progs/*.c)))
> 
>  >  TEST_GEN_FILES = $(BPF_OBJ_FILES)
> 
>  > diff --git a/tools/testing/selftests/bpf/bpf_helpers.h 
> b/tools/testing/selftests/bpf/bpf_helpers.h
> 
>  > index 120aa86c58d3..c96795a9d983 100644
> 
>  > --- a/tools/testing/selftests/bpf/bpf_helpers.h
> 
>  > +++ b/tools/testing/selftests/bpf/bpf_helpers.h
> 
>  > @@ -231,6 +231,9 @@ static int (*bpf_send_signal)(unsigned sig) = 
> (void *)BPF_FUNC_send_signal;
> 
>  >  static long long (*bpf_tcp_gen_syncookie)(struct bpf_sock *sk, void *ip,
> 
>  >                                           int ip_len, void *tcp, int 
> tcp_len) =
> 
>  >         (void *) BPF_FUNC_tcp_gen_syncookie;
> 
>  > +static int (*bpf_get_current_pidns_info)(struct bpf_pidns_info *buf,
> 
>  > +                                        unsigned int buf_size) =
> 
>  > +       (void *) BPF_FUNC_get_current_pidns_info;
> 
>  >
> 
>  >  /* llvm builtin functions that eBPF C program may use to
> 
>  >   * emit BPF_LD_ABS and BPF_LD_IND instructions
> 
>  > diff --git a/tools/testing/selftests/bpf/progs/test_pidns_kern.c 
> b/tools/testing/selftests/bpf/progs/test_pidns_kern.c
> 
>  > new file mode 100644
> 
>  > index 000000000000..e1d2facfa762
> 
>  > --- /dev/null
> 
>  > +++ b/tools/testing/selftests/bpf/progs/test_pidns_kern.c
> 
>  > @@ -0,0 +1,51 @@
> 
>  > +// SPDX-License-Identifier: GPL-2.0
> 
>  > +/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
> 
>  > + *
> 
>  > + * This program is free software; you can redistribute it and/or
> 
>  > + * modify it under the terms of version 2 of the GNU General Public
> 
>  > + * License as published by the Free Software Foundation.
> 
>  > + */
> 
>  > +
> 
>  > +#include <linux/bpf.h>
> 
>  > +#include <errno.h>
> 
>  > +#include "bpf_helpers.h"
> 
>  > +
> 
>  > +struct bpf_map_def SEC("maps") nsidmap = {
> 
>  > +       .type = BPF_MAP_TYPE_ARRAY,
> 
>  > +       .key_size = sizeof(__u32),
> 
>  > +       .value_size = sizeof(__u32),
> 
>  > +       .max_entries = 1,
> 
>  > +};
> 
>  > +
> 
>  > +struct bpf_map_def SEC("maps") pidmap = {
> 
>  > +       .type = BPF_MAP_TYPE_ARRAY,
> 
>  > +       .key_size = sizeof(__u32),
> 
>  > +       .value_size = sizeof(__u32),
> 
>  > +       .max_entries = 1,
> 
>  > +};
> 
>  > +
> 
>  > +SEC("tracepoint/syscalls/sys_enter_nanosleep")
> 
>  > +int trace(void *ctx)
> 
>  > +{
> 
>  > +       struct bpf_pidns_info nsinfo;
> 
>  > +       __u32 key = 0, *expected_pid, *val;
> 
>  > +       char fmt[] = "ERROR nspid:%d\n";
> 
>  > +
> 
>  > +       if (bpf_get_current_pidns_info(&nsinfo, sizeof(nsinfo)))
> 
>  > +               return -EINVAL;
> 
>  > +
> 
>  > +       expected_pid = bpf_map_lookup_elem(&pidmap, &key);
> 
>  > +
> 
>  > +
> 
>  > +       if (!expected_pid || *expected_pid != nsinfo.pid)
> 
>  > +               return 0;
> 
>  > +
> 
>  > +       val = bpf_map_lookup_elem(&nsidmap, &key);
> 
>  > +       if (val)
> 
>  > +               *val = nsinfo.nsid;
> 
>  > +
> 
>  > +       return 0;
> 
>  > +}
> 
>  > +
> 
>  > +char _license[] SEC("license") = "GPL";
> 
>  > +__u32 _version SEC("version") = 1;
> 
>  > diff --git a/tools/testing/selftests/bpf/test_pidns.c 
> b/tools/testing/selftests/bpf/test_pidns.c
> 
>  > new file mode 100644
> 
>  > index 000000000000..a7254055f294
> 
>  > --- /dev/null
> 
>  > +++ b/tools/testing/selftests/bpf/test_pidns.c
> 
>  > @@ -0,0 +1,138 @@
> 
>  > +// SPDX-License-Identifier: GPL-2.0
> 
>  > +/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
> 
>  > + *
> 
>  > + * This program is free software; you can redistribute it and/or
> 
>  > + * modify it under the terms of version 2 of the GNU General Public
> 
>  > + * License as published by the Free Software Foundation.
> 
>  > + */
> 
>  > +
> 
>  > +#include <stdio.h>
> 
>  > +#include <stdlib.h>
> 
>  > +#include <string.h>
> 
>  > +#include <errno.h>
> 
>  > +#include <fcntl.h>
> 
>  > +#include <syscall.h>
> 
>  > +#include <unistd.h>
> 
>  > +#include <linux/perf_event.h>
> 
>  > +#include <sys/ioctl.h>
> 
>  > +#include <sys/time.h>
> 
>  > +#include <sys/types.h>
> 
>  > +#include <sys/stat.h>
> 
>  > +
> 
>  > +#include <linux/bpf.h>
> 
>  > +#include <bpf/bpf.h>
> 
>  > +#include <bpf/libbpf.h>
> 
>  > +
> 
>  > +#include "cgroup_helpers.h"
> 
>  > +#include "bpf_rlimit.h"
> 
>  > +
> 
>  > +#define CHECK(condition, tag, format...) ({            \
> 
>  > +       int __ret = !!(condition);                      \
> 
>  > +       if (__ret) {                                    \
> 
>  > +               printf("%s:FAIL:%s ", __func__, tag);   \
> 
>  > +               printf(format);                         \
> 
>  > +       } else {                                        \
> 
>  > +               printf("%s:PASS:%s\n", __func__, tag);  \
> 
>  > +       }                                               \
> 
>  > +       __ret;                                          \
> 
>  > +})
> 
>  > +
> 
>  > +static int bpf_find_map(const char *test, struct bpf_object *obj,
> 
>  > +                       const char *name)
> 
>  > +{
> 
>  > +       struct bpf_map *map;
> 
>  > +
> 
>  > +       map = bpf_object__find_map_by_name(obj, name);
> 
>  > +       if (!map)
> 
>  > +               return -1;
> 
>  > +       return bpf_map__fd(map);
> 
>  > +}
> 
>  > +
> 
>  > +
> 
>  > +int main(int argc, char **argv)
> 
>  > +{
> 
>  > +       const char *probe_name = "syscalls/sys_enter_nanosleep";
> 
>  > +       const char *file = "test_pidns_kern.o";
> 
>  > +       int err, bytes, efd, prog_fd, pmu_fd;
> 
>  > +       int pidmap_fd, nsidmap_fd;
> 
>  > +       struct perf_event_attr attr = {};
> 
>  > +       struct bpf_object *obj;
> 
>  > +       __u32 knsid = 0;
> 
>  > +       __u32 key = 0, pid;
> 
>  > +       int exit_code = 1;
> 
>  > +       struct stat st;
> 
>  > +       char buf[256];
> 
>  > +
> 
>  > +       err = bpf_prog_load(file, BPF_PROG_TYPE_TRACEPOINT, &obj, 
> &prog_fd);
> 
>  > +       if (CHECK(err, "bpf_prog_load", "err %d errno %d\n", err, errno))
> 
>  > +               goto cleanup_cgroup_env;
> 
>  > +
> 
>  > +       nsidmap_fd = bpf_find_map(__func__, obj, "nsidmap");
> 
>  > +       if (CHECK(nsidmap_fd < 0, "bpf_find_map", "err %d errno %d\n",
> 
>  > +                 nsidmap_fd, errno))
> 
>  > +               goto close_prog;
> 
>  > +
> 
>  > +       pidmap_fd = bpf_find_map(__func__, obj, "pidmap");
> 
>  > +       if (CHECK(pidmap_fd < 0, "bpf_find_map", "err %d errno %d\n",
> 
>  > +                 pidmap_fd, errno))
> 
>  > +               goto close_prog;
> 
>  > +
> 
>  > +       pid = getpid();
> 
>  > +       bpf_map_update_elem(pidmap_fd, &key, &pid, 0);
> 
>  > +
> 
>  > +       snprintf(buf, sizeof(buf),
> 
>  > +                "/sys/kernel/debug/tracing/events/%s/id", probe_name);
> 
>  > +       efd = open(buf, O_RDONLY, 0);
> 
>  > +       if (CHECK(efd < 0, "open", "err %d errno %d\n", efd, errno))
> 
>  > +               goto close_prog;
> 
>  > +       bytes = read(efd, buf, sizeof(buf));
> 
>  > +       close(efd);
> 
>  > +       if (CHECK(bytes <= 0 || bytes >= sizeof(buf), "read",
> 
>  > +                 "bytes %d errno %d\n", bytes, errno))
> 
>  > +               goto close_prog;
> 
>  > +
> 
>  > +       attr.config = strtol(buf, NULL, 0);
> 
>  > +       attr.type = PERF_TYPE_TRACEPOINT;
> 
>  > +       attr.sample_type = PERF_SAMPLE_RAW;
> 
>  > +       attr.sample_period = 1;
> 
>  > +       attr.wakeup_events = 1;
> 
>  > +
> 
>  > +       pmu_fd = syscall(__NR_perf_event_open, &attr, getpid(), -1, 
> -1, 0);
> 
>  > +       if (CHECK(pmu_fd < 0, "perf_event_open", "err %d errno %d\n", 
> pmu_fd,
> 
>  > +                 errno))
> 
>  > +               goto close_prog;
> 
>  > +
> 
>  > +       err = ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
> 
>  > +       if (CHECK(err, "perf_event_ioc_enable", "err %d errno %d\n", err,
> 
>  > +                 errno))
> 
>  > +               goto close_pmu;
> 
>  > +
> 
>  > +       err = ioctl(pmu_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
> 
>  > +       if (CHECK(err, "perf_event_ioc_set_bpf", "err %d errno %d\n", 
> err,
> 
>  > +                 errno))
> 
>  > +               goto close_pmu;
> 
>  > +
> 
>  > +       /* trigger some syscalls */
> 
>  > +       sleep(1);
> 
>  > +
> 
>  > +       err = bpf_map_lookup_elem(nsidmap_fd, &key, &knsid);
> 
>  > +       if (CHECK(err, "bpf_map_lookup_elem", "err %d errno %d\n", 
> err, errno))
> 
>  > +               goto close_pmu;
> 
>  > +
> 
>  > +       if (stat("/proc/self/ns/pid", &st))
> 
>  > +               goto close_pmu;
> 
>  > +
> 
>  > +       if (CHECK(knsid != (__u32) st.st_ino, "compare_namespace_id",
> 
>  > +                 "kern knsid %u user unsid %u\n", knsid, (__u32) 
> st.st_ino))
> 
>  > +               goto close_pmu;
> 
>  > +
> 
>  > +       exit_code = 0;
> 
>  > +       printf("%s:PASS\n", argv[0]);
> 
>  > +
> 
>  > +close_pmu:
> 
>  > +       close(pmu_fd);
> 
>  > +close_prog:
> 
>  > +       bpf_object__close(obj);
> 
>  > +cleanup_cgroup_env:
> 
>  > +       return exit_code;
> 
>  > +}
> 
>  > --
> 
>  > 2.11.0
> 
>  >
> 
>  >
> 
>  >
> 
>  >
> 
>  >
> 
>  >
> 
>  > On Thu, Aug 08, 2019 at 05:09:51AM +0000, Yonghong Song wrote:
> 
>  > >
> 
>  > >
> 
>  > > On 8/7/19 6:22 PM, Carlos Antonio Neira Bustos wrote:
> 
>  > > > The code has been modified to avoid syscalls that could sleep.
> 
>  > > > Please let me know if any other modification is needed.
> 
>  > > >
> 
>  > > >  From be0384c0fa209a78c1567936e8db4e35b9a7c0f8 Mon Sep 17 
> 00:00:00 2001
> 
>  > > > From: Carlos <cneirabustos@gmail.com>
> 
>  > > > Date: Wed, 7 Aug 2019 20:04:30 -0400
> 
>  > > > Subject: [PATCH] [PATCH v5 bpf-next] BPF: New helper to obtain 
> namespace data
> 
>  > > >   from current task
> 
>  > > >
> 
>  > > > This helper obtains the active namespace from current and returns 
> pid, tgid,
> 
>  > > > device and namespace id as seen from that namespace, allowing to 
> instrument
> 
>  > > > a process inside a container.
> 
>  > > > Device is read from /proc/self/ns/pid, as in the future it's 
> possible that
> 
>  > > > different pid_ns files may belong to different devices, according
> 
>  > > > to the discussion between Eric Biederman and Yonghong in 2017 
> linux plumbers
> 
>  > > > conference.
> 
>  > > > Currently bpf_get_current_pid_tgid(), is used to do pid filtering 
> in bcc's
> 
>  > > > scripts but this helper returns the pid as seen by the root 
> namespace which is
> 
>  > > > fine when a bcc script is not executed inside a container.
> 
>  > > > When the process of interest is inside a container, pid filtering 
> will not work
> 
>  > > > if bpf_get_current_pid_tgid() is used. This helper addresses this 
> limitation
> 
>  > > > returning the pid as it's seen by the current namespace where the 
> script is
> 
>  > > > executing.
> 
>  > > >
> 
>  > > > This helper has the same use cases as bpf_get_current_pid_tgid() 
> as it can be
> 
>  > > > used to do pid filtering even inside a container.
> 
>  > > >
> 
>  > > > For example a bcc script using bpf_get_current_pid_tgid() 
> (tools/funccount.py):
> 
>  > > >
> 
>  > > >          u32 pid = bpf_get_current_pid_tgid() >> 32;
> 
>  > > >          if (pid != <pid_arg_passed_in>)
> 
>  > > >                  return 0;
> 
>  > > > Could be modified to use bpf_get_current_pidns_info() as follows:
> 
>  > > >
> 
>  > > >          struct bpf_pidns pidns;
> 
>  > > >          bpf_get_current_pidns_info(&pidns, sizeof(struct 
> bpf_pidns));
> 
>  > > >          u32 pid = pidns.tgid;
> 
>  > > >          u32 nsid = pidns.nsid;
> 
>  > > >          if ((pid != <pid_arg_passed_in>) && (nsid != 
> <nsid_arg_passed_in>))
> 
>  > > >                  return 0;
> 
>  > > >
> 
>  > > > To find out the name PID namespace id of a process, you could use 
> this command:
> 
>  > > >
> 
>  > > > $ ps -h -o pidns -p <pid_of_interest>
> 
>  > > >
> 
>  > > > Or this other command:
> 
>  > > >
> 
>  > > > $ ls -Li /proc/<pid_of_interest>/ns/pid
> 
>  > > >
> 
>  > > > Signed-off-by: Carlos Neira <cneirabustos@gmail.com>
> 
>  > > > ---
> 
>  > > >   fs/namei.c                                         |   2 +-
> 
>  > > >   include/linux/bpf.h                                |   1 +
> 
>  > > >   include/linux/namei.h                              |   4 +
> 
>  > > >   include/uapi/linux/bpf.h                           |  29 ++++-
> 
>  > > >   kernel/bpf/core.c                                  |   1 +
> 
>  > > >   kernel/bpf/helpers.c                               |  78 
> ++++++++++++
> 
>  > > >   kernel/trace/bpf_trace.c                           |   2 +
> 
>  > > >   samples/bpf/Makefile                               |   3 +
> 
>  > > >   samples/bpf/trace_ns_info_user.c                   |  35 ++++++
> 
>  > > >   samples/bpf/trace_ns_info_user_kern.c              |  44 +++++++
> 
>  > > >   tools/include/uapi/linux/bpf.h                     |  29 ++++-
> 
>  > > >   tools/testing/selftests/bpf/Makefile               |   2 +-
> 
>  > > >   tools/testing/selftests/bpf/bpf_helpers.h          |   3 +
> 
>  > > >   .../testing/selftests/bpf/progs/test_pidns_kern.c  |  51 ++++++++
> 
>  > > >   tools/testing/selftests/bpf/test_pidns.c           | 138 
> +++++++++++++++++++++
> 
>  > > >   15 files changed, 418 insertions(+), 4 deletions(-)
> 
>  > > >   create mode 100644 samples/bpf/trace_ns_info_user.c
> 
>  > > >   create mode 100644 samples/bpf/trace_ns_info_user_kern.c
> 
>  > > >   create mode 100644 
> tools/testing/selftests/bpf/progs/test_pidns_kern.c
> 
>  > > >   create mode 100644 tools/testing/selftests/bpf/test_pidns.c
> 
>  > > >
> 
>  > > > diff --git a/fs/namei.c b/fs/namei.c
> 
>  > > > index 209c51a5226c..d1eca36972d2 100644
> 
>  > > > --- a/fs/namei.c
> 
>  > > > +++ b/fs/namei.c
> 
>  > > > @@ -19,7 +19,6 @@
> 
>  > > >   #include <linux/export.h>
> 
>  > > >   #include <linux/kernel.h>
> 
>  > > >   #include <linux/slab.h>
> 
>  > > > -#include <linux/fs.h>
> 
>  > > >   #include <linux/namei.h>
> 
>  > > >   #include <linux/pagemap.h>
> 
>  > > >   #include <linux/fsnotify.h>
> 
>  > > > @@ -2355,6 +2354,7 @@ int filename_lookup(int dfd, struct 
> filename *name, unsigned flags,
> 
>  > > >     putname(name);
> 
>  > > >     return retval;
> 
>  > > >   }
> 
>  > > > +EXPORT_SYMBOL(filename_lookup);
> 
>  > >
> 
>  > > No need to export symbols. bpf uses it and bpf is in the core, not in
> 
>  > > modules.
> 
>  > >
> 
>  > > >
> 
>  > > >   /* Returns 0 and nd will be valid on success; Retuns error, 
> otherwise. */
> 
>  > > >   static int path_parentat(struct nameidata *nd, unsigned flags,
> 
>  > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> 
>  > > > index f9a506147c8a..e4adf5e05afd 100644
> 
>  > > > --- a/include/linux/bpf.h
> 
>  > > > +++ b/include/linux/bpf.h
> 
>  > > > @@ -1050,6 +1050,7 @@ extern const struct bpf_func_proto 
> bpf_get_local_storage_proto;
> 
>  > > >   extern const struct bpf_func_proto bpf_strtol_proto;
> 
>  > > >   extern const struct bpf_func_proto bpf_strtoul_proto;
> 
>  > > >   extern const struct bpf_func_proto bpf_tcp_sock_proto;
> 
>  > > > +extern const struct bpf_func_proto bpf_get_current_pidns_info_proto;
> 
>  > > >
> 
>  > > >   /* Shared helpers among cBPF and eBPF. */
> 
>  > > >   void bpf_user_rnd_init_once(void);
> 
>  > > > diff --git a/include/linux/namei.h b/include/linux/namei.h
> 
>  > > > index 9138b4471dbf..2c24e8c71d46 100644
> 
>  > > > --- a/include/linux/namei.h
> 
>  > > > +++ b/include/linux/namei.h
> 
>  > > > @@ -6,6 +6,7 @@
> 
>  > > >   #include <linux/path.h>
> 
>  > > >   #include <linux/fcntl.h>
> 
>  > > >   #include <linux/errno.h>
> 
>  > > > +#include <linux/fs.h>
> 
>  > > >
> 
>  > > >   enum { MAX_NESTED_LINKS = 8 };
> 
>  > > >
> 
>  > > > @@ -97,6 +98,9 @@ extern void unlock_rename(struct dentry *, 
> struct dentry *);
> 
>  > > >
> 
>  > > >   extern void nd_jump_link(struct path *path);
> 
>  > > >
> 
>  > > > +extern int filename_lookup(int dfd, struct filename *name, 
> unsigned int flags,
> 
>  > > > +               struct path *path, struct path *root);
> 
>  > >
> 
>  > > The previous definition in fs/internal.h should be removed.
> 
>  > >
> 
>  > > > +
> 
>  > > >   static inline void nd_terminate_link(void *name, size_t len, 
> size_t maxlen)
> 
>  > > >   {
> 
>  > > >     ((char *) name)[min(len, maxlen)] = '\0';
> 
>  > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> 
>  > > > index 4393bd4b2419..6f601f7106e2 100644
> 
>  > > > --- a/include/uapi/linux/bpf.h
> 
>  > > > +++ b/include/uapi/linux/bpf.h
> 
>  > > > @@ -2741,6 +2741,26 @@ union bpf_attr {
> 
>  > > >    *                **-EOPNOTSUPP** kernel configuration does not 
> enable SYN cookies
> 
>  > > >    *
> 
>  > > >    *                **-EPROTONOSUPPORT** IP packet version is not 
> 4 or 6
> 
>  > > > + *
> 
>  > > > + * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, 
> u32 size_of_pidns)
> 
>  > > > + * Description
> 
>  > > > + *         Copies into *pidns* pid, namespace id and tgid as 
> seen by the
> 
>  > > > + *         current namespace and also device from /proc/self/ns/pid.
> 
>  > > > + *         *size_of_pidns* must be the size of *pidns*
> 
>  > > > + *
> 
>  > > > + *         This helper is used when pid filtering is needed inside a
> 
>  > > > + *         container as bpf_get_current_tgid() helper returns 
> always the
> 
>  > > > + *         pid id as seen by the root namespace.
> 
>  > > > + * Return
> 
>  > > > + *         0 on success
> 
>  > > > + *
> 
>  > > > + *         **-EINVAL**  if unable to get ns, pid or tgid of 
> current task.
> 
>  > > > + *         Or if size_of_pidns is not valid.
> 
>  > >
> 
>  > > Maybe reword by following the code sequence.
> 
>  > >     if *size_of_pidns* is not valid or unable to get ns, pid or tgid of
> 
>  > >     the current task.
> 
>  > >
> 
>  > > > + *
> 
>  > > > + *         **-ENOMEM**  if allocation fails.
> 
>  > >
> 
>  > > Maybe some other error codes in filename_lookup() function?
> 
>  > >
> 
>  > > > + *
> 
>  > > > + *         If unable to get the inode from /proc/self/ns/pid an 
> error code
> 
>  > > > + *         will be returned.
> 
>  > >
> 
>  > > You do not need this. The description of error code cases should 
> cover this.
> 
>  > >
> 
>  > > >    */
> 
>  > > >   #define __BPF_FUNC_MAPPER(FN)             \
> 
>  > > >     FN(unspec),                     \
> 
>  > > > @@ -2853,7 +2873,8 @@ union bpf_attr {
> 
>  > > >     FN(sk_storage_get),             \
> 
>  > > >     FN(sk_storage_delete),          \
> 
>  > > >     FN(send_signal),                \
> 
>  > > > -   FN(tcp_gen_syncookie),
> 
>  > > > +   FN(tcp_gen_syncookie),          \
> 
>  > > > +   FN(get_current_pidns_info),
> 
>  > > >
> 
>  > > >   /* integer value in 'imm' field of BPF_CALL instruction selects 
> which helper
> 
>  > > >    * function eBPF program intends to call
> 
>  > > > @@ -3604,4 +3625,10 @@ struct bpf_sockopt {
> 
>  > > >     __s32   retval;
> 
>  > > >   };
> 
>  > > >
> 
>  > > > +struct bpf_pidns_info {
> 
>  > > > +   __u32 dev;
> 
>  > > > +   __u32 nsid;
> 
>  > > > +   __u32 tgid;
> 
>  > > > +   __u32 pid;
> 
>  > > > +};
> 
>  > > >   #endif /* _UAPI__LINUX_BPF_H__ */
> 
>  > > > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> 
>  > > > index 8191a7db2777..3159f2a0188c 100644
> 
>  > > > --- a/kernel/bpf/core.c
> 
>  > > > +++ b/kernel/bpf/core.c
> 
>  > > > @@ -2038,6 +2038,7 @@ const struct bpf_func_proto 
> bpf_get_current_uid_gid_proto __weak;
> 
>  > > >   const struct bpf_func_proto bpf_get_current_comm_proto __weak;
> 
>  > > >   const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak;
> 
>  > > >   const struct bpf_func_proto bpf_get_local_storage_proto __weak;
> 
>  > > > +const struct bpf_func_proto bpf_get_current_pidns_info __weak;
> 
>  > > >
> 
>  > > >   const struct bpf_func_proto * __weak 
> bpf_get_trace_printk_proto(void)
> 
>  > > >   {
> 
>  > > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> 
>  > > > index 5e28718928ca..571f24077db2 100644
> 
>  > > > --- a/kernel/bpf/helpers.c
> 
>  > > > +++ b/kernel/bpf/helpers.c
> 
>  > > > @@ -11,6 +11,12 @@
> 
>  > > >   #include <linux/uidgid.h>
> 
>  > > >   #include <linux/filter.h>
> 
>  > > >   #include <linux/ctype.h>
> 
>  > > > +#include <linux/pid_namespace.h>
> 
>  > > > +#include <linux/major.h>
> 
>  > > > +#include <linux/stat.h>
> 
>  > > > +#include <linux/namei.h>
> 
>  > > > +#include <linux/version.h>
> 
>  > > > +
> 
>  > > >
> 
>  > > >   #include "../../lib/kstrtox.h"
> 
>  > > >
> 
>  > > > @@ -312,6 +318,78 @@ void copy_map_value_locked(struct bpf_map 
> *map, void *dst, void *src,
> 
>  > > >     preempt_enable();
> 
>  > > >   }
> 
>  > > >
> 
>  > > > +BPF_CALL_2(bpf_get_current_pidns_info, struct bpf_pidns_info *, 
> pidns_info, u32,
> 
>  > > > +    size)
> 
>  > > > +{
> 
>  > > > +   const char *name = "/proc/self/ns/pid";
> 
>  > >
> 
>  > > maybe rename this variable to pidns_path?
> 
>  > >
> 
>  > > > +   struct pid_namespace *pidns = NULL;
> 
>  > > > +   struct filename *tmp = NULL;
> 
>  > >
> 
>  > > Maybe rename this variable to name?
> 
>  > >
> 
>  > > > +   int len = strlen(name) + 1;
> 
>  > >
> 
>  > > We can delay this assignment later until it is needed.
> 
>  > >
> 
>  > > > +   struct inode *inode;
> 
>  > > > +   struct path kp;
> 
>  > > > +   pid_t tgid = 0;
> 
>  > > > +   pid_t pid = 0;
> 
>  > > > +   int ret;
> 
>  > > > +
> 
>  > > > +   if (unlikely(size != sizeof(struct bpf_pidns_info)))
> 
>  > > > +           return -EINVAL;
> 
>  > > > +
> 
>  > > > +   pidns = task_active_pid_ns(current);
> 
>  > > > +
> 
>  > >
> 
>  > > we can save an empty line here.
> 
>  > >
> 
>  > > > +   if (unlikely(!pidns))
> 
>  > > > +           goto clear;
> 
>  > > > +
> 
>  > > > +   pidns_info->nsid =  pidns->ns.inum;
> 
>  > > > +   pid = task_pid_nr_ns(current, pidns);
> 
>  > > > +
> 
>  > >
> 
>  > > We can save an empty line here.
> 
>  > >
> 
>  > > > +   if (unlikely(!pid))
> 
>  > > > +           goto clear;
> 
>  > > > +
> 
>  > > > +   tgid = task_tgid_nr_ns(current, pidns);
> 
>  > > > +
> 
>  > > ditto. save an empty line.
> 
>  > > > +   if (unlikely(!tgid))
> 
>  > > > +           goto clear;
> 
>  > > > +
> 
>  > > > +   pidns_info->tgid = (u32) tgid;
> 
>  > > > +   pidns_info->pid = (u32) pid;
> 
>  > > > +
> 
>  > > > +   tmp = kmem_cache_alloc(names_cachep, GFP_ATOMIC);
> 
>  > > > +   if (unlikely(!tmp)) {
> 
>  > > > +           memset((void *)pidns_info, 0, (size_t) size);
> 
>  > > > +           return -ENOMEM;
> 
>  > > > +   }
> 
>  > > > +
> 
>  > > > +   memcpy((char *)tmp->name, name, len);
> 
>  > > > +   tmp->uptr = NULL;
> 
>  > > > +   tmp->aname = NULL;
> 
>  > > > +   tmp->refcnt = 1;
> 
>  > > > +
> 
>  > > ditto. save an empty line.
> 
>  > > > +   ret = filename_lookup(AT_FDCWD, tmp, 0, &kp, NULL);
> 
>  > > > +
> 
>  > > ditto. save an empty line.
> 
>  > > > +   if (ret) {
> 
>  > > > +           memset((void *)pidns_info, 0, (size_t) size);
> 
>  > > > +           return ret;
> 
>  > > > +   }
> 
>  > > > +
> 
>  > > > +   inode = d_backing_inode(kp.dentry);
> 
>  > > > +   pidns_info->dev = inode->i_sb->s_dev;
> 
>  > > > +
> 
>  > > > +   return 0;
> 
>  > > > +
> 
>  > > > +clear:
> 
>  > > > +   memset((void *)pidns_info, 0, (size_t) size);
> 
>  > > > +
> 
>  > > save an empty line.
> 
>  > > > +   return -EINVAL;
> 
>  > > > +}
> 
>  > > > +
> 
>  > > > +const struct bpf_func_proto bpf_get_current_pidns_info_proto = {
> 
>  > > > +   .func   = bpf_get_current_pidns_info,
> 
>  > > make the "= " aligned with others?
> 
>  > > > +   .gpl_only       = false,
> 
>  > > > +   .ret_type       = RET_INTEGER,
> 
>  > > > +   .arg1_type      = ARG_PTR_TO_UNINIT_MEM,
> 
>  > > > +   .arg2_type      = ARG_CONST_SIZE,
> 
>  > > > +};
> 
>  > > > +
> 
>  > > >   #ifdef CONFIG_CGROUPS
> 
>  > > >   BPF_CALL_0(bpf_get_current_cgroup_id)
> 
>  > > >   {
> 
>  > > > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> 
>  > > > index ca1255d14576..5e1dc22765a5 100644
> 
>  > > > --- a/kernel/trace/bpf_trace.c
> 
>  > > > +++ b/kernel/trace/bpf_trace.c
> 
>  > > > @@ -709,6 +709,8 @@ tracing_func_proto(enum bpf_func_id func_id, 
> const struct bpf_prog *prog)
> 
>  > > >   #endif
> 
>  > > >     case BPF_FUNC_send_signal:
> 
>  > > >             return &bpf_send_signal_proto;
> 
>  > > > +   case BPF_FUNC_get_current_pidns_info:
> 
>  > > > +           return &bpf_get_current_pidns_info_proto;
> 
>  > > >     default:
> 
>  > > >             return NULL;
> 
>  > > >     }
> 
>  > > > diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> 
>  > > > index 1d9be26b4edd..238453ff27d2 100644
> 
>  > > > --- a/samples/bpf/Makefile
> 
>  > > > +++ b/samples/bpf/Makefile
> 
>  > > > @@ -53,6 +53,7 @@ hostprogs-y += task_fd_query
> 
>  > > >   hostprogs-y += xdp_sample_pkts
> 
>  > > >   hostprogs-y += ibumad
> 
>  > > >   hostprogs-y += hbm
> 
>  > > > +hostprogs-y += trace_ns_info
> 
>  > > [...]
> 

^ permalink raw reply

* Re: [PATCH net-next] net: openvswitch: Set OvS recirc_id from tc chain index
From: Pravin Shelar @ 2019-08-08 20:53 UTC (permalink / raw)
  To: Paul Blakey
  Cc: Linux Kernel Network Developers, David S. Miller, Justin Pettit,
	Simon Horman, Marcelo Ricardo Leitner, Vlad Buslov, Jiri Pirko,
	Roi Dayan, Yossi Kuperman, Rony Efraim, Oz Shlomo
In-Reply-To: <1565179722-22488-1-git-send-email-paulb@mellanox.com>

On Wed, Aug 7, 2019 at 5:08 AM Paul Blakey <paulb@mellanox.com> wrote:
>
> Offloaded OvS datapath rules are translated one to one to tc rules,
> for example the following simplified OvS rule:
>
> recirc_id(0),in_port(dev1),eth_type(0x0800),ct_state(-trk) actions:ct(),recirc(2)
>
> Will be translated to the following tc rule:
>
> $ tc filter add dev dev1 ingress \
>             prio 1 chain 0 proto ip \
>                 flower tcp ct_state -trk \
>                 action ct pipe \
>                 action goto chain 2
>
> Received packets will first travel though tc, and if they aren't stolen
> by it, like in the above rule, they will continue to OvS datapath.
> Since we already did some actions (action ct in this case) which might
> modify the packets, and updated action stats, we would like to continue
> the proccessing with the correct recirc_id in OvS (here recirc_id(2))
> where we left off.
>
> To support this, introduce a new skb extension for tc, which
> will be used for translating tc chain to ovs recirc_id to
> handle these miss cases. Last tc chain index will be set
> by tc goto chain action and read by OvS datapath.
>
> Signed-off-by: Paul Blakey <paulb@mellanox.com>
> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
> Acked-by: Jiri Pirko <jiri@mellanox.com>
> ---
>  include/linux/skbuff.h    | 13 +++++++++++++
>  include/net/sch_generic.h |  5 ++++-
>  net/core/skbuff.c         |  6 ++++++
>  net/openvswitch/flow.c    |  9 +++++++++
>  net/sched/Kconfig         | 13 +++++++++++++
>  net/sched/act_api.c       |  1 +
>  net/sched/cls_api.c       | 12 ++++++++++++
>  7 files changed, 58 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 3aef8d8..fb2a792 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -279,6 +279,16 @@ struct nf_bridge_info {
>  };
>  #endif
>
> +#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
> +/* Chain in tc_skb_ext will be used to share the tc chain with
> + * ovs recirc_id. It will be set to the current chain by tc
> + * and read by ovs to recirc_id.
> + */
> +struct tc_skb_ext {
> +       __u32 chain;
> +};
> +#endif
> +
>  struct sk_buff_head {
>         /* These two members must be first. */
>         struct sk_buff  *next;
> @@ -4050,6 +4060,9 @@ enum skb_ext_id {
>  #ifdef CONFIG_XFRM
>         SKB_EXT_SEC_PATH,
>  #endif
> +#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
> +       TC_SKB_EXT,
> +#endif
>         SKB_EXT_NUM, /* must be last */
>  };
>
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index 6b6b012..871feea 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -275,7 +275,10 @@ struct tcf_result {
>                         unsigned long   class;
>                         u32             classid;
>                 };
> -               const struct tcf_proto *goto_tp;
> +               struct {
> +                       const struct tcf_proto *goto_tp;
> +                       u32 goto_index;
> +               };
>
>                 /* used in the skb_tc_reinsert function */
>                 struct {
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index ea8e8d3..2b40b5a 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -4087,6 +4087,9 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
>  #ifdef CONFIG_XFRM
>         [SKB_EXT_SEC_PATH] = SKB_EXT_CHUNKSIZEOF(struct sec_path),
>  #endif
> +#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
> +       [TC_SKB_EXT] = SKB_EXT_CHUNKSIZEOF(struct tc_skb_ext),
> +#endif
>  };
>
>  static __always_inline unsigned int skb_ext_total_length(void)
> @@ -4098,6 +4101,9 @@ static __always_inline unsigned int skb_ext_total_length(void)
>  #ifdef CONFIG_XFRM
>                 skb_ext_type_len[SKB_EXT_SEC_PATH] +
>  #endif
> +#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
> +               skb_ext_type_len[TC_SKB_EXT] +
> +#endif
>                 0;
>  }
>
> diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
> index bc89e16..0287ead 100644
> --- a/net/openvswitch/flow.c
> +++ b/net/openvswitch/flow.c
> @@ -816,6 +816,9 @@ static int key_extract_mac_proto(struct sk_buff *skb)
>  int ovs_flow_key_extract(const struct ip_tunnel_info *tun_info,
>                          struct sk_buff *skb, struct sw_flow_key *key)
>  {
> +#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
> +       struct tc_skb_ext *tc_ext;
> +#endif
>         int res, err;
>
>         /* Extract metadata from packet. */
> @@ -848,7 +851,13 @@ int ovs_flow_key_extract(const struct ip_tunnel_info *tun_info,
>         if (res < 0)
>                 return res;
>         key->mac_proto = res;
> +
> +#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
> +       tc_ext = skb_ext_find(skb, TC_SKB_EXT);
> +       key->recirc_id = tc_ext ? tc_ext->chain : 0;
> +#else
>         key->recirc_id = 0;
> +#endif
>
Most of cases the config would be turned on, so the ifdef is not that
useful. Can you add static key to avoid searching the skb-ext in non
offload cases.

^ permalink raw reply

* Re: [PATCH net-next 00/10] drop_monitor: Capture dropped packets and metadata
From: David Ahern @ 2019-08-08 21:08 UTC (permalink / raw)
  To: Ido Schimmel, netdev
  Cc: davem, nhorman, jiri, toke, roopa, nikolay, jakub.kicinski, andy,
	f.fainelli, andrew, vivien.didelot, mlxsw, Ido Schimmel
In-Reply-To: <20190807103059.15270-1-idosch@idosch.org>

On 8/7/19 4:30 AM, Ido Schimmel wrote:
> Example usage with patched dropwatch [1] can be found here [2]. Example
> dissection of drop monitor netlink events with patched wireshark [3] can
> be found here [4]. I will submit both changes upstream after the kernel
> changes are accepted. Another change worth making is adding a dropmon
> pseudo interface to libpcap, similar to the nflog interface [5]. This
> will allow users to specifically listen on dropmon traffic instead of
> capturing all netlink packets via the nlmon netdev.

Nice work, Ido.

On top of your dropwatch changes I added the ability to print the
payload as hex. e.g.,

Issue Ctrl-C to stop monitoring
drop at: nf_hook_slow+0x59/0x98 (0xffffffff814ec532)
input port ifindex: 1
timestamp: Thu Aug  8 15:04:02 2019 360015026 nsec
length: 64
00 00 00 00 00 00 00 00  00 00 00 00 08 00 45 00      ........ ......E.
00 3c e7 50 40 00 40 06  55 69 7f 00 00 01 7f 00      .<.P@.@. Ui......
00 01 80 2c 30 39 74 b9  c7 4d 00 00 00 00 a0 02      ...,09t. .M......
ff d7 fe 30 00 00 02 04  ff d7 04 02 08 0a 53 79       ...0.... ......Sy
original length: 74


Seems like the skb protocol is also needed to properly parse the payload
- ie., to know it is an ethernet header, followed by ip and tcp.

^ permalink raw reply

* Re: [PATCH] powerpc/kmcent2: update the ethernet devices' phy properties
From: Valentin Longchamp @ 2019-08-08 21:09 UTC (permalink / raw)
  To: Madalin-cristian Bucur
  Cc: Scott Wood, linuxppc-dev@lists.ozlabs.org,
	galak@kernel.crashing.org, netdev@vger.kernel.org
In-Reply-To: <VI1PR04MB55679AAE8DDC3160B9CCE073ECDC0@VI1PR04MB5567.eurprd04.prod.outlook.com>

Le mar. 30 juil. 2019 à 11:44, Madalin-cristian Bucur
<madalin.bucur@nxp.com> a écrit :
>
> > -----Original Message-----
> >
> > > Le dim. 14 juil. 2019 à 22:05, Valentin Longchamp
> > > <valentin@longchamp.me> a écrit :
> > > >
> > > > Change all phy-connection-type properties to phy-mode that are better
> > > > supported by the fman driver.
> > > >
> > > > Use the more readable fixed-link node for the 2 sgmii links.
> > > >
> > > > Change the RGMII link to rgmii-id as the clock delays are added by the
> > > > phy.
> > > >
> > > > Signed-off-by: Valentin Longchamp <valentin@longchamp.me>
> >
> > I don't see any other uses of phy-mode in arch/powerpc/boot/dts/fsl, and I see
> > lots of phy-connection-type with fman.  Madalin, does this patch look OK?
> >
> > -Scott
>
> Hi,
>
> we are using "phy-connection-type" not "phy-mode" for the NXP (former Freescale)
> DPAA platforms. While the two seem to be interchangeable ("phy-mode" seems to be
> more recent, looking at the device tree bindings), the driver code in Linux seems
> to use one or the other, not both so one should stick with the variant the driver
> is using. To make things more complex, there may be dependencies in bootloaders,
> I see code in u-boot using only "phy-connection-type" or only "phy-mode".
>
> I'd leave "phy-connection-type" as is.

So I have finally had time to have a look and now I understand what
happens. You are right, there are bootloader dependencies: u-boot
calls fdt_fixup_phy_connection() that somehow in our case adds (or
changes if already in the device tree) the phy-connection-type
property to a wrong value ! By having a phy-mode in the device tree,
that is not changed by u-boot and by chance picked up by the kernel
fman driver (of_get_phy_mode() ) over phy-connection-mode, the below
patch fixes it for us.

I agree with you, it's not correct to have both phy-connection-type
and phy-mode. Ideally, u-boot on the board should be reworked so that
it does not perform the above wrong fixup. However, in an "unfixed"
.dtb (I have disabled fdt_fixup_phy_connection), the device tree in
the end only has either phy-connection-type or phy-mode, according to
what was chosen in the .dts file. And the fman driver works well with
both (thanks to the call to of_get_phy_mode() ). I would therefore
argue that even if all other DPAA platforms use phy-connection-type,
phy-mode is valid as well. (Furthermore we already have hundreds of
such boards in the field and we don't really support "remote" u-boot
update, so the u-boot fix is going to be difficult for us to pull).

Valentin

>
> Madalin
>
> > > > ---
> > > >  arch/powerpc/boot/dts/fsl/kmcent2.dts | 16 +++++++++++-----
> > > >  1 file changed, 11 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/arch/powerpc/boot/dts/fsl/kmcent2.dts
> > > > b/arch/powerpc/boot/dts/fsl/kmcent2.dts
> > > > index 48b7f9797124..c3e0741cafb1 100644
> > > > --- a/arch/powerpc/boot/dts/fsl/kmcent2.dts
> > > > +++ b/arch/powerpc/boot/dts/fsl/kmcent2.dts
> > > > @@ -210,13 +210,19 @@
> > > >
> > > >                 fman@400000 {
> > > >                         ethernet@e0000 {
> > > > -                               fixed-link = <0 1 1000 0 0>;
> > > > -                               phy-connection-type = "sgmii";
> > > > +                               phy-mode = "sgmii";
> > > > +                               fixed-link {
> > > > +                                       speed = <1000>;
> > > > +                                       full-duplex;
> > > > +                               };
> > > >                         };
> > > >
> > > >                         ethernet@e2000 {
> > > > -                               fixed-link = <1 1 1000 0 0>;
> > > > -                               phy-connection-type = "sgmii";
> > > > +                               phy-mode = "sgmii";
> > > > +                               fixed-link {
> > > > +                                       speed = <1000>;
> > > > +                                       full-duplex;
> > > > +                               };
> > > >                         };
> > > >
> > > >                         ethernet@e4000 {
> > > > @@ -229,7 +235,7 @@
> > > >
> > > >                         ethernet@e8000 {
> > > >                                 phy-handle = <&front_phy>;
> > > > -                               phy-connection-type = "rgmii";
> > > > +                               phy-mode = "rgmii-id";
> > > >                         };
> > > >
> > > >                         mdio0: mdio@fc000 {
> > > > --
> > > > 2.17.1
> > > >
> > >
> > >
>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox