Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH iproute2-rc 1/8] rdma: Update uapi headers to add statistic counter support
From: Stephen Hemminger @ 2019-07-15 20:52 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Leon Romanovsky, netdev, David Ahern, Mark Zhang,
	RDMA mailing list
In-Reply-To: <20190710072455.9125-2-leon@kernel.org>

On Wed, 10 Jul 2019 10:24:48 +0300
Leon Romanovsky <leon@kernel.org> wrote:

> From: Mark Zhang <markz@mellanox.com>
> 
> Update rdma_netlink.h to kernel commit 6e7be47a5345 ("RDMA/nldev:
> Allow get default counter statistics through RDMA netlink").
> 
> Signed-off-by: Mark Zhang <markz@mellanox.com>
> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>

I am waiting on this until it gets to Linus's tree.

^ permalink raw reply

* Re: [bpf PATCH v2 2/6] bpf: tls fix transition through disconnect with close
From: John Fastabend @ 2019-07-15 20:58 UTC (permalink / raw)
  To: Jakub Kicinski, John Fastabend; +Cc: ast, daniel, netdev, edumazet, bpf
In-Reply-To: <20190711201633.552292e6@cakuba.netronome.com>

Jakub Kicinski wrote:
> On Thu, 11 Jul 2019 14:25:54 -0700, John Fastabend wrote:
> > Jakub Kicinski wrote:
> > > On Thu, 11 Jul 2019 09:47:16 -0700, John Fastabend wrote:  
> > > > Jakub Kicinski wrote:  
> > > > > On Wed, 10 Jul 2019 12:34:17 -0700, Jakub Kicinski wrote:    
> > > > > > > > > +		if (sk->sk_prot->unhash)
> > > > > > > > > +			sk->sk_prot->unhash(sk);
> > > > > > > > > +	}
> > > > > > > > > +
> > > > > > > > > +	ctx = tls_get_ctx(sk);
> > > > > > > > > +	if (ctx->tx_conf == TLS_SW || ctx->rx_conf == TLS_SW)
> > > > > > > > > +		tls_sk_proto_cleanup(sk, ctx, timeo);    
> > > > > 
> > > > > Do we still need to hook into unhash? With patch 6 in place perhaps we
> > > > > can just do disconnect 🥺    
> > > > 
> > > > ?? "can just do a disconnect", not sure I folow. We still need unhash
> > > > in cases where we have a TLS socket transition from ESTABLISHED
> > > > to LISTEN state without calling close(). This is independent of if
> > > > sockmap is running or not.
> > > > 
> > > > Originally, I thought this would be extremely rare but I did see it
> > > > in real applications on the sockmap side so presumably it is possible
> > > > here as well.  
> > > 
> > > Ugh, sorry, I meant shutdown. Instead of replacing the unhash callback
> > > replace the shutdown callback. We probably shouldn't release the socket
> > > lock either there, but we can sleep, so I'll be able to run the device
> > > connection remove callback (which sleep).
> > 
> > ah OK seems doable to me. Do you want to write that on top of this
> > series? Or would you like to push it onto your branch and I can pull
> > it in push the rest of the patches on top and send it out? I think
> > if you can get to it in the next few days then it makes sense to wait.
> 
> Mm.. perhaps its easiest if we forget about HW for now and get SW 
> to work? Once you get the SW to 100% I can probably figure out what 
> to do for HW, but I feel like we got too many moving parts ATM.

Hi Jack,

I went ahead and pushed a v3 with your patches at the front. This resolves
a set of issues for me so I think it makes sense to push now and look
to resolve any further issues later. I'll look into the close with pending
data potential issue to see if it is/is-not a real issue.

Thanks,
John

^ permalink raw reply

* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Paul Moore @ 2019-07-15 21:04 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: Serge E. Hallyn, Tycho Andersen, containers, linux-api,
	Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
	netfilter-devel, sgrubb, omosnace, dhowells, simo, Eric Paris,
	ebiederm, nhorman
In-Reply-To: <20190708180558.5bar6ripag3sdadl@madcap2.tricolour.ca>

On Mon, Jul 8, 2019 at 2:06 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2019-05-30 15:29, Paul Moore wrote:

...

> > [REMINDER: It is an "*audit* container ID" and not a general
> > "container ID" ;)  Smiley aside, I'm not kidding about that part.]
> >
> > I'm not interested in supporting/merging something that isn't useful;
> > if this doesn't work for your use case then we need to figure out what
> > would work.  It sounds like nested containers are much more common in
> > the lxc world, can you elaborate a bit more on this?
> >
> > As far as the possible solutions you mention above, I'm not sure I
> > like the per-userns audit container IDs, I'd much rather just emit the
> > necessary tracking information via the audit record stream and let the
> > log analysis tools figure it out.  However, the bigger question is how
> > to limit (re)setting the audit container ID when you are in a non-init
> > userns.  For reasons already mentioned, using capable() is a non
> > starter for everything but the initial userns, and using ns_capable()
> > is equally poor as it essentially allows any userns the ability to
> > munge it's audit container ID (obviously not good).  It appears we
> > need a different method for controlling access to the audit container
> > ID.
>
> We're not quite ready yet for multiple audit daemons and possibly not
> yet for audit namespaces, but this is starting to look a lot like the
> latter.

A few quick comments on audit namespaces: the audit container ID is
not envisioned as a new namespace (even in nested form) and neither do
I consider running multiple audit daemons to be a new namespace.

From my perspective we create namespaces to allow us to redefine a
global resource for some subset of the system, e.g. providing a unique
/tmp for some number of processes on the system.  While it may be
tempting to think of the audit container ID as something we could
"namespace", especially when multiple audit daemons are concerned, in
some ways this would be counter productive; the audit container ID is
intended to be a global ID that can be used to associate audit event
records with a "container" where the "container" is defined by an
orchestrator outside the audit subsystem.  The global nature of the
audit container ID allows us to maintain a sane(ish) view of the
system in the audit log, if we were to "namespace" the audit container
ID such that the value was no longer guaranteed to be unique
throughout the system, we would need to additionally track the audit
namespace along with the audit container ID which starts to border on
insanity IMHO.

> If we can't trust ns_capable() then why are we passing on
> CAP_AUDIT_CONTROL?  It is being passed down and not stripped purposely
> by the orchestrator/engine.  If ns_capable() isn't inherited how is it
> gained otherwise?  Can it be inserted by cotainer image?  I think the
> answer is "no".  Either we trust ns_capable() or we have audit
> namespaces (recommend based on user namespace) (or both).

My thinking is that since ns_capable() checks the credentials with
respect to the current user namespace we can't rely on it to control
access since it would be possible for a privileged process running
inside an unprivileged container to manipulate the audit container ID
(containerized process has CAP_AUDIT_CONTROL, e.g. running as root in
the container, while the container itself does not).

> At this point I would say we are at an impasse unless we trust
> ns_capable() or we implement audit namespaces.

I'm not sure how we can trust ns_capable(), but if you can think of a
way I would love to hear it.  I'm also not sure how namespacing audit
is helpful (see my above comments), but if you think it is please
explain.

> I don't think another mechanism to trust nested orchestrators/engines
> will buy us anything.
>
> Am I missing something?

Based on your questions/comments above it looks like your
understanding of ns_capable() does not match mine; if I'm thinking
about ns_capable() incorrectly, please educate me.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* [PATCH v1] net: fec: optionally reset PHY via a reset-controller
From: Sven Van Asbroeck @ 2019-07-15 21:05 UTC (permalink / raw)
  To: Fugang Duan; +Cc: David S . Miller, netdev, linux-kernel

The current fec driver allows the PHY to be reset via a gpio,
specified in the devicetree. However, some PHYs need to be reset
in a more complex way.

To accommodate such PHYs, allow an optional reset controller
in the fec devicetree. If no reset controller is found, the
fec will fall back to the legacy reset behaviour.

Example:
&fec {
	phy-mode = "rgmii";
	resets = <&phy_reset>;
	reset-names = "phy";
	status = "okay";
};

Signed-off-by: Sven Van Asbroeck <TheSven73@gmail.com>
---

Will send a Documentation patch if this receives a positive review.

 drivers/net/ethernet/freescale/fec_main.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index 38f10f7dcbc3..5a5f3ed6f16d 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -61,6 +61,7 @@
 #include <linux/regulator/consumer.h>
 #include <linux/if_vlan.h>
 #include <linux/pinctrl/consumer.h>
+#include <linux/reset.h>
 #include <linux/prefetch.h>
 #include <soc/imx/cpuidle.h>
 
@@ -3335,6 +3336,7 @@ static int fec_enet_get_irq_cnt(struct platform_device *pdev)
 static int
 fec_probe(struct platform_device *pdev)
 {
+	struct reset_control *phy_reset;
 	struct fec_enet_private *fep;
 	struct fec_platform_data *pdata;
 	struct net_device *ndev;
@@ -3490,7 +3492,9 @@ fec_probe(struct platform_device *pdev)
 	pm_runtime_set_active(&pdev->dev);
 	pm_runtime_enable(&pdev->dev);
 
-	ret = fec_reset_phy(pdev);
+	phy_reset = devm_reset_control_get_exclusive(&pdev->dev, "phy");
+	ret = IS_ERR(phy_reset) ? fec_reset_phy(pdev) :
+			reset_control_reset(phy_reset);
 	if (ret)
 		goto failed_reset;
 
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Paul Moore @ 2019-07-15 21:09 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: Tycho Andersen, Serge E. Hallyn, containers, linux-api,
	Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
	netfilter-devel, sgrubb, omosnace, dhowells, simo, Eric Paris,
	ebiederm, nhorman
In-Reply-To: <20190708181237.5poheliito7zpvmc@madcap2.tricolour.ca>

On Mon, Jul 8, 2019 at 2:12 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2019-05-30 19:26, Paul Moore wrote:

...

> > I like the creativity, but I worry that at some point these
> > limitations are going to be raised (limits have a funny way of doing
> > that over time) and we will be in trouble.  I say "trouble" because I
> > want to be able to quickly do an audit container ID comparison and
> > we're going to pay a penalty for these larger values (we'll need this
> > when we add multiple auditd support and the requisite record routing).
> >
> > Thinking about this makes me also realize we probably need to think a
> > bit longer about audit container ID conflicts between orchestrators.
> > Right now we just take the value that is given to us by the
> > orchestrator, but if we want to allow multiple container orchestrators
> > to work without some form of cooperation in userspace (I think we have
> > to assume the orchestrators will not talk to each other) we likely
> > need to have some way to block reuse of an audit container ID.  We
> > would either need to prevent the orchestrator from explicitly setting
> > an audit container ID to a currently in use value, or instead generate
> > the audit container ID in the kernel upon an event triggered by the
> > orchestrator (e.g. a write to a /proc file).  I suspect we should
> > start looking at the idr code, I think we will need to make use of it.
>
> To address this, I'd suggest that it is enforced to only allow the
> setting of descendants and to maintain a master list of audit container
> identifiers (with a hash table if necessary later) that includes the
> container owner.

We're discussing the audit container ID management policy elsewhere in
this thread so I won't comment on that here, but I did want to say
that we will likely need something better than a simple list of audit
container IDs from the start.  It's common for systems to have
thousands of containers now (or multiple thousands), which tells me
that a list is a poor choice.  You mentioned a hash table, so I would
suggest starting with that over the list for the initial patchset.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* [PATCH] bnx2x: Prevent load reordering in tx completion processing
From: Brian King @ 2019-07-15 21:41 UTC (permalink / raw)
  To: GR-everest-linux-l2; +Cc: skalluru, aelior, netdev, Brian King

This patch fixes an issue seen on Power systems with bnx2x which results
in the skb is NULL WARN_ON in bnx2x_free_tx_pkt firing due to the skb
pointer getting loaded in bnx2x_free_tx_pkt prior to the hw_cons
load in bnx2x_tx_int. Adding a read memory barrier resolves the issue.

Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index 656ed80..e2be5a6 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -285,6 +285,9 @@ int bnx2x_tx_int(struct bnx2x *bp, struct bnx2x_fp_txdata *txdata)
 	hw_cons = le16_to_cpu(*txdata->tx_cons_sb);
 	sw_cons = txdata->tx_pkt_cons;
 
+	/* Ensure subsequent loads occur after hw_cons */
+	smp_rmb();
+
 	while (sw_cons != hw_cons) {
 		u16 pkt_cons;
 
-- 
1.8.3.1


^ permalink raw reply related

* Re: [PATCH 8/9] acpi: Use built-in RCU list checking for acpi_ioremaps list (v1)
From: Rafael J. Wysocki @ 2019-07-15 21:44 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: Linux Kernel Mailing List, Alexey Kuznetsov, Bjorn Helgaas,
	Borislav Petkov, c0d1n61at3, David S. Miller, edumazet,
	Greg Kroah-Hartman, Hideaki YOSHIFUJI, H. Peter Anvin,
	Ingo Molnar, Jonathan Corbet, Josh Triplett, Kees Cook,
	Kernel Hardening, Cc: Android Kernel, Lai Jiangshan, Len Brown,
	ACPI Devel Maling List, open list:DOCUMENTATION, Linux PCI,
	Linux PM, Mathieu Desnoyers, NeilBrown, netdev, Oleg Nesterov,
	Paul E. McKenney, Pavel Machek, Peter Zijlstra, Rafael J. Wysocki,
	Rasmus Villemoes, rcu, Steven Rostedt, Tejun Heo, Thomas Gleixner,
	Will Deacon, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)
In-Reply-To: <20190715143705.117908-9-joel@joelfernandes.org>

On Mon, Jul 15, 2019 at 4:43 PM Joel Fernandes (Google)
<joel@joelfernandes.org> wrote:
>
> list_for_each_entry_rcu has built-in RCU and lock checking. Make use of
> it for acpi_ioremaps list traversal.
>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
>  drivers/acpi/osl.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
> index 9c0edf2fc0dd..2f9d0d20b836 100644
> --- a/drivers/acpi/osl.c
> +++ b/drivers/acpi/osl.c
> @@ -14,6 +14,7 @@
>  #include <linux/slab.h>
>  #include <linux/mm.h>
>  #include <linux/highmem.h>
> +#include <linux/lockdep.h>
>  #include <linux/pci.h>
>  #include <linux/interrupt.h>
>  #include <linux/kmod.h>
> @@ -80,6 +81,7 @@ struct acpi_ioremap {
>
>  static LIST_HEAD(acpi_ioremaps);
>  static DEFINE_MUTEX(acpi_ioremap_lock);
> +#define acpi_ioremap_lock_held() lock_is_held(&acpi_ioremap_lock.dep_map)
>
>  static void __init acpi_request_region (struct acpi_generic_address *gas,
>         unsigned int length, char *desc)
> @@ -206,7 +208,7 @@ acpi_map_lookup(acpi_physical_address phys, acpi_size size)
>  {
>         struct acpi_ioremap *map;
>
> -       list_for_each_entry_rcu(map, &acpi_ioremaps, list)
> +       list_for_each_entry_rcu(map, &acpi_ioremaps, list, acpi_ioremap_lock_held())
>                 if (map->phys <= phys &&
>                     phys + size <= map->phys + map->size)
>                         return map;
> @@ -249,7 +251,7 @@ acpi_map_lookup_virt(void __iomem *virt, acpi_size size)
>  {
>         struct acpi_ioremap *map;
>
> -       list_for_each_entry_rcu(map, &acpi_ioremaps, list)
> +       list_for_each_entry_rcu(map, &acpi_ioremaps, list, acpi_ioremap_lock_held())
>                 if (map->virt <= virt &&
>                     virt + size <= map->virt + map->size)
>                         return map;
> --
> 2.22.0.510.g264f2c817a-goog
>

^ permalink raw reply

* Re: [PATCH v2 bpf-next] selftests/bpf: fix "alu with different scalars 1" on s390
From: Alexei Starovoitov @ 2019-07-15 22:13 UTC (permalink / raw)
  To: Ilya Leoshkevich; +Cc: bpf, Network Development, Y Song
In-Reply-To: <20190704085224.65223-1-iii@linux.ibm.com>

On Thu, Jul 4, 2019 at 1:53 AM Ilya Leoshkevich <iii@linux.ibm.com> wrote:
>
> BPF_LDX_MEM is used to load the least significant byte of the retrieved
> test_val.index, however, on big-endian machines it ends up retrieving
> the most significant byte.
>
> Use the correct least significant byte offset on big-endian machines.
>
> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
> ---
>
> v1->v2:
> - use __BYTE_ORDER instead of __BYTE_ORDER__.
>
>  tools/testing/selftests/bpf/verifier/value_ptr_arith.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/tools/testing/selftests/bpf/verifier/value_ptr_arith.c b/tools/testing/selftests/bpf/verifier/value_ptr_arith.c
> index c3de1a2c9dc5..e5940c4e8b8f 100644
> --- a/tools/testing/selftests/bpf/verifier/value_ptr_arith.c
> +++ b/tools/testing/selftests/bpf/verifier/value_ptr_arith.c
> @@ -183,7 +183,11 @@
>         BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
>         BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
>         BPF_EXIT_INSN(),
> +#if __BYTE_ORDER == __LITTLE_ENDIAN
>         BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0),
> +#else
> +       BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, sizeof(int) - 1),
> +#endif

I think tests should be arch and endian independent where possible.
In this case test_val.index is 4 byte and 4 byte load should work just as well.

^ permalink raw reply

* Re: [PATCH v2 bpf-next] selftests/bpf: fix "alu with different scalars 1" on s390
From: Daniel Borkmann @ 2019-07-15 22:16 UTC (permalink / raw)
  To: Alexei Starovoitov, Ilya Leoshkevich; +Cc: bpf, Network Development, Y Song
In-Reply-To: <CAADnVQ+H9bOW+EY6=AKt7mqgdEgaPhc1Wk_o=Ez43CracLCaiA@mail.gmail.com>

On 7/16/19 12:13 AM, Alexei Starovoitov wrote:
> On Thu, Jul 4, 2019 at 1:53 AM Ilya Leoshkevich <iii@linux.ibm.com> wrote:
>>
>> BPF_LDX_MEM is used to load the least significant byte of the retrieved
>> test_val.index, however, on big-endian machines it ends up retrieving
>> the most significant byte.
>>
>> Use the correct least significant byte offset on big-endian machines.
>>
>> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
>> ---
>>
>> v1->v2:
>> - use __BYTE_ORDER instead of __BYTE_ORDER__.
>>
>>  tools/testing/selftests/bpf/verifier/value_ptr_arith.c | 4 ++++
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/tools/testing/selftests/bpf/verifier/value_ptr_arith.c b/tools/testing/selftests/bpf/verifier/value_ptr_arith.c
>> index c3de1a2c9dc5..e5940c4e8b8f 100644
>> --- a/tools/testing/selftests/bpf/verifier/value_ptr_arith.c
>> +++ b/tools/testing/selftests/bpf/verifier/value_ptr_arith.c
>> @@ -183,7 +183,11 @@
>>         BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
>>         BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
>>         BPF_EXIT_INSN(),
>> +#if __BYTE_ORDER == __LITTLE_ENDIAN
>>         BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0),
>> +#else
>> +       BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, sizeof(int) - 1),
>> +#endif
> 
> I think tests should be arch and endian independent where possible.
> In this case test_val.index is 4 byte and 4 byte load should work just as well.

Yes, agree, this should be fixed with BPF_W as load.

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH v3 bpf 0/3] fix BTF verification size resolution
From: Daniel Borkmann @ 2019-07-15 22:18 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, netdev, ast, yhs; +Cc: andrii.nakryiko, kernel-team
In-Reply-To: <20190712172557.4039121-1-andriin@fb.com>

On 7/12/19 7:25 PM, Andrii Nakryiko wrote:
> BTF size resolution logic isn't always resolving type size correctly, leading
> to erroneous map creation failures due to value size mismatch.
> 
> This patch set:
> 1. fixes the issue (patch #1);
> 2. adds tests for trickier cases (patch #2);
> 3. and converts few test cases utilizing BTF-defined maps, that previously
>    couldn't use typedef'ed arrays due to kernel bug (patch #3).
> 
> Andrii Nakryiko (3):
>   bpf: fix BTF verifier size resolution logic
>   selftests/bpf: add trickier size resolution tests
>   selftests/bpf: use typedef'ed arrays as map values
> 
>  kernel/bpf/btf.c                              | 19 ++--
>  .../bpf/progs/test_get_stack_rawtp.c          |  3 +-
>  .../bpf/progs/test_stacktrace_build_id.c      |  3 +-
>  .../selftests/bpf/progs/test_stacktrace_map.c |  2 +-
>  tools/testing/selftests/bpf/test_btf.c        | 88 +++++++++++++++++++
>  5 files changed, 104 insertions(+), 11 deletions(-)
> 

Applied, thanks!

^ permalink raw reply

* Re: [PATCH bpf] selftests/bpf: fix attach_probe on s390
From: Daniel Borkmann @ 2019-07-15 22:18 UTC (permalink / raw)
  To: Ilya Leoshkevich, bpf, netdev; +Cc: gor, heiko.carstens
In-Reply-To: <20190712134142.90668-1-iii@linux.ibm.com>

On 7/12/19 3:41 PM, Ilya Leoshkevich wrote:
> attach_probe test fails, because it cannot install a kprobe on a
> non-existent sys_nanosleep symbol.
> 
> Use the correct symbol name for the nanosleep syscall on 64-bit s390.
> Don't bother adding one for 31-bit mode, since tests are compiled only
> in 64-bit mode.
> 
> Fixes: 1e8611bbdfc9 ("selftests/bpf: add kprobe/uprobe selftests")
> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
> Acked-by: Vasily Gorbik <gor@linux.ibm.com>

Applied, thanks!

^ permalink raw reply

* Re: [PATCH bpf] selftests/bpf: make directory prerequisites order-only
From: Daniel Borkmann @ 2019-07-15 22:19 UTC (permalink / raw)
  To: Ilya Leoshkevich, bpf, netdev; +Cc: gor, heiko.carstens
In-Reply-To: <20190712135631.91398-1-iii@linux.ibm.com>

On 7/12/19 3:56 PM, Ilya Leoshkevich wrote:
> When directories are used as prerequisites in Makefiles, they can cause
> a lot of unnecessary rebuilds, because a directory is considered changed
> whenever a file in this directory is added, removed or modified.
> 
> If the only thing a target is interested in is the existence of the
> directory it depends on, which is the case for selftests/bpf, this
> directory should be specified as an order-only prerequisite: it would
> still be created in case it does not exist, but it would not trigger a
> rebuild of a target in case it's considered changed.
> 
> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>

Applied, thanks!

^ permalink raw reply

* Re: [oss-drivers] Re: [RFC bpf-next 2/8] bpf: extend list based insn patching infra to verification layer
From: Andrii Nakryiko @ 2019-07-15 22:29 UTC (permalink / raw)
  To: Jiong Wang
  Cc: Andrii Nakryiko, Alexei Starovoitov, Daniel Borkmann, Edward Cree,
	Naveen N. Rao, Jakub Kicinski, bpf, Networking, oss-drivers
In-Reply-To: <87h87n39aj.fsf@netronome.com>

On Mon, Jul 15, 2019 at 3:02 AM Jiong Wang <jiong.wang@netronome.com> wrote:
>
>
> Andrii Nakryiko writes:
>
> > On Thu, Jul 11, 2019 at 5:20 AM Jiong Wang <jiong.wang@netronome.com> wrote:
> >>
> >>
> >> Jiong Wang writes:
> >>
> >> > Andrii Nakryiko writes:
> >> >
> >> >> On Thu, Jul 4, 2019 at 2:32 PM Jiong Wang <jiong.wang@netronome.com> wrote:
> >> >>>
> >> >>> Verification layer also needs to handle auxiliar info as well as adjusting
> >> >>> subprog start.
> >> >>>
> >> >>> At this layer, insns inside patch buffer could be jump, but they should
> >> >>> have been resolved, meaning they shouldn't jump to insn outside of the
> >> >>> patch buffer. Lineration function for this layer won't touch insns inside
> >> >>> patch buffer.
> >> >>>
> >> >>> Adjusting subprog is finished along with adjusting jump target when the
> >> >>> input will cover bpf to bpf call insn, re-register subprog start is cheap.
> >> >>> But adjustment when there is insn deleteion is not considered yet.
> >> >>>
> >> >>> Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
> >> >>> ---
> >> >>>  kernel/bpf/verifier.c | 150 ++++++++++++++++++++++++++++++++++++++++++++++++++
> >> >>>  1 file changed, 150 insertions(+)
> >> >>>
> >> >>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> >> >>> index a2e7637..2026d64 100644
> >> >>> --- a/kernel/bpf/verifier.c
> >> >>> +++ b/kernel/bpf/verifier.c
> >> >>> @@ -8350,6 +8350,156 @@ static void opt_hard_wire_dead_code_branches(struct bpf_verifier_env *env)
> >> >>>         }
> >> >>>  }
> >> >>>
> >> >>> +/* Linearize bpf list insn to array (verifier layer). */
> >> >>> +static struct bpf_verifier_env *
> >> >>> +verifier_linearize_list_insn(struct bpf_verifier_env *env,
> >> >>> +                            struct bpf_list_insn *list)
> >> >>
> >> >> It's unclear why this returns env back? It's not allocating a new env,
> >> >> so it's weird and unnecessary. Just return error code.
> >> >
> >> > The reason is I was thinking we have two layers in BPF, the core and the
> >> > verifier.
> >> >
> >> > For core layer (the relevant file is core.c), when doing patching, the
> >> > input is insn list and bpf_prog, the linearization should linearize the
> >> > insn list into insn array, and also whatever others affect inside bpf_prog
> >> > due to changing on insns, for example line info inside prog->aux. So the
> >> > return value is bpf_prog for core layer linearization hook.
> >> >
> >> > For verifier layer, it is similar, but the context if bpf_verifier_env, the
> >> > linearization hook should linearize the insn list, and also those affected
> >> > inside env, for example bpf_insn_aux_data, so the return value is
> >> > bpf_verifier_env, meaning returning an updated verifier context
> >> > (bpf_verifier_env) after insn list linearization.
> >>
> >> Realized your point is no new env is allocated, so just return error
> >> code. Yes, the env pointer is not changed, just internal data is
> >> updated. Return bpf_verifier_env mostly is trying to make the hook more
> >> clear that it returns an updated "context" where the linearization happens,
> >> for verifier layer, it is bpf_verifier_env, and for core layer, it is
> >> bpf_prog, so return value was designed to return these two types.
> >
> > Oh, I missed that core layer returns bpf_prog*. I think this is
> > confusing as hell and is very contrary to what one would expect. If
> > the function doesn't allocate those objects, it shouldn't return them,
> > except for rare cases of some accessor functions. Me reading this,
> > I'll always be suprised and will have to go skim code just to check
> > whether those functions really return new bpf_prog or
> > bpf_verifier_env, respectively.
>
> bpf_prog_realloc do return new bpf_prog, so we will need to return bpf_prog
> * for core layer.

Ah, I see, then it would make sense for core layer, but still is very
confusing for verifier_linearize_list_insn.
I still hope for unified solution, so it shouldn't matter. But it
pointed me to a bug in your code, see below.

>
> >
> > Please change them both to just return error code.
> >
> >>
> >> >
> >> > Make sense?
> >> >
> >> > Regards,
> >> > Jiong
> >> >
> >> >>
> >> >>> +{
> >> >>> +       u32 *idx_map, idx, orig_cnt, fini_cnt = 0;
> >> >>> +       struct bpf_subprog_info *new_subinfo;
> >> >>> +       struct bpf_insn_aux_data *new_data;
> >> >>> +       struct bpf_prog *prog = env->prog;
> >> >>> +       struct bpf_verifier_env *ret_env;
> >> >>> +       struct bpf_insn *insns, *insn;
> >> >>> +       struct bpf_list_insn *elem;
> >> >>> +       int ret;
> >> >>> +
> >> >>> +       /* Calculate final size. */
> >> >>> +       for (elem = list; elem; elem = elem->next)
> >> >>> +               if (!(elem->flag & LIST_INSN_FLAG_REMOVED))
> >> >>> +                       fini_cnt++;
> >> >>> +
> >> >>> +       orig_cnt = prog->len;
> >> >>> +       insns = prog->insnsi;
> >> >>> +       /* If prog length remains same, nothing else to do. */
> >> >>> +       if (fini_cnt == orig_cnt) {
> >> >>> +               for (insn = insns, elem = list; elem; elem = elem->next, insn++)
> >> >>> +                       *insn = elem->insn;
> >> >>> +               return env;
> >> >>> +       }
> >> >>> +       /* Realloc insn buffer when necessary. */
> >> >>> +       if (fini_cnt > orig_cnt)
> >> >>> +               prog = bpf_prog_realloc(prog, bpf_prog_size(fini_cnt),
> >> >>> +                                       GFP_USER);
> >> >>> +       if (!prog)
> >> >>> +               return ERR_PTR(-ENOMEM);

On realloc failure, prog will be non-NULL, so you need to handle error
properly (and propagate it, instead of returning -ENOMEM):

if (IS_ERR(prog))
    return ERR_PTR(prog);


> >> >>> +       insns = prog->insnsi;
> >> >>> +       prog->len = fini_cnt;
> >> >>> +       ret_env = env;
> >> >>> +
> >> >>> +       /* idx_map[OLD_IDX] = NEW_IDX */
> >> >>> +       idx_map = kvmalloc(orig_cnt * sizeof(u32), GFP_KERNEL);
> >> >>> +       if (!idx_map)
> >> >>> +               return ERR_PTR(-ENOMEM);
> >> >>> +       memset(idx_map, 0xff, orig_cnt * sizeof(u32));
> >> >>> +
> >> >>> +       /* Use the same alloc method used when allocating env->insn_aux_data. */
> >> >>> +       new_data = vzalloc(array_size(sizeof(*new_data), fini_cnt));
> >> >>> +       if (!new_data) {
> >> >>> +               kvfree(idx_map);
> >> >>> +               return ERR_PTR(-ENOMEM);
> >> >>> +       }
> >> >>> +
> >> >>> +       /* Copy over insn + calculate idx_map. */
> >> >>> +       for (idx = 0, elem = list; elem; elem = elem->next) {
> >> >>> +               int orig_idx = elem->orig_idx - 1;
> >> >>> +
> >> >>> +               if (orig_idx >= 0) {
> >> >>> +                       idx_map[orig_idx] = idx;
> >> >>> +
> >> >>> +                       if (elem->flag & LIST_INSN_FLAG_REMOVED)
> >> >>> +                               continue;
> >> >>> +
> >> >>> +                       new_data[idx] = env->insn_aux_data[orig_idx];
> >> >>> +
> >> >>> +                       if (elem->flag & LIST_INSN_FLAG_PATCHED)
> >> >>> +                               new_data[idx].zext_dst =
> >> >>> +                                       insn_has_def32(env, &elem->insn);
> >> >>> +               } else {
> >> >>> +                       new_data[idx].seen = true;
> >> >>> +                       new_data[idx].zext_dst = insn_has_def32(env,
> >> >>> +                                                               &elem->insn);
> >> >>> +               }
> >> >>> +               insns[idx++] = elem->insn;
> >> >>> +       }
> >> >>> +
> >> >>> +       new_subinfo = kvzalloc(sizeof(env->subprog_info), GFP_KERNEL);
> >> >>> +       if (!new_subinfo) {
> >> >>> +               kvfree(idx_map);
> >> >>> +               vfree(new_data);
> >> >>> +               return ERR_PTR(-ENOMEM);
> >> >>> +       }
> >> >>> +       memcpy(new_subinfo, env->subprog_info, sizeof(env->subprog_info));
> >> >>> +       memset(env->subprog_info, 0, sizeof(env->subprog_info));
> >> >>> +       env->subprog_cnt = 0;
> >> >>> +       env->prog = prog;
> >> >>> +       ret = add_subprog(env, 0);
> >> >>> +       if (ret < 0) {
> >> >>> +               ret_env = ERR_PTR(ret);
> >> >>> +               goto free_all_ret;
> >> >>> +       }
> >> >>> +       /* Relocate jumps using idx_map.
> >> >>> +        *   old_dst = jmp_insn.old_target + old_pc + 1;
> >> >>> +        *   new_dst = idx_map[old_dst] = jmp_insn.new_target + new_pc + 1;
> >> >>> +        *   jmp_insn.new_target = new_dst - new_pc - 1;
> >> >>> +        */
> >> >>> +       for (idx = 0, elem = list; elem; elem = elem->next) {
> >> >>> +               int orig_idx = elem->orig_idx;
> >> >>> +
> >> >>> +               if (elem->flag & LIST_INSN_FLAG_REMOVED)
> >> >>> +                       continue;
> >> >>> +               if ((elem->flag & LIST_INSN_FLAG_PATCHED) || !orig_idx) {
> >> >>> +                       idx++;
> >> >>> +                       continue;
> >> >>> +               }
> >> >>> +
> >> >>> +               ret = bpf_jit_adj_imm_off(&insns[idx], orig_idx - 1, idx,
> >> >>> +                                         idx_map);
> >> >>> +               if (ret < 0) {
> >> >>> +                       ret_env = ERR_PTR(ret);
> >> >>> +                       goto free_all_ret;
> >> >>> +               }
> >> >>> +               /* Recalculate subprog start as we are at bpf2bpf call insn. */
> >> >>> +               if (ret > 0) {
> >> >>> +                       ret = add_subprog(env, idx + insns[idx].imm + 1);
> >> >>> +                       if (ret < 0) {
> >> >>> +                               ret_env = ERR_PTR(ret);
> >> >>> +                               goto free_all_ret;
> >> >>> +                       }
> >> >>> +               }
> >> >>> +               idx++;
> >> >>> +       }
> >> >>> +       if (ret < 0) {
> >> >>> +               ret_env = ERR_PTR(ret);
> >> >>> +               goto free_all_ret;
> >> >>> +       }
> >> >>> +
> >> >>> +       env->subprog_info[env->subprog_cnt].start = fini_cnt;
> >> >>> +       for (idx = 0; idx <= env->subprog_cnt; idx++)
> >> >>> +               new_subinfo[idx].start = env->subprog_info[idx].start;
> >> >>> +       memcpy(env->subprog_info, new_subinfo, sizeof(env->subprog_info));
> >> >>> +
> >> >>> +       /* Adjust linfo.
> >> >>> +        * FIXME: no support for insn removal at the moment.
> >> >>> +        */
> >> >>> +       if (prog->aux->nr_linfo) {
> >> >>> +               struct bpf_line_info *linfo = prog->aux->linfo;
> >> >>> +               u32 nr_linfo = prog->aux->nr_linfo;
> >> >>> +
> >> >>> +               for (idx = 0; idx < nr_linfo; idx++)
> >> >>> +                       linfo[idx].insn_off = idx_map[linfo[idx].insn_off];
> >> >>> +       }
> >> >>> +       vfree(env->insn_aux_data);
> >> >>> +       env->insn_aux_data = new_data;
> >> >>> +       goto free_mem_list_ret;
> >> >>> +free_all_ret:
> >> >>> +       vfree(new_data);
> >> >>> +free_mem_list_ret:
> >> >>> +       kvfree(new_subinfo);
> >> >>> +       kvfree(idx_map);
> >> >>> +       return ret_env;
> >> >>> +}
> >> >>> +
> >> >>>  static int opt_remove_dead_code(struct bpf_verifier_env *env)
> >> >>>  {
> >> >>>         struct bpf_insn_aux_data *aux_data = env->insn_aux_data;
> >> >>> --
> >> >>> 2.7.4
> >> >>>
> >>
>

^ permalink raw reply

* Re: [PATCH bpf] samples/bpf: build with -D__TARGET_ARCH_$(SRCARCH)
From: Daniel Borkmann @ 2019-07-15 22:20 UTC (permalink / raw)
  To: Ilya Leoshkevich, bpf, netdev; +Cc: gor, heiko.carstens
In-Reply-To: <20190715091103.4030-1-iii@linux.ibm.com>

On 7/15/19 11:11 AM, Ilya Leoshkevich wrote:
> While $ARCH can be relatively flexible (see Makefile and
> tools/scripts/Makefile.arch), $SRCARCH always corresponds to a directory
> name under arch/.
> 
> Therefore, build samples with -D__TARGET_ARCH_$(SRCARCH), since that
> matches the expectations of bpf_helpers.h.
> 
> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
> Acked-by: Vasily Gorbik <gor@linux.ibm.com>

Applied, thanks!

^ permalink raw reply

* Re: [PATCH bpf 0/5] bpf: allow wide (u64) aligned loads for some fields of bpf_sock_addr
From: Daniel Borkmann @ 2019-07-15 22:21 UTC (permalink / raw)
  To: Stanislav Fomichev, netdev, bpf; +Cc: davem, ast, Yonghong Song
In-Reply-To: <20190715163956.204061-1-sdf@google.com>

On 7/15/19 6:39 PM, Stanislav Fomichev wrote:
> When fixing selftests by adding support for wide stores, Yonghong
> reported that he had seen some examples where clang generates
> single u64 loads for two adjacent u32s as well:
> http://lore.kernel.org/netdev/a66c937f-94c0-eaf8-5b37-8587d66c0c62@fb.com
> 
> Let's support aligned u64 reads for some bpf_sock_addr fields
> as well.
> 
> (This can probably wait for bpf-next, I'll defer to Younhong and the
> maintainers.)
> 
> Cc: Yonghong Song <yhs@fb.com>
> 
> Stanislav Fomichev (5):
>   bpf: rename bpf_ctx_wide_store_ok to bpf_ctx_wide_access_ok
>   bpf: allow wide aligned loads for bpf_sock_addr user_ip6 and
>     msg_src_ip6
>   selftests/bpf: rename verifier/wide_store.c to verifier/wide_access.c
>   selftests/bpf: add selftests for wide loads
>   bpf: sync bpf.h to tools/
> 
>  include/linux/filter.h                        |  2 +-
>  include/uapi/linux/bpf.h                      |  4 +-
>  net/core/filter.c                             | 24 ++++--
>  tools/include/uapi/linux/bpf.h                |  4 +-
>  .../selftests/bpf/verifier/wide_access.c      | 73 +++++++++++++++++++
>  .../selftests/bpf/verifier/wide_store.c       | 36 ---------
>  6 files changed, 95 insertions(+), 48 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/verifier/wide_access.c
>  delete mode 100644 tools/testing/selftests/bpf/verifier/wide_store.c
> 

Applied, thanks!

^ permalink raw reply

* Re: [PATCH bpf] selftests/bpf: put test_stub.o into $(OUTPUT)
From: Daniel Borkmann @ 2019-07-15 22:20 UTC (permalink / raw)
  To: Ilya Leoshkevich, bpf, netdev; +Cc: gor, heiko.carstens
In-Reply-To: <20190712135950.91600-1-iii@linux.ibm.com>

On 7/12/19 3:59 PM, Ilya Leoshkevich wrote:
> Add a rule to put test_stub.o in $(OUTPUT) and change the references to
> it accordingly. This prevents test_stub.o from being created in the
> source directory.
> 
> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>

Applied, thanks!

^ permalink raw reply

* Re: [PATCH iproute2 net-next v2 1/6] Kernel header update for hardware offloading changes.
From: Patel, Vedang @ 2019-07-15 22:39 UTC (permalink / raw)
  To: David Ahern
  Cc: Stephen Hemminger, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Gomes, Vinicius, netdev@vger.kernel.org, Dorileo, Leandro,
	Jakub Kicinski, Murali Karicheri
In-Reply-To: <2c13cf19-0b4a-2149-1624-040191cedad9@gmail.com>

Ok I will send out the patches again. 

Thanks,
Vedang

> On Jul 15, 2019, at 1:16 PM, David Ahern <dsahern@gmail.com> wrote:
> 
> On 7/15/19 1:50 PM, Stephen Hemminger wrote:
>> On Mon, 15 Jul 2019 19:40:19 +0000
>> "Patel, Vedang" <vedang.patel@intel.com> wrote:
>> 
>>> Hi Stephen, 
>>> 
>>> The kernel patches corresponding to this series have been merged. I just wanted to check whether these iproute2 related patches are on your TODO list.
>>> 
>>> Let me know if you need any information from me on these patches.
>>> 
>>> Thanks,
>>> Vedang Patel
>> 
>> 
>> David Ahern handles iproute2 next
>> 
>> https://patchwork.ozlabs.org/patch/1111466/
>> 
> 
> given the long time delay between when the iproute2 patches were posted
> and when the kernel side was accepted you will need to re-send the
> iproute2 patches.


^ permalink raw reply

* [PATCH iproute2 net-next v3 1/5] etf: Add skip_sock_check
From: Vedang Patel @ 2019-07-15 22:51 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, stephen, vinicius.gomes,
	leandro.maciel.dorileo, jakub.kicinski, m-karicheri2, dsahern,
	Vedang Patel

ETF Qdisc currently checks for a socket with SO_TXTIME socket option. If
either is not present, the packet is dropped. In the future commits, we
want other Qdiscs to add packet with launchtime to the ETF Qdisc. Also,
there are some packets (e.g. ICMP packets) which may not have a socket
associated with them.  So, add an option to skip this check.

Signed-off-by: Vedang Patel <vedang.patel@intel.com>
---
 tc/q_etf.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/tc/q_etf.c b/tc/q_etf.c
index 76aca476c61d..c2090589bc64 100644
--- a/tc/q_etf.c
+++ b/tc/q_etf.c
@@ -130,6 +130,13 @@ static int etf_parse_opt(struct qdisc_util *qu, int argc,
 				explain_clockid(*argv);
 				return -1;
 			}
+		} else if (strcmp(*argv, "skip_sock_check") == 0) {
+			if (opt.flags & TC_ETF_SKIP_SOCK_CHECK) {
+				fprintf(stderr, "etf: duplicate \"skip_sock_check\" specification\n");
+				return -1;
+			}
+
+			opt.flags |= TC_ETF_SKIP_SOCK_CHECK;
 		} else if (strcmp(*argv, "help") == 0) {
 			explain();
 			return -1;
@@ -171,8 +178,10 @@ static int etf_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
 	print_uint(PRINT_ANY, "delta", "delta %d ", qopt->delta);
 	print_string(PRINT_ANY, "offload", "offload %s ",
 				(qopt->flags & TC_ETF_OFFLOAD_ON) ? "on" : "off");
-	print_string(PRINT_ANY, "deadline_mode", "deadline_mode %s",
+	print_string(PRINT_ANY, "deadline_mode", "deadline_mode %s ",
 				(qopt->flags & TC_ETF_DEADLINE_MODE_ON) ? "on" : "off");
+	print_string(PRINT_ANY, "skip_sock_check", "skip_sock_check %s",
+				(qopt->flags & TC_ETF_SKIP_SOCK_CHECK) ? "on" : "off");
 
 	return 0;
 }
-- 
2.7.3


^ permalink raw reply related

* [PATCH iproute2 net-next v3 2/5] taprio: Add support for setting flags
From: Vedang Patel @ 2019-07-15 22:51 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, stephen, vinicius.gomes,
	leandro.maciel.dorileo, jakub.kicinski, m-karicheri2, dsahern,
	Vedang Patel
In-Reply-To: <1563231104-19912-1-git-send-email-vedang.patel@intel.com>

From: Vinicius Costa Gomes <vinicius.gomes@intel.com>

This allows a new parameter, flags, to be passed to taprio. Currently, it
only supports enabling the txtime-assist mode. But, we plan to add
different modes for taprio (e.g. hardware offloading) and this parameter
will be useful in enabling those modes.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Vedang Patel <vedang.patel@intel.com>
---
 tc/q_taprio.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/tc/q_taprio.c b/tc/q_taprio.c
index 62c8c591da99..930ecb9d1eef 100644
--- a/tc/q_taprio.c
+++ b/tc/q_taprio.c
@@ -154,6 +154,7 @@ static struct sched_entry *create_entry(uint32_t gatemask, uint32_t interval, ui
 static int taprio_parse_opt(struct qdisc_util *qu, int argc,
 			    char **argv, struct nlmsghdr *n, const char *dev)
 {
+	__u32 taprio_flags = UINT32_MAX;
 	__s32 clockid = CLOCKID_INVALID;
 	struct tc_mqprio_qopt opt = { };
 	__s64 cycle_time_extension = 0;
@@ -281,6 +282,17 @@ static int taprio_parse_opt(struct qdisc_util *qu, int argc,
 				explain_clockid(*argv);
 				return -1;
 			}
+		} else if (strcmp(*argv, "flags") == 0) {
+			NEXT_ARG();
+			if (taprio_flags != UINT32_MAX) {
+				fprintf(stderr, "taprio: duplicate \"flags\" specification\n");
+				return -1;
+			}
+			if (get_u32(&taprio_flags, *argv, 0)) {
+				PREV_ARG();
+				return -1;
+			}
+
 		} else if (strcmp(*argv, "help") == 0) {
 			explain();
 			return -1;
@@ -297,6 +309,9 @@ static int taprio_parse_opt(struct qdisc_util *qu, int argc,
 	if (clockid != CLOCKID_INVALID)
 		addattr_l(n, 1024, TCA_TAPRIO_ATTR_SCHED_CLOCKID, &clockid, sizeof(clockid));
 
+	if (taprio_flags != UINT32_MAX)
+		addattr_l(n, 1024, TCA_TAPRIO_ATTR_FLAGS, &taprio_flags, sizeof(taprio_flags));
+
 	if (opt.num_tc > 0)
 		addattr_l(n, 1024, TCA_TAPRIO_ATTR_PRIOMAP, &opt, sizeof(opt));
 
@@ -405,6 +420,7 @@ static int taprio_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
 	struct rtattr *tb[TCA_TAPRIO_ATTR_MAX + 1];
 	struct tc_mqprio_qopt *qopt = 0;
 	__s32 clockid = CLOCKID_INVALID;
+	__u32 taprio_flags = 0;
 	int i;
 
 	if (opt == NULL)
@@ -442,6 +458,11 @@ static int taprio_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
 
 	print_string(PRINT_ANY, "clockid", "clockid %s", get_clock_name(clockid));
 
+	if (tb[TCA_TAPRIO_ATTR_FLAGS]) {
+		taprio_flags = rta_getattr_u32(tb[TCA_TAPRIO_ATTR_FLAGS]);
+		print_uint(PRINT_ANY, "flags", " flags %x", taprio_flags);
+	}
+
 	print_schedule(f, tb);
 
 	if (tb[TCA_TAPRIO_ATTR_ADMIN_SCHED]) {
-- 
2.7.3


^ permalink raw reply related

* [PATCH iproute2 net-next v3 3/5] taprio: add support for setting txtime_delay.
From: Vedang Patel @ 2019-07-15 22:51 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, stephen, vinicius.gomes,
	leandro.maciel.dorileo, jakub.kicinski, m-karicheri2, dsahern,
	Vedang Patel
In-Reply-To: <1563231104-19912-1-git-send-email-vedang.patel@intel.com>

This adds support for setting the txtime_delay parameter which is useful
for the txtime offload mode of taprio.

Signed-off-by: Vedang Patel <vedang.patel@intel.com>
---
 tc/q_taprio.c | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/tc/q_taprio.c b/tc/q_taprio.c
index 930ecb9d1eef..13bd5c44cac9 100644
--- a/tc/q_taprio.c
+++ b/tc/q_taprio.c
@@ -52,7 +52,7 @@ static void explain(void)
 		"		[num_tc NUMBER] [map P0 P1 ...] "
 		"		[queues COUNT@OFFSET COUNT@OFFSET COUNT@OFFSET ...] "
 		"		[ [sched-entry index cmd gate-mask interval] ... ] "
-		"		[base-time time] "
+		"		[base-time time] [txtime-delay delay]"
 		"\n"
 		"CLOCKID must be a valid SYS-V id (i.e. CLOCK_TAI)\n");
 }
@@ -162,6 +162,7 @@ static int taprio_parse_opt(struct qdisc_util *qu, int argc,
 	struct rtattr *tail, *l;
 	__s64 cycle_time = 0;
 	__s64 base_time = 0;
+	__s32 txtime_delay = 0;
 	int err, idx;
 
 	INIT_LIST_HEAD(&sched_entries);
@@ -293,6 +294,17 @@ static int taprio_parse_opt(struct qdisc_util *qu, int argc,
 				return -1;
 			}
 
+		} else if (strcmp(*argv, "txtime-delay") == 0) {
+			NEXT_ARG();
+			if (txtime_delay != 0) {
+				fprintf(stderr, "taprio: duplicate \"txtime-delay\" specification\n");
+				return -1;
+			}
+			if (get_s32(&txtime_delay, *argv, 0)) {
+				PREV_ARG();
+				return -1;
+			}
+
 		} else if (strcmp(*argv, "help") == 0) {
 			explain();
 			return -1;
@@ -315,6 +327,9 @@ static int taprio_parse_opt(struct qdisc_util *qu, int argc,
 	if (opt.num_tc > 0)
 		addattr_l(n, 1024, TCA_TAPRIO_ATTR_PRIOMAP, &opt, sizeof(opt));
 
+	if (txtime_delay)
+		addattr_l(n, 1024, TCA_TAPRIO_ATTR_TXTIME_DELAY, &txtime_delay, sizeof(txtime_delay));
+
 	if (base_time)
 		addattr_l(n, 1024, TCA_TAPRIO_ATTR_SCHED_BASE_TIME, &base_time, sizeof(base_time));
 
@@ -421,6 +436,7 @@ static int taprio_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
 	struct tc_mqprio_qopt *qopt = 0;
 	__s32 clockid = CLOCKID_INVALID;
 	__u32 taprio_flags = 0;
+	__s32 txtime_delay = 0;
 	int i;
 
 	if (opt == NULL)
@@ -463,6 +479,11 @@ static int taprio_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
 		print_uint(PRINT_ANY, "flags", " flags %x", taprio_flags);
 	}
 
+	if (tb[TCA_TAPRIO_ATTR_TXTIME_DELAY]) {
+		txtime_delay = rta_getattr_s32(tb[TCA_TAPRIO_ATTR_TXTIME_DELAY]);
+		print_int(PRINT_ANY, "txtime_delay", " txtime delay %d", txtime_delay);
+	}
+
 	print_schedule(f, tb);
 
 	if (tb[TCA_TAPRIO_ATTR_ADMIN_SCHED]) {
-- 
2.7.3


^ permalink raw reply related

* [PATCH iproute2 net-next v3 4/5] tc: etf: Add documentation for skip-skb-check.
From: Vedang Patel @ 2019-07-15 22:51 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, stephen, vinicius.gomes,
	leandro.maciel.dorileo, jakub.kicinski, m-karicheri2, dsahern,
	Vedang Patel
In-Reply-To: <1563231104-19912-1-git-send-email-vedang.patel@intel.com>

Document the newly added option (skip-skb-check) on the etf man-page.

Signed-off-by: Vedang Patel <vedang.patel@intel.com>
---
 man/man8/tc-etf.8 | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/man/man8/tc-etf.8 b/man/man8/tc-etf.8
index 30a12de7d2c7..2e01a591dbaa 100644
--- a/man/man8/tc-etf.8
+++ b/man/man8/tc-etf.8
@@ -106,6 +106,16 @@ referred to as "Launch Time" or "Time-Based Scheduling" by the
 documentation of network interface controllers.
 The default is for this option to be disabled.
 
+.TP
+skip_skb_check
+.br
+.BR etf(8)
+currently drops any packet which does not have a socket associated with it or
+if the socket does not have SO_TXTIME socket option set. But, this will not
+work if the launchtime is set by another entity inside the kernel (e.g. some
+other Qdisc). Setting the skip_skb_check will skip checking for a socket
+associated with the packet.
+
 .SH EXAMPLES
 
 ETF is used to enforce a Quality of Service. It controls when each
-- 
2.7.3


^ permalink raw reply related

* [PATCH iproute2 net-next v3 5/5] tc: taprio: Update documentation
From: Vedang Patel @ 2019-07-15 22:51 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, stephen, vinicius.gomes,
	leandro.maciel.dorileo, jakub.kicinski, m-karicheri2, dsahern,
	Vedang Patel
In-Reply-To: <1563231104-19912-1-git-send-email-vedang.patel@intel.com>

Add documentation for the latest options, flags and txtime-delay, to the
taprio manpage.

This also adds an example to run tc in txtime offload mode.

Signed-off-by: Vedang Patel <vedang.patel@intel.com>
---
 man/man8/tc-taprio.8 | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/man/man8/tc-taprio.8 b/man/man8/tc-taprio.8
index 850be9b03649..e1d19ba19089 100644
--- a/man/man8/tc-taprio.8
+++ b/man/man8/tc-taprio.8
@@ -112,6 +112,26 @@ means that traffic class 0 is "active" for that schedule entry.
 long that state defined by <command> and <gate mask> should be held
 before moving to the next entry.
 
+.TP
+flags
+.br
+Specifies different modes for taprio. Currently, only txtime-assist is
+supported which can be enabled by setting it to 0x1. In this mode, taprio will
+set the transmit timestamp depending on the interval in which the packet needs
+to be transmitted. It will then utililize the
+.BR etf(8)
+qdisc to sort and transmit the packets at the right time. The second example
+can be used as a reference to configure this mode.
+
+.TP
+txtime-delay
+.br
+This parameter is specific to the txtime offload mode. It specifies the maximum
+time a packet might take to reach the network card from the taprio qdisc. The
+value should always be greater than the delta specified in the
+.BR etf(8)
+qdisc.
+
 .SH EXAMPLES
 
 The following example shows how an traffic schedule with three traffic
@@ -137,6 +157,26 @@ reference CLOCK_TAI. The schedule is composed of three entries each of
               clockid CLOCK_TAI
 .EE
 
+Following is an example to enable the txtime offload mode in taprio. See
+.BR etf(8)
+for more information about configuring the ETF qdisc.
+
+.EX
+# tc qdisc replace dev eth0 parent root handle 100 taprio \\
+              num_tc 3 \\
+              map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \\
+              queues 1@0 1@0 1@0 \\
+              base-time 1528743495910289987 \\
+              sched-entry S 01 300000 \\
+              sched-entry S 02 300000 \\
+              sched-entry S 04 400000 \\
+              flags 0x1 \\
+              txtime-delay 200000 \\
+              clockid CLOCK_TAI
+
+# tc qdisc replace dev $IFACE parent 100:1 etf skip_skb_check \\
+              offload delta 200000 clockid CLOCK_TAI
+.EE
 
 .SH AUTHORS
 Vinicius Costa Gomes <vinicius.gomes@intel.com>
-- 
2.7.3


^ permalink raw reply related

* Re: [PATCH V35 23/29] bpf: Restrict bpf when kernel lockdown is in confidentiality mode
From: Daniel Borkmann @ 2019-07-15 22:54 UTC (permalink / raw)
  To: Matthew Garrett, jmorris
  Cc: linux-security-module, linux-kernel, linux-api, David Howells,
	Alexei Starovoitov, Matthew Garrett, netdev, Chun-Yi Lee
In-Reply-To: <20190715195946.223443-24-matthewgarrett@google.com>

On 7/15/19 9:59 PM, Matthew Garrett wrote:
> From: David Howells <dhowells@redhat.com>
> 
> bpf_read() and bpf_read_str() could potentially be abused to (eg) allow
> private keys in kernel memory to be leaked. Disable them if the kernel
> has been locked down in confidentiality mode.
> 
> Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> Signed-off-by: Matthew Garrett <mjg59@google.com>
> cc: netdev@vger.kernel.org
> cc: Chun-Yi Lee <jlee@suse.com>
> cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  include/linux/security.h     |  1 +
>  kernel/trace/bpf_trace.c     | 10 ++++++++++
>  security/lockdown/lockdown.c |  1 +
>  3 files changed, 12 insertions(+)
> 
> diff --git a/include/linux/security.h b/include/linux/security.h
> index 987d8427f091..8dd1741a52cd 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -118,6 +118,7 @@ enum lockdown_reason {
>  	LOCKDOWN_INTEGRITY_MAX,
>  	LOCKDOWN_KCORE,
>  	LOCKDOWN_KPROBES,
> +	LOCKDOWN_BPF_READ,
>  	LOCKDOWN_CONFIDENTIALITY_MAX,
>  };
>  
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index ca1255d14576..605908da61c5 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -142,7 +142,12 @@ BPF_CALL_3(bpf_probe_read, void *, dst, u32, size, const void *, unsafe_ptr)
>  {
>  	int ret;
>  
> +	ret = security_locked_down(LOCKDOWN_BPF_READ);
> +	if (ret)
> +		goto out;
> +
>  	ret = probe_kernel_read(dst, unsafe_ptr, size);
> +out:
>  	if (unlikely(ret < 0))
>  		memset(dst, 0, size);

Hmm, does security_locked_down() ever return a code > 0 or why do you
have the double check on return code? If not, then for clarity the
ret code from security_locked_down() should be checked as 'ret < 0'
as well and out label should be at the memset directly instead.

> @@ -569,6 +574,10 @@ BPF_CALL_3(bpf_probe_read_str, void *, dst, u32, size,
>  {
>  	int ret;
>  
> +	ret = security_locked_down(LOCKDOWN_BPF_READ);
> +	if (ret)
> +		goto out;
> +
>  	/*
>  	 * The strncpy_from_unsafe() call will likely not fill the entire
>  	 * buffer, but that's okay in this circumstance as we're probing
> @@ -579,6 +588,7 @@ BPF_CALL_3(bpf_probe_read_str, void *, dst, u32, size,
>  	 * is returned that can be used for bpf_perf_event_output() et al.
>  	 */
>  	ret = strncpy_from_unsafe(dst, unsafe_ptr, size);
> +out:
>  	if (unlikely(ret < 0))
>  		memset(dst, 0, size);

Ditto.

Thanks,
Daniel

^ permalink raw reply

* Re: [RFC bpf-next 0/8] bpf: accelerate insn patching speed
From: Andrii Nakryiko @ 2019-07-15 22:55 UTC (permalink / raw)
  To: Jiong Wang
  Cc: Alexei Starovoitov, Daniel Borkmann, Edward Cree, Naveen N. Rao,
	Andrii Nakryiko, Jakub Kicinski, bpf, Networking, oss-drivers,
	Yonghong Song
In-Reply-To: <87k1cj3b69.fsf@netronome.com>

On Mon, Jul 15, 2019 at 2:21 AM Jiong Wang <jiong.wang@netronome.com> wrote:
>
>
> Andrii Nakryiko writes:
>
> > On Thu, Jul 11, 2019 at 4:22 AM Jiong Wang <jiong.wang@netronome.com> wrote:
> >>
> >>
> >> Andrii Nakryiko writes:
> >>
> >> > On Thu, Jul 4, 2019 at 2:31 PM Jiong Wang <jiong.wang@netronome.com> wrote:
> >> >>
> >> >> This is an RFC based on latest bpf-next about acclerating insn patching
> >> >> speed, it is now near the shape of final PATCH set, and we could see the
> >> >> changes migrating to list patching would brings, so send out for
> >> >> comments. Most of the info are in cover letter. I splitted the code in a
> >> >> way to show API migration more easily.
> >> >
> >> >
> >> > Hey Jiong,
> >> >
> >> >
> >> > Sorry, took me a while to get to this and learn more about instruction
> >> > patching. Overall this looks good and I think is a good direction.
> >> > I'll post high-level feedback here, and some more
> >> > implementation-specific ones in corresponding patches.
> >>
> >> Great, thanks very much for the feedbacks. Most of your feedbacks are
> >> hitting those pain points I exactly had ran into. For some of them, I
> >> thought similar solutions like yours, but failed due to various
> >> reasons. Let's go through them again, I could have missed some important
> >> things.
> >>
> >> Please see my replies below.
> >
> > Thanks for thoughtful reply :)
> >
> >>
> >> >>
> >> >> Test Results
> >> >> ===
> >> >>   - Full pass on test_verifier/test_prog/test_prog_32 under all three
> >> >>     modes (interpreter, JIT, JIT with blinding).
> >> >>
> >> >>   - Benchmarking shows 10 ~ 15x faster on medium sized prog, and reduce
> >> >>     patching time from 5100s (nearly one and a half hour) to less than
> >> >>     0.5s for 1M insn patching.
> >> >>
> >> >> Known Issues
> >> >> ===
> >> >>   - The following warning is triggered when running scale test which
> >> >>     contains 1M insns and patching:
> >> >>       warning of mm/page_alloc.c:4639 __alloc_pages_nodemask+0x29e/0x330
> >> >>
> >> >>     This is caused by existing code, it can be reproduced on bpf-next
> >> >>     master with jit blinding enabled, then run scale unit test, it will
> >> >>     shown up after half an hour. After this set, patching is very fast, so
> >> >>     it shows up quickly.
> >> >>
> >> >>   - No line info adjustment support when doing insn delete, subprog adj
> >> >>     is with bug when doing insn delete as well. Generally, removal of insns
> >> >>     could possibly cause remove of entire line or subprog, therefore
> >> >>     entries of prog->aux->linfo or env->subprog needs to be deleted. I
> >> >>     don't have good idea and clean code for integrating this into the
> >> >>     linearization code at the moment, will do more experimenting,
> >> >>     appreciate ideas and suggestions on this.
> >> >
> >> > Is there any specific problem to detect which line info to delete? Or
> >> > what am I missing besides careful implementation?
> >>
> >> Mostly line info and subprog info are range info which covers a range of
> >> insns. Deleting insns could causing you adjusting the range or removing one
> >> range entirely. subprog info could be fully recalcuated during
> >> linearization while line info I need some careful implementation and I
> >> failed to have clean code for this during linearization also as said no
> >> unit tests to help me understand whether the code is correct or not.
> >>
> >
> > Ok, that's good that it's just about clean implementation. Try to
> > implement it as clearly as possible. Then post it here, and if it can
> > be improved someone (me?) will try to help to clean it up further.
> >
> > Not a big expert on line info, so can't comment on that,
> > unfortunately. Maybe Yonghong can chime in (cc'ed)
> >
> >
> >> I will described this latter, spent too much time writing the following
> >> reply. Might worth an separate discussion thread.
> >>
> >> >>
> >> >>     Insn delete doesn't happen on normal programs, for example Cilium
> >> >>     benchmarks, and happens rarely on test_progs, so the test coverage is
> >> >>     not good. That's also why this RFC have a full pass on selftest with
> >> >>     this known issue.
> >> >
> >> > I hope you'll add test for deletion (and w/ corresponding line info)
> >> > in final patch set :)
> >>
> >> Will try. Need to spend some time on BTF format.
> >> >
> >> >>
> >> >>   - Could further use mem pool to accelerate the speed, changes are trivial
> >> >>     on top of this RFC, and could be 2x extra faster. Not included in this
> >> >>     RFC as reducing the algo complexity from quadratic to linear of insn
> >> >>     number is the first step.
> >> >
> >> > Honestly, I think that would add more complexity than necessary, and I
> >> > think we can further speed up performance without that, see below.
> >> >
> >> >>
> >> >> Background
> >> >> ===
> >> >> This RFC aims to accelerate BPF insn patching speed, patching means expand
> >> >> one bpf insn at any offset inside bpf prog into a set of new insns, or
> >> >> remove insns.
> >> >>
> >> >> At the moment, insn patching is quadratic of insn number, this is due to
> >> >> branch targets of jump insns needs to be adjusted, and the algo used is:
> >> >>
> >> >>   for insn inside prog
> >> >>     patch insn + regeneate bpf prog
> >> >>     for insn inside new prog
> >> >>       adjust jump target
> >> >>
> >> >> This is causing significant time spending when a bpf prog requires large
> >> >> amount of patching on different insns. Benchmarking shows it could take
> >> >> more than half minutes to finish patching when patching number is more
> >> >> than 50K, and the time spent could be more than one hour when patching
> >> >> number is around 1M.
> >> >>
> >> >>   15000   :    3s
> >> >>   45000   :   29s
> >> >>   95000   :  125s
> >> >>   195000  :  712s
> >> >>   1000000 : 5100s
> >> >>
> >> >> This RFC introduces new patching infrastructure. Before doing insn
> >> >> patching, insns in bpf prog are turned into a singly linked list, insert
> >> >> new insns just insert new list node, delete insns just set delete flag.
> >> >> And finally, the list is linearized back into array, and branch target
> >> >> adjustment is done for all jump insns during linearization. This algo
> >> >> brings the time complexity from quadratic to linear of insn number.
> >> >>
> >> >> Benchmarking shows the new patching infrastructure could be 10 ~ 15x faster
> >> >> on medium sized prog, and for a 1M patching it reduce the time from 5100s
> >> >> to less than 0.5s.
> >> >>
> >> >> Patching API
> >> >> ===
> >> >> Insn patching could happen on two layers inside BPF. One is "core layer"
> >> >> where only BPF insns are patched. The other is "verification layer" where
> >> >> insns have corresponding aux info as well high level subprog info, so
> >> >> insn patching means aux info needs to be patched as well, and subprog info
> >> >> needs to be adjusted. BPF prog also has debug info associated, so line info
> >> >> should always be updated after insn patching.
> >> >>
> >> >> So, list creation, destroy, insert, delete is the same for both layer,
> >> >> but lineration is different. "verification layer" patching require extra
> >> >> work. Therefore the patch APIs are:
> >> >>
> >> >>    list creation:                bpf_create_list_insn
> >> >>    list patch:                   bpf_patch_list_insn
> >> >>    list pre-patch:               bpf_prepatch_list_insn
> >> >
> >> > I think pre-patch name is very confusing, until I read full
> >> > description I couldn't understand what it's supposed to be used for.
> >> > Speaking of bpf_patch_list_insn, patch is also generic enough to leave
> >> > me wondering whether instruction buffer is inserted after instruction,
> >> > or instruction is replaced with a bunch of instructions.
> >> >
> >> > So how about two more specific names:
> >> > bpf_patch_list_insn -> bpf_list_insn_replace (meaning replace given
> >> > instruction with a list of patch instructions)
> >> > bpf_prepatch_list_insn -> bpf_list_insn_prepend (well, I think this
> >> > one is pretty clear).
> >>
> >> My sense on English word is not great, will switch to above which indeed
> >> reads more clear.
> >>
> >> >>    list lineration (core layer): prog = bpf_linearize_list_insn(prog, list)
> >> >>    list lineration (veri layer): env = verifier_linearize_list_insn(env, list)
> >> >
> >> > These two functions are both quite involved, as well as share a lot of
> >> > common code. I'd rather have one linearize instruction, that takes env
> >> > as an optional parameter. If env is specified (which is the case for
> >> > all cases except for constant blinding pass), then adjust aux_data and
> >> > subprogs along the way.
> >>
> >> Two version of lineration and how to unify them was a painpoint to me. I
> >> thought to factor out some of the common code out, but it actually doesn't
> >> count much, the final size counting + insnsi resize parts are the same,
> >> then things start to diverge since the "Copy over insn" loop.
> >>
> >> verifier layer needs to copy and initialize aux data etc. And jump
> >> relocation is different. At core layer, the use case is JIT blinding which
> >> could expand an jump_imm insn into a and/or/jump_reg sequence, and the
> >
> > Sorry, I didn't get what "could expand an jump_imm insn into a
> > and/or/jump_reg sequence", maybe you can clarify if I'm missing
> > something.
> >
> > But from your cover letter description, core layer has no jumps at
> > all, while verifier has jumps inside patch buffer. So, if you support
> > jumps inside of patch buffer, it will automatically work for core
> > layer. Or what am I missing?
>
> I meant in core layer (JIT blinding), there is the following patching:
>
> input:
>   insn 0             insn 0
>   insn 1             insn 1
>   jmp_imm   >>       mov_imm  \
>   insn 2             xor_imm    insn seq expanded from jmp_imm
>   insn 3             jmp_reg  /
>                      insn 2
>                      insn 3
>
>
> jmp_imm is the insn that will be patched, and the actually transformation
> is to expand it into mov_imm/xor_imm/jmp_reg sequence. "jmp_reg", sitting
> at the end of the patch buffer, must jump to the same destination as the
> original jmp_imm, so "jmp_reg" is an insn inside patch buffer but should
> be relocated, and the jump destination is outside of patch buffer.


Ok, great, thanks for explaining, yeah it's definitely something that
we should be able to support. BUT. It got me thinking a bit more and I
think I have simpler and more elegant solution now, again, supporting
both core-layer and verifier-layer operations.

struct bpf_patchable_insn {
   struct bpf_patchable_insn *next;
   struct bpf_insn insn;
   int orig_idx; /* original non-patched index */
   int new_idx;  /* new index, will be filled only during linearization */
};

struct bpf_patcher {
    /* dummy head node of a chain of patchable instructions */
    struct bpf_patchable_insn insn_head;
    /* dynamic array of size(original instruction count)
     * this is a map from original instruction index to a first
     * patchable instruction that replaced that instruction (or
     * just original instruction as bpf_patchable_insn).
     */
    int *orig_idx_to_patchable_insn;
    int cnt;
};

Few points, but it should be pretty clear just from comments and definitions:
1. When you created bpf_patcher, you create patchabe_insn list, fill
orig_idx_to_patchable_insn map to store proper pointers. This array is
NEVER changed after that.
2. When replacing instruction, you re-use struct bpf_patchable_insn
for first patched instruction, then append after that (not prepend to
next instruction to not disrupt orig_idx -> patchable_insn mapping).
3. During linearizations, you first traverse the chain of instructions
and trivially assing new_idxs.
4. No need for patchabe_insn->target anymore. All jumps use relative
instruction offsets, right? So when you need to determine new
instruction index during linearization, you just do (after you
calculated new instruction indicies):

func adjust_jmp(struct bpf_patcher* patcher, struct bpf_patchable_insn *insn) {
   int old_jmp_idx = insn->orig_idx + jmp_offset_of(insn->insn);
   int new_jmp_idx = patcher->orig_idx_to_patchable_insn[old_jmp_idx]->new_idx;
   adjust_jmp_offset(insn->insn, new_jmp_idx) - insn->orig_idx;
}

The idea is that we want to support quick look-up by original
instruction index. That's what orig_idx_to_patchable_insn provides. On
the other hand, no existing instruction is ever referencing newly
patched instruction by its new offset, so with careful implementation,
you can transparently support all the cases, regardless if it's in
core layer or verifier layer (so, e.g., verifier layer patched
instructions now will be able to jump out of patched buffer, if
necessary, neat, right?).

It is cleaner than everything we've discussed so far. Unless I missed
something critical (it's all quite convoluted, so I might have
forgotten some parts already). Let me know what you think.


>
> This means for core layer (jit blinding), it needs to take care of insn
> inside patch buffer.
>
> > Just compared two version of linearize side by side. From what I can
> > see, unified version could look like this, high-level:
> >
> > 1. Count final insn count (but see my other suggestions how to avoid
> > that altogether). If not changed - exit.
> > 2. Realloc insn buffer, copy just instructions (not aux_data yet).
> > Build idx_map, if necessary.
> > 3. (if env) then bpf_patchable_insn has aux_data, so now do another
> > pass and copy it into resulting array.
> > 4. (if env) Copy sub info. Though I'd see if we can just reuse old
> > ones and just adjust offsets. I'm not sure why we need to allocate new
> > array, subprogram count shouldn't change, right?
>
> If there is no dead insn elimination opt, then we could just adjust
> offsets. When there is insn deleting, I feel the logic becomes more
> complex. One subprog could be completely deleted or partially deleted, so
> I feel just recalculate the whole subprog info as a side-product is
> much simpler.

What's the situation where entirety of subprog can be deleted?


>
> > 5. (common) Relocate jumps. Not clear why core layer doesn't care
> > about PATCHED (or, alternatively, why verifier layer cares).
>
> See above, in this RFC, core layer care PATCHED during relocating jumps,
> and verifier layer doesn't.
>
> > And again, with targets pointer it will look totally different (and
> > simpler).
>
> Yes, will see how the code looks.
>
> > 6. (if env) adjust subprogs
> > 7. (common) Adjust prog's line info.
> >
> > The devil is in the details, but I think this will still be better if
> > contained in one function if a bunch of `if (env)` checks. Still
> > pretty linear.
> >
> >> jump_reg is at the end of the patch buffer, it should be relocated. While
> >> all use case in verifier layer, no jump in the prog will be patched and all
> >> new jumps in patch buffer will jump inside the buffer locally so no need to
> >> resolve.
> >>
> >> And yes we could unify them into one and control the diverge using
> >> argument, but then where to place the function is an issue. My
> >> understanding is verifier.c is designed to be on top of core.c and core.c
> >> should not reference and no need to be aware of any verifier specific data
> >> structures, for example env or bpf_aux_insn_data etc.
> >
> > Func prototype where it is. Maybe forward-declare verifier env struct.
> > Implementation in verifier.c?
> >
> >>
> >> So, in this RFC, I had choosed to write separate linerization function for
> >> core and verifier layer. Does this make sense?
> >
> > See above. Let's still try to make it better.
> >
> >>
> >> >
> >> > This would keep logic less duplicated and shouldn't complexity beyond
> >> > few null checks in few places.
> >> >
> >> >>    list destroy:                 bpf_destroy_list_insn
> >> >>
> >> >
> >> > I'd also add a macro foreach_list_insn instead of explicit for loops
> >> > in multiple places. That would also allow to skip deleted instructions
> >> > transparently.
> >> >
> >> >> list patch could change the insn at patch point, it will invalid the aux
> >> >
> >> > typo: invalid -> invalidate
> >>
> >> Ack.
> >>
> >> >
> >> >> info at patching point. list pre-patch insert new insns before patch point
> >> >> where the insn and associated aux info are not touched, it is used for
> >> >> example in convert_ctx_access when generating prologue.
> >> >>
> >> >> Typical API sequence for one patching pass:
> >> >>
> >> >>    struct bpf_list_insn list = bpf_create_list_insn(struct bpf_prog);
> >> >>    for (elem = list; elem; elem = elem->next)
> >> >>       patch_buf = gen_patch_buf_logic;
> >> >>       elem = bpf_patch_list_insn(elem, patch_buf, cnt);
> >> >>    bpf_prog = bpf_linearize_list_insn(list)
> >> >>    bpf_destroy_list_insn(list)
> >> >>
> >> >> Several patching passes could also share the same list:
> >> >>
> >> >>    struct bpf_list_insn list = bpf_create_list_insn(struct bpf_prog);
> >> >>    for (elem = list; elem; elem = elem->next)
> >> >>       patch_buf = gen_patch_buf_logic1;
> >> >>       elem = bpf_patch_list_insn(elem, patch_buf, cnt);
> >> >>    for (elem = list; elem; elem = elem->next)
> >> >>       patch_buf = gen_patch_buf_logic2;
> >> >>       elem = bpf_patch_list_insn(elem, patch_buf, cnt);
> >> >>    bpf_prog = bpf_linearize_list_insn(list)
> >> >>    bpf_destroy_list_insn(list)
> >> >>
> >> >> but note new inserted insns int early passes won't have aux info except
> >> >> zext info. So, if one patch pass requires all aux info updated and
> >> >> recalculated for all insns including those pathced, it should first
> >> >> linearize the old list, then re-create the list. The RFC always create and
> >> >> linearize the list for each migrated patching pass separately.
> >> >
> >> > I think we should do just one list creation, few passes of patching
> >> > and then linearize once. That will save quite a lot of memory
> >> > allocation and will speed up a lot of things. All the verifier
> >> > patching happens one after the other without any other functionality
> >> > in between, so there shouldn't be any problem.
> >>
> >> Yes, as mentioned above, it is possible and I had tried to do it in an very
> >> initial impl. IIRC convert_ctx_access + fixup_bpf_calls could share the
> >> same list, but then the 32-bit zero extension insertion pass requires
> >> aux.zext_dst set properly for all instructions including those patched
> >
> > So zext_dst. Seems like it's easily calculatable, so doesn't seem like
> > it even needs to be accessed from aux_data.
> >
> > But. I can see at least two ways to do this:
> > 1. those patching passes that care about aux_data, should just do
> > extra check for NULL. Because when we adjust insns now, we just leave
> > zero-initialized aux_data, except for zext_dst and seen. So it's easy
> > to default to them if aux_data is NULL for patchable_insn.
> > 2. just allocate and fill them out them when applying patch insns
> > buffer. It's not a duplication, we already fill them out during
> > patching today. So just do the same, except through malloc()'ed
> > pointer instead. At the end they will be copied into linear resulting
> > array during linearization (uniformly with non-patched insns).
> >
> >> one which we need to linearize the list first (as we set zext_dst during
> >> linerization), or the other choice is we do the zext_dst initialization
> >> during bpf_patch_list_insn, but this then make bpf_patch_list_insn diverge
> >> between core and verifier layer.
> >
> > List construction is much simpler, even if we have to have extra
> > check, similar to `if (env) { do_extra(); }`, IMO, it's fine.
> >
> >>
> >> > As for aux_data. We can solve that even more simply and reliably by
> >> > storing a pointer along the struct bpf_list_insn
> >>
> >> This is exactly what I had implemented initially, but then the issue is how
> >> to handle aux_data for patched insn? IIRC I was leave it as a NULL pointer,
> >> but later found zext_dst info is required for all insns, so I end up
> >> duplicating zext_dst in bpf_list_insn.
> >
> > See above. No duplication. You have a pointer. Whether aux_data is in
> > original array or was malloc()'ed, doesn't matter. But no duplication
> > of fields.
> >
> >>
> >> This leads me worrying we need to keep duplicating fields there as soon as
> >> there is new similar requirements in future patching pass and I thought it
> >> might be better to just reference the aux_data inside env using orig_idx,
> >> this avoids duplicating information, but we need to make sure used fields
> >> inside aux_data for patched insn update-to-date during linearization or
> >> patching list.
> >>
> >> > (btw, how about calling it bpf_patchable_insn?).
> >>
> >> No preference, will use this one.
> >>
> >> > Here's how I propose to represent this patchable instruction:
> >> >
> >> > struct bpf_list_insn {
> >> >        struct bpf_insn insn;
> >> >        struct bpf_list_insn *next;
> >> >        struct bpf_list_insn *target;
> >> >        struct bpf_insn_aux_data *aux_data;
> >> >        s32 orig_idx; // can repurpose this to have three meanings:
> >> >                      // -2 - deleted
> >> >                      // -1 - patched/inserted insn
> >> >                      // >=0 - original idx
> >>
> >> I actually had experimented the -2/-1/0 trick, exactly the same number
> >> assignment :) IIRC the code was not clear compared with using flag, the
> >> reason seems to be:
> >>   1. we still need orig_idx of an patched insn somehow, meaning negate the
> >>      index.
> >
> > Not following, original index with be >=0, no?
> >
> >>   2. somehow somecode need to know whether one insn is deleted or patched
> >>      after the negation, so I end up with some ugly code.
> >
> > So that's why you'll have constants with descriptive name for -2 and -1.
> >
> >>
> >> Anyway, I might had not thought hard enough on this, I will retry using the
> >> special index instead of flag, hopefully I could have clean code this time.
> >>
> >
> > Yeah, please try again. All those `orig_idx = insn->orig_idx - 1; if
> > (orig_idx >= 0) { ... }` are very confusing.
> >
> >> > };
> >> >
> >> > The idea would be as follows:
> >> > 1. when creating original list, target pointer will point directly to
> >> > a patchable instruction wrapper for jumps/calls. This will allow to
> >> > stop tracking and re-calculating jump offsets and instruction indicies
> >> > until linearization.
> >>
> >> Not sure I have followed the idea of "target" pointer. At the moment we are
> >> using index mapping array (generated as by-product during coping insn).
> >>
> >> While the "target" pointer means to during list initialization, each jump
> >> insn will have target initialized to the list node of the converted jump
> >> destination insn, and all those non-jump insns are with NULL? Then during
> >> linearization you assign index to each list node (could be done as
> >> by-product of other pass) before insn coping which could then relocate the
> >> insn during the coping as the "target" would have final index calculated?
> >> Am I following correctly?
> >
> > Yes, I think you are understanding correctly what I'm saying. For
> > implementation, you can do it in few ways, through few passes or with
> > some additional data, is less important. See what's cleanest.
> >
> >>
> >> > 2. aux_data is also filled at that point. Later at linearization time
> >> > you'd just iterate over all the instructions in final order and copy
> >> > original aux_data, if it's present. And then just repace env's
> >> > aux_data array at the end, should be very simple and fast.
> >>
> >> As explained, I am worried making aux_data a pointer will causing
> >> duplicating some fields into list_insn if the fields are required for
> >> patched insns.
> >
> > Addressed above, I don't think there will be any duplication, because
> > we pass aux_data by pointer.
> >
> >>
> >> > 3. during fix_bpf_calls, zext, ctx rewrite passes, we'll reuse the
> >> > same list of instructions and those passes will just keep inserting
> >> > instruction buffers. Given we have restriction that all the jumps are
> >> > only within patch buffer, it will be trivial to construct proper
> >> > patchable instruction wrappers for newly added instructions, with NULL
> >> > for aux_data and possibly non-NULL target (if it's a JMP insn).
> >> > 4. After those passes, linearize, adjust subprogs (for this you'll
> >> > probably still need to create index mapping, right?), copy or create
> >> > new aux_data.
> >> > 5. Done.
> >> >
> >> > What do you think? I think this should be overall simpler and faster.
> >> > But let me know if I'm missing something.
> >>
> >> Thanks for all these thoughts, they are very good suggestions and reminds
> >> me to revisit some points I had forgotten. I will do the following things:
> >>
> >>   1. retry the negative index solution to eliminate flag if the result code
> >>      could be clean.
> >>   2. the "target" pointer seems make sense, it makes list_insn bigger but
> >>      normally space trade with time, so I will try to implement it to see
> >>      how the code looks like.
> >>   3. I still have concerns on making aux_data as pointer. Mostly due to
> >>      patched insn will have NULL pointer and in case aux info of patched
> >>      insn is required, we need to duplicate info inside list_insn. For
> >>      example 32-bit zext opt requires zext_dst.
> >>
> >
> >
> > So one more thing I wanted to suggest. I'll try to keep high-level
> > suggestions here.
> >
> > What about having a wrapper for patchable_insn list, where you can
> > store some additional data, like final count and whatever else. It
> > will eliminate some passes (counting) and will make list handling
> > easier (because you can have a dummy head pointer, so no special
> > handling of first element
>
> Will try it.
>
> > you had this concern in patch #1, I
> > believe). But it will be clear if it's beneficial once implemented.
>
> >> Regards,
> >> Jiong
> >>
> >> >>
> >> >> Compared with old patching code, this new infrastructure has much less core
> >> >> code, even though the final code has a couple of extra lines but that is
> >> >> mostly due to for list based infrastructure, we need to do more error
> >> >> checks, so the list and associated aux data structure could be freed when
> >> >> errors happens.
> >> >>
> >> >> Patching Restrictions
> >> >> ===
> >> >>   - For core layer, the linearization assume no new jumps inside patch buf.
> >> >>     Currently, the only user of this layer is jit blinding.
> >> >>   - For verifier layer, there could be new jumps inside patch buf, but
> >> >>     they should have branch target resolved themselves, meaning new jumps
> >> >>     doesn't jump to insns out of the patch buf. This is the case for all
> >> >>     existing verifier layer users.
> >> >>   - bpf_insn_aux_data for all patched insns including the one at patch
> >> >>     point are invalidated, only 32-bit zext info will be recalcuated.
> >> >>     If the aux data of insn at patch point needs to be retained, it is
> >> >>     purely insn insertion, so need to use the pre-patch API.
> >> >>
> >> >> I plan to send out a PATCH set once I finished insn deletion line info adj
> >> >> support, please have a looks at this RFC, and appreciate feedbacks.
> >> >>
> >> >> Jiong Wang (8):
> >> >>   bpf: introducing list based insn patching infra to core layer
> >> >>   bpf: extend list based insn patching infra to verification layer
> >> >>   bpf: migrate jit blinding to list patching infra
> >> >>   bpf: migrate convert_ctx_accesses to list patching infra
> >> >>   bpf: migrate fixup_bpf_calls to list patching infra
> >> >>   bpf: migrate zero extension opt to list patching infra
> >> >>   bpf: migrate insn remove to list patching infra
> >> >>   bpf: delete all those code around old insn patching infrastructure
> >> >>
> >> >>  include/linux/bpf_verifier.h |   1 -
> >> >>  include/linux/filter.h       |  27 +-
> >> >>  kernel/bpf/core.c            | 431 +++++++++++++++++-----------
> >> >>  kernel/bpf/verifier.c        | 649 +++++++++++++++++++------------------------
> >> >>  4 files changed, 580 insertions(+), 528 deletions(-)
> >> >>
> >> >> --
> >> >> 2.7.4
> >> >>
> >>
>

^ permalink raw reply

* Re: [RFC bpf-next 0/8] bpf: accelerate insn patching speed
From: Andrii Nakryiko @ 2019-07-15 23:00 UTC (permalink / raw)
  To: Jiong Wang
  Cc: Alexei Starovoitov, Daniel Borkmann, Edward Cree, Naveen N. Rao,
	Andrii Nakryiko, Jakub Kicinski, bpf, Networking, oss-drivers,
	Yonghong Song
In-Reply-To: <CAEf4BzYDAVUgajz4=dRTu5xQDddp5pi2s=T1BdFmRLZjOwGypQ@mail.gmail.com>

On Mon, Jul 15, 2019 at 3:55 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Jul 15, 2019 at 2:21 AM Jiong Wang <jiong.wang@netronome.com> wrote:
> >
> >
> > Andrii Nakryiko writes:
> >
> > > On Thu, Jul 11, 2019 at 4:22 AM Jiong Wang <jiong.wang@netronome.com> wrote:
> > >>
> > >>
> > >> Andrii Nakryiko writes:
> > >>
> > >> > On Thu, Jul 4, 2019 at 2:31 PM Jiong Wang <jiong.wang@netronome.com> wrote:
> > >> >>
> > >> >> This is an RFC based on latest bpf-next about acclerating insn patching
> > >> >> speed, it is now near the shape of final PATCH set, and we could see the
> > >> >> changes migrating to list patching would brings, so send out for
> > >> >> comments. Most of the info are in cover letter. I splitted the code in a
> > >> >> way to show API migration more easily.
> > >> >
> > >> >
> > >> > Hey Jiong,
> > >> >
> > >> >
> > >> > Sorry, took me a while to get to this and learn more about instruction
> > >> > patching. Overall this looks good and I think is a good direction.
> > >> > I'll post high-level feedback here, and some more
> > >> > implementation-specific ones in corresponding patches.
> > >>
> > >> Great, thanks very much for the feedbacks. Most of your feedbacks are
> > >> hitting those pain points I exactly had ran into. For some of them, I
> > >> thought similar solutions like yours, but failed due to various
> > >> reasons. Let's go through them again, I could have missed some important
> > >> things.
> > >>
> > >> Please see my replies below.
> > >
> > > Thanks for thoughtful reply :)
> > >
> > >>
> > >> >>
> > >> >> Test Results
> > >> >> ===
> > >> >>   - Full pass on test_verifier/test_prog/test_prog_32 under all three
> > >> >>     modes (interpreter, JIT, JIT with blinding).
> > >> >>
> > >> >>   - Benchmarking shows 10 ~ 15x faster on medium sized prog, and reduce
> > >> >>     patching time from 5100s (nearly one and a half hour) to less than
> > >> >>     0.5s for 1M insn patching.
> > >> >>
> > >> >> Known Issues
> > >> >> ===
> > >> >>   - The following warning is triggered when running scale test which
> > >> >>     contains 1M insns and patching:
> > >> >>       warning of mm/page_alloc.c:4639 __alloc_pages_nodemask+0x29e/0x330
> > >> >>
> > >> >>     This is caused by existing code, it can be reproduced on bpf-next
> > >> >>     master with jit blinding enabled, then run scale unit test, it will
> > >> >>     shown up after half an hour. After this set, patching is very fast, so
> > >> >>     it shows up quickly.
> > >> >>
> > >> >>   - No line info adjustment support when doing insn delete, subprog adj
> > >> >>     is with bug when doing insn delete as well. Generally, removal of insns
> > >> >>     could possibly cause remove of entire line or subprog, therefore
> > >> >>     entries of prog->aux->linfo or env->subprog needs to be deleted. I
> > >> >>     don't have good idea and clean code for integrating this into the
> > >> >>     linearization code at the moment, will do more experimenting,
> > >> >>     appreciate ideas and suggestions on this.
> > >> >
> > >> > Is there any specific problem to detect which line info to delete? Or
> > >> > what am I missing besides careful implementation?
> > >>
> > >> Mostly line info and subprog info are range info which covers a range of
> > >> insns. Deleting insns could causing you adjusting the range or removing one
> > >> range entirely. subprog info could be fully recalcuated during
> > >> linearization while line info I need some careful implementation and I
> > >> failed to have clean code for this during linearization also as said no
> > >> unit tests to help me understand whether the code is correct or not.
> > >>
> > >
> > > Ok, that's good that it's just about clean implementation. Try to
> > > implement it as clearly as possible. Then post it here, and if it can
> > > be improved someone (me?) will try to help to clean it up further.
> > >
> > > Not a big expert on line info, so can't comment on that,
> > > unfortunately. Maybe Yonghong can chime in (cc'ed)
> > >
> > >
> > >> I will described this latter, spent too much time writing the following
> > >> reply. Might worth an separate discussion thread.
> > >>
> > >> >>
> > >> >>     Insn delete doesn't happen on normal programs, for example Cilium
> > >> >>     benchmarks, and happens rarely on test_progs, so the test coverage is
> > >> >>     not good. That's also why this RFC have a full pass on selftest with
> > >> >>     this known issue.
> > >> >
> > >> > I hope you'll add test for deletion (and w/ corresponding line info)
> > >> > in final patch set :)
> > >>
> > >> Will try. Need to spend some time on BTF format.
> > >> >
> > >> >>
> > >> >>   - Could further use mem pool to accelerate the speed, changes are trivial
> > >> >>     on top of this RFC, and could be 2x extra faster. Not included in this
> > >> >>     RFC as reducing the algo complexity from quadratic to linear of insn
> > >> >>     number is the first step.
> > >> >
> > >> > Honestly, I think that would add more complexity than necessary, and I
> > >> > think we can further speed up performance without that, see below.
> > >> >
> > >> >>
> > >> >> Background
> > >> >> ===
> > >> >> This RFC aims to accelerate BPF insn patching speed, patching means expand
> > >> >> one bpf insn at any offset inside bpf prog into a set of new insns, or
> > >> >> remove insns.
> > >> >>
> > >> >> At the moment, insn patching is quadratic of insn number, this is due to
> > >> >> branch targets of jump insns needs to be adjusted, and the algo used is:
> > >> >>
> > >> >>   for insn inside prog
> > >> >>     patch insn + regeneate bpf prog
> > >> >>     for insn inside new prog
> > >> >>       adjust jump target
> > >> >>
> > >> >> This is causing significant time spending when a bpf prog requires large
> > >> >> amount of patching on different insns. Benchmarking shows it could take
> > >> >> more than half minutes to finish patching when patching number is more
> > >> >> than 50K, and the time spent could be more than one hour when patching
> > >> >> number is around 1M.
> > >> >>
> > >> >>   15000   :    3s
> > >> >>   45000   :   29s
> > >> >>   95000   :  125s
> > >> >>   195000  :  712s
> > >> >>   1000000 : 5100s
> > >> >>
> > >> >> This RFC introduces new patching infrastructure. Before doing insn
> > >> >> patching, insns in bpf prog are turned into a singly linked list, insert
> > >> >> new insns just insert new list node, delete insns just set delete flag.
> > >> >> And finally, the list is linearized back into array, and branch target
> > >> >> adjustment is done for all jump insns during linearization. This algo
> > >> >> brings the time complexity from quadratic to linear of insn number.
> > >> >>
> > >> >> Benchmarking shows the new patching infrastructure could be 10 ~ 15x faster
> > >> >> on medium sized prog, and for a 1M patching it reduce the time from 5100s
> > >> >> to less than 0.5s.
> > >> >>
> > >> >> Patching API
> > >> >> ===
> > >> >> Insn patching could happen on two layers inside BPF. One is "core layer"
> > >> >> where only BPF insns are patched. The other is "verification layer" where
> > >> >> insns have corresponding aux info as well high level subprog info, so
> > >> >> insn patching means aux info needs to be patched as well, and subprog info
> > >> >> needs to be adjusted. BPF prog also has debug info associated, so line info
> > >> >> should always be updated after insn patching.
> > >> >>
> > >> >> So, list creation, destroy, insert, delete is the same for both layer,
> > >> >> but lineration is different. "verification layer" patching require extra
> > >> >> work. Therefore the patch APIs are:
> > >> >>
> > >> >>    list creation:                bpf_create_list_insn
> > >> >>    list patch:                   bpf_patch_list_insn
> > >> >>    list pre-patch:               bpf_prepatch_list_insn
> > >> >
> > >> > I think pre-patch name is very confusing, until I read full
> > >> > description I couldn't understand what it's supposed to be used for.
> > >> > Speaking of bpf_patch_list_insn, patch is also generic enough to leave
> > >> > me wondering whether instruction buffer is inserted after instruction,
> > >> > or instruction is replaced with a bunch of instructions.
> > >> >
> > >> > So how about two more specific names:
> > >> > bpf_patch_list_insn -> bpf_list_insn_replace (meaning replace given
> > >> > instruction with a list of patch instructions)
> > >> > bpf_prepatch_list_insn -> bpf_list_insn_prepend (well, I think this
> > >> > one is pretty clear).
> > >>
> > >> My sense on English word is not great, will switch to above which indeed
> > >> reads more clear.
> > >>
> > >> >>    list lineration (core layer): prog = bpf_linearize_list_insn(prog, list)
> > >> >>    list lineration (veri layer): env = verifier_linearize_list_insn(env, list)
> > >> >
> > >> > These two functions are both quite involved, as well as share a lot of
> > >> > common code. I'd rather have one linearize instruction, that takes env
> > >> > as an optional parameter. If env is specified (which is the case for
> > >> > all cases except for constant blinding pass), then adjust aux_data and
> > >> > subprogs along the way.
> > >>
> > >> Two version of lineration and how to unify them was a painpoint to me. I
> > >> thought to factor out some of the common code out, but it actually doesn't
> > >> count much, the final size counting + insnsi resize parts are the same,
> > >> then things start to diverge since the "Copy over insn" loop.
> > >>
> > >> verifier layer needs to copy and initialize aux data etc. And jump
> > >> relocation is different. At core layer, the use case is JIT blinding which
> > >> could expand an jump_imm insn into a and/or/jump_reg sequence, and the
> > >
> > > Sorry, I didn't get what "could expand an jump_imm insn into a
> > > and/or/jump_reg sequence", maybe you can clarify if I'm missing
> > > something.
> > >
> > > But from your cover letter description, core layer has no jumps at
> > > all, while verifier has jumps inside patch buffer. So, if you support
> > > jumps inside of patch buffer, it will automatically work for core
> > > layer. Or what am I missing?
> >
> > I meant in core layer (JIT blinding), there is the following patching:
> >
> > input:
> >   insn 0             insn 0
> >   insn 1             insn 1
> >   jmp_imm   >>       mov_imm  \
> >   insn 2             xor_imm    insn seq expanded from jmp_imm
> >   insn 3             jmp_reg  /
> >                      insn 2
> >                      insn 3
> >
> >
> > jmp_imm is the insn that will be patched, and the actually transformation
> > is to expand it into mov_imm/xor_imm/jmp_reg sequence. "jmp_reg", sitting
> > at the end of the patch buffer, must jump to the same destination as the
> > original jmp_imm, so "jmp_reg" is an insn inside patch buffer but should
> > be relocated, and the jump destination is outside of patch buffer.
>
>
> Ok, great, thanks for explaining, yeah it's definitely something that
> we should be able to support. BUT. It got me thinking a bit more and I
> think I have simpler and more elegant solution now, again, supporting
> both core-layer and verifier-layer operations.
>
> struct bpf_patchable_insn {
>    struct bpf_patchable_insn *next;
>    struct bpf_insn insn;
>    int orig_idx; /* original non-patched index */
>    int new_idx;  /* new index, will be filled only during linearization */
> };
>
> struct bpf_patcher {
>     /* dummy head node of a chain of patchable instructions */
>     struct bpf_patchable_insn insn_head;
>     /* dynamic array of size(original instruction count)
>      * this is a map from original instruction index to a first
>      * patchable instruction that replaced that instruction (or
>      * just original instruction as bpf_patchable_insn).
>      */
>     int *orig_idx_to_patchable_insn;
>     int cnt;
> };
>
> Few points, but it should be pretty clear just from comments and definitions:
> 1. When you created bpf_patcher, you create patchabe_insn list, fill
> orig_idx_to_patchable_insn map to store proper pointers. This array is
> NEVER changed after that.
> 2. When replacing instruction, you re-use struct bpf_patchable_insn
> for first patched instruction, then append after that (not prepend to
> next instruction to not disrupt orig_idx -> patchable_insn mapping).
> 3. During linearizations, you first traverse the chain of instructions
> and trivially assing new_idxs.
> 4. No need for patchabe_insn->target anymore. All jumps use relative
> instruction offsets, right? So when you need to determine new
> instruction index during linearization, you just do (after you
> calculated new instruction indicies):
>
> func adjust_jmp(struct bpf_patcher* patcher, struct bpf_patchable_insn *insn) {
>    int old_jmp_idx = insn->orig_idx + jmp_offset_of(insn->insn);
>    int new_jmp_idx = patcher->orig_idx_to_patchable_insn[old_jmp_idx]->new_idx;
>    adjust_jmp_offset(insn->insn, new_jmp_idx) - insn->orig_idx;
> }

Forgot to mention. To handle deleted insns, you can either traverse
till you find first non-deleted instruction after that one, or during
filling of new_idx, just make sure to that new_idx of deleted
instruction is the same as the first non-deleted instruction's
new_idx.

For subprogs (presuming there is no case where we can just eliminate
entire subprog), you can just do simple look up from original index to
a new index. No need to copy/recalculate, etc. It's just orig_idx ->
new_idx mapping.

>
> The idea is that we want to support quick look-up by original
> instruction index. That's what orig_idx_to_patchable_insn provides. On
> the other hand, no existing instruction is ever referencing newly
> patched instruction by its new offset, so with careful implementation,
> you can transparently support all the cases, regardless if it's in
> core layer or verifier layer (so, e.g., verifier layer patched
> instructions now will be able to jump out of patched buffer, if
> necessary, neat, right?).
>
> It is cleaner than everything we've discussed so far. Unless I missed
> something critical (it's all quite convoluted, so I might have
> forgotten some parts already). Let me know what you think.
>
>
> >
> > This means for core layer (jit blinding), it needs to take care of insn
> > inside patch buffer.
> >
> > > Just compared two version of linearize side by side. From what I can
> > > see, unified version could look like this, high-level:
> > >
> > > 1. Count final insn count (but see my other suggestions how to avoid
> > > that altogether). If not changed - exit.
> > > 2. Realloc insn buffer, copy just instructions (not aux_data yet).
> > > Build idx_map, if necessary.
> > > 3. (if env) then bpf_patchable_insn has aux_data, so now do another
> > > pass and copy it into resulting array.
> > > 4. (if env) Copy sub info. Though I'd see if we can just reuse old
> > > ones and just adjust offsets. I'm not sure why we need to allocate new
> > > array, subprogram count shouldn't change, right?
> >
> > If there is no dead insn elimination opt, then we could just adjust
> > offsets. When there is insn deleting, I feel the logic becomes more
> > complex. One subprog could be completely deleted or partially deleted, so
> > I feel just recalculate the whole subprog info as a side-product is
> > much simpler.
>
> What's the situation where entirety of subprog can be deleted?
>
>
> >
> > > 5. (common) Relocate jumps. Not clear why core layer doesn't care
> > > about PATCHED (or, alternatively, why verifier layer cares).
> >
> > See above, in this RFC, core layer care PATCHED during relocating jumps,
> > and verifier layer doesn't.
> >
> > > And again, with targets pointer it will look totally different (and
> > > simpler).
> >
> > Yes, will see how the code looks.
> >
> > > 6. (if env) adjust subprogs
> > > 7. (common) Adjust prog's line info.
> > >
> > > The devil is in the details, but I think this will still be better if
> > > contained in one function if a bunch of `if (env)` checks. Still
> > > pretty linear.
> > >
> > >> jump_reg is at the end of the patch buffer, it should be relocated. While
> > >> all use case in verifier layer, no jump in the prog will be patched and all
> > >> new jumps in patch buffer will jump inside the buffer locally so no need to
> > >> resolve.
> > >>
> > >> And yes we could unify them into one and control the diverge using
> > >> argument, but then where to place the function is an issue. My
> > >> understanding is verifier.c is designed to be on top of core.c and core.c
> > >> should not reference and no need to be aware of any verifier specific data
> > >> structures, for example env or bpf_aux_insn_data etc.
> > >
> > > Func prototype where it is. Maybe forward-declare verifier env struct.
> > > Implementation in verifier.c?
> > >
> > >>
> > >> So, in this RFC, I had choosed to write separate linerization function for
> > >> core and verifier layer. Does this make sense?
> > >
> > > See above. Let's still try to make it better.
> > >
> > >>
> > >> >
> > >> > This would keep logic less duplicated and shouldn't complexity beyond
> > >> > few null checks in few places.
> > >> >
> > >> >>    list destroy:                 bpf_destroy_list_insn
> > >> >>
> > >> >
> > >> > I'd also add a macro foreach_list_insn instead of explicit for loops
> > >> > in multiple places. That would also allow to skip deleted instructions
> > >> > transparently.
> > >> >
> > >> >> list patch could change the insn at patch point, it will invalid the aux
> > >> >
> > >> > typo: invalid -> invalidate
> > >>
> > >> Ack.
> > >>
> > >> >
> > >> >> info at patching point. list pre-patch insert new insns before patch point
> > >> >> where the insn and associated aux info are not touched, it is used for
> > >> >> example in convert_ctx_access when generating prologue.
> > >> >>
> > >> >> Typical API sequence for one patching pass:
> > >> >>
> > >> >>    struct bpf_list_insn list = bpf_create_list_insn(struct bpf_prog);
> > >> >>    for (elem = list; elem; elem = elem->next)
> > >> >>       patch_buf = gen_patch_buf_logic;
> > >> >>       elem = bpf_patch_list_insn(elem, patch_buf, cnt);
> > >> >>    bpf_prog = bpf_linearize_list_insn(list)
> > >> >>    bpf_destroy_list_insn(list)
> > >> >>
> > >> >> Several patching passes could also share the same list:
> > >> >>
> > >> >>    struct bpf_list_insn list = bpf_create_list_insn(struct bpf_prog);
> > >> >>    for (elem = list; elem; elem = elem->next)
> > >> >>       patch_buf = gen_patch_buf_logic1;
> > >> >>       elem = bpf_patch_list_insn(elem, patch_buf, cnt);
> > >> >>    for (elem = list; elem; elem = elem->next)
> > >> >>       patch_buf = gen_patch_buf_logic2;
> > >> >>       elem = bpf_patch_list_insn(elem, patch_buf, cnt);
> > >> >>    bpf_prog = bpf_linearize_list_insn(list)
> > >> >>    bpf_destroy_list_insn(list)
> > >> >>
> > >> >> but note new inserted insns int early passes won't have aux info except
> > >> >> zext info. So, if one patch pass requires all aux info updated and
> > >> >> recalculated for all insns including those pathced, it should first
> > >> >> linearize the old list, then re-create the list. The RFC always create and
> > >> >> linearize the list for each migrated patching pass separately.
> > >> >
> > >> > I think we should do just one list creation, few passes of patching
> > >> > and then linearize once. That will save quite a lot of memory
> > >> > allocation and will speed up a lot of things. All the verifier
> > >> > patching happens one after the other without any other functionality
> > >> > in between, so there shouldn't be any problem.
> > >>
> > >> Yes, as mentioned above, it is possible and I had tried to do it in an very
> > >> initial impl. IIRC convert_ctx_access + fixup_bpf_calls could share the
> > >> same list, but then the 32-bit zero extension insertion pass requires
> > >> aux.zext_dst set properly for all instructions including those patched
> > >
> > > So zext_dst. Seems like it's easily calculatable, so doesn't seem like
> > > it even needs to be accessed from aux_data.
> > >
> > > But. I can see at least two ways to do this:
> > > 1. those patching passes that care about aux_data, should just do
> > > extra check for NULL. Because when we adjust insns now, we just leave
> > > zero-initialized aux_data, except for zext_dst and seen. So it's easy
> > > to default to them if aux_data is NULL for patchable_insn.
> > > 2. just allocate and fill them out them when applying patch insns
> > > buffer. It's not a duplication, we already fill them out during
> > > patching today. So just do the same, except through malloc()'ed
> > > pointer instead. At the end they will be copied into linear resulting
> > > array during linearization (uniformly with non-patched insns).
> > >
> > >> one which we need to linearize the list first (as we set zext_dst during
> > >> linerization), or the other choice is we do the zext_dst initialization
> > >> during bpf_patch_list_insn, but this then make bpf_patch_list_insn diverge
> > >> between core and verifier layer.
> > >
> > > List construction is much simpler, even if we have to have extra
> > > check, similar to `if (env) { do_extra(); }`, IMO, it's fine.
> > >
> > >>
> > >> > As for aux_data. We can solve that even more simply and reliably by
> > >> > storing a pointer along the struct bpf_list_insn
> > >>
> > >> This is exactly what I had implemented initially, but then the issue is how
> > >> to handle aux_data for patched insn? IIRC I was leave it as a NULL pointer,
> > >> but later found zext_dst info is required for all insns, so I end up
> > >> duplicating zext_dst in bpf_list_insn.
> > >
> > > See above. No duplication. You have a pointer. Whether aux_data is in
> > > original array or was malloc()'ed, doesn't matter. But no duplication
> > > of fields.
> > >
> > >>
> > >> This leads me worrying we need to keep duplicating fields there as soon as
> > >> there is new similar requirements in future patching pass and I thought it
> > >> might be better to just reference the aux_data inside env using orig_idx,
> > >> this avoids duplicating information, but we need to make sure used fields
> > >> inside aux_data for patched insn update-to-date during linearization or
> > >> patching list.
> > >>
> > >> > (btw, how about calling it bpf_patchable_insn?).
> > >>
> > >> No preference, will use this one.
> > >>
> > >> > Here's how I propose to represent this patchable instruction:
> > >> >
> > >> > struct bpf_list_insn {
> > >> >        struct bpf_insn insn;
> > >> >        struct bpf_list_insn *next;
> > >> >        struct bpf_list_insn *target;
> > >> >        struct bpf_insn_aux_data *aux_data;
> > >> >        s32 orig_idx; // can repurpose this to have three meanings:
> > >> >                      // -2 - deleted
> > >> >                      // -1 - patched/inserted insn
> > >> >                      // >=0 - original idx
> > >>
> > >> I actually had experimented the -2/-1/0 trick, exactly the same number
> > >> assignment :) IIRC the code was not clear compared with using flag, the
> > >> reason seems to be:
> > >>   1. we still need orig_idx of an patched insn somehow, meaning negate the
> > >>      index.
> > >
> > > Not following, original index with be >=0, no?
> > >
> > >>   2. somehow somecode need to know whether one insn is deleted or patched
> > >>      after the negation, so I end up with some ugly code.
> > >
> > > So that's why you'll have constants with descriptive name for -2 and -1.
> > >
> > >>
> > >> Anyway, I might had not thought hard enough on this, I will retry using the
> > >> special index instead of flag, hopefully I could have clean code this time.
> > >>
> > >
> > > Yeah, please try again. All those `orig_idx = insn->orig_idx - 1; if
> > > (orig_idx >= 0) { ... }` are very confusing.
> > >
> > >> > };
> > >> >
> > >> > The idea would be as follows:
> > >> > 1. when creating original list, target pointer will point directly to
> > >> > a patchable instruction wrapper for jumps/calls. This will allow to
> > >> > stop tracking and re-calculating jump offsets and instruction indicies
> > >> > until linearization.
> > >>
> > >> Not sure I have followed the idea of "target" pointer. At the moment we are
> > >> using index mapping array (generated as by-product during coping insn).
> > >>
> > >> While the "target" pointer means to during list initialization, each jump
> > >> insn will have target initialized to the list node of the converted jump
> > >> destination insn, and all those non-jump insns are with NULL? Then during
> > >> linearization you assign index to each list node (could be done as
> > >> by-product of other pass) before insn coping which could then relocate the
> > >> insn during the coping as the "target" would have final index calculated?
> > >> Am I following correctly?
> > >
> > > Yes, I think you are understanding correctly what I'm saying. For
> > > implementation, you can do it in few ways, through few passes or with
> > > some additional data, is less important. See what's cleanest.
> > >
> > >>
> > >> > 2. aux_data is also filled at that point. Later at linearization time
> > >> > you'd just iterate over all the instructions in final order and copy
> > >> > original aux_data, if it's present. And then just repace env's
> > >> > aux_data array at the end, should be very simple and fast.
> > >>
> > >> As explained, I am worried making aux_data a pointer will causing
> > >> duplicating some fields into list_insn if the fields are required for
> > >> patched insns.
> > >
> > > Addressed above, I don't think there will be any duplication, because
> > > we pass aux_data by pointer.
> > >
> > >>
> > >> > 3. during fix_bpf_calls, zext, ctx rewrite passes, we'll reuse the
> > >> > same list of instructions and those passes will just keep inserting
> > >> > instruction buffers. Given we have restriction that all the jumps are
> > >> > only within patch buffer, it will be trivial to construct proper
> > >> > patchable instruction wrappers for newly added instructions, with NULL
> > >> > for aux_data and possibly non-NULL target (if it's a JMP insn).
> > >> > 4. After those passes, linearize, adjust subprogs (for this you'll
> > >> > probably still need to create index mapping, right?), copy or create
> > >> > new aux_data.
> > >> > 5. Done.
> > >> >
> > >> > What do you think? I think this should be overall simpler and faster.
> > >> > But let me know if I'm missing something.
> > >>
> > >> Thanks for all these thoughts, they are very good suggestions and reminds
> > >> me to revisit some points I had forgotten. I will do the following things:
> > >>
> > >>   1. retry the negative index solution to eliminate flag if the result code
> > >>      could be clean.
> > >>   2. the "target" pointer seems make sense, it makes list_insn bigger but
> > >>      normally space trade with time, so I will try to implement it to see
> > >>      how the code looks like.
> > >>   3. I still have concerns on making aux_data as pointer. Mostly due to
> > >>      patched insn will have NULL pointer and in case aux info of patched
> > >>      insn is required, we need to duplicate info inside list_insn. For
> > >>      example 32-bit zext opt requires zext_dst.
> > >>
> > >
> > >
> > > So one more thing I wanted to suggest. I'll try to keep high-level
> > > suggestions here.
> > >
> > > What about having a wrapper for patchable_insn list, where you can
> > > store some additional data, like final count and whatever else. It
> > > will eliminate some passes (counting) and will make list handling
> > > easier (because you can have a dummy head pointer, so no special
> > > handling of first element
> >
> > Will try it.
> >
> > > you had this concern in patch #1, I
> > > believe). But it will be clear if it's beneficial once implemented.
> >
> > >> Regards,
> > >> Jiong
> > >>
> > >> >>
> > >> >> Compared with old patching code, this new infrastructure has much less core
> > >> >> code, even though the final code has a couple of extra lines but that is
> > >> >> mostly due to for list based infrastructure, we need to do more error
> > >> >> checks, so the list and associated aux data structure could be freed when
> > >> >> errors happens.
> > >> >>
> > >> >> Patching Restrictions
> > >> >> ===
> > >> >>   - For core layer, the linearization assume no new jumps inside patch buf.
> > >> >>     Currently, the only user of this layer is jit blinding.
> > >> >>   - For verifier layer, there could be new jumps inside patch buf, but
> > >> >>     they should have branch target resolved themselves, meaning new jumps
> > >> >>     doesn't jump to insns out of the patch buf. This is the case for all
> > >> >>     existing verifier layer users.
> > >> >>   - bpf_insn_aux_data for all patched insns including the one at patch
> > >> >>     point are invalidated, only 32-bit zext info will be recalcuated.
> > >> >>     If the aux data of insn at patch point needs to be retained, it is
> > >> >>     purely insn insertion, so need to use the pre-patch API.
> > >> >>
> > >> >> I plan to send out a PATCH set once I finished insn deletion line info adj
> > >> >> support, please have a looks at this RFC, and appreciate feedbacks.
> > >> >>
> > >> >> Jiong Wang (8):
> > >> >>   bpf: introducing list based insn patching infra to core layer
> > >> >>   bpf: extend list based insn patching infra to verification layer
> > >> >>   bpf: migrate jit blinding to list patching infra
> > >> >>   bpf: migrate convert_ctx_accesses to list patching infra
> > >> >>   bpf: migrate fixup_bpf_calls to list patching infra
> > >> >>   bpf: migrate zero extension opt to list patching infra
> > >> >>   bpf: migrate insn remove to list patching infra
> > >> >>   bpf: delete all those code around old insn patching infrastructure
> > >> >>
> > >> >>  include/linux/bpf_verifier.h |   1 -
> > >> >>  include/linux/filter.h       |  27 +-
> > >> >>  kernel/bpf/core.c            | 431 +++++++++++++++++-----------
> > >> >>  kernel/bpf/verifier.c        | 649 +++++++++++++++++++------------------------
> > >> >>  4 files changed, 580 insertions(+), 528 deletions(-)
> > >> >>
> > >> >> --
> > >> >> 2.7.4
> > >> >>
> > >>
> >

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox