Netdev List

Netdev List
 help / color / mirror / Atom feed

* lockup in hacked 4.20.17+ kernel, maybe addrconf_verify_work related?
From: Ben Greear @ 2019-07-10 18:51 UTC (permalink / raw)
  To: netdev

Hello,

This is with my hacked-upon kernel, could be my mistake, etc.  But, curious
if anyone else has seen a similar deadlock?  I was running a complicated automated
wifi test, for what that is worth.

66707.212081] ath: regdomain 0x8348 dynamically updated by user
[67044.625948] INFO: task kworker/0:0:27387 blocked for more than 180 seconds.
[67044.631831]       Tainted: G        W  O      4.20.17+ #30
[67044.636180] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[67044.647424] kworker/0:0     D    0 27387      2 0x80000080
[67044.647447] Workqueue: ipv6_addrconf addrconf_verify_work [ipv6]
[67044.647448] Call Trace:
[67044.647455]  ? __schedule+0x29e/0x880
[67044.647457]  schedule+0x2a/0x80
[67044.647459]  schedule_preempt_disabled+0xc/0x20
[67044.647460]  __mutex_lock.isra.10+0x2e7/0x4f0
[67044.647468]  addrconf_verify_work+0x5/0x10 [ipv6]
[67044.647472]  process_one_work+0x1f3/0x420
[67044.647473]  worker_thread+0x28/0x3c0
[67044.647475]  ? process_one_work+0x420/0x420
[67044.647476]  kthread+0x10b/0x130
[67044.647478]  ? kthread_create_on_node+0x60/0x60
[67044.647480]  ret_from_fork+0x1f/0x30
[67044.647491] INFO: task DNS Resolver #6:19810 blocked for more than 180 seconds.
[67044.656865]       Tainted: G        W  O      4.20.17+ #30
[67044.661364] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[67044.670861] DNS Resolver #6 D    0 19810  19213 0x00000080
[67044.670863] Call Trace:
[67044.670870]  ? __schedule+0x29e/0x880
[67044.670872]  schedule+0x2a/0x80
[67044.670874]  schedule_preempt_disabled+0xc/0x20
[67044.670876]  __mutex_lock.isra.10+0x2e7/0x4f0
[67044.670879]  ? netlink_lookup+0x111/0x160
[67044.670881]  __netlink_dump_start+0x4f/0x1d0
[67044.670883]  ? rtnl_xdp_prog_skb+0x60/0x60
[67044.670885]  rtnetlink_rcv_msg+0x25c/0x390
[67044.670886]  ? rtnl_xdp_prog_skb+0x60/0x60
[67044.670888]  ? rtnl_calcit.isra.31+0x110/0x110
[67044.670889]  netlink_rcv_skb+0x44/0x120
[67044.670891]  netlink_unicast+0x18b/0x220
[67044.670893]  netlink_sendmsg+0x1ff/0x3d0
[67044.670896]  sock_sendmsg+0x2b/0x40
[67044.670898]  __sys_sendto+0xe9/0x150
[67044.670901]  ? __audit_syscall_exit+0x216/0x280
[67044.670903]  __x64_sys_sendto+0x1f/0x30
[67044.670906]  do_syscall_64+0x4a/0xf0
[67044.670909]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[67044.670911] RIP: 0033:0x7f94b119b6b0
[67044.670915] Code: Bad RIP value.
[67044.670915] RSP: 002b:00007f94817942e0 EFLAGS: 00000293 ORIG_RAX: 000000000000002c
[67044.670917] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f94b119b6b0
[67044.670917] RDX: 0000000000000014 RSI: 00007f9481795420 RDI: 000000000000004d
[67044.670918] RBP: 00007f9481795420 R08: 00007f94817953c4 R09: 000000000000000c
[67044.670919] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000014
[67044.670919] R13: 00007f94817953c4 R14: 00007f947ff42208 R15: 000000000000004d
[67044.670933] INFO: task kworker/0:2:17500 blocked for more than 180 seconds.
[67044.678868]       Tainted: G        W  O      4.20.17+ #30
[67044.685860] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[67044.692746] kworker/0:2     D    0 17500      2 0x80000080
[67044.692787] Workqueue: events_power_efficient reg_check_chans_work [cfg80211]
[67044.692788] Call Trace:
[67044.692795]  ? __schedule+0x29e/0x880
[67044.692797]  schedule+0x2a/0x80
[67044.692799]  schedule_preempt_disabled+0xc/0x20
[67044.692800]  __mutex_lock.isra.10+0x2e7/0x4f0
[67044.692810]  reg_check_chans_work+0x28/0x350 [cfg80211]
[67044.692815]  process_one_work+0x1f3/0x420
[67044.692817]  worker_thread+0x28/0x3c0
[67044.692819]  ? process_one_work+0x420/0x420
[67044.692821]  kthread+0x10b/0x130
[67044.692822]  ? kthread_create_on_node+0x60/0x60
[67044.692825]  ret_from_fork+0x1f/0x30
[67044.692833] INFO: task iw:1488 blocked for more than 180 seconds.
[67044.700860]       Tainted: G        W  O      4.20.17+ #30
[67044.705216] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[67044.714862] iw              D    0  1488   1487 0x00000080
[67044.714864] Call Trace:
[67044.714871]  ? __schedule+0x29e/0x880
[67044.714874]  schedule+0x2a/0x80
[67044.714876]  schedule_preempt_disabled+0xc/0x20
[67044.714877]  __mutex_lock.isra.10+0x2e7/0x4f0
[67044.714915]  nl80211_dump_wiphy+0x1d/0x1b0 [cfg80211]
[67044.714919]  genl_lock_dumpit+0x23/0x40
[67044.714921]  netlink_dump+0x16d/0x360
[67044.714923]  __netlink_dump_start+0x168/0x1d0
[67044.714925]  genl_family_rcv_msg+0x25f/0x3a0
[67044.714927]  ? genl_lock_dumpit+0x40/0x40
[67044.714928]  ? genl_lock_done+0x40/0x40
[67044.714929]  ? genl_unlock+0x10/0x10
[67044.714931]  ? netlink_unicast+0x1ff/0x220
[67044.714932]  genl_rcv_msg+0x42/0x87
[67044.714934]  ? genl_family_rcv_msg+0x3a0/0x3a0
[67044.714935]  netlink_rcv_skb+0x44/0x120
[67044.714937]  genl_rcv+0x1f/0x30
[67044.714939]  netlink_unicast+0x18b/0x220
[67044.714940]  netlink_sendmsg+0x1ff/0x3d0
[67044.714944]  sock_sendmsg+0x2b/0x40
[67044.714946]  ___sys_sendmsg+0x28a/0x2f0
[67044.714947]  ? ___sys_recvmsg+0x156/0x1d0
[67044.714950]  ? __alloc_pages_nodemask+0x111/0x280
[67044.714954]  ? alloc_pages_vma+0x6f/0x1c0
[67044.714957]  ? page_add_new_anon_rmap+0x72/0xb0
[67044.714958]  ? __handle_mm_fault+0x7db/0x12c0
[67044.714961]  __sys_sendmsg+0x52/0xa0
[67044.714964]  do_syscall_64+0x4a/0xf0
[67044.714967]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[67044.714969] RIP: 0033:0x7fa9c4af15a7
[67044.714972] Code: Bad RIP value.
[67044.714973] RSP: 002b:00007fffdd7ac818 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[67044.714974] RAX: ffffffffffffffda RBX: 00000000021ae990 RCX: 00007fa9c4af15a7
[67044.714975] RDX: 0000000000000000 RSI: 00007fffdd7ac8b0 RDI: 0000000000000008
[67044.714976] RBP: 00000000021b3d80 R08: 0000000000000004 R09: 00007fa9c4dabf20
[67044.714976] R10: 0000000000000170 R11: 0000000000000246 R12: 00000000021b3ec0
[67044.714977] R13: 00007fffdd7ac8b0 R14: 00000000021b3ec0 R15: 00007fffdd7acb18
[67044.714980] INFO: task sshd:1763 blocked for more than 180 seconds.
[67044.720810]       Tainted: G        W  O      4.20.17+ #30
[67044.725186] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[67044.732027] sshd            D    0  1763   1355 0x00000080
[67044.732029] Call Trace:
[67044.732038]  ? __schedule+0x29e/0x880
[67044.732040]  schedule+0x2a/0x80
[67044.732042]  schedule_preempt_disabled+0xc/0x20
[67044.732043]  __mutex_lock.isra.10+0x2e7/0x4f0
[67044.732046]  ? netlink_lookup+0x111/0x160
[67044.732048]  __netlink_dump_start+0x4f/0x1d0
[67044.732051]  ? rtnl_xdp_prog_skb+0x60/0x60
[67044.732052]  rtnetlink_rcv_msg+0x25c/0x390
[67044.732054]  ? rtnl_xdp_prog_skb+0x60/0x60
[67044.732055]  ? rtnl_calcit.isra.31+0x110/0x110
[67044.732057]  netlink_rcv_skb+0x44/0x120
[67044.732059]  netlink_unicast+0x18b/0x220
[67044.732060]  netlink_sendmsg+0x1ff/0x3d0
[67044.732064]  sock_sendmsg+0x2b/0x40
[67044.732066]  __sys_sendto+0xe9/0x150
[67044.732070]  ? __audit_syscall_exit+0x216/0x280
[67044.732071]  __x64_sys_sendto+0x1f/0x30
[67044.732075]  do_syscall_64+0x4a/0xf0
[67044.732077]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[67044.732079] RIP: 0033:0x7f16e29c765a
[67044.732082] Code: Bad RIP value.
[67044.732083] RSP: 002b:00007ffe57e52e88 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[67044.732084] RAX: ffffffffffffffda RBX: 00007ffe57e53f80 RCX: 00007f16e29c765a
[67044.732085] RDX: 0000000000000014 RSI: 00007ffe57e53f80 RDI: 0000000000000003
[67044.732085] RBP: 00007ffe57e53fd0 R08: 00007ffe57e53f24 R09: 000000000000000c
[67044.732086] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffe57e53f24
[67044.732087] R13: 00007ffe57e54160 R14: 0000000000000000 R15: 0000000000000003

Thanks,
Ben
-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply

* RE: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
From: Prout, Andrew - LLSC - MITLL @ 2019-07-10 18:53 UTC (permalink / raw)
  To: Eric Dumazet, Christoph Paasch
  Cc: David S . Miller, netdev, Greg Kroah-Hartman, Jonathan Looney,
	Neal Cardwell, Tyler Hicks, Yuchung Cheng, Bruce Curtis,
	Jonathan Lemon, Dustin Marquess
In-Reply-To: <96791fd5-8d36-2e00-3fef-60b23bea05e5@gmail.com>

On 7/10/19 2:29 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On 7/10/19 8:23 PM, Prout, Andrew - LLSC - MITLL wrote:
>> On 6/17/19 8:19 PM, Christoph Paasch wrote:
>>>
>>> Yes, this does the trick for my packetdrill-test.
>>>
>>> I wonder, is there a way we could end up in a situation where we can't
>>> retransmit anymore?
>>> For example, sk_wmem_queued has grown so much that the new test fails.
>>> Then, if we legitimately need to fragment in __tcp_retransmit_skb() we
>>> won't be able to do so. So we will never retransmit. And if no ACK
>>> comes back in to make some room we are stuck, no?
>> 
>> We seem to be having exactly this problem. We’re running on the 4.14 branch. After recently updating our kernel, we’ve been having a problem with TCP connections stalling / dying off without disconnecting. They're stuck and never recover.
>> 
>> I bisected the problem to 4.14.127 commit 9daf226ff92679d09aeca1b5c1240e3607153336 (commit f070ef2ac66716357066b683fb0baf55f8191a2e upstream): tcp: tcp_fragment() should apply sane memory limits. That lead me to this thread.
>> 
>> Our environment is a supercomputing center: lots of servers interconnected with a non-blocking 10Gbit ethernet network. We’ve zeroed in on the problem in two situations: remote users on VPN accessing large files via samba and compute jobs using Intel MPI over TCP/IP/ethernet. It certainly affects other situations, many of our workloads have been unstable since this patch went into production, but those are the two we clearly identified as they fail reliably every time. We had to take the system down for unscheduled maintenance to roll back to an older kernel.
>> 
>> The TCPWqueueTooBig count is incrementing when the problem occurs.
>> 
>> Using ftrace/trace-cmd on an affected process, it appears the call stack is:
>> run_timer_softirq
>> expire_timers
>> call_timer_fn
>> tcp_write_timer
>> tcp_write_timer_handler
>> tcp_retransmit_timer
>> tcp_retransmit_skb
>> __tcp_retransmit_skb
>> tcp_fragment
>> 
>> Andrew Prout
>> MIT Lincoln Laboratory Supercomputing Center
>> 
>
> What was the kernel version you used exactly ?
>
> This problem is supposed to be fixed in v4.14.131

Our initial rollout was v4.14.130, but I reproduced it with v4.14.132 as well, reliably for the samba test and once (not reliably) with synthetic test I was trying. A patched v4.14.132 with this patch partially reverted (just the four lines from tcp_fragment deleted) passed the samba test.

The synthetic test was a pair of simple send/recv test programs under the following conditions:
-The send socket was non-blocking
-SO_SNDBUF set to 128KiB
-The receiver NIC was being flooded with traffic from multiple hosts (to induce packet loss/retransmits)
-Load was on both systems: a while(1) program spinning on each CPU core
-The receiver was on an older unaffected kernel


^ permalink raw reply

* Re: [PATCH v6 0/5] net: macb: cover letter
From: David Miller @ 2019-07-10 18:47 UTC (permalink / raw)
  To: pthombar
  Cc: andrew, nicolas.ferre, f.fainelli, linux, netdev, hkallweit1,
	linux-kernel, rafalc, piotrs, aniljoy, arthurm, stevenh, mparab
In-Reply-To: <1562769391-31803-1-git-send-email-pthombar@cadence.com>


As announced clearly yesterday on this list, the net-next tree is closed.

Please resubmit these changes when the tree opens back up again.

Thank you.

^ permalink raw reply

* Re: [PATCH v2 bpf] selftests/bpf: fix bpf_target_sparc check
From: David Miller @ 2019-07-10 18:42 UTC (permalink / raw)
  To: iii; +Cc: bpf, netdev, andrii.nakryiko
In-Reply-To: <20190710115654.44841-1-iii@linux.ibm.com>

From: Ilya Leoshkevich <iii@linux.ibm.com>
Date: Wed, 10 Jul 2019 13:56:54 +0200

> bpf_helpers.h fails to compile on sparc: the code should be checking
> for defined(bpf_target_sparc), but checks simply for bpf_target_sparc.
> 
> Also change #ifdef bpf_target_powerpc to #if defined() for consistency.
> 
> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply

* RE: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
From: Prout, Andrew - LLSC - MITLL @ 2019-07-10 18:23 UTC (permalink / raw)
  To: Christoph Paasch, Eric Dumazet
  Cc: David S . Miller, netdev, Greg Kroah-Hartman, Jonathan Looney,
	Neal Cardwell, Tyler Hicks, Yuchung Cheng, Bruce Curtis,
	Jonathan Lemon, Dustin Marquess
In-Reply-To: <CALMXkpZ4isoXpFp_5=nVUcWrt5TofYVhpdAjv7LkCH7RFW1tYw@mail.gmail.com>

On 6/17/19 8:19 PM, Christoph Paasch wrote:
> 
> Yes, this does the trick for my packetdrill-test.
> 
> I wonder, is there a way we could end up in a situation where we can't
> retransmit anymore?
> For example, sk_wmem_queued has grown so much that the new test fails.
> Then, if we legitimately need to fragment in __tcp_retransmit_skb() we
> won't be able to do so. So we will never retransmit. And if no ACK
> comes back in to make some room we are stuck, no?

We seem to be having exactly this problem. We’re running on the 4.14 branch. After recently updating our kernel, we’ve been having a problem with TCP connections stalling / dying off without disconnecting. They're stuck and never recover.

I bisected the problem to 4.14.127 commit 9daf226ff92679d09aeca1b5c1240e3607153336 (commit f070ef2ac66716357066b683fb0baf55f8191a2e upstream): tcp: tcp_fragment() should apply sane memory limits. That lead me to this thread.

Our environment is a supercomputing center: lots of servers interconnected with a non-blocking 10Gbit ethernet network. We’ve zeroed in on the problem in two situations: remote users on VPN accessing large files via samba and compute jobs using Intel MPI over TCP/IP/ethernet. It certainly affects other situations, many of our workloads have been unstable since this patch went into production, but those are the two we clearly identified as they fail reliably every time. We had to take the system down for unscheduled maintenance to roll back to an older kernel.

The TCPWqueueTooBig count is incrementing when the problem occurs.

Using ftrace/trace-cmd on an affected process, it appears the call stack is:
run_timer_softirq
expire_timers
call_timer_fn
tcp_write_timer
tcp_write_timer_handler
tcp_retransmit_timer
tcp_retransmit_skb
__tcp_retransmit_skb
tcp_fragment

Andrew Prout
MIT Lincoln Laboratory Supercomputing Center

^ permalink raw reply

* Re: [GIT PULL] Keys: Set 4 - Key ACLs for 5.3
From: Linus Torvalds @ 2019-07-10 18:35 UTC (permalink / raw)
  To: David Howells
  Cc: James Morris James Morris, keyrings, Netdev, linux-nfs, CIFS,
	linux-afs, linux-fsdevel, linux-integrity, LSM List,
	Linux List Kernel Mailing
In-Reply-To: <28477.1562362239@warthog.procyon.org.uk>

On Fri, Jul 5, 2019 at 2:30 PM David Howells <dhowells@redhat.com> wrote:
>
> Here's my fourth block of keyrings changes for the next merge window.  They
> change the permissions model used by keys and keyrings to be based on an
> internal ACL by the following means:

It turns out that this is broken, and I'll probably have to revert the
merge entirely.

With this merge in place, I can't boot any of the machines that have
an encrypted disk setup. The boot just stops at

  systemd[1]: Started Forward Password Requests to Plymouth Directory Watch.
  systemd[1]: Reached target Paths.

and never gets any further. I never get the prompt for a passphrase
for the disk encryption.

Apparently not a lot of developers are using encrypted volumes for
their development machines.

I'm not sure if the only requirement is an encrypted volume, or if
this is also particular to a F30 install in case you need to be able
to reproduce. But considering that you have a redhat email address,
I'm sure you can find a F30 install somewhere with an encrypted disk.

David, if you can fix this quickly, I'll hold off on the revert of it
all, but I can wait only so long. I've stopped merging stuff since I
noticed my machines don't work (this merge window has not been
pleasant so far - in addition to this issue I had another entirely
unrelated boot failure which made bisecting this one even more fun).

So if I don't see a quick fix, I'll just revert in order to then
continue to do pull requests later today. Because I do not want to do
further pulls with something that I can't boot as a base.

                 Linus

^ permalink raw reply

* [PATCH net-next 2/2] nfp: flower: ensure ip protocol is specified for L4 matches
From: John Hurley @ 2019-07-10 18:30 UTC (permalink / raw)
  To: netdev; +Cc: davem, simon.horman, jakub.kicinski, oss-drivers, John Hurley
In-Reply-To: <1562783430-7031-1-git-send-email-john.hurley@netronome.com>

Flower rules on the NFP firmware are able to match on an IP protocol
field. When parsing rules in the driver, unknown IP protocols are only
rejected when further matches are to be carried out on layer 4 fields, as
the firmware will not be able to extract such fields from packets.

L4 protocol dissectors such as FLOW_DISSECTOR_KEY_PORTS are only parsed if
an IP protocol is specified. This leaves a loophole whereby a rule that
attempts to match on transport layer information such as port numbers but
does not explicitly give an IP protocol type can be incorrectly offloaded
(in this case with wildcard port numbers matches).

Fix this by rejecting the offload of flows that attempt to match on L4
information, not only when matching on an unknown IP protocol type, but
also when the protocol is wildcarded.

Fixes: 2a04784594f6 ("nfp: flower: check L4 matches on unknown IP protocols")
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 drivers/net/ethernet/netronome/nfp/flower/offload.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index 885f968..faa8ba0 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -386,18 +386,15 @@ nfp_flower_calculate_key_layers(struct nfp_app *app,
 			key_layer |= NFP_FLOWER_LAYER_TP;
 			key_size += sizeof(struct nfp_flower_tp_ports);
 			break;
-		default:
-			/* Other ip proto - we need check the masks for the
-			 * remainder of the key to ensure we can offload.
-			 */
-			if (nfp_flower_check_higher_than_l3(flow)) {
-				NL_SET_ERR_MSG_MOD(extack, "unsupported offload: unknown IP protocol with L4 matches not supported");
-				return -EOPNOTSUPP;
-			}
-			break;
 		}
 	}

+	if (!(key_layer & NFP_FLOWER_LAYER_TP) &&
+	    nfp_flower_check_higher_than_l3(flow)) {
+		NL_SET_ERR_MSG_MOD(extack, "unsupported offload: cannot match on L4 information without specified IP protocol type");
+		return -EOPNOTSUPP;
+	}
+
 	if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_TCP)) {
 		struct flow_match_tcp tcp;
 		u32 tcp_flags;
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 1/2] nfp: flower: fix ethernet check on match fields
From: John Hurley @ 2019-07-10 18:30 UTC (permalink / raw)
  To: netdev; +Cc: davem, simon.horman, jakub.kicinski, oss-drivers, John Hurley
In-Reply-To: <1562783430-7031-1-git-send-email-john.hurley@netronome.com>

NFP firmware does not explicitly match on an ethernet type field. Rather,
each rule has a bitmask of match fields that can be used to infer the
ethernet type.

Currently, if a flower rule contains an unknown ethernet type, a check is
carried out for matches on other fields of the packet. If matches on
layer 3 or 4 are found, then the offload is rejected as firmware will not
be able to extract these fields from a packet with an ethernet type it
does not currently understand.

However, if a rule contains an unknown ethernet type without any L3 (or
above) matches then this will effectively be offloaded as a rule with a
wildcarded ethertype. This can lead to misclassifications on the firmware.

Fix this issue by rejecting all flower rules that specify a match on an
unknown ethernet type.

Further ensure correct offloads by moving the 'L3 and above' check to any
rule that does not specify an ethernet type and rejecting rules with
further matches. This means that we can still offload rules with a
wildcarded ethertype if they only match on L2 fields but will prevent
rules which match on further fields that we cannot be sure if the firmware
will be able to extract.

Fixes: af9d842c1354 ("nfp: extend flower add flow offload")
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 drivers/net/ethernet/netronome/nfp/flower/offload.c | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index 7e725fa..885f968 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -368,15 +368,12 @@ nfp_flower_calculate_key_layers(struct nfp_app *app,
 			break;

 		default:
-			/* Other ethtype - we need check the masks for the
-			 * remainder of the key to ensure we can offload.
-			 */
-			if (nfp_flower_check_higher_than_mac(flow)) {
-				NL_SET_ERR_MSG_MOD(extack, "unsupported offload: non IPv4/IPv6 offload with L3/L4 matches not supported");
-				return -EOPNOTSUPP;
-			}
-			break;
+			NL_SET_ERR_MSG_MOD(extack, "unsupported offload: match on given EtherType is not supported");
+			return -EOPNOTSUPP;
 		}
+	} else if (nfp_flower_check_higher_than_mac(flow)) {
+		NL_SET_ERR_MSG_MOD(extack, "unsupported offload: cannot match above L2 without specified EtherType");
+		return -EOPNOTSUPP;
 	}

 	if (basic.mask && basic.mask->ip_proto) {
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 0/2] Fix bugs in NFP flower match offload
From: John Hurley @ 2019-07-10 18:30 UTC (permalink / raw)
  To: netdev; +Cc: davem, simon.horman, jakub.kicinski, oss-drivers, John Hurley

This patchset contains bug fixes for corner cases in the match fields of
flower offloads. The patches ensure that flows that should not be
supported are not (incorrectly) offloaded. These include rules that match
on layer 3 and/or 4 data without specified ethernet or ip protocol fields.

John Hurley (2):
  nfp: flower: fix ethernet check on match fields
  nfp: flower: ensure ip protocol is specified for L4 matches

 .../net/ethernet/netronome/nfp/flower/offload.c    | 28 +++++++++-------------
 1 file changed, 11 insertions(+), 17 deletions(-)

-- 
2.7.4

^ permalink raw reply

* Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
From: Eric Dumazet @ 2019-07-10 18:28 UTC (permalink / raw)
  To: Prout, Andrew - LLSC - MITLL, Christoph Paasch
  Cc: David S . Miller, netdev, Greg Kroah-Hartman, Jonathan Looney,
	Neal Cardwell, Tyler Hicks, Yuchung Cheng, Bruce Curtis,
	Jonathan Lemon, Dustin Marquess
In-Reply-To: <63cd99ed3d0c440185ebec3ad12327fc@ll.mit.edu>



On 7/10/19 8:23 PM, Prout, Andrew - LLSC - MITLL wrote:
> On 6/17/19 8:19 PM, Christoph Paasch wrote:
>>
>> Yes, this does the trick for my packetdrill-test.
>>
>> I wonder, is there a way we could end up in a situation where we can't
>> retransmit anymore?
>> For example, sk_wmem_queued has grown so much that the new test fails.
>> Then, if we legitimately need to fragment in __tcp_retransmit_skb() we
>> won't be able to do so. So we will never retransmit. And if no ACK
>> comes back in to make some room we are stuck, no?
> 
> We seem to be having exactly this problem. We’re running on the 4.14 branch. After recently updating our kernel, we’ve been having a problem with TCP connections stalling / dying off without disconnecting. They're stuck and never recover.
> 
> I bisected the problem to 4.14.127 commit 9daf226ff92679d09aeca1b5c1240e3607153336 (commit f070ef2ac66716357066b683fb0baf55f8191a2e upstream): tcp: tcp_fragment() should apply sane memory limits. That lead me to this thread.
> 
> Our environment is a supercomputing center: lots of servers interconnected with a non-blocking 10Gbit ethernet network. We’ve zeroed in on the problem in two situations: remote users on VPN accessing large files via samba and compute jobs using Intel MPI over TCP/IP/ethernet. It certainly affects other situations, many of our workloads have been unstable since this patch went into production, but those are the two we clearly identified as they fail reliably every time. We had to take the system down for unscheduled maintenance to roll back to an older kernel.
> 
> The TCPWqueueTooBig count is incrementing when the problem occurs.
> 
> Using ftrace/trace-cmd on an affected process, it appears the call stack is:
> run_timer_softirq
> expire_timers
> call_timer_fn
> tcp_write_timer
> tcp_write_timer_handler
> tcp_retransmit_timer
> tcp_retransmit_skb
> __tcp_retransmit_skb
> tcp_fragment
> 
> Andrew Prout
> MIT Lincoln Laboratory Supercomputing Center
> 

What was the kernel version you used exactly ?

This problem is supposed to be fixed in v4.14.131


^ permalink raw reply

* [PATCH net-next] net/mlx5e: Provide cb_list pointer when setting up tc block on rep
From: Vlad Buslov @ 2019-07-10 18:25 UTC (permalink / raw)
  To: netdev; +Cc: jhs, xiyou.wangcong, jiri, davem, pablo, saeedm, Vlad Buslov

Recent refactoring of tc block offloads infrastructure introduced new
flow_block_cb_setup_simple() method intended to be used as unified way for
all drivers to register offload callbacks. However, commit that actually
extended all users (drivers) with block cb list and provided it to
flow_block infra missed mlx5 en_rep. This leads to following NULL-pointer
dereference when creating Qdisc:

[  278.385175] BUG: kernel NULL pointer dereference, address: 0000000000000000
[  278.393233] #PF: supervisor read access in kernel mode
[  278.399446] #PF: error_code(0x0000) - not-present page
[  278.405847] PGD 8000000850e73067 P4D 8000000850e73067 PUD 8620cd067 PMD 0
[  278.414141] Oops: 0000 [#1] SMP PTI
[  278.419019] CPU: 7 PID: 3369 Comm: tc Not tainted 5.2.0-rc6+ #492
[  278.426580] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017
[  278.435853] RIP: 0010:flow_block_cb_setup_simple+0xc4/0x190
[  278.442953] Code: 10 48 89 42 08 48 89 10 48 b8 00 01 00 00 00 00 ad de 49 89 00 48 05 00 01 00 00 49 89 40 08 31 c0 c3 b8 a1 ff ff ff c3 f3 c3 <48> 8b 06 48 39 c6 75 0a eb 1a 48 8b 00 48 39 c6 74 12
 48 3b 50 28
[  278.464829] RSP: 0018:ffffaf07c3f97990 EFLAGS: 00010246
[  278.471648] RAX: 0000000000000000 RBX: ffff9b43ed4c7680 RCX: ffff9b43d5f80840
[  278.480408] RDX: ffffffffc0491650 RSI: 0000000000000000 RDI: ffffaf07c3f97998
[  278.489110] RBP: ffff9b43ddff9000 R08: ffff9b43d5f80840 R09: 0000000000000001
[  278.497838] R10: 0000000000000009 R11: 00000000000003ad R12: ffffaf07c3f97c08
[  278.506595] R13: ffff9b43d5f80000 R14: ffff9b43ed4c7680 R15: ffff9b43dfa20b40
[  278.515374] FS:  00007f796be1b400(0000) GS:ffff9b43ef840000(0000) knlGS:0000000000000000
[  278.525099] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  278.532453] CR2: 0000000000000000 CR3: 0000000840398002 CR4: 00000000001606e0
[  278.541197] Call Trace:
[  278.545252]  tcf_block_offload_cmd.isra.52+0x7e/0xb0
[  278.551871]  tcf_block_get_ext+0x365/0x3e0
[  278.557569]  qdisc_create+0x15c/0x4e0
[  278.562859]  ? kmem_cache_alloc_trace+0x1a2/0x1c0
[  278.569235]  tc_modify_qdisc+0x1c8/0x780
[  278.574761]  rtnetlink_rcv_msg+0x291/0x340
[  278.580518]  ? _cond_resched+0x15/0x40
[  278.585856]  ? rtnl_calcit.isra.29+0x120/0x120
[  278.591868]  netlink_rcv_skb+0x4a/0x110
[  278.597198]  netlink_unicast+0x1a0/0x250
[  278.602601]  netlink_sendmsg+0x2c1/0x3c0
[  278.608022]  sock_sendmsg+0x5b/0x60
[  278.612969]  ___sys_sendmsg+0x289/0x310
[  278.618231]  ? do_wp_page+0x99/0x730
[  278.623216]  ? page_add_new_anon_rmap+0xbe/0x140
[  278.629298]  ? __handle_mm_fault+0xc84/0x1360
[  278.635113]  ? __sys_sendmsg+0x5e/0xa0
[  278.640285]  __sys_sendmsg+0x5e/0xa0
[  278.645239]  do_syscall_64+0x5b/0x1b0
[  278.650274]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  278.656697] RIP: 0033:0x7f796abdeb87
[  278.661628] Code: 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 80 00 00 00 00 8b 05 6a 2b 2c 00 48 63 d2 48 63 ff 85 c0 75 18 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 59 f3 c3 0f 1f 80 00 00 00 00 53
 48 89 f3 48
[  278.683248] RSP: 002b:00007ffde213ba48 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[  278.692245] RAX: ffffffffffffffda RBX: 000000005d261e6f RCX: 00007f796abdeb87
[  278.700862] RDX: 0000000000000000 RSI: 00007ffde213bab0 RDI: 0000000000000003
[  278.709527] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000006
[  278.718167] R10: 000000000000000c R11: 0000000000000246 R12: 0000000000000001
[  278.726743] R13: 000000000067b580 R14: 0000000000000000 R15: 0000000000000000
[  278.735302] Modules linked in: dummy vxlan ip6_udp_tunnel udp_tunnel sch_ingress nfsv3 nfs_acl nfs lockd grace fscache bridge stp llc sunrpc mlx5_ib ib_uverbs intel_rapl ib_core sb_edac x86_pkg_temp_
thermal intel_powerclamp coretemp kvm_intel kvm mlx5_core irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel igb ghash_clmulni_intel ses mei_me enclosure mlxfw ipmi_ssif intel_cstate iTCO_wdt ptp mei
pps_core iTCO_vendor_support pcspkr joydev intel_uncore i2c_i801 ipmi_si lpc_ich intel_rapl_perf ioatdma wmi dca pcc_cpufreq ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad ast i2c_algo_bit drm_k
ms_helper ttm drm mpt3sas raid_class scsi_transport_sas
[  278.802263] CR2: 0000000000000000
[  278.807170] ---[ end trace b1f0a442a279e66f ]---

Extend en_rep with new static mlx5e_rep_block_cb_list list and pass it to
flow_block_cb_setup_simple() function instead of hardcoded NULL pointer.

Fixes: 955bcb6ea0df ("drivers: net: use flow block API")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index 10ef90a7bddd..7245d287633d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -1175,6 +1175,8 @@ static int mlx5e_rep_setup_tc_cb(enum tc_setup_type type, void *type_data,
 	}
 }
 
+static LIST_HEAD(mlx5e_rep_block_cb_list);
+
 static int mlx5e_rep_setup_tc(struct net_device *dev, enum tc_setup_type type,
 			      void *type_data)
 {
@@ -1182,7 +1184,8 @@ static int mlx5e_rep_setup_tc(struct net_device *dev, enum tc_setup_type type,
 
 	switch (type) {
 	case TC_SETUP_BLOCK:
-		return flow_block_cb_setup_simple(type_data, NULL,
+		return flow_block_cb_setup_simple(type_data,
+						  &mlx5e_rep_block_cb_list,
 						  mlx5e_rep_setup_tc_cb,
 						  priv, priv, true);
 	default:
-- 
2.21.0


^ permalink raw reply related

* [PATCH bpf-next v9 2/2] selftests/bpf: Add selftests for bpf_perf_event_output
From: Allan Zhang @ 2019-07-10 18:18 UTC (permalink / raw)
  To: netdev, bpf, songliubraving, daniel, andrii.nakryiko; +Cc: ast, allanzhang
In-Reply-To: <20190710181811.127374-1-allanzhang@google.com>

From: allanzhang <allanzhang@google.com>

Software event output is only enabled by a few prog types.
This test is to ensure that all supported types are enabled for
bpf_perf_event_output successfully.

Signed-off-by: Allan Zhang <allanzhang@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
---
 tools/testing/selftests/bpf/test_verifier.c   | 12 ++-
 .../selftests/bpf/verifier/event_output.c     | 94 +++++++++++++++++++
 2 files changed, 105 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/verifier/event_output.c

diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index c5514daf8865..5e98d7c37eb7 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -50,7 +50,7 @@
 #define MAX_INSNS	BPF_MAXINSNS
 #define MAX_TEST_INSNS	1000000
 #define MAX_FIXUPS	8
-#define MAX_NR_MAPS	18
+#define MAX_NR_MAPS	19
 #define MAX_TEST_RUNS	8
 #define POINTER_VALUE	0xcafe4all
 #define TEST_DATA_LEN	64
@@ -84,6 +84,7 @@ struct bpf_test {
 	int fixup_map_array_wo[MAX_FIXUPS];
 	int fixup_map_array_small[MAX_FIXUPS];
 	int fixup_sk_storage_map[MAX_FIXUPS];
+	int fixup_map_event_output[MAX_FIXUPS];
 	const char *errstr;
 	const char *errstr_unpriv;
 	uint32_t retval, retval_unpriv, insn_processed;
@@ -627,6 +628,7 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
 	int *fixup_map_array_wo = test->fixup_map_array_wo;
 	int *fixup_map_array_small = test->fixup_map_array_small;
 	int *fixup_sk_storage_map = test->fixup_sk_storage_map;
+	int *fixup_map_event_output = test->fixup_map_event_output;
 
 	if (test->fill_helper) {
 		test->fill_insns = calloc(MAX_TEST_INSNS, sizeof(struct bpf_insn));
@@ -788,6 +790,14 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
 			fixup_sk_storage_map++;
 		} while (*fixup_sk_storage_map);
 	}
+	if (*fixup_map_event_output) {
+		map_fds[18] = __create_map(BPF_MAP_TYPE_PERF_EVENT_ARRAY,
+					   sizeof(int), sizeof(int), 1, 0);
+		do {
+			prog[*fixup_map_event_output].imm = map_fds[18];
+			fixup_map_event_output++;
+		} while (*fixup_map_event_output);
+	}
 }
 
 static int set_admin(bool admin)
diff --git a/tools/testing/selftests/bpf/verifier/event_output.c b/tools/testing/selftests/bpf/verifier/event_output.c
new file mode 100644
index 000000000000..130553e19eca
--- /dev/null
+++ b/tools/testing/selftests/bpf/verifier/event_output.c
@@ -0,0 +1,94 @@
+/* instructions used to output a skb based software event, produced
+ * from code snippet:
+ * struct TMP {
+ *  uint64_t tmp;
+ * } tt;
+ * tt.tmp = 5;
+ * bpf_perf_event_output(skb, &connection_tracking_event_map, 0,
+ *			 &tt, sizeof(tt));
+ * return 1;
+ *
+ * the bpf assembly from llvm is:
+ *        0:       b7 02 00 00 05 00 00 00         r2 = 5
+ *        1:       7b 2a f8 ff 00 00 00 00         *(u64 *)(r10 - 8) = r2
+ *        2:       bf a4 00 00 00 00 00 00         r4 = r10
+ *        3:       07 04 00 00 f8 ff ff ff         r4 += -8
+ *        4:       18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00    r2 = 0ll
+ *        6:       b7 03 00 00 00 00 00 00         r3 = 0
+ *        7:       b7 05 00 00 08 00 00 00         r5 = 8
+ *        8:       85 00 00 00 19 00 00 00         call 25
+ *        9:       b7 00 00 00 01 00 00 00         r0 = 1
+ *       10:       95 00 00 00 00 00 00 00         exit
+ *
+ *     The reason I put the code here instead of fill_helpers is that map fixup
+ *     is against the insns, instead of filled prog.
+ */
+
+#define __PERF_EVENT_INSNS__					\
+	BPF_MOV64_IMM(BPF_REG_2, 5),				\
+	BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_2, -8),		\
+	BPF_MOV64_REG(BPF_REG_4, BPF_REG_10),			\
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, -8),			\
+	BPF_LD_MAP_FD(BPF_REG_2, 0),				\
+	BPF_MOV64_IMM(BPF_REG_3, 0),				\
+	BPF_MOV64_IMM(BPF_REG_5, 8),				\
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,		\
+		     BPF_FUNC_perf_event_output),		\
+	BPF_MOV64_IMM(BPF_REG_0, 1),				\
+	BPF_EXIT_INSN(),
+{
+	"perfevent for sockops",
+	.insns = { __PERF_EVENT_INSNS__ },
+	.prog_type = BPF_PROG_TYPE_SOCK_OPS,
+	.fixup_map_event_output = { 4 },
+	.result = ACCEPT,
+	.retval = 1,
+},
+{
+	"perfevent for tc",
+	.insns =  { __PERF_EVENT_INSNS__ },
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.fixup_map_event_output = { 4 },
+	.result = ACCEPT,
+	.retval = 1,
+},
+{
+	"perfevent for lwt out",
+	.insns =  { __PERF_EVENT_INSNS__ },
+	.prog_type = BPF_PROG_TYPE_LWT_OUT,
+	.fixup_map_event_output = { 4 },
+	.result = ACCEPT,
+	.retval = 1,
+},
+{
+	"perfevent for xdp",
+	.insns =  { __PERF_EVENT_INSNS__ },
+	.prog_type = BPF_PROG_TYPE_XDP,
+	.fixup_map_event_output = { 4 },
+	.result = ACCEPT,
+	.retval = 1,
+},
+{
+	"perfevent for socket filter",
+	.insns =  { __PERF_EVENT_INSNS__ },
+	.prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
+	.fixup_map_event_output = { 4 },
+	.result = ACCEPT,
+	.retval = 1,
+},
+{
+	"perfevent for sk_skb",
+	.insns =  { __PERF_EVENT_INSNS__ },
+	.prog_type = BPF_PROG_TYPE_SK_SKB,
+	.fixup_map_event_output = { 4 },
+	.result = ACCEPT,
+	.retval = 1,
+},
+{
+	"perfevent for cgroup skb",
+	.insns =  { __PERF_EVENT_INSNS__ },
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.fixup_map_event_output = { 4 },
+	.result = ACCEPT,
+	.retval = 1,
+},
-- 
2.22.0.410.gd8fdbe21b5-goog


^ permalink raw reply related

* [PATCH bpf-next v9 1/2] bpf: Allow bpf_skb_event_output for a few prog types
From: Allan Zhang @ 2019-07-10 18:18 UTC (permalink / raw)
  To: netdev, bpf, songliubraving, daniel, andrii.nakryiko; +Cc: ast, allanzhang
In-Reply-To: <20190710181811.127374-1-allanzhang@google.com>

From: allanzhang <allanzhang@google.com>

Software event output is only enabled by a few prog types right now (TC,
LWT out, XDP, sockops). Many other skb based prog types need
bpf_skb_event_output to produce software event.

Added socket_filter, cg_skb, sk_skb prog types to generate sw event.

Test bpf code is generated from code snippet:

struct TMP {
    uint64_t tmp;
} tt;
tt.tmp = 5;
bpf_perf_event_output(skb, &connection_tracking_event_map, 0,
                      &tt, sizeof(tt));
return 1;

the bpf assembly from llvm is:
       0:       b7 02 00 00 05 00 00 00         r2 = 5
       1:       7b 2a f8 ff 00 00 00 00         *(u64 *)(r10 - 8) = r2
       2:       bf a4 00 00 00 00 00 00         r4 = r10
       3:       07 04 00 00 f8 ff ff ff         r4 += -8
       4:       18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00    r2 = 0ll
       6:       b7 03 00 00 00 00 00 00         r3 = 0
       7:       b7 05 00 00 08 00 00 00         r5 = 8
       8:       85 00 00 00 19 00 00 00         call 25
       9:       b7 00 00 00 01 00 00 00         r0 = 1
      10:       95 00 00 00 00 00 00 00         exit

Signed-off-by: Allan Zhang <allanzhang@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
---
 net/core/filter.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 2014d76e0d2a..b75fcf412628 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5958,6 +5958,8 @@ sk_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_get_socket_cookie_proto;
 	case BPF_FUNC_get_socket_uid:
 		return &bpf_get_socket_uid_proto;
+	case BPF_FUNC_perf_event_output:
+		return &bpf_skb_event_output_proto;
 	default:
 		return bpf_base_func_proto(func_id);
 	}
@@ -5978,6 +5980,8 @@ cg_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_sk_storage_get_proto;
 	case BPF_FUNC_sk_storage_delete:
 		return &bpf_sk_storage_delete_proto;
+	case BPF_FUNC_perf_event_output:
+		return &bpf_skb_event_output_proto;
 #ifdef CONFIG_SOCK_CGROUP_DATA
 	case BPF_FUNC_skb_cgroup_id:
 		return &bpf_skb_cgroup_id_proto;
@@ -6226,6 +6230,8 @@ sk_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_sk_redirect_map_proto;
 	case BPF_FUNC_sk_redirect_hash:
 		return &bpf_sk_redirect_hash_proto;
+	case BPF_FUNC_perf_event_output:
+		return &bpf_skb_event_output_proto;
 #ifdef CONFIG_INET
 	case BPF_FUNC_sk_lookup_tcp:
 		return &bpf_sk_lookup_tcp_proto;
-- 
2.22.0.410.gd8fdbe21b5-goog


^ permalink raw reply related

* [PATCH bpf-next v9 0/2] bpf: Allow bpf_skb_event_output for more prog types
From: Allan Zhang @ 2019-07-10 18:18 UTC (permalink / raw)
  To: netdev, bpf, songliubraving, daniel, andrii.nakryiko; +Cc: ast, Allan Zhang

Software event output is only enabled by a few prog types right now (TC,
LWT out, XDP, sockops). Many other skb based prog types need
bpf_skb_event_output to produce software event.

More prog types are enabled to access bpf_skb_event_output in this
patch.

v9 changes:
add "Acked-by" field.

v8 changes:
No actual change, just cc to netdev@vger.kernel.org and
bpf@vger.kernel.org.
v7 patches are acked by Song Liu.

v7 changes:
Reformat from hints by scripts/checkpatch.pl, including Song's comment
on signed-off-by name to captical case in cover letter.
3 of hints are ignored:
1. new file mode.
2. SPDX-License-Identifier for event_output.c since all files under
   this dir have no such line.
3. "Macros ... enclosed in parentheses" for macro in event_output.c
   due to code's nature.

Change patch 02 subject "bpf:..." to "selftests/bpf:..."

v6 changes:
Fix Signed-off-by, fix fixup map creation.

v5 changes:
Fix typos, reformat comments in event_output.c, move revision history to
cover letter.

v4 changes:
Reformating log message.

v3 changes:
Reformating log message.

v2 changes:
Reformating log message.

Allan Zhang (2):
  bpf: Allow bpf_skb_event_output for a few prog types
  selftests/bpf: Add selftests for bpf_perf_event_output

 net/core/filter.c                             |  6 ++
 tools/testing/selftests/bpf/test_verifier.c   | 12 ++-
 .../selftests/bpf/verifier/event_output.c     | 94 +++++++++++++++++++
 3 files changed, 111 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/verifier/event_output.c

-- 
2.22.0.410.gd8fdbe21b5-goog

^ permalink raw reply

* Re: [PATCH V2 1/1 (was 0/1 by accident)] tools/dtrace: initial implementation of DTrace
From: Kris Van Hees @ 2019-07-10 18:12 UTC (permalink / raw)
  To: Kris Van Hees
  Cc: netdev, bpf, dtrace-devel, linux-kernel, rostedt, mhiramat, acme,
	ast, daniel, Peter Zijlstra, Chris Mason
In-Reply-To: <201907101542.x6AFgOO9012232@userv0121.oracle.com>

This patch's subject should of course be [PATCH V2 1/1] rather than 0/1.
Sorry about that.

On Wed, Jul 10, 2019 at 08:42:24AM -0700, Kris Van Hees wrote:
> This initial implementation of a tiny subset of DTrace functionality
> provides the following options:
> 
> 	dtrace [-lvV] [-b bufsz] -s script
> 	    -b  set trace buffer size
> 	    -l  list probes (only works with '-s script' for now)
> 	    -s  enable or list probes for the specified BPF program
> 	    -V  report DTrace API version
> 
> The patch comprises quite a bit of code due to DTrace requiring a few
> crucial components, even in its most basic form.
> 
> The code is structured around the command line interface implemented in
> dtrace.c.  It provides option parsing and drives the three modes of
> operation that are currently implemented:
> 
> 1. Report DTrace API version information.
> 	Report the version information and terminate.
> 
> 2. List probes in BPF programs.
> 	Initialize the list of probes that DTrace recognizes, load BPF
> 	programs, parse all BPF ELF section names, resolve them into
> 	known probes, and emit the probe names.  Then terminate.
> 
> 3. Load BPF programs and collect tracing data.
> 	Initialize the list of probes that DTrace recognizes, load BPF
> 	programs and attach them to their corresponding probes, set up
> 	perf event output buffers, and start processing tracing data.
> 
> This implementation makes extensive use of BPF (handled by dt_bpf.c) and
> the perf event output ring buffer (handled by dt_buffer.c).  DTrace-style
> probe handling (dt_probe.c) offers an interface to probes that hides the
> implementation details of the individual probe types by provider (dt_fbt.c
> and dt_syscall.c).  Probe lookup by name uses a hashtable implementation
> (dt_hash.c).  The dt_utils.c code populates a list of online CPU ids, so
> we know what CPUs we can obtain tracing data from.
> 
> Building the tool is trivial because its only dependency (libbpf) is in
> the kernel tree under tools/lib/bpf.  A simple 'make' in the tools/dtrace
> directory suffices.
> 
> The 'dtrace' executable needs to run as root because BPF programs cannot
> be loaded by non-root users.
> 
> Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
> Reviewed-by: David Mc Lean <david.mclean@oracle.com>
> Reviewed-by: Eugene Loh <eugene.loh@oracle.com>
> ---
> Changes in v2:
>         - Use ring_buffer_read_head() and ring_buffer_write_tail() to
>           avoid use of volatile.
>         - Handle perf events that wrap around the ring buffer boundary.
>         - Remove unnecessary PERF_EVENT_IOC_ENABLE.
>         - Remove -I$(srctree)/tools/perf from KBUILD_HOSTCFLAGS since it
>           is not actually used.
>         - Use PT_REGS_PARM1(x), etc instead of my own macros.  Adding 
>           PT_REGS_PARM6(x) in bpf_sample.c because we need to be able to
>           support up to 6 arguments passed by registers.
> ---
>  MAINTAINERS                |   6 +
>  tools/dtrace/Makefile      |  87 ++++++++++
>  tools/dtrace/bpf_sample.c  | 146 ++++++++++++++++
>  tools/dtrace/dt_bpf.c      | 185 ++++++++++++++++++++
>  tools/dtrace/dt_buffer.c   | 338 +++++++++++++++++++++++++++++++++++++
>  tools/dtrace/dt_fbt.c      | 201 ++++++++++++++++++++++
>  tools/dtrace/dt_hash.c     | 211 +++++++++++++++++++++++
>  tools/dtrace/dt_probe.c    | 230 +++++++++++++++++++++++++
>  tools/dtrace/dt_syscall.c  | 179 ++++++++++++++++++++
>  tools/dtrace/dt_utils.c    | 132 +++++++++++++++
>  tools/dtrace/dtrace.c      | 249 +++++++++++++++++++++++++++
>  tools/dtrace/dtrace.h      |  13 ++
>  tools/dtrace/dtrace_impl.h | 101 +++++++++++
>  13 files changed, 2078 insertions(+)
>  create mode 100644 tools/dtrace/Makefile
>  create mode 100644 tools/dtrace/bpf_sample.c
>  create mode 100644 tools/dtrace/dt_bpf.c
>  create mode 100644 tools/dtrace/dt_buffer.c
>  create mode 100644 tools/dtrace/dt_fbt.c
>  create mode 100644 tools/dtrace/dt_hash.c
>  create mode 100644 tools/dtrace/dt_probe.c
>  create mode 100644 tools/dtrace/dt_syscall.c
>  create mode 100644 tools/dtrace/dt_utils.c
>  create mode 100644 tools/dtrace/dtrace.c
>  create mode 100644 tools/dtrace/dtrace.h
>  create mode 100644 tools/dtrace/dtrace_impl.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index cfa9ed89c031..410240732d55 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -5485,6 +5485,12 @@ W:	https://linuxtv.org
>  S:	Odd Fixes
>  F:	drivers/media/pci/dt3155/
>  
> +DTRACE
> +M:	Kris Van Hees <kris.van.hees@oracle.com>
> +L:	dtrace-devel@oss.oracle.com
> +S:	Maintained
> +F:	tools/dtrace/
> +
>  DVB_USB_AF9015 MEDIA DRIVER
>  M:	Antti Palosaari <crope@iki.fi>
>  L:	linux-media@vger.kernel.org
> diff --git a/tools/dtrace/Makefile b/tools/dtrace/Makefile
> new file mode 100644
> index 000000000000..03ae498d1429
> --- /dev/null
> +++ b/tools/dtrace/Makefile
> @@ -0,0 +1,87 @@
> +# SPDX-License-Identifier: GPL-2.0
> +#
> +# This Makefile is based on samples/bpf.
> +#
> +# Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
> +
> +DT_VERSION		:= 2.0.0
> +DT_GIT_VERSION		:= $(shell git rev-parse HEAD 2>/dev/null || \
> +				   echo Unknown)
> +
> +DTRACE_PATH		?= $(abspath $(srctree)/$(src))
> +TOOLS_PATH		:= $(DTRACE_PATH)/..
> +SAMPLES_PATH		:= $(DTRACE_PATH)/../../samples
> +
> +hostprogs-y		:= dtrace
> +
> +LIBBPF			:= $(TOOLS_PATH)/lib/bpf/libbpf.a
> +OBJS			:= dt_bpf.o dt_buffer.o dt_utils.o dt_probe.o \
> +			   dt_hash.o \
> +			   dt_fbt.o dt_syscall.o
> +
> +dtrace-objs		:= $(OBJS) dtrace.o
> +
> +always			:= $(hostprogs-y)
> +always			+= bpf_sample.o
> +
> +KBUILD_HOSTCFLAGS	+= -DDT_VERSION=\"$(DT_VERSION)\"
> +KBUILD_HOSTCFLAGS	+= -DDT_GIT_VERSION=\"$(DT_GIT_VERSION)\"
> +KBUILD_HOSTCFLAGS	+= -I$(srctree)/tools/lib
> +KBUILD_HOSTCFLAGS	+= -I$(srctree)/tools/include/uapi
> +KBUILD_HOSTCFLAGS	+= -I$(srctree)/tools/include/
> +KBUILD_HOSTCFLAGS	+= -I$(srctree)/usr/include
> +
> +KBUILD_HOSTLDLIBS	:= $(LIBBPF) -lelf
> +
> +LLC			?= llc
> +CLANG			?= clang
> +LLVM_OBJCOPY		?= llvm-objcopy
> +
> +ifdef CROSS_COMPILE
> +HOSTCC			= $(CROSS_COMPILE)gcc
> +CLANG_ARCH_ARGS		= -target $(ARCH)
> +endif
> +
> +all:
> +	$(MAKE) -C ../../ $(CURDIR)/ DTRACE_PATH=$(CURDIR)
> +
> +clean:
> +	$(MAKE) -C ../../ M=$(CURDIR) clean
> +	@rm -f *~
> +
> +$(LIBBPF): FORCE
> +	$(MAKE) -C $(dir $@) RM='rm -rf' LDFLAGS= srctree=$(DTRACE_PATH)/../../ O=
> +
> +FORCE:
> +
> +.PHONY: verify_cmds verify_target_bpf $(CLANG) $(LLC)
> +
> +verify_cmds: $(CLANG) $(LLC)
> +	@for TOOL in $^ ; do \
> +		if ! (which -- "$${TOOL}" > /dev/null 2>&1); then \
> +			echo "*** ERROR: Cannot find LLVM tool $${TOOL}" ;\
> +			exit 1; \
> +		else true; fi; \
> +	done
> +
> +verify_target_bpf: verify_cmds
> +	@if ! (${LLC} -march=bpf -mattr=help > /dev/null 2>&1); then \
> +		echo "*** ERROR: LLVM (${LLC}) does not support 'bpf' target" ;\
> +		echo "   NOTICE: LLVM version >= 3.7.1 required" ;\
> +		exit 2; \
> +	else true; fi
> +
> +$(DTRACE_PATH)/*.c: verify_target_bpf $(LIBBPF)
> +$(src)/*.c: verify_target_bpf $(LIBBPF)
> +
> +$(obj)/%.o: $(src)/%.c
> +	@echo "  CLANG-bpf " $@
> +	$(Q)$(CLANG) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) -I$(obj) \
> +		-I$(srctree)/tools/testing/selftests/bpf/ \
> +		-D__KERNEL__ -D__BPF_TRACING__ -Wno-unused-value -Wno-pointer-sign \
> +		-D__TARGET_ARCH_$(ARCH) -Wno-compare-distinct-pointer-types \
> +		-Wno-gnu-variable-sized-type-not-at-end \
> +		-Wno-address-of-packed-member -Wno-tautological-compare \
> +		-Wno-unknown-warning-option $(CLANG_ARCH_ARGS) \
> +		-I$(srctree)/samples/bpf/ -include asm_goto_workaround.h \
> +		-O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf $(LLC_FLAGS) -filetype=obj -o $@
> diff --git a/tools/dtrace/bpf_sample.c b/tools/dtrace/bpf_sample.c
> new file mode 100644
> index 000000000000..9862f75f92d3
> --- /dev/null
> +++ b/tools/dtrace/bpf_sample.c
> @@ -0,0 +1,146 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * This sample DTrace BPF tracing program demonstrates how actions can be
> + * associated with different probe types.
> + *
> + * The kprobe/ksys_write probe is a Function Boundary Tracing (FBT) entry probe
> + * on the ksys_write(fd, buf, count) function in the kernel.  Arguments to the
> + * function can be retrieved from the CPU registers (struct pt_regs).
> + *
> + * The tracepoint/syscalls/sys_enter_write probe is a System Call entry probe
> + * for the write(d, buf, count) system call.  Arguments to the system call can
> + * be retrieved from the tracepoint data passed to the BPF program as context
> + * struct syscall_data) when the probe fires.
> + *
> + * The BPF program associated with each probe prepares a DTrace BPF context
> + * (struct dt_bpf_context) that stores the probe ID and up to 10 arguments.
> + * Only 3 arguments are used in this sample.  Then the prorgams call a shared
> + * BPF function (bpf_action) that implements the actual action to be taken when
> + * a probe fires.  It prepares a data record to be stored in the tracing buffer
> + * and submits it to the buffer.  The data in the data record is obtained from
> + * the DTrace BPF context.
> + *
> + * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
> + */
> +#include <uapi/linux/bpf.h>
> +#include <linux/ptrace.h>
> +#include <linux/version.h>
> +#include <uapi/linux/unistd.h>
> +#include "bpf_helpers.h"
> +
> +#include "dtrace.h"
> +
> +struct syscall_data {
> +	struct pt_regs *regs;
> +	long syscall_nr;
> +	long arg[6];
> +};
> +
> +struct bpf_map_def SEC("maps") buffers = {
> +	.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
> +	.key_size = sizeof(u32),
> +	.value_size = sizeof(u32),
> +	.max_entries = NR_CPUS,
> +};
> +
> +#if defined(bpf_target_x86)
> +# define PT_REGS_PARM6(x)	((x)->r9)
> +#elif defined(bpf_target_s390x)
> +# define PT_REGS_PARM6(x)	((x)->gprs[7])
> +#elif defined(bpf_target_arm)
> +# define PT_REGS_PARM6(x)	((x)->uregs[5])
> +#elif defined(bpf_target_arm64)
> +# define PT_REGS_PARM6(x)	((x)->regs[5])
> +#elif defined(bpf_target_mips)
> +# define PT_REGS_PARM6(x)	((x)->regs[9])
> +#elif defined(bpf_target_powerpc)
> +# define PT_REGS_PARM6(x)	((x)->gpr[8])
> +#elif defined(bpf_target_sparc)
> +# define PT_REGS_PARM6(x)	((x)->u_regs[UREG_I5])
> +#else
> +# error Argument retrieval from pt_regs is not supported yet on this arch.
> +#endif
> +
> +/*
> + * We must pass a valid BPF context pointer because the bpf_perf_event_output()
> + * helper requires a BPF context pointer as first argument (and the verifier is
> + * validating that we pass a value that is known to be a context pointer).
> + *
> + * This BPF function implements the following D action:
> + * {
> + *	trace(curthread);
> + *	trace(arg0);
> + *	trace(arg1);
> + *	trace(arg2);
> + * }
> + *
> + * Expected output will look like:
> + *   CPU     ID
> + *    15  70423 0xffff8c0968bf8ec0 0x00000000000001 0x0055e019eb3f60 0x0000000000002c
> + *    15  18876 0xffff8c0968bf8ec0 0x00000000000001 0x0055e019eb3f60 0x0000000000002c
> + *    |   |     +-- curthread      +--> arg0 (fd)   +--> arg1 (buf)  +-- arg2 (count)
> + *    |   |
> + *    |   +--> probe ID
> + *    |
> + *    +--> CPU the probe fired on
> + */
> +static noinline int bpf_action(void *bpf_ctx, struct dt_bpf_context *ctx)
> +{
> +	int			cpu = bpf_get_smp_processor_id();
> +	struct data {
> +		u32	probe_id;	/* mandatory */
> +
> +		u64	task;		/* first data item (current task) */
> +		u64	arg0;		/* 2nd data item (arg0, fd) */
> +		u64	arg1;		/* 3rd data item (arg1, buf) */
> +		u64	arg2;		/* 4th data item (arg2, count) */
> +	}			rec;
> +
> +	memset(&rec, 0, sizeof(rec));
> +
> +	rec.probe_id = ctx->probe_id;
> +	rec.task = bpf_get_current_task();
> +	rec.arg0 = ctx->argv[0];
> +	rec.arg1 = ctx->argv[1];
> +	rec.arg2 = ctx->argv[2];
> +
> +	bpf_perf_event_output(bpf_ctx, &buffers, cpu, &rec, sizeof(rec));
> +
> +	return 0;
> +}
> +
> +SEC("kprobe/ksys_write")
> +int bpf_kprobe(struct pt_regs *regs)
> +{
> +	struct dt_bpf_context	ctx;
> +
> +	memset(&ctx, 0, sizeof(ctx));
> +
> +	ctx.probe_id = 18876;
> +	ctx.argv[0] = PT_REGS_PARM1(regs);
> +	ctx.argv[1] = PT_REGS_PARM2(regs);
> +	ctx.argv[2] = PT_REGS_PARM3(regs);
> +	ctx.argv[3] = PT_REGS_PARM4(regs);
> +	ctx.argv[4] = PT_REGS_PARM5(regs);
> +	ctx.argv[5] = PT_REGS_PARM6(regs);
> +
> +	return bpf_action(regs, &ctx);
> +}
> +
> +SEC("tracepoint/syscalls/sys_enter_write")
> +int bpf_tp(struct syscall_data *scd)
> +{
> +	struct dt_bpf_context	ctx;
> +
> +	memset(&ctx, 0, sizeof(ctx));
> +
> +	ctx.probe_id = 70423;
> +	ctx.argv[0] = scd->arg[0];
> +	ctx.argv[1] = scd->arg[1];
> +	ctx.argv[2] = scd->arg[2];
> +
> +	return bpf_action(scd, &ctx);
> +}
> +
> +char _license[] SEC("license") = "GPL";
> +u32 _version SEC("version") = LINUX_VERSION_CODE;
> diff --git a/tools/dtrace/dt_bpf.c b/tools/dtrace/dt_bpf.c
> new file mode 100644
> index 000000000000..78c90de016c6
> --- /dev/null
> +++ b/tools/dtrace/dt_bpf.c
> @@ -0,0 +1,185 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * This file provides the interface for handling BPF.  It uses the bpf library
> + * to interact with BPF ELF object files.
> + *
> + * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
> + */
> +#include <errno.h>
> +#include <stdarg.h>
> +#include <stdio.h>
> +#include <string.h>
> +#include <unistd.h>
> +#include <bpf/libbpf.h>
> +#include <linux/kernel.h>
> +#include <linux/perf_event.h>
> +#include <sys/ioctl.h>
> +
> +#include "dtrace_impl.h"
> +
> +/*
> + * Validate the output buffer map that is specified in the BPF ELF object.  It
> + * must match the following definition to be valid:
> + *
> + * struct bpf_map_def SEC("maps") buffers = {
> + *	.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
> + *	.key_size = sizeof(u32),
> + *	.value_size = sizeof(u32),
> + *	.max_entries = num,
> + * };
> + * where num is greater than dt_maxcpuid.
> + */
> +static int is_valid_buffers(const struct bpf_map_def *mdef)
> +{
> +	return mdef->type == BPF_MAP_TYPE_PERF_EVENT_ARRAY &&
> +	       mdef->key_size == sizeof(u32) &&
> +	       mdef->value_size == sizeof(u32) &&
> +	       mdef->max_entries > dt_maxcpuid;
> +}
> +
> +/*
> + * List the probes specified in the given BPF ELF object file.
> + */
> +int dt_bpf_list_probes(const char *fn)
> +{
> +	struct bpf_object	*obj;
> +	struct bpf_program	*prog;
> +	int			rc, fd;
> +
> +	libbpf_set_print(NULL);
> +
> +	/*
> +	 * Listing probes is done before the DTrace command line utility loads
> +	 * the supplied programs.  We load them here without attaching them to
> +	 * probes so that we can retrieve the ELF section names for each BPF
> +	 * program.  The section name indicates the probe that the program is
> +	 * associated with.
> +	 */
> +	rc = bpf_prog_load(fn, BPF_PROG_TYPE_UNSPEC, &obj, &fd);
> +	if (rc)
> +		return rc;
> +
> +	/*
> +	 * Loop through the programs in the BPF ELF object, and try to resolve
> +	 * the section names into probes.  Use the supplied callback function
> +	 * to emit the probe description.
> +	 */
> +	for (prog = bpf_program__next(NULL, obj); prog != NULL;
> +	     prog = bpf_program__next(prog, obj)) {
> +		struct dt_probe	*probe;
> +
> +		probe = dt_probe_resolve_event(bpf_program__title(prog, false));
> +
> +		printf("%5d %10s %17s %33s %s\n", probe->id,
> +		       probe->prv_name ? probe->prv_name : "",
> +		       probe->mod_name ? probe->mod_name : "",
> +		       probe->fun_name ? probe->fun_name : "",
> +		       probe->prb_name ? probe->prb_name : "");
> +	}
> +
> +
> +	/* Done with the BPF ELF object.  */
> +	bpf_object__close(obj);
> +
> +	return 0;
> +}
> +
> +/*
> + * Load the given BPF ELF object file.
> + */
> +int dt_bpf_load_file(const char *fn)
> +{
> +	struct bpf_object	*obj;
> +	struct bpf_map		*map;
> +	struct bpf_program	*prog;
> +	int			rc, fd;
> +
> +	libbpf_set_print(NULL);
> +
> +	/* Load the BPF ELF object file. */
> +	rc = bpf_prog_load(fn, BPF_PROG_TYPE_UNSPEC, &obj, &fd);
> +	if (rc)
> +		return rc;
> +
> +	/* Validate buffers map. */
> +	map = bpf_object__find_map_by_name(obj, "buffers");
> +	if (map && is_valid_buffers(bpf_map__def(map)))
> +		dt_bufmap_fd = bpf_map__fd(map);
> +	else
> +		goto fail;
> +
> +	/*
> +	 * Loop through the programs and resolve each into the matching probe.
> +	 * Attach the program to the probe.
> +	 */
> +	for (prog = bpf_program__next(NULL, obj); prog != NULL;
> +	     prog = bpf_program__next(prog, obj)) {
> +		struct dt_probe	*probe;
> +
> +		probe = dt_probe_resolve_event(bpf_program__title(prog, false));
> +		if (!probe)
> +			return -ENOENT;
> +		if (probe->prov && probe->prov->attach)
> +			probe->prov->attach(bpf_program__title(prog, false),
> +					    bpf_program__fd(prog));
> +	}
> +
> +	return 0;
> +
> +fail:
> +	bpf_object__close(obj);
> +	return -EINVAL;
> +}
> +
> +/*
> + * Store the (key, value) pair in the map referenced by the given fd.
> + */
> +int dt_bpf_map_update(int fd, const void *key, const void *val)
> +{
> +	union bpf_attr	attr;
> +
> +	memset(&attr, 0, sizeof(attr));
> +
> +	attr.map_fd = fd;
> +	attr.key = (u64)(unsigned long)key;
> +	attr.value = (u64)(unsigned long)val;
> +	attr.flags = 0;
> +
> +	return bpf(BPF_MAP_UPDATE_ELEM, &attr);
> +}
> +
> +/*
> + * Attach a trace event and associate a BPF program with it.
> + */
> +int dt_bpf_attach(int event_id, int bpf_fd)
> +{
> +	int			event_fd;
> +	int			rc;
> +	struct perf_event_attr	attr = {};
> +
> +	attr.type = PERF_TYPE_TRACEPOINT;
> +	attr.sample_type = PERF_SAMPLE_RAW;
> +	attr.sample_period = 1;
> +	attr.wakeup_events = 1;
> +	attr.config = event_id;
> +
> +	/*
> +	 * Register the event (based on its id), and obtain a fd.  It gets
> +	 * created as an enabled probe, so we don't have to explicitly enable
> +	 * it.
> +	 */
> +	event_fd = perf_event_open(&attr, -1, 0, -1, 0);
> +	if (event_fd < 0) {
> +		perror("sys_perf_event_open");
> +		return -1;
> +	}
> +
> +	/* Associate the BPF program with the event. */
> +	rc = ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, bpf_fd);
> +	if (rc < 0) {
> +		perror("PERF_EVENT_IOC_SET_BPF");
> +		return -1;
> +	}
> +
> +	return 0;
> +}
> diff --git a/tools/dtrace/dt_buffer.c b/tools/dtrace/dt_buffer.c
> new file mode 100644
> index 000000000000..19bb7e4cfc92
> --- /dev/null
> +++ b/tools/dtrace/dt_buffer.c
> @@ -0,0 +1,338 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * This file provides the tracing buffer handling for DTrace.  It makes use of
> + * the perf event output ring buffers that can be written to from BPF programs.
> + *
> + * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
> + */
> +#include <errno.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <syscall.h>
> +#include <unistd.h>
> +#include <sys/epoll.h>
> +#include <sys/ioctl.h>
> +#include <sys/mman.h>
> +#include <linux/bpf.h>
> +#include <linux/perf_event.h>
> +#include <linux/ring_buffer.h>
> +
> +#include "dtrace_impl.h"
> +
> +/*
> + * Probe data is recorded in per-CPU perf ring buffers.
> + */
> +struct dtrace_buffer {
> +	int	cpu;			/* ID of CPU that uses this buffer */
> +	int	fd;			/* fd of perf output buffer */
> +	size_t	page_size;		/* size of each page in buffer */
> +	size_t	data_size;		/* total buffer size */
> +	u8	*base;			/* address of buffer */
> +	u8	*endp;			/* address of end of buffer */
> +	u8	*tmp;			/* temporary event buffer */
> +	u32	tmp_len;		/* length of temporary event buffer */
> +};
> +
> +static struct dtrace_buffer	*dt_buffers;
> +
> +/*
> + * File descriptor for the BPF map that holds the buffers for the online CPUs.
> + * The map is a bpf_array indexed by CPU id, and it stores a file descriptor as
> + * value (the fd for the perf_event that represents the CPU buffer).
> + */
> +int				dt_bufmap_fd = -1;
> +
> +/*
> + * Create a perf_event buffer for the given DTrace buffer.  This will create
> + * a perf_event ring_buffer, mmap it, and enable the perf_event that owns the
> + * buffer.
> + */
> +static int perf_buffer_open(struct dtrace_buffer *buf)
> +{
> +	int			pefd;
> +	struct perf_event_attr	attr = {};
> +
> +	/*
> +	 * Event configuration for BPF-generated output in perf_event ring
> +	 * buffers.  The event is created in enabled state.
> +	 */
> +	attr.config = PERF_COUNT_SW_BPF_OUTPUT;
> +	attr.type = PERF_TYPE_SOFTWARE;
> +	attr.sample_type = PERF_SAMPLE_RAW;
> +	attr.sample_period = 1;
> +	attr.wakeup_events = 1;
> +	pefd = perf_event_open(&attr, -1, buf->cpu, -1, PERF_FLAG_FD_CLOEXEC);
> +	if (pefd < 0) {
> +		fprintf(stderr, "perf_event_open(cpu %d): %s\n", buf->cpu,
> +			strerror(errno));
> +		goto fail;
> +	}
> +
> +	/*
> +	 * We add buf->page_size to the buf->data_size, because perf maintains
> +	 * a meta-data page at the beginning of the memory region.  That page
> +	 * is used for reader/writer symchronization.
> +	 */
> +	buf->fd = pefd;
> +	buf->base = mmap(NULL, buf->page_size + buf->data_size,
> +			 PROT_READ | PROT_WRITE, MAP_SHARED, buf->fd, 0);
> +	buf->endp = buf->base + buf->page_size + buf->data_size - 1;
> +	if (!buf->base)
> +		goto fail;
> +
> +	return 0;
> +
> +fail:
> +	if (buf->base) {
> +		munmap(buf->base, buf->page_size + buf->data_size);
> +		buf->base = NULL;
> +		buf->endp = NULL;
> +	}
> +	if (buf->fd) {
> +		close(buf->fd);
> +		buf->fd = -1;
> +	}
> +
> +	return -1;
> +}
> +
> +/*
> + * Close the given DTrace buffer.  This function disables the perf_event that
> + * owns the buffer, munmaps the memory space, and closes the perf buffer fd.
> + */
> +static void perf_buffer_close(struct dtrace_buffer *buf)
> +{
> +	/*
> +	 * If the perf buffer failed to open, there is no need to close it.
> +	 */
> +	if (buf->fd < 0)
> +		return;
> +
> +	if (ioctl(buf->fd, PERF_EVENT_IOC_DISABLE, 0) < 0)
> +		fprintf(stderr, "PERF_EVENT_IOC_DISABLE(cpu %d): %s\n",
> +			buf->cpu, strerror(errno));
> +
> +	munmap(buf->base, buf->page_size + buf->data_size);
> +
> +	if (close(buf->fd))
> +		fprintf(stderr, "perf buffer close(cpu %d): %s\n",
> +			buf->cpu, strerror(errno));
> +
> +	buf->base = NULL;
> +	buf->fd = -1;
> +}
> +
> +/*
> + * Initialize the probe data buffers (one per online CPU).  Each buffer will
> + * contain the given number of pages (i.e. total size of each buffer will be
> + * num_pages * getpagesize()).  This function also sets up an event polling
> + * descriptor that monitors all CPU buffers at once.
> + */
> +int dt_buffer_init(int num_pages)
> +{
> +	int	i;
> +	int	epoll_fd;
> +
> +	if (dt_bufmap_fd < 0)
> +		return -EINVAL;
> +
> +	/* Allocate the per-CPU buffer structs. */
> +	dt_buffers = calloc(dt_numcpus, sizeof(struct dtrace_buffer));
> +	if (dt_buffers == NULL)
> +		return -ENOMEM;
> +
> +	/* Set up the event polling file descriptor. */
> +	epoll_fd = epoll_create1(EPOLL_CLOEXEC);
> +	if (epoll_fd < 0) {
> +		free(dt_buffers);
> +		return -errno;
> +	}
> +
> +	for (i = 0; i < dt_numcpus; i++) {
> +		int			cpu = dt_cpuids[i];
> +		struct epoll_event	ev;
> +		struct dtrace_buffer	*buf = &dt_buffers[i];
> +
> +		buf->cpu = cpu;
> +		buf->page_size = getpagesize();
> +		buf->data_size = num_pages * buf->page_size;
> +		buf->tmp = NULL;
> +		buf->tmp_len = 0;
> +
> +		/* Try to create the perf buffer for this DTrace buffer. */
> +		if (perf_buffer_open(buf) == -1)
> +			continue;
> +
> +		/* Store the perf buffer fd in the buffer map. */
> +		dt_bpf_map_update(dt_bufmap_fd, &cpu, &buf->fd);
> +
> +		/* Add the buffer to the event polling descriptor. */
> +		ev.events = EPOLLIN;
> +		ev.data.ptr = buf;
> +		if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, buf->fd, &ev) == -1) {
> +			fprintf(stderr, "EPOLL_CTL_ADD(cpu %d): %s\n",
> +				buf->cpu, strerror(errno));
> +			continue;
> +		}
> +	}
> +
> +	return epoll_fd;
> +}
> +
> +/*
> + * Clean up the buffers.
> + */
> +void dt_buffer_exit(int epoll_fd)
> +{
> +	int	i;
> +
> +	for (i = 0; i < dt_numcpus; i++)
> +		perf_buffer_close(&dt_buffers[i]);
> +
> +	free(dt_buffers);
> +	close(epoll_fd);
> +}
> +
> +/*
> + * Process and output the probe data at the supplied address.
> + */
> +static void output_event(int cpu, u64 *buf)
> +{
> +	u8				*data = (u8 *)buf;
> +	struct perf_event_header	*hdr;
> +
> +	hdr = (struct perf_event_header *)data;
> +	data += sizeof(struct perf_event_header);
> +
> +	if (hdr->type == PERF_RECORD_SAMPLE) {
> +		u8		*ptr = data;
> +		u32		i, size, probe_id;
> +
> +		/*
> +		 * struct {
> +		 *	struct perf_event_header	header;
> +		 *	u32				size;
> +		 *	u32				probe_id;
> +		 *	u32				gap;
> +		 *	u64				data[n];
> +		 * }
> +		 * and data points to the 'size' member at this point.
> +		 */
> +		if (ptr > (u8 *)buf + hdr->size) {
> +			fprintf(stderr, "BAD: corrupted sample header\n");
> +			return;
> +		}
> +
> +		size = *(u32 *)data;
> +		data += sizeof(size);
> +		ptr += sizeof(size) + size;
> +		if (ptr != (u8 *)buf + hdr->size) {
> +			fprintf(stderr, "BAD: invalid sample size\n");
> +			return;
> +		}
> +
> +		probe_id = *(u32 *)data;
> +		data += sizeof(probe_id);
> +		size -= sizeof(probe_id);
> +		data += sizeof(u32);		/* skip 32-bit gap */
> +		size -= sizeof(u32);
> +		buf = (u64 *)data;
> +
> +		printf("%3d %6d ", cpu, probe_id);
> +		for (i = 0, size /= sizeof(u64); i < size; i++)
> +			printf("%#016lx ", buf[i]);
> +		printf("\n");
> +	} else if (hdr->type == PERF_RECORD_LOST) {
> +		u64	lost;
> +
> +		/*
> +		 * struct {
> +		 *	struct perf_event_header	header;
> +		 *	u64				id;
> +		 *	u64				lost;
> +		 * }
> +		 * and data points to the 'id' member at this point.
> +		 */
> +		lost = *(u64 *)(data + sizeof(u64));
> +
> +		printf("[%ld probes dropped]\n", lost);
> +	} else
> +		fprintf(stderr, "UNKNOWN: record type %d\n", hdr->type);
> +}
> +
> +/*
> + * Process the available probe data in the given buffer.
> + */
> +static void process_data(struct dtrace_buffer *buf)
> +{
> +	struct perf_event_mmap_page	*rb_page = (void *)buf->base;
> +	struct perf_event_header	*hdr;
> +	u8				*base;
> +	u64				head, tail;
> +
> +	/* Set base to be the start of the buffer data. */
> +	base = buf->base + buf->page_size;
> +
> +	for (;;) {
> +		head = ring_buffer_read_head(rb_page);
> +		tail = rb_page->data_tail;
> +
> +		if (tail == head)
> +			break;
> +
> +		do {
> +			u8	*event = base + tail % buf->data_size;
> +			u32	len;
> +
> +			hdr = (struct perf_event_header *)event;
> +			len = hdr->size;
> +
> +			/*
> +			 * If the perf event data wraps around the boundary of
> +			 * the buffer, we make a copy in contiguous memory.
> +			 */
> +			if (event + len > buf->endp) {
> +				u8	*dst;
> +				u32	num;
> +
> +				/* Increase buffer as needed. */
> +				if (buf->tmp_len < len) {
> +					buf->tmp = realloc(buf->tmp, len);
> +					buf->tmp_len = len;
> +				}
> +
> +				dst = buf->tmp;
> +				num = buf->endp - event + 1;
> +				memcpy(dst, event, num);
> +				memcpy(dst + num, base, len - num);
> +
> +				event = dst;
> +			}
> +
> +			output_event(buf->cpu, (u64 *)event);
> +
> +			tail += hdr->size;
> +		} while (tail != head);
> +
> +		ring_buffer_write_tail(rb_page, tail);
> +	}
> +}
> +
> +/*
> + * Wait for data to become available in any of the buffers.
> + */
> +int dt_buffer_poll(int epoll_fd, int timeout)
> +{
> +	struct epoll_event	events[dt_numcpus];
> +	int			i, cnt;
> +
> +	cnt = epoll_wait(epoll_fd, events, dt_numcpus, timeout);
> +	if (cnt < 0)
> +		return -errno;
> +
> +	for (i = 0; i < cnt; i++)
> +		process_data((struct dtrace_buffer *)events[i].data.ptr);
> +
> +	return cnt;
> +}
> diff --git a/tools/dtrace/dt_fbt.c b/tools/dtrace/dt_fbt.c
> new file mode 100644
> index 000000000000..fcf95243bf97
> --- /dev/null
> +++ b/tools/dtrace/dt_fbt.c
> @@ -0,0 +1,201 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * The Function Boundary Tracing (FBT) provider for DTrace.
> + *
> + * FBT probes are exposed by the kernel as kprobes.  They are listed in the
> + * TRACEFS/available_filter_functions file.  Some kprobes are associated with
> + * a specific kernel module, while most are in the core kernel.
> + *
> + * Mapping from event name to DTrace probe name:
> + *
> + *      <name>					fbt:vmlinux:<name>:entry
> + *						fbt:vmlinux:<name>:return
> + *   or
> + *      <name> [<modname>]			fbt:<modname>:<name>:entry
> + *						fbt:<modname>:<name>:return
> + *
> + * Mapping from BPF section name to DTrace probe name:
> + *
> + *      kprobe/<name>				fbt:vmlinux:<name>:entry
> + *      kretprobe/<name>			fbt:vmlinux:<name>:return
> + *
> + * (Note that the BPF section does not carry information about the module that
> + *  the function is found in.  This means that BPF section name cannot be used
> + *  to distinguish between functions with the same name occurring in different
> + *  modules.)
> + *
> + * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
> + */
> +#include <fcntl.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <unistd.h>
> +#include <linux/bpf.h>
> +#include <sys/stat.h>
> +#include <sys/types.h>
> +
> +#include "dtrace_impl.h"
> +
> +#define KPROBE_EVENTS	TRACEFS "kprobe_events"
> +#define PROBE_LIST	TRACEFS "available_filter_functions"
> +
> +static const char	provname[] = "fbt";
> +static const char	modname[] = "vmlinux";
> +
> +/*
> + * Scan the PROBE_LIST file and add entry and return probes for every function
> + * that is listed.
> + */
> +static int fbt_populate(void)
> +{
> +	FILE			*f;
> +	char			buf[256];
> +	char			*p;
> +
> +	f = fopen(PROBE_LIST, "r");
> +	if (f == NULL)
> +		return -1;
> +
> +	while (fgets(buf, sizeof(buf), f)) {
> +		/*
> +		 * Here buf is either "funcname\n" or "funcname [modname]\n".
> +		 */
> +		p = strchr(buf, '\n');
> +		if (p) {
> +			*p = '\0';
> +			if (p > buf && *(--p) == ']')
> +				*p = '\0';
> +		} else {
> +			/* If we didn't see a newline, the line was too long.
> +			 * Report it, and continue until the end of the line.
> +			 */
> +			fprintf(stderr, "%s: Line too long: %s\n",
> +				PROBE_LIST, buf);
> +			do
> +				fgets(buf, sizeof(buf), f);
> +			while (strchr(buf, '\n') == NULL);
> +			continue;
> +		}
> +
> +		/*
> +		 * Now buf is either "funcname" or "funcname [modname".  If
> +		 * there is no module name provided, we will use the default.
> +		 */
> +		p = strchr(buf, ' ');
> +		if (p) {
> +			*p++ = '\0';
> +			if (*p == '[')
> +				p++;
> +		}
> +
> +		dt_probe_new(&dt_fbt, provname, p ? p : modname, buf, "entry");
> +		dt_probe_new(&dt_fbt, provname, p ? p : modname, buf, "return");
> +	}
> +
> +	fclose(f);
> +
> +	return 0;
> +}
> +
> +#define ENTRY_PREFIX	"kprobe/"
> +#define EXIT_PREFIX	"kretprobe/"
> +
> +/*
> + * Perform a probe lookup based on an event name (BPF ELF section name).
> + */
> +static struct dt_probe *fbt_resolve_event(const char *name)
> +{
> +	const char	*prbname;
> +	struct dt_probe	tmpl;
> +	struct dt_probe	*probe;
> +
> +	if (!name)
> +		return NULL;
> +
> +	if (strncmp(name, ENTRY_PREFIX, sizeof(ENTRY_PREFIX) - 1) == 0) {
> +		name += sizeof(ENTRY_PREFIX) - 1;
> +		prbname = "entry";
> +	} else if (strncmp(name, EXIT_PREFIX, sizeof(EXIT_PREFIX) - 1) == 0) {
> +		name += sizeof(EXIT_PREFIX) - 1;
> +		prbname = "return";
> +	} else
> +		return NULL;
> +
> +	memset(&tmpl, 0, sizeof(tmpl));
> +	tmpl.prv_name = provname;
> +	tmpl.mod_name = modname;
> +	tmpl.fun_name = name;
> +	tmpl.prb_name = prbname;
> +
> +	probe = dt_probe_by_name(&tmpl);
> +
> +	return probe;
> +}
> +
> +/*
> + * Attach the given BPF program (identified by its file descriptor) to the
> + * kprobe identified by the given section name.
> + */
> +static int fbt_attach(const char *name, int bpf_fd)
> +{
> +	char    efn[256];
> +	char    buf[256];
> +	int	event_id, fd, rc;
> +
> +	name += 7;				/* skip "kprobe/" */
> +	snprintf(buf, sizeof(buf), "p:%s %s\n", name, name);
> +
> +	/*
> +	 * Register the kprobe with the tracing subsystem.  This will create
> +	 * a tracepoint event.
> +	 */
> +	fd = open(KPROBE_EVENTS, O_WRONLY | O_APPEND);
> +	if (fd < 0) {
> +		perror(KPROBE_EVENTS);
> +		return -1;
> +	}
> +	rc = write(fd, buf, strlen(buf));
> +	if (rc < 0) {
> +		perror(KPROBE_EVENTS);
> +		close(fd);
> +		return -1;
> +	}
> +	close(fd);
> +
> +	/*
> +	 * Read the tracepoint event id for the kprobe we just registered.
> +	 */
> +	strcpy(efn, EVENTSFS);
> +	strcat(efn, "kprobes/");
> +	strcat(efn, name);
> +	strcat(efn, "/id");
> +
> +	fd = open(efn, O_RDONLY);
> +	if (fd < 0) {
> +		perror(efn);
> +		return -1;
> +	}
> +	rc = read(fd, buf, sizeof(buf));
> +	if (rc < 0 || rc >= sizeof(buf)) {
> +		perror(efn);
> +		close(fd);
> +		return -1;
> +	}
> +	close(fd);
> +	buf[rc] = '\0';
> +	event_id = atoi(buf);
> +
> +	/*
> +	 * Attaching a BPF program (by file descriptor) to an event (by ID) is
> +	 * a generic operation provided by the BPF interface code.
> +	 */
> +	return dt_bpf_attach(event_id, bpf_fd);
> +}
> +
> +struct dt_provider	dt_fbt = {
> +	.name		= "fbt",
> +	.populate	= &fbt_populate,
> +	.resolve_event	= &fbt_resolve_event,
> +	.attach		= &fbt_attach,
> +};
> diff --git a/tools/dtrace/dt_hash.c b/tools/dtrace/dt_hash.c
> new file mode 100644
> index 000000000000..b1f563bc0773
> --- /dev/null
> +++ b/tools/dtrace/dt_hash.c
> @@ -0,0 +1,211 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * This file provides a generic hashtable implementation for probes.
> + *
> + * The hashtable is created with 4 user-provided functions:
> + *	hval(probe)		- calculate a hash value for the given probe
> + *	cmp(probe1, probe2)	- compare two probes
> + *	add(head, probe)	- add a probe to a list of probes
> + *	del(head, probe)	- delete a probe from a list of probes
> + *
> + * Probes are hashed into a hashtable slot based on the return value of
> + * hval(probe).  Each hashtable slot holds a list of buckets, with each
> + * bucket storing probes that are equal under the cmp(probe1, probe2)
> + * function. Probes are added to the list of probes in a bucket using the
> + * add(head, probe) function, and they are deleted using a call to the
> + * del(head, probe) function.
> + *
> + * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
> + */
> +#include <errno.h>
> +#include <stdint.h>
> +#include <stdlib.h>
> +
> +#include "dtrace_impl.h"
> +
> +/*
> + * Hashtable implementation for probes.
> + */
> +struct dt_hbucket {
> +	u32			hval;
> +	struct dt_hbucket	*next;
> +	struct dt_probe		*head;
> +	int			nprobes;
> +};
> +
> +struct dt_htab {
> +	struct dt_hbucket	**tab;
> +	int			size;
> +	int			mask;
> +	int			nbuckets;
> +	dt_hval_fn		hval;		/* calculate hash value */
> +	dt_cmp_fn		cmp;		/* compare 2 probes */
> +	dt_add_fn		add;		/* add probe to list */
> +	dt_del_fn		del;		/* delete probe from list */
> +};
> +
> +/*
> + * Create a new (empty) hashtable.
> + */
> +struct dt_htab *dt_htab_new(dt_hval_fn hval, dt_cmp_fn cmp, dt_add_fn add,
> +			    dt_del_fn del)
> +{
> +	struct dt_htab	*htab = malloc(sizeof(struct dt_htab));
> +
> +	if (!htab)
> +		return NULL;
> +
> +	htab->size = 1;
> +	htab->mask = htab->size - 1;
> +	htab->nbuckets = 0;
> +	htab->hval = hval;
> +	htab->cmp = cmp;
> +	htab->add = add;
> +	htab->del = del;
> +
> +	htab->tab = calloc(htab->size, sizeof(struct dt_hbucket *));
> +	if (!htab->tab) {
> +		free(htab);
> +		return NULL;
> +	}
> +
> +	return htab;
> +}
> +
> +/*
> + * Resize the hashtable by doubling the number of slots.
> + */
> +static int resize(struct dt_htab *htab)
> +{
> +	int			i;
> +	int			osize = htab->size;
> +	int			nsize = osize << 1;
> +	int			nmask = nsize - 1;
> +	struct dt_hbucket	**ntab;
> +
> +	ntab = calloc(nsize, sizeof(struct dt_hbucket *));
> +	if (!ntab)
> +		return -ENOMEM;
> +
> +	for (i = 0; i < osize; i++) {
> +		struct dt_hbucket	*bucket, *next;
> +
> +		for (bucket = htab->tab[i]; bucket; bucket = next) {
> +			int	idx	= bucket->hval & nmask;
> +
> +			next = bucket->next;
> +			bucket->next = ntab[idx];
> +			ntab[idx] = bucket;
> +		}
> +	}
> +
> +	free(htab->tab);
> +	htab->tab = ntab;
> +	htab->size = nsize;
> +	htab->mask = nmask;
> +
> +	return 0;
> +}
> +
> +/*
> + * Add a probe to the hashtable.  Resize if necessary, and allocate a new
> + * bucket if necessary.
> + */
> +int dt_htab_add(struct dt_htab *htab, struct dt_probe *probe)
> +{
> +	u32			hval = htab->hval(probe);
> +	int			idx;
> +	struct dt_hbucket	*bucket;
> +
> +retry:
> +	idx = hval & htab->mask;
> +	for (bucket = htab->tab[idx]; bucket; bucket = bucket->next) {
> +		if (htab->cmp(bucket->head, probe) == 0)
> +			goto add;
> +	}
> +
> +	if ((htab->nbuckets >> 1) > htab->size) {
> +		int	err;
> +
> +		err = resize(htab);
> +		if (err)
> +			return err;
> +
> +		goto retry;
> +	}
> +
> +	bucket = malloc(sizeof(struct dt_hbucket));
> +	if (!bucket)
> +		return -ENOMEM;
> +
> +	bucket->hval = hval;
> +	bucket->next = htab->tab[idx];
> +	bucket->head = NULL;
> +	bucket->nprobes = 0;
> +	htab->tab[idx] = bucket;
> +	htab->nbuckets++;
> +
> +add:
> +	bucket->head = htab->add(bucket->head, probe);
> +	bucket->nprobes++;
> +
> +	return 0;
> +}
> +
> +/*
> + * Find a probe in the hashtable.
> + */
> +struct dt_probe *dt_htab_lookup(const struct dt_htab *htab,
> +				const struct dt_probe *probe)
> +{
> +	u32			hval = htab->hval(probe);
> +	int			idx = hval & htab->mask;
> +	struct dt_hbucket	*bucket;
> +
> +	for (bucket = htab->tab[idx]; bucket; bucket = bucket->next) {
> +		if (htab->cmp(bucket->head, probe) == 0)
> +			return bucket->head;
> +	}
> +
> +	return NULL;
> +}
> +
> +/*
> + * Remove a probe from the hashtable.  If we are deleting the last probe in a
> + * bucket, get rid of the bucket.
> + */
> +int dt_htab_del(struct dt_htab *htab, struct dt_probe *probe)
> +{
> +	u32			hval = htab->hval(probe);
> +	int			idx = hval & htab->mask;
> +	struct dt_hbucket	*bucket;
> +	struct dt_probe		*head;
> +
> +	for (bucket = htab->tab[idx]; bucket; bucket = bucket->next) {
> +		if (htab->cmp(bucket->head, probe) == 0)
> +			break;
> +	}
> +
> +	if (bucket == NULL)
> +		return -ENOENT;
> +
> +	head = htab->del(bucket->head, probe);
> +	if (!head) {
> +		struct dt_hbucket	*b = htab->tab[idx];
> +
> +		if (bucket == b)
> +			htab->tab[idx] = bucket->next;
> +		else {
> +			while (b->next != bucket)
> +				b = b->next;
> +
> +			b->next = bucket->next;
> +		}
> +
> +		htab->nbuckets--;
> +		free(bucket);
> +	} else
> +		bucket->head = head;
> +
> +	return 0;
> +}
> diff --git a/tools/dtrace/dt_probe.c b/tools/dtrace/dt_probe.c
> new file mode 100644
> index 000000000000..0b6228eaff29
> --- /dev/null
> +++ b/tools/dtrace/dt_probe.c
> @@ -0,0 +1,230 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * This file implements the interface to probes grouped by provider.
> + *
> + * Probes are named by a set of 4 identifiers:
> + *	- provider name
> + *	- module name
> + *	- function name
> + *	- probe name
> + *
> + * The Fully Qualified Name (FQN) is "provider:module:function:name".
> + *
> + * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
> + */
> +#include <errno.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <linux/bpf.h>
> +#include <linux/kernel.h>
> +
> +#include "dtrace_impl.h"
> +
> +static struct dt_provider      *dt_providers[] = {
> +							&dt_fbt,
> +							&dt_syscall,
> +						 };
> +
> +static struct dt_htab	*ht_byfqn;
> +
> +static u32		next_probe_id;
> +
> +/*
> + * Calculate a hash value based on a given string and an initial value.  The
> + * initial value is used to calculate compound hash values, e.g.
> + *
> + *	u32	hval;
> + *
> + *	hval = str2hval(str1, 0);
> + *	hval = str2hval(str2, hval);
> + */
> +static u32 str2hval(const char *p, u32 hval)
> +{
> +	u32	g;
> +
> +	if (!p)
> +		return hval;
> +
> +	while (*p) {
> +		hval = (hval << 4) + *p++;
> +		g = hval & 0xf0000000;
> +		if (g != 0)
> +			hval ^= g >> 24;
> +
> +		hval &= ~g;
> +	}
> +
> +	return hval;
> +}
> +
> +/*
> + * String compare function that can handle either or both strings being NULL.
> + */
> +static int safe_strcmp(const char *p, const char *q)
> +{
> +	return (!p) ? (!q) ? 0
> +			   : -1
> +		    : (!q) ? 1
> +			   : strcmp(p, q);
> +}
> +
> +/*
> + * Calculate the hash value of a probe as the cummulative hash value of the
> + * FQN.
> + */
> +static u32 fqn_hval(const struct dt_probe *probe)
> +{
> +	u32	hval = 0;
> +
> +	hval = str2hval(probe->prv_name, hval);
> +	hval = str2hval(":", hval);
> +	hval = str2hval(probe->mod_name, hval);
> +	hval = str2hval(":", hval);
> +	hval = str2hval(probe->fun_name, hval);
> +	hval = str2hval(":", hval);
> +	hval = str2hval(probe->prb_name, hval);
> +
> +	return hval;
> +}
> +
> +/*
> + * Compare two probes based on the FQN.
> + */
> +static int fqn_cmp(const struct dt_probe *p, const struct dt_probe *q)
> +{
> +	int	rc;
> +
> +	rc = safe_strcmp(p->prv_name, q->prv_name);
> +	if (rc)
> +		return rc;
> +	rc = safe_strcmp(p->mod_name, q->mod_name);
> +	if (rc)
> +		return rc;
> +	rc = safe_strcmp(p->fun_name, q->fun_name);
> +	if (rc)
> +		return rc;
> +	rc = safe_strcmp(p->prb_name, q->prb_name);
> +	if (rc)
> +		return rc;
> +
> +	return 0;
> +}
> +
> +/*
> + * Add the given probe 'new' to the double-linked probe list 'head'.  Probe
> + * 'new' becomes the new list head.
> + */
> +static struct dt_probe *fqn_add(struct dt_probe *head, struct dt_probe *new)
> +{
> +	if (!head)
> +		return new;
> +
> +	new->he_fqn.next = head;
> +	head->he_fqn.prev = new;
> +
> +	return new;
> +}
> +
> +/*
> + * Remove the given probe 'probe' from the double-linked probe list 'head'.
> + * If we are deleting the current head, the next probe in the list is returned
> + * as the new head.  If that value is NULL, the list is now empty.
> + */
> +static struct dt_probe *fqn_del(struct dt_probe *head, struct dt_probe *probe)
> +{
> +	if (head == probe) {
> +		if (!probe->he_fqn.next)
> +			return NULL;
> +
> +		head = probe->he_fqn.next;
> +		head->he_fqn.prev = NULL;
> +		probe->he_fqn.next = NULL;
> +
> +		return head;
> +	}
> +
> +	if (!probe->he_fqn.next) {
> +		probe->he_fqn.prev->he_fqn.next = NULL;
> +		probe->he_fqn.prev = NULL;
> +
> +		return head;
> +	}
> +
> +	probe->he_fqn.prev->he_fqn.next = probe->he_fqn.next;
> +	probe->he_fqn.next->he_fqn.prev = probe->he_fqn.prev;
> +	probe->he_fqn.prev = probe->he_fqn.next = NULL;
> +
> +	return head;
> +}
> +
> +/*
> + * Initialize the probe handling by populating the FQN hashtable with probes
> + * from all providers.
> + */
> +int dt_probe_init(void)
> +{
> +	int	i;
> +
> +	ht_byfqn = dt_htab_new(fqn_hval, fqn_cmp, fqn_add, fqn_del);
> +
> +	for (i = 0; i < ARRAY_SIZE(dt_providers); i++) {
> +		if (dt_providers[i]->populate() < 0)
> +			return -1;
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * Allocate a new probe and add it to the FQN hashtable.
> + */
> +int dt_probe_new(const struct dt_provider *prov, const char *pname,
> +		 const char *mname, const char *fname, const char *name)
> +{
> +	struct dt_probe	*probe;
> +
> +	probe = malloc(sizeof(struct dt_probe));
> +	if (!probe)
> +		return -ENOMEM;
> +
> +	memset(probe, 0, sizeof(struct dt_probe));
> +	probe->id = next_probe_id++;
> +	probe->prov = prov;
> +	probe->prv_name = pname ? strdup(pname) : NULL;
> +	probe->mod_name = mname ? strdup(mname) : NULL;
> +	probe->fun_name = fname ? strdup(fname) : NULL;
> +	probe->prb_name = name ? strdup(name) : NULL;
> +
> +	dt_htab_add(ht_byfqn, probe);
> +
> +	return 0;
> +}
> +
> +/*
> + * Perform a probe lookup based on FQN.
> + */
> +struct dt_probe *dt_probe_by_name(const struct dt_probe *tmpl)
> +{
> +	return dt_htab_lookup(ht_byfqn, tmpl);
> +}
> +
> +/*
> + * Resolve an event name (BPF ELF section name) into a probe.  We query each
> + * provider, and as soon as we get a hit, we return the result.
> + */
> +struct dt_probe *dt_probe_resolve_event(const char *name)
> +{
> +	int		i;
> +	struct dt_probe	*probe;
> +
> +	for (i = 0; i < ARRAY_SIZE(dt_providers); i++) {
> +		if (!dt_providers[i]->resolve_event)
> +			continue;
> +		probe = dt_providers[i]->resolve_event(name);
> +		if (probe)
> +			return probe;
> +	}
> +
> +	return NULL;
> +}
> diff --git a/tools/dtrace/dt_syscall.c b/tools/dtrace/dt_syscall.c
> new file mode 100644
> index 000000000000..6695a4a1c701
> --- /dev/null
> +++ b/tools/dtrace/dt_syscall.c
> @@ -0,0 +1,179 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * The syscall provider for DTrace.
> + *
> + * System call probes are exposed by the kernel as tracepoint events in the
> + * "syscalls" group.  Entry probe names start with "sys_enter_" and exit probes
> + * start with "sys_exit_".
> + *
> + * Mapping from event name to DTrace probe name:
> + *
> + *	syscalls:sys_enter_<name>		syscall:vmlinux:<name>:entry
> + *	syscalls:sys_exit_<name>		syscall:vmlinux:<name>:return
> + *
> + * Mapping from BPF section name to DTrace probe name:
> + *
> + *	tracepoint/syscalls/sys_enter_<name>	syscall:vmlinux:<name>:entry
> + *	tracepoint/syscalls/sys_exit_<name>	syscall:vmlinux:<name>:return
> + *
> + * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
> + */
> +#include <ctype.h>
> +#include <fcntl.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <unistd.h>
> +#include <linux/bpf.h>
> +#include <sys/stat.h>
> +#include <sys/types.h>
> +
> +#include "dtrace_impl.h"
> +
> +static const char	provname[] = "syscall";
> +static const char	modname[] = "vmlinux";
> +
> +#define PROBE_LIST	TRACEFS "available_events"
> +
> +#define PROV_PREFIX	"syscalls:"
> +#define ENTRY_PREFIX	"sys_enter_"
> +#define EXIT_PREFIX	"sys_exit_"
> +
> +/*
> + * Scan the PROBE_LIST file and add probes for any syscalls events.
> + */
> +static int syscall_populate(void)
> +{
> +	FILE			*f;
> +	char			buf[256];
> +
> +	f = fopen(PROBE_LIST, "r");
> +	if (f == NULL)
> +		return -1;
> +
> +	while (fgets(buf, sizeof(buf), f)) {
> +		char	*p;
> +
> +		/* * Here buf is "group:event".  */
> +		p = strchr(buf, '\n');
> +		if (p)
> +			*p = '\0';
> +		else {
> +			/*
> +			 * If we didn't see a newline, the line was too long.
> +			 * Report it, and continue until the end of the line.
> +			 */
> +			fprintf(stderr, "%s: Line too long: %s\n",
> +				PROBE_LIST, buf);
> +			do
> +				fgets(buf, sizeof(buf), f);
> +			while (strchr(buf, '\n') == NULL);
> +			continue;
> +		}
> +
> +		/* We need "group:" to match "syscalls:". */
> +		p = buf;
> +		if (memcmp(p, PROV_PREFIX, sizeof(PROV_PREFIX) - 1) != 0)
> +			continue;
> +
> +		p += sizeof(PROV_PREFIX) - 1;
> +		/*
> +		 * Now p will be just "event", and we are only interested in
> +		 * events that match "sys_enter_*" or "sys_exit_*".
> +		 */
> +		if (!memcmp(p, ENTRY_PREFIX, sizeof(ENTRY_PREFIX) - 1)) {
> +			p += sizeof(ENTRY_PREFIX) - 1;
> +			dt_probe_new(&dt_syscall, provname, modname, p,
> +				     "entry");
> +		} else if (!memcmp(p, EXIT_PREFIX, sizeof(EXIT_PREFIX) - 1)) {
> +			p += sizeof(EXIT_PREFIX) - 1;
> +			dt_probe_new(&dt_syscall, provname, modname, p,
> +				     "return");
> +		}
> +	}
> +
> +	fclose(f);
> +
> +	return 0;
> +}
> +
> +#define EVENT_PREFIX	"tracepoint/syscalls/"
> +
> +/*
> + * Perform a probe lookup based on an event name (BPF ELF section name).
> + */
> +static struct dt_probe *systrace_resolve_event(const char *name)
> +{
> +	const char	*prbname;
> +	struct dt_probe	tmpl;
> +	struct dt_probe	*probe;
> +
> +	if (!name)
> +		return NULL;
> +
> +	/* Exclude anything that is not a syscalls tracepoint */
> +	if (strncmp(name, EVENT_PREFIX, sizeof(EVENT_PREFIX) - 1) != 0)
> +		return NULL;
> +	name += sizeof(EVENT_PREFIX) - 1;
> +
> +	if (strncmp(name, ENTRY_PREFIX, sizeof(ENTRY_PREFIX) - 1) == 0) {
> +		name += sizeof(ENTRY_PREFIX) - 1;
> +		prbname = "entry";
> +	} else if (strncmp(name, EXIT_PREFIX, sizeof(EXIT_PREFIX) - 1) == 0) {
> +		name += sizeof(EXIT_PREFIX) - 1;
> +		prbname = "return";
> +	} else
> +		return NULL;
> +
> +	memset(&tmpl, 0, sizeof(tmpl));
> +	tmpl.prv_name = provname;
> +	tmpl.mod_name = modname;
> +	tmpl.fun_name = name;
> +	tmpl.prb_name = prbname;
> +
> +	probe = dt_probe_by_name(&tmpl);
> +
> +	return probe;
> +}
> +
> +#define SYSCALLSFS	EVENTSFS "syscalls/"
> +
> +/*
> + * Attach the given BPF program (identified by its file descriptor) to the
> + * event identified by the given section name.
> + */
> +static int syscall_attach(const char *name, int bpf_fd)
> +{
> +	char    efn[256];
> +	char    buf[256];
> +	int	event_id, fd, rc;
> +
> +	name += sizeof(EVENT_PREFIX) - 1;
> +	strcpy(efn, SYSCALLSFS);
> +	strcat(efn, name);
> +	strcat(efn, "/id");
> +
> +	fd = open(efn, O_RDONLY);
> +	if (fd < 0) {
> +		perror(efn);
> +		return -1;
> +	}
> +	rc = read(fd, buf, sizeof(buf));
> +	if (rc < 0 || rc >= sizeof(buf)) {
> +		perror(efn);
> +		close(fd);
> +		return -1;
> +	}
> +	close(fd);
> +	buf[rc] = '\0';
> +	event_id = atoi(buf);
> +
> +	return dt_bpf_attach(event_id, bpf_fd);
> +}
> +
> +struct dt_provider	dt_syscall = {
> +	.name		= "syscall",
> +	.populate	= &syscall_populate,
> +	.resolve_event	= &systrace_resolve_event,
> +	.attach		= &syscall_attach,
> +};
> diff --git a/tools/dtrace/dt_utils.c b/tools/dtrace/dt_utils.c
> new file mode 100644
> index 000000000000..55d51bae1d97
> --- /dev/null
> +++ b/tools/dtrace/dt_utils.c
> @@ -0,0 +1,132 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
> + */
> +#include <sys/types.h>
> +#include <sys/stat.h>
> +#include <fcntl.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <unistd.h>
> +
> +#include "dtrace_impl.h"
> +
> +#define BUF_SIZE	1024		/* max size for online cpu data */
> +
> +int	dt_numcpus;			/* number of online CPUs */
> +int	dt_maxcpuid;			/* highest CPU id */
> +int	*dt_cpuids;			/* list of CPU ids */
> +
> +/*
> + * Populate the online CPU id information from sysfs data.  We only do this
> + * once because we do not care about CPUs coming online after we started
> + * tracing.  If a CPU goes offline during tracing, we do not care either
> + * because that simply means that it won't be writing any new probe data into
> + * its buffer.
> + */
> +void cpu_list_populate(void)
> +{
> +	char buf[BUF_SIZE];
> +	int fd, cnt, start, end, i;
> +	int *cpu;
> +	char *p, *q;
> +
> +	fd = open("/sys/devices/system/cpu/online", O_RDONLY);
> +	if (fd < 0)
> +		goto fail;
> +	cnt = read(fd, buf, sizeof(buf));
> +	close(fd);
> +	if (cnt <= 0)
> +		goto fail;
> +
> +	/*
> +	 * The string should always end with a newline, but let's make sure.
> +	 */
> +	if (buf[cnt - 1] == '\n')
> +		buf[--cnt] = 0;
> +
> +	/*
> +	 * Count how many CPUs we have.
> +	 */
> +	dt_numcpus = 0;
> +	p = buf;
> +	do {
> +		start = (int)strtol(p, &q, 10);
> +		switch (*q) {
> +		case '-':		/* range */
> +			p = q + 1;
> +			end = (int)strtol(p, &q, 10);
> +			dt_numcpus += end - start + 1;
> +			if (*q == 0) {	/* end of string */
> +				p = q;
> +				break;
> +			}
> +			if (*q != ',')
> +				goto fail;
> +			p = q + 1;
> +			break;
> +		case 0:			/* end of string */
> +			dt_numcpus++;
> +			p = q;
> +			break;
> +		case ',':	/* gap  */
> +			dt_numcpus++;
> +			p = q + 1;
> +			break;
> +		}
> +	} while (*p != 0);
> +
> +	dt_cpuids = calloc(dt_numcpus,  sizeof(int));
> +	cpu = dt_cpuids;
> +
> +	/*
> +	 * Fill in the CPU ids.
> +	 */
> +	p = buf;
> +	do {
> +		start = (int)strtol(p, &q, 10);
> +		switch (*q) {
> +		case '-':		/* range */
> +			p = q + 1;
> +			end = (int)strtol(p, &q, 10);
> +			for (i = start; i <= end; i++)
> +				*cpu++ = i;
> +			if (*q == 0) {	/* end of string */
> +				p = q;
> +				break;
> +			}
> +			if (*q != ',')
> +				goto fail;
> +			p = q + 1;
> +			break;
> +		case 0:			/* end of string */
> +			*cpu = start;
> +			p = q;
> +			break;
> +		case ',':	/* gap  */
> +			*cpu++ = start;
> +			p = q + 1;
> +			break;
> +		}
> +	} while (*p != 0);
> +
> +	/* Record the highest CPU id of the set of online CPUs. */
> +	dt_maxcpuid = *(cpu - 1);
> +
> +	return;
> +fail:
> +	if (dt_cpuids)
> +		free(dt_cpuids);
> +
> +	dt_numcpus = 0;
> +	dt_maxcpuid = 0;
> +	dt_cpuids = NULL;
> +}
> +
> +void cpu_list_free(void)
> +{
> +	free(dt_cpuids);
> +	dt_numcpus = 0;
> +	dt_maxcpuid = 0;
> +	dt_cpuids = NULL;
> +}
> diff --git a/tools/dtrace/dtrace.c b/tools/dtrace/dtrace.c
> new file mode 100644
> index 000000000000..36ad526c1cd4
> --- /dev/null
> +++ b/tools/dtrace/dtrace.c
> @@ -0,0 +1,249 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
> + */
> +#include <errno.h>
> +#include <libgen.h>
> +#include <stdarg.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <unistd.h>
> +#include <linux/log2.h>
> +
> +#include "dtrace_impl.h"
> +
> +#define DTRACE_BUFSIZE	32		/* default buffer size (in pages) */
> +
> +#define DMODE_VERS	0		/* display version information (-V) */
> +#define DMODE_LIST	1		/* list probes (-l) */
> +#define DMODE_EXEC	2		/* compile program and start tracing */
> +
> +#define E_SUCCESS	0
> +#define E_ERROR		1
> +#define E_USAGE		2
> +
> +#define NUM_PAGES(sz)	(((sz) + getpagesize() - 1) / getpagesize())
> +
> +static const char		*dtrace_options = "+b:ls:V";
> +
> +static char			*g_pname;
> +static int			g_mode = DMODE_EXEC;
> +
> +static int usage(void)
> +{
> +	fprintf(stderr, "Usage: %s [-lV] [-b bufsz] -s script\n", g_pname);
> +	fprintf(stderr,
> +	"\t-b  set trace buffer size\n"
> +	"\t-l  list probes matching specified criteria\n"
> +	"\t-s  enable or list probes for the specified BPF program\n"
> +	"\t-V  report DTrace API version\n");
> +
> +	return E_USAGE;
> +}
> +
> +static u64 parse_size(const char *arg)
> +{
> +	long long	mul = 1;
> +	long long	neg, val;
> +	size_t		len;
> +	char		*end;
> +
> +	if (!arg)
> +		return -1;
> +
> +	len = strlen(arg);
> +	if (!len)
> +		return -1;
> +
> +	switch (arg[len - 1]) {
> +	case 't':
> +	case 'T':
> +		mul *= 1024;
> +		/* fall-through */
> +	case 'g':
> +	case 'G':
> +		mul *= 1024;
> +		/* fall-through */
> +	case 'm':
> +	case 'M':
> +		mul *= 1024;
> +		/* fall-through */
> +	case 'k':
> +	case 'K':
> +		mul *= 1024;
> +		/* fall-through */
> +	default:
> +		break;
> +	}
> +
> +	neg = strtoll(arg, NULL, 0);
> +	errno = 0;
> +	val = strtoull(arg, &end, 0) * mul;
> +
> +	if ((mul > 1 && end != &arg[len - 1]) || (mul == 1 && *end != '\0') ||
> +	    val < 0 || neg < 0 || errno != 0)
> +		return -1;
> +
> +	return val;
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +	int	i;
> +	int	modec = 0;
> +	int	bufsize = DTRACE_BUFSIZE;
> +	int	epoll_fd;
> +	int	cnt;
> +	char	**prgv;
> +	int	prgc;
> +
> +	g_pname = basename(argv[0]);
> +
> +	if (argc == 1)
> +		return usage();
> +
> +	prgc = 0;
> +	prgv = calloc(argc, sizeof(char *));
> +	if (!prgv) {
> +		fprintf(stderr, "failed to allocate memory for arguments: %s\n",
> +			strerror(errno));
> +		return E_ERROR;
> +	}
> +
> +	argv[0] = g_pname;			/* argv[0] for getopt errors */
> +
> +	for (optind = 1; optind < argc; optind++) {
> +		int	opt;
> +
> +		while ((opt = getopt(argc, argv, dtrace_options)) != EOF) {
> +			u64			val;
> +
> +			switch (opt) {
> +			case 'b':
> +				val = parse_size(optarg);
> +				if (val < 0) {
> +					fprintf(stderr, "invalid: -b %s\n",
> +						optarg);
> +					return E_ERROR;
> +				}
> +
> +				/*
> +				 * Bufsize needs to be a number of pages, and
> +				 * must be a power of 2.  This is required by
> +				 * the perf event buffer code.
> +				 */
> +				bufsize = roundup_pow_of_two(NUM_PAGES(val));
> +				if ((u64)bufsize * getpagesize() > val)
> +					fprintf(stderr,
> +						"bufsize increased to %ld\n",
> +						(u64)bufsize * getpagesize());
> +
> +				break;
> +			case 'l':
> +				g_mode = DMODE_LIST;
> +				modec++;
> +				break;
> +			case 's':
> +				prgv[prgc++] = optarg;
> +				break;
> +			case 'V':
> +				g_mode = DMODE_VERS;
> +				modec++;
> +				break;
> +			default:
> +				if (strchr(dtrace_options, opt) == NULL)
> +					return usage();
> +			}
> +		}
> +
> +		if (optind < argc) {
> +			fprintf(stderr, "unknown option '%s'\n", argv[optind]);
> +			return E_ERROR;
> +		}
> +	}
> +
> +	if (modec > 1) {
> +		fprintf(stderr,
> +			"only one of [-lV] can be specified at a time\n");
> +		return E_USAGE;
> +	}
> +
> +	/*
> +	 * We handle requests for version information first because we do not
> +	 * need probe information for it.
> +	 */
> +	if (g_mode == DMODE_VERS) {
> +		printf("%s\n"
> +		       "This is DTrace %s\n"
> +		       "dtrace(1) version-control ID: %s\n",
> +		       DT_VERS_STRING, DT_VERSION, DT_GIT_VERSION);
> +
> +		return E_SUCCESS;
> +	}
> +
> +	/* Initialize probes. */
> +	if (dt_probe_init() < 0) {
> +		fprintf(stderr, "failed to initialize probes: %s\n",
> +			strerror(errno));
> +		return E_ERROR;
> +	}
> +
> +	/*
> +	 * We handle requests to list probes next.
> +	 */
> +	if (g_mode == DMODE_LIST) {
> +		int	rc = 0;
> +
> +		printf("%5s %10s %17s %33s %s\n",
> +		       "ID", "PROVIDER", "MODULE", "FUNCTION", "NAME");
> +		for (i = 0; i < prgc; i++) {
> +			rc = dt_bpf_list_probes(prgv[i]);
> +			if (rc < 0)
> +				fprintf(stderr, "failed to load %s: %s\n",
> +					prgv[i], strerror(errno));
> +		}
> +
> +		return rc ? E_ERROR : E_SUCCESS;
> +	}
> +
> +	if (!prgc) {
> +		fprintf(stderr, "missing BPF program(s)\n");
> +		return E_ERROR;
> +	}
> +
> +	/* Process the BPF program. */
> +	for (i = 0; i < prgc; i++) {
> +		int	err;
> +
> +		err = dt_bpf_load_file(prgv[i]);
> +		if (err) {
> +			errno = -err;
> +			fprintf(stderr, "failed to load %s: %s\n",
> +				prgv[i], strerror(errno));
> +			return E_ERROR;
> +		}
> +	}
> +
> +	/* Get the list of online CPUs. */
> +	cpu_list_populate();
> +
> +	/* Initialize buffers. */
> +	epoll_fd = dt_buffer_init(bufsize);
> +	if (epoll_fd < 0) {
> +		errno = -epoll_fd;
> +		fprintf(stderr, "failed to allocate buffers: %s\n",
> +			strerror(errno));
> +		return E_ERROR;
> +	}
> +
> +	/* Process probe data. */
> +	printf("%3s %6s\n", "CPU", "ID");
> +	do {
> +		cnt = dt_buffer_poll(epoll_fd, 100);
> +	} while (cnt >= 0);
> +
> +	dt_buffer_exit(epoll_fd);
> +
> +	return E_SUCCESS;
> +}
> diff --git a/tools/dtrace/dtrace.h b/tools/dtrace/dtrace.h
> new file mode 100644
> index 000000000000..c79398432d17
> --- /dev/null
> +++ b/tools/dtrace/dtrace.h
> @@ -0,0 +1,13 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
> + */
> +#ifndef _UAPI_LINUX_DTRACE_H
> +#define _UAPI_LINUX_DTRACE_H
> +
> +struct dt_bpf_context {
> +	u32		probe_id;
> +	u64		argv[10];
> +};
> +
> +#endif /* _UAPI_LINUX_DTRACE_H */
> diff --git a/tools/dtrace/dtrace_impl.h b/tools/dtrace/dtrace_impl.h
> new file mode 100644
> index 000000000000..9aa51b4c4aee
> --- /dev/null
> +++ b/tools/dtrace/dtrace_impl.h
> @@ -0,0 +1,101 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
> + */
> +#ifndef _DTRACE_H
> +#define _DTRACE_H
> +
> +#include <unistd.h>
> +#include <bpf/libbpf.h>
> +#include <linux/types.h>
> +#include <linux/ptrace.h>
> +#include <linux/perf_event.h>
> +#include <sys/syscall.h>
> +
> +#include "dtrace.h"
> +
> +#define DT_DEBUG
> +
> +#define DT_VERS_STRING	"Oracle D 2.0.0"
> +
> +#define TRACEFS		"/sys/kernel/debug/tracing/"
> +#define EVENTSFS	TRACEFS "events/"
> +
> +extern int	dt_numcpus;
> +extern int	dt_maxcpuid;
> +extern int	*dt_cpuids;
> +
> +extern void cpu_list_populate(void);
> +extern void cpu_list_free(void);
> +
> +struct dt_provider {
> +	char		*name;
> +	int		(*populate)(void);
> +	struct dt_probe *(*resolve_event)(const char *name);
> +	int		(*attach)(const char *name, int bpf_fd);
> +};
> +
> +extern struct dt_provider	dt_fbt;
> +extern struct dt_provider	dt_syscall;
> +
> +struct dt_hentry {
> +	struct dt_probe		*next;
> +	struct dt_probe		*prev;
> +};
> +
> +struct dt_htab;
> +
> +typedef u32 (*dt_hval_fn)(const struct dt_probe *);
> +typedef int (*dt_cmp_fn)(const struct dt_probe *, const struct dt_probe *);
> +typedef struct dt_probe *(*dt_add_fn)(struct dt_probe *, struct dt_probe *);
> +typedef struct dt_probe *(*dt_del_fn)(struct dt_probe *, struct dt_probe *);
> +
> +extern struct dt_htab *dt_htab_new(dt_hval_fn hval, dt_cmp_fn cmp,
> +				   dt_add_fn add, dt_del_fn del);
> +extern int dt_htab_add(struct dt_htab *htab, struct dt_probe *probe);
> +extern struct dt_probe *dt_htab_lookup(const struct dt_htab *htab,
> +				       const struct dt_probe *probe);
> +extern int dt_htab_del(struct dt_htab *htab, struct dt_probe *probe);
> +
> +struct dt_probe {
> +	u32				id;
> +	int				event_fd;
> +	const struct dt_provider	*prov;
> +	const char			*prv_name;	/* provider name */
> +	const char			*mod_name;	/* module name */
> +	const char			*fun_name;	/* function name */
> +	const char			*prb_name;	/* probe name */
> +	struct dt_hentry		he_fqn;
> +};
> +
> +typedef void (*dt_probe_fn)(const struct dt_probe *probe);
> +
> +extern int dt_probe_init(void);
> +extern int dt_probe_new(const struct dt_provider *prov, const char *pname,
> +			const char *mname, const char *fname, const char *name);
> +extern struct dt_probe *dt_probe_by_name(const struct dt_probe *tmpl);
> +extern struct dt_probe *dt_probe_resolve_event(const char *name);
> +
> +extern int dt_bpf_list_probes(const char *fn);
> +extern int dt_bpf_load_file(const char *fn);
> +extern int dt_bpf_map_update(int fd, const void *key, const void *val);
> +extern int dt_bpf_attach(int event_id, int bpf_fd);
> +
> +extern int dt_bufmap_fd;
> +
> +extern int dt_buffer_init(int num_pages);
> +extern int dt_buffer_poll(int epoll_fd, int timeout);
> +extern void dt_buffer_exit(int epoll_fd);
> +
> +static inline int perf_event_open(struct perf_event_attr *attr, pid_t pid,
> +				  int cpu, int group_fd, unsigned long flags)
> +{
> +	return syscall(__NR_perf_event_open, attr, pid, cpu, group_fd, flags);
> +}
> +
> +extern inline int bpf(enum bpf_cmd cmd, union bpf_attr *attr)
> +{
> +	return syscall(__NR_bpf, cmd, attr, sizeof(union bpf_attr));
> +}
> +
> +#endif /* _DTRACE_H */
> -- 
> 2.20.1

^ permalink raw reply

* Re: [PATCH bpf-next v8 0/2] bpf: Allow bpf_skb_event_output for more prog types
From: Andrii Nakryiko @ 2019-07-10 18:05 UTC (permalink / raw)
  To: Allan Zhang
  Cc: Networking, bpf, Song Liu, Daniel Borkmann, Alexei Starovoitov
In-Reply-To: <20190709180005.33406-1-allanzhang@google.com>

On Tue, Jul 9, 2019 at 11:03 AM Allan Zhang <allanzhang@google.com> wrote:
>
> Software event output is only enabled by a few prog types right now (TC,
> LWT out, XDP, sockops). Many other skb based prog types need
> bpf_skb_event_output to produce software event.
>
> More prog types are enabled to access bpf_skb_event_output in this
> patch.
>
> v8 changes:
> No actual change, just cc to netdev@vger.kernel.org and
> bpf@vger.kernel.org.
> v7 patches are acked by Song Liu.

You forgot to include Song's Acked-by into your patch commit messages. Just add

Acked-by: Song Liu <songliubraving@fb.com>

to both patches after your Signed-off-by.

>
> v7 changes:
> Reformat from hints by scripts/checkpatch.pl, including Song's comment
> on signed-off-by name to captical case in cover letter.
> 3 of hints are ignored:
> 1. new file mode.
> 2. SPDX-License-Identifier for event_output.c since all files under
>    this dir have no such line.
> 3. "Macros ... enclosed in parentheses" for macro in event_output.c
>    due to code's nature.
>
> Change patch 02 subject "bpf:..." to "selftests/bpf:..."
>
> v6 changes:
> Fix Signed-off-by, fix fixup map creation.
>
> v5 changes:
> Fix typos, reformat comments in event_output.c, move revision history to
> cover letter.
>
> v4 changes:
> Reformating log message.
>
> v3 changes:
> Reformating log message.
>
> v2 changes:
> Reformating log message.
>
> Allan Zhang (2):
>   bpf: Allow bpf_skb_event_output for a few prog types
>   selftests/bpf: Add selftests for bpf_perf_event_output
>
>  net/core/filter.c                             |  6 ++
>  tools/testing/selftests/bpf/test_verifier.c   | 12 ++-
>  .../selftests/bpf/verifier/event_output.c     | 94 +++++++++++++++++++
>  3 files changed, 111 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/bpf/verifier/event_output.c
>
> --
> 2.22.0.410.gd8fdbe21b5-goog
>

^ permalink raw reply

* [PATCH] net: phy: make exported variables non-static
From: Denis Efremov @ 2019-07-10 18:03 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Denis Efremov, Florian Fainelli, Heiner Kallweit, David S. Miller,
	netdev, linux-kernel

The variables phy_basic_ports_array, phy_fibre_port_array and
phy_all_ports_features_array are declared static and marked
EXPORT_SYMBOL_GPL(), which is at best an odd combination.
Because the variables were decided to be a part of API, this commit
removes the static attributes and adds the declarations to the header.

Fixes: 3c1bcc8614db ("net: ethernet: Convert phydev advertize and supported from u32 to link mode")
Signed-off-by: Denis Efremov <efremov@linux.com>
---
 drivers/net/phy/phy_device.c | 6 +++---
 include/linux/phy.h          | 3 +++
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index dcc93a873174..f94171e9aa45 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -56,19 +56,19 @@ EXPORT_SYMBOL_GPL(phy_10gbit_features);
 __ETHTOOL_DECLARE_LINK_MODE_MASK(phy_10gbit_fec_features) __ro_after_init;
 EXPORT_SYMBOL_GPL(phy_10gbit_fec_features);
 
-static const int phy_basic_ports_array[] = {
+const int phy_basic_ports_array[3] = {
 	ETHTOOL_LINK_MODE_Autoneg_BIT,
 	ETHTOOL_LINK_MODE_TP_BIT,
 	ETHTOOL_LINK_MODE_MII_BIT,
 };
 EXPORT_SYMBOL_GPL(phy_basic_ports_array);
 
-static const int phy_fibre_port_array[] = {
+const int phy_fibre_port_array[1] = {
 	ETHTOOL_LINK_MODE_FIBRE_BIT,
 };
 EXPORT_SYMBOL_GPL(phy_fibre_port_array);
 
-static const int phy_all_ports_features_array[] = {
+const int phy_all_ports_features_array[7] = {
 	ETHTOOL_LINK_MODE_Autoneg_BIT,
 	ETHTOOL_LINK_MODE_TP_BIT,
 	ETHTOOL_LINK_MODE_MII_BIT,
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 6424586fe2d6..ec54a788fc96 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -55,6 +55,9 @@ extern __ETHTOOL_DECLARE_LINK_MODE_MASK(phy_10gbit_full_features) __ro_after_ini
 #define PHY_10GBIT_FEC_FEATURES ((unsigned long *)&phy_10gbit_fec_features)
 #define PHY_10GBIT_FULL_FEATURES ((unsigned long *)&phy_10gbit_full_features)
 
+extern const int phy_basic_ports_array[3];
+extern const int phy_fibre_port_array[1];
+extern const int phy_all_ports_features_array[7];
 extern const int phy_10_100_features_array[4];
 extern const int phy_basic_t1_features_array[2];
 extern const int phy_gbit_features_array[2];
-- 
2.21.0


^ permalink raw reply related

* [PATCH v6 bpf-next 0/3] bpf: add bpf_descendant_of helper
From: Javier Honduvilla Coto @ 2019-07-10 18:00 UTC (permalink / raw)
  To: netdev; +Cc: yhs, kernel-team, jonhaslam
In-Reply-To: <20190410203631.1576576-1-javierhonduco@fb.com>

Hi all,

This patch adds the bpf_descendant_of helper which accepts a PID and
returns 1 if the PID of the process currently being executed is a
descendant of it or if it's itself. Returns 0 otherwise. The passed
PID should be the one as seen from the "global" pid namespace as the
processes' PIDs in the hierarchy are resolved using the context of said
initial namespace.

This is very useful in tracing programs when we want to filter by a
given PID and all the children it might spawn. The current workarounds
most people implement for this purpose have issues:

- Attaching to process spawning syscalls and dynamically add those PIDs
to some bpf map that would be used to filter is cumbersome and
potentially racy.
- Unrolling some loop to perform what this helper is doing consumes lots
of instructions. That and the impossibility to jump backwards makes it
really hard to be correct in really large process chains.


Let me know what do you think!

Thanks,

---
Changes in V6:
        - Small style fix
        - Clarify in the docs that we are resolving PIDs using the global,
initial PID namespace, and the provided *pid* argument should be global, too
        - Changed the way we assert on the helper return value

Changes in V5:
        - Addressed code review feedback
        - Renamed from progenyof => descendant_of as suggested by Jon Haslam
and Brendan Gregg

Changes in V4:
        - Rebased on latest bpf-next after merge window

Changes in V3:
        - Removed RCU read (un)locking as BPF programs alredy run in RCU locked
                context
        - progenyof(0) now returns 1, which, semantically makes more sense
        - Added new test case for PID 0 and changed sentinel value for errors
        - Rebase on latest bpf-next/master
        - Used my work email as somehow I accidentally used my personal one in v2

Changes in V2:
        - Adding missing docs in include/uapi/linux/bpf.h
-- 
2.17.1


^ permalink raw reply

* [PATCH v6 bpf-next 3/3] bpf: add tests for bpf_descendant_of
From: Javier Honduvilla Coto @ 2019-07-10 18:00 UTC (permalink / raw)
  To: netdev; +Cc: yhs, kernel-team, jonhaslam
In-Reply-To: <20190710180025.94726-1-javierhonduco@fb.com>

Adding the following test cases:

- bpf_descendant_of(current->pid) == 1
- bpf_descendant_of(current->real_parent->pid) == 1
- bpf_descendant_of(1) == 1
- bpf_descendant_of(0) == 1

- bpf_descendant_of(-1) == 0
- bpf_descendant_of(current->children[0]->pid) == 0

Signed-off-by: Javier Honduvilla Coto <javierhonduco@fb.com>
---
 tools/testing/selftests/bpf/.gitignore        |   1 +
 tools/testing/selftests/bpf/Makefile          |   2 +-
 tools/testing/selftests/bpf/bpf_helpers.h     |   3 +
 .../bpf/progs/test_descendant_of_kern.c       |  43 +++
 .../selftests/bpf/test_descendant_of_user.c   | 266 ++++++++++++++++++
 5 files changed, 314 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/progs/test_descendant_of_kern.c
 create mode 100644 tools/testing/selftests/bpf/test_descendant_of_user.c

diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
index 90f70d2c7c22..4b63d7105ba2 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -43,3 +43,4 @@ test_sockopt
 test_sockopt_sk
 test_sockopt_multi
 test_tcp_rtt
+test_descendant_of_user
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 2620406a53ec..b3dc1e26c41c 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -27,7 +27,7 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test
 	test_cgroup_storage test_select_reuseport test_section_names \
 	test_netcnt test_tcpnotify_user test_sock_fields test_sysctl test_hashmap \
 	test_btf_dump test_cgroup_attach xdping test_sockopt test_sockopt_sk \
-	test_sockopt_multi test_tcp_rtt
+	test_sockopt_multi test_tcp_rtt test_descendant_of_user
 
 BPF_OBJ_FILES = $(patsubst %.c,%.o, $(notdir $(wildcard progs/*.c)))
 TEST_GEN_FILES = $(BPF_OBJ_FILES)
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index 5a3d92c8bec8..7525783ffbc9 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -1,4 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0 */
+#include <sys/types.h>
+
 #ifndef __BPF_HELPERS_H
 #define __BPF_HELPERS_H
 
@@ -228,6 +230,7 @@ static void *(*bpf_sk_storage_get)(void *map, struct bpf_sock *sk,
 static int (*bpf_sk_storage_delete)(void *map, struct bpf_sock *sk) =
 	(void *)BPF_FUNC_sk_storage_delete;
 static int (*bpf_send_signal)(unsigned sig) = (void *)BPF_FUNC_send_signal;
+static int (*bpf_descendant_of)(pid_t pid) = (void *) BPF_FUNC_descendant_of;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/tools/testing/selftests/bpf/progs/test_descendant_of_kern.c b/tools/testing/selftests/bpf/progs/test_descendant_of_kern.c
new file mode 100644
index 000000000000..802e01595527
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_descendant_of_kern.c
@@ -0,0 +1,43 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/bpf.h>
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") pidmap = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u32),
+	.max_entries = 2,
+};
+
+struct bpf_map_def SEC("maps") resultmap = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u32),
+	.max_entries = 1,
+};
+
+SEC("tracepoint/syscalls/sys_enter_open")
+int trace(void *ctx)
+{
+	__u32 pid = bpf_get_current_pid_tgid();
+	__u32 current_key = 0, ancestor_key = 1, *expected_pid, *ancestor_pid;
+	__u32 *val;
+
+	expected_pid = bpf_map_lookup_elem(&pidmap, &current_key);
+	if (!expected_pid || *expected_pid != pid)
+		return 0;
+
+	ancestor_pid = bpf_map_lookup_elem(&pidmap, &ancestor_key);
+	if (!ancestor_pid)
+		return 0;
+
+	val = bpf_map_lookup_elem(&resultmap, &current_key);
+	if (val)
+		*val = bpf_descendant_of(*ancestor_pid);
+
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
+__u32 _version SEC("version") = 1;
diff --git a/tools/testing/selftests/bpf/test_descendant_of_user.c b/tools/testing/selftests/bpf/test_descendant_of_user.c
new file mode 100644
index 000000000000..f616c8c976a4
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_descendant_of_user.c
@@ -0,0 +1,266 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <syscall.h>
+#include <unistd.h>
+#include <linux/perf_event.h>
+#include <sys/ioctl.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#define CHECK(condition, tag, format...)                                       \
+	({                                                                     \
+		int __ret = !!(condition);                                     \
+		if (__ret) {                                                   \
+			printf("%s:FAIL:%s ", __func__, tag);                  \
+			printf(format);                                        \
+		} else {                                                       \
+			printf("%s:PASS:%s\n", __func__, tag);                 \
+		}                                                              \
+		__ret;                                                         \
+	})
+
+static int bpf_find_map(const char *test, struct bpf_object *obj,
+			const char *name)
+{
+	struct bpf_map *map;
+
+	map = bpf_object__find_map_by_name(obj, name);
+	if (!map)
+		return -1;
+	return bpf_map__fd(map);
+}
+
+int main(int argc, char **argv)
+{
+	const char *probe_name = "syscalls/sys_enter_open";
+	const char *file = "test_descendant_of_kern.o";
+	int err, bytes, efd, prog_fd, pmu_fd;
+	int resultmap_fd, pidmap_fd;
+	struct perf_event_attr attr = {};
+	struct bpf_object *obj;
+	__u32 descendant_of_result = 0;
+	__u32 key = 0, pid;
+	int exit_code = EXIT_FAILURE;
+	char buf[256];
+
+	int child_pid, ancestor_pid, root_fd, nonexistant = -42;
+	__u32 ancestor_key = 1;
+	int pipefd[2];
+	char marker[1];
+
+	err = bpf_prog_load(file, BPF_PROG_TYPE_TRACEPOINT, &obj, &prog_fd);
+	if (CHECK(err, "bpf_prog_load", "err %d errno %d\n", err, errno))
+		goto fail;
+
+	resultmap_fd = bpf_find_map(__func__, obj, "resultmap");
+	if (CHECK(resultmap_fd < 0, "bpf_find_map", "err %d errno %d\n",
+		  resultmap_fd, errno))
+		goto close_prog;
+
+	pidmap_fd = bpf_find_map(__func__, obj, "pidmap");
+	if (CHECK(pidmap_fd < 0, "bpf_find_map", "err %d errno %d\n", pidmap_fd,
+		  errno))
+		goto close_prog;
+
+	pid = getpid();
+	bpf_map_update_elem(pidmap_fd, &key, &pid, 0);
+	bpf_map_update_elem(pidmap_fd, &ancestor_key, &pid, 0);
+
+	snprintf(buf, sizeof(buf), "/sys/kernel/debug/tracing/events/%s/id",
+		 probe_name);
+	efd = open(buf, O_RDONLY, 0);
+	if (CHECK(efd < 0, "open", "err %d errno %d\n", efd, errno))
+		goto close_prog;
+	bytes = read(efd, buf, sizeof(buf));
+	close(efd);
+	if (CHECK(bytes <= 0 || bytes >= sizeof(buf), "read",
+		  "bytes %d errno %d\n", bytes, errno))
+		goto close_prog;
+
+	attr.config = strtol(buf, NULL, 0);
+	attr.type = PERF_TYPE_TRACEPOINT;
+	attr.sample_type = PERF_SAMPLE_RAW;
+	attr.sample_period = 1;
+	attr.wakeup_events = 1;
+
+	pmu_fd = syscall(__NR_perf_event_open, &attr, getpid(), -1, -1, 0);
+	if (CHECK(pmu_fd < 0, "perf_event_open", "err %d errno %d\n", pmu_fd,
+		  errno))
+		goto close_prog;
+
+	err = ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
+	if (CHECK(err, "perf_event_ioc_enable", "err %d errno %d\n", err,
+		  errno))
+		goto close_pmu;
+
+	err = ioctl(pmu_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
+	if (CHECK(err, "perf_event_ioc_set_bpf", "err %d errno %d\n", err,
+		  errno))
+		goto close_pmu;
+
+	// Test that descendant_of(current->pid) is true
+	bpf_map_update_elem(pidmap_fd, &key, &pid, 0);
+	bpf_map_update_elem(pidmap_fd, &ancestor_key, &pid, 0);
+	bpf_map_update_elem(resultmap_fd, &key, &nonexistant, 0);
+
+	root_fd = open("/", O_RDONLY);
+	if (CHECK(efd < 0, "open", "errno %d\n", errno))
+		goto close_prog;
+	close(root_fd);
+
+	err = bpf_map_lookup_elem(resultmap_fd, &key, &descendant_of_result);
+	if (CHECK(err, "bpf_map_lookup_elem", "err %d errno %d\n", err, errno))
+		goto close_pmu;
+	if (CHECK(descendant_of_result != 1,
+		  "descendant_of is true with same pid", "%d == %d\n",
+		  descendant_of_result, 1))
+		goto close_pmu;
+
+	// Test that PID 1 an ancestor
+	bpf_map_update_elem(pidmap_fd, &key, &pid, 0);
+	ancestor_pid = 1;
+	bpf_map_update_elem(pidmap_fd, &ancestor_key, &ancestor_pid, 0);
+	bpf_map_update_elem(resultmap_fd, &key, &nonexistant, 0);
+
+	root_fd = open("/", O_RDONLY);
+	if (CHECK(efd < 0, "open", "errno %d\n", errno))
+		goto close_prog;
+	close(root_fd);
+
+	err = bpf_map_lookup_elem(resultmap_fd, &key, &descendant_of_result);
+	if (CHECK(err, "bpf_map_lookup_elem", "err %d errno %d\n", err, errno))
+		goto close_pmu;
+	if (CHECK(descendant_of_result != 1, "descendant_of reaches init",
+		  "%d == %d\n", descendant_of_result, 1))
+		goto close_pmu;
+
+	// Test that PID 0 is an ancestor
+	bpf_map_update_elem(pidmap_fd, &key, &pid, 0);
+	ancestor_pid = 0;
+	bpf_map_update_elem(pidmap_fd, &ancestor_key, &ancestor_pid, 0);
+	bpf_map_update_elem(resultmap_fd, &key, &nonexistant, 0);
+
+	root_fd = open("/", O_RDONLY);
+	if (CHECK(efd < 0, "open", "errno %d\n", errno))
+		goto close_prog;
+	close(root_fd);
+
+	err = bpf_map_lookup_elem(resultmap_fd, &key, &descendant_of_result);
+	if (CHECK(err, "bpf_map_lookup_elem", "err %d errno %d\n", err, errno))
+		goto close_pmu;
+	if (CHECK(descendant_of_result != 1, "PID 0 is our ancestor",
+		  "%d == %d\n", descendant_of_result, 1))
+		goto close_pmu;
+
+	// Test that we don't go over PID 0
+	bpf_map_update_elem(pidmap_fd, &key, &pid, 0);
+	ancestor_pid = -1;
+	bpf_map_update_elem(pidmap_fd, &ancestor_key, &ancestor_pid, 0);
+	bpf_map_update_elem(resultmap_fd, &key, &nonexistant, 0);
+
+	root_fd = open("/", O_RDONLY);
+	if (CHECK(efd < 0, "open", "errno %d\n", errno))
+		goto close_prog;
+	close(root_fd);
+
+	err = bpf_map_lookup_elem(resultmap_fd, &key, &descendant_of_result);
+	if (CHECK(err, "bpf_map_lookup_elem", "err %d errno %d\n", err, errno))
+		goto close_pmu;
+	if (CHECK(descendant_of_result != 0,
+		  "descendant_of does not go over PID 0", "%d == %d\n",
+		  descendant_of_result, 0))
+		goto close_pmu;
+
+	// Test that we are an ancestor of our child
+	pipe(pipefd);
+	child_pid = fork();
+	if (child_pid == -1) {
+		printf("fork failed\n");
+		goto close_pmu;
+	} else if (child_pid == 0) {
+		close(pipefd[1]);
+		read(pipefd[0], &marker, 1);
+
+		root_fd = open("/", O_RDONLY);
+		if (CHECK(efd < 0, "open", "errno %d\n", errno))
+			goto close_prog;
+		close(root_fd);
+
+		close(pipefd[0]);
+		_exit(EXIT_SUCCESS);
+	} else {
+		close(pipefd[0]);
+		bpf_map_update_elem(resultmap_fd, &key, &nonexistant, 0);
+		bpf_map_update_elem(pidmap_fd, &key, &child_pid, 0);
+		bpf_map_update_elem(pidmap_fd, &ancestor_key, &pid, 0);
+
+		write(pipefd[1], &marker, 1);
+		wait(NULL);
+		close(pipefd[1]);
+
+		err = bpf_map_lookup_elem(resultmap_fd, &key,
+					  &descendant_of_result);
+		if (CHECK(err, "bpf_map_lookup_elem", "err %d errno %d\n", err,
+			  errno))
+			goto close_pmu;
+		if (CHECK(descendant_of_result != 1, "descendant_of of parent",
+			  "%d == %d\n", descendant_of_result, 1))
+			goto close_pmu;
+	}
+
+	// Test that a child of ours doesn't belong to our ancestors
+	bpf_map_update_elem(pidmap_fd, &key, &pid, 0);
+	bpf_map_update_elem(resultmap_fd, &key, &nonexistant, 0);
+
+	pipe(pipefd);
+	child_pid = fork();
+	if (child_pid == -1) {
+		printf("fork failed\n");
+		goto close_pmu;
+	} else if (child_pid == 0) {
+		close(pipefd[1]);
+		read(pipefd[0], marker, 1);
+		close(pipefd[0]);
+		_exit(EXIT_SUCCESS);
+	} else {
+		close(pipefd[0]);
+
+		bpf_map_update_elem(pidmap_fd, &ancestor_key, &child_pid, 0);
+
+		root_fd = open("/", O_RDONLY);
+		if (CHECK(efd < 0, "open", "errno %d\n", errno))
+			goto close_prog;
+		close(root_fd);
+
+		write(pipefd[1], marker, 1);
+		wait(NULL);
+		close(pipefd[1]);
+
+		err = bpf_map_lookup_elem(resultmap_fd, &key,
+					  &descendant_of_result);
+		if (CHECK(err, "bpf_map_lookup_elem", "err %d errno %d\n", err,
+			  errno))
+			goto close_pmu;
+		if (CHECK(descendant_of_result != 0, "descendant_of of child",
+			  "%d == %d\n", descendant_of_result, 0))
+			goto close_pmu;
+	}
+
+	exit_code = EXIT_SUCCESS;
+	printf("%s:PASS\n", argv[0]);
+
+close_pmu:
+	close(pmu_fd);
+close_prog:
+	bpf_object__close(obj);
+fail:
+	return exit_code;
+}
-- 
2.17.1


^ permalink raw reply related

* [PATCH v6 bpf-next 2/3] bpf: sync kernel uapi headers
From: Javier Honduvilla Coto @ 2019-07-10 18:00 UTC (permalink / raw)
  To: netdev; +Cc: yhs, kernel-team, jonhaslam
In-Reply-To: <20190710180025.94726-1-javierhonduco@fb.com>

Sync kernel uapi headers.

Signed-off-by: Javier Honduvilla Coto <javierhonduco@fb.com>
---
 tools/include/uapi/linux/bpf.h | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 402208581b2d..505ee91898c2 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2710,6 +2710,23 @@ union bpf_attr {
  *		**-EPERM** if no permission to send the *sig*.
  *
  *		**-EAGAIN** if bpf program can try again.
+ *
+ * int bpf_descendant_of(pid_t pid)
+ *     Description
+ *             Determine if the process identified by *pid* is an ancestor
+ *             (or equal) of the user process executed in this tracing
+ *             context. This is useful when filtering events happening
+ *             to a process and all of its descendants.
+ *
+ *             Note that *pid* must be the pid from the global namespace
+ *             as the pids of the process chain will be resolved using the
+ *             initial pid namespace viewer context.
+ *     Return
+ *             * 1 if the process identified by *pid* is an ancestor, or equal,
+ *             of the currently executing process within the global pid
+ *             namespace
+ *
+ *             * 0 otherwise.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2821,7 +2838,8 @@ union bpf_attr {
 	FN(strtoul),			\
 	FN(sk_storage_get),		\
 	FN(sk_storage_delete),		\
-	FN(send_signal),
+	FN(send_signal),		\
+	FN(descendant_of),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
-- 
2.17.1


^ permalink raw reply related

* [PATCH v6 bpf-next 1/3] bpf: add bpf_descendant_of helper
From: Javier Honduvilla Coto @ 2019-07-10 18:00 UTC (permalink / raw)
  To: netdev; +Cc: yhs, kernel-team, jonhaslam
In-Reply-To: <20190710180025.94726-1-javierhonduco@fb.com>

This patch adds the bpf_descendant_of helper which accepts a PID and
returns 1 if the PID of the process currently being executed is a
descendant of it or if it's itself. Returns 0 otherwise. The passed
PID should be the one as seen from the "global" pid namespace as the
processes' PIDs in the hierarchy are resolved using the context of said
initial namespace.

This is very useful in tracing programs when we want to filter by a
given PID and all the children it might spawn. The current workarounds
most people implement for this purpose have issues:

- Attaching to process spawning syscalls and dynamically add those PIDs
  to some bpf map that would be used to filter is cumbersome and
potentially racy.
- Unrolling some loop to perform what this helper is doing consumes lots
  of instructions. That and the impossibility to jump backwards makes it
really hard to be correct in really large process chains.

Signed-off-by: Javier Honduvilla Coto <javierhonduco@fb.com>
---
 include/linux/bpf.h      |  1 +
 include/uapi/linux/bpf.h | 20 +++++++++++++++++++-
 kernel/bpf/core.c        |  1 +
 kernel/bpf/helpers.c     | 27 +++++++++++++++++++++++++++
 kernel/trace/bpf_trace.c |  2 ++
 5 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 18f4cc2c6acd..4e861138887d 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1053,6 +1053,7 @@ extern const struct bpf_func_proto bpf_get_local_storage_proto;
 extern const struct bpf_func_proto bpf_strtol_proto;
 extern const struct bpf_func_proto bpf_strtoul_proto;
 extern const struct bpf_func_proto bpf_tcp_sock_proto;
+extern const struct bpf_func_proto bpf_descendant_of_proto;
 
 /* Shared helpers among cBPF and eBPF. */
 void bpf_user_rnd_init_once(void);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 5695ab53e354..7e8c2bd654f5 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2713,6 +2713,23 @@ union bpf_attr {
  *		**-EPERM** if no permission to send the *sig*.
  *
  *		**-EAGAIN** if bpf program can try again.
+ *
+ * int bpf_descendant_of(pid_t pid)
+ *     Description
+ *             Determine if the process identified by *pid* is an ancestor
+ *             (or equal) of the user process executed in this tracing
+ *             context. This is useful when filtering events happening
+ *             to a process and all of its descendants.
+ *
+ *             Note that *pid* must be the pid from the global namespace
+ *             as the pids of the process chain will be resolved using the
+ *             initial pid namespace viewer context.
+ *     Return
+ *             * 1 if the process identified by *pid* is an ancestor, or equal,
+ *             of the currently executing process within the global pid
+ *             namespace
+ *
+ *             * 0 otherwise.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2824,7 +2841,8 @@ union bpf_attr {
 	FN(strtoul),			\
 	FN(sk_storage_get),		\
 	FN(sk_storage_delete),		\
-	FN(send_signal),
+	FN(send_signal),		\
+	FN(descendant_of),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 16079550db6d..8f7f0ec8cded 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2039,6 +2039,7 @@ const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
 const struct bpf_func_proto bpf_get_current_comm_proto __weak;
 const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak;
 const struct bpf_func_proto bpf_get_local_storage_proto __weak;
+const struct bpf_func_proto bpf_descendant_of_proto __weak;
 
 const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
 {
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 5e28718928ca..2214194e5f49 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -11,6 +11,7 @@
 #include <linux/uidgid.h>
 #include <linux/filter.h>
 #include <linux/ctype.h>
+#include <linux/init_task.h>
 
 #include "../../lib/kstrtox.h"
 
@@ -487,3 +488,29 @@ const struct bpf_func_proto bpf_strtoul_proto = {
 	.arg4_type	= ARG_PTR_TO_LONG,
 };
 #endif
+
+BPF_CALL_1(bpf_descendant_of, pid_t, pid)
+{
+	int result = 0;
+	struct task_struct *task = current;
+
+	if (pid == 0)
+		return 1;
+
+	while (task != &init_task) {
+		if (task->pid == pid) {
+			result = 1;
+			break;
+		}
+		task = rcu_dereference(task->real_parent);
+	}
+
+	return result;
+}
+
+const struct bpf_func_proto bpf_descendant_of_proto = {
+	.func		= bpf_descendant_of,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_ANYTHING,
+};
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index ca1255d14576..797d7b4a8e9a 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -703,6 +703,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_get_prandom_u32_proto;
 	case BPF_FUNC_probe_read_str:
 		return &bpf_probe_read_str_proto;
+	case BPF_FUNC_descendant_of:
+		return &bpf_descendant_of_proto;
 #ifdef CONFIG_CGROUPS
 	case BPF_FUNC_get_current_cgroup_id:
 		return &bpf_get_current_cgroup_id_proto;
-- 
2.17.1


^ permalink raw reply related

* Re: linux-next: build failure after merge of the net-next tree
From: Jason Gunthorpe @ 2019-07-10 17:52 UTC (permalink / raw)
  To: Leon Romanovsky, Bernard Metzler
  Cc: Stephen Rothwell, Doug Ledford, David Miller, Networking,
	Linux Next Mailing List, Linux Kernel Mailing List
In-Reply-To: <20190709064346.GF7034@mtr-leonro.mtl.com>

On Tue, Jul 09, 2019 at 09:43:46AM +0300, Leon Romanovsky wrote:
> On Tue, Jul 09, 2019 at 01:56:36PM +1000, Stephen Rothwell wrote:
> > Hi all,
> >
> > After merging the net-next tree, today's linux-next build (x86_64
> > allmodconfig) failed like this:
> >
> > drivers/infiniband/sw/siw/siw_cm.c: In function 'siw_create_listen':
> > drivers/infiniband/sw/siw/siw_cm.c:1978:3: error: implicit declaration of function 'for_ifa'; did you mean 'fork_idle'? [-Werror=implicit-function-declaration]
> >    for_ifa(in_dev)
> >    ^~~~~~~
> >    fork_idle
> > drivers/infiniband/sw/siw/siw_cm.c:1978:18: error: expected ';' before '{' token
> >    for_ifa(in_dev)
> >                   ^
> >                   ;
> >    {
> >    ~
> >
> > Caused by commit
> >
> >   6c52fdc244b5 ("rdma/siw: connection management")
> >
> > from the rdma tree.  I don't know why this didn't fail after I mereged
> > that tree.
> 
> I had the same question, because I have this fix for a couple of days already.
> 
> From 56c9e15ec670af580daa8c3ffde9503af3042d67 Mon Sep 17 00:00:00 2001
> From: Leon Romanovsky <leonro@mellanox.com>
> Date: Sun, 7 Jul 2019 10:43:42 +0300
> Subject: [PATCH] Fixup to build SIW issue
> 
> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
>  drivers/infiniband/sw/siw/siw_cm.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/infiniband/sw/siw/siw_cm.c b/drivers/infiniband/sw/siw/siw_cm.c
> index 8e618cb7261f..c883bf514341 100644
> +++ b/drivers/infiniband/sw/siw/siw_cm.c
> @@ -1954,6 +1954,7 @@ static void siw_drop_listeners(struct iw_cm_id *id)
>  int siw_create_listen(struct iw_cm_id *id, int backlog)
>  {
>  	struct net_device *dev = to_siw_dev(id->device)->netdev;
> +	const struct in_ifaddr *ifa;
>  	int rv = 0, listeners = 0;
> 
>  	siw_dbg(id->device, "id 0x%p: backlog %d\n", id, backlog);
> @@ -1975,8 +1976,7 @@ int siw_create_listen(struct iw_cm_id *id, int backlog)
>  			id, &s_laddr.sin_addr, ntohs(s_laddr.sin_port),
>  			&s_raddr->sin_addr, ntohs(s_raddr->sin_port));
> 
> -		for_ifa(in_dev)
> -		{
> +		in_dev_for_each_ifa_rcu(ifa, in_dev) {
>  			if (ipv4_is_zeronet(s_laddr.sin_addr.s_addr) ||

Hum. There is no rcu lock held here and we can't use RCU anyhow as
siw_listen_address will sleep.

I think this needs to use rtnl, as below. Bernard, please urgently
confirm. Thanks

diff --git a/drivers/infiniband/sw/siw/siw_cm.c b/drivers/infiniband/sw/siw/siw_cm.c
index 8e618cb7261f62..ee98e96a5bfaba 100644
--- a/drivers/infiniband/sw/siw/siw_cm.c
+++ b/drivers/infiniband/sw/siw/siw_cm.c
@@ -1965,6 +1965,7 @@ int siw_create_listen(struct iw_cm_id *id, int backlog)
 	 */
 	if (id->local_addr.ss_family == AF_INET) {
 		struct in_device *in_dev = in_dev_get(dev);
+		const struct in_ifaddr *ifa;
 		struct sockaddr_in s_laddr, *s_raddr;
 
 		memcpy(&s_laddr, &id->local_addr, sizeof(s_laddr));
@@ -1975,8 +1976,8 @@ int siw_create_listen(struct iw_cm_id *id, int backlog)
 			id, &s_laddr.sin_addr, ntohs(s_laddr.sin_port),
 			&s_raddr->sin_addr, ntohs(s_raddr->sin_port));
 
-		for_ifa(in_dev)
-		{
+		rtnl_lock();
+		in_dev_for_each_ifa_rtnl(ifa, in_dev) {
 			if (ipv4_is_zeronet(s_laddr.sin_addr.s_addr) ||
 			    s_laddr.sin_addr.s_addr == ifa->ifa_address) {
 				s_laddr.sin_addr.s_addr = ifa->ifa_address;
@@ -1988,7 +1989,7 @@ int siw_create_listen(struct iw_cm_id *id, int backlog)
 					listeners++;
 			}
 		}
-		endfor_ifa(in_dev);
+		rtnl_unlock();
 		in_dev_put(in_dev);
 	} else if (id->local_addr.ss_family == AF_INET6) {
 		struct inet6_dev *in6_dev = in6_dev_get(dev);

^ permalink raw reply related

* Re: [RFC bpf-next 2/8] bpf: extend list based insn patching infra to verification layer
From: Andrii Nakryiko @ 2019-07-10 17:50 UTC (permalink / raw)
  To: Jiong Wang
  Cc: Alexei Starovoitov, Daniel Borkmann, Edward Cree, Naveen N. Rao,
	Andrii Nakryiko, Jakub Kicinski, bpf, Networking, oss-drivers
In-Reply-To: <1562275611-31790-3-git-send-email-jiong.wang@netronome.com>

On Thu, Jul 4, 2019 at 2:32 PM Jiong Wang <jiong.wang@netronome.com> wrote:
>
> Verification layer also needs to handle auxiliar info as well as adjusting
> subprog start.
>
> At this layer, insns inside patch buffer could be jump, but they should
> have been resolved, meaning they shouldn't jump to insn outside of the
> patch buffer. Lineration function for this layer won't touch insns inside
> patch buffer.
>
> Adjusting subprog is finished along with adjusting jump target when the
> input will cover bpf to bpf call insn, re-register subprog start is cheap.
> But adjustment when there is insn deleteion is not considered yet.
>
> Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
> ---
>  kernel/bpf/verifier.c | 150 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 150 insertions(+)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index a2e7637..2026d64 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -8350,6 +8350,156 @@ static void opt_hard_wire_dead_code_branches(struct bpf_verifier_env *env)
>         }
>  }
>
> +/* Linearize bpf list insn to array (verifier layer). */
> +static struct bpf_verifier_env *
> +verifier_linearize_list_insn(struct bpf_verifier_env *env,
> +                            struct bpf_list_insn *list)

It's unclear why this returns env back? It's not allocating a new env,
so it's weird and unnecessary. Just return error code.

> +{
> +       u32 *idx_map, idx, orig_cnt, fini_cnt = 0;
> +       struct bpf_subprog_info *new_subinfo;
> +       struct bpf_insn_aux_data *new_data;
> +       struct bpf_prog *prog = env->prog;
> +       struct bpf_verifier_env *ret_env;
> +       struct bpf_insn *insns, *insn;
> +       struct bpf_list_insn *elem;
> +       int ret;
> +
> +       /* Calculate final size. */
> +       for (elem = list; elem; elem = elem->next)
> +               if (!(elem->flag & LIST_INSN_FLAG_REMOVED))
> +                       fini_cnt++;
> +
> +       orig_cnt = prog->len;
> +       insns = prog->insnsi;
> +       /* If prog length remains same, nothing else to do. */
> +       if (fini_cnt == orig_cnt) {
> +               for (insn = insns, elem = list; elem; elem = elem->next, insn++)
> +                       *insn = elem->insn;
> +               return env;
> +       }
> +       /* Realloc insn buffer when necessary. */
> +       if (fini_cnt > orig_cnt)
> +               prog = bpf_prog_realloc(prog, bpf_prog_size(fini_cnt),
> +                                       GFP_USER);
> +       if (!prog)
> +               return ERR_PTR(-ENOMEM);
> +       insns = prog->insnsi;
> +       prog->len = fini_cnt;
> +       ret_env = env;
> +
> +       /* idx_map[OLD_IDX] = NEW_IDX */
> +       idx_map = kvmalloc(orig_cnt * sizeof(u32), GFP_KERNEL);
> +       if (!idx_map)
> +               return ERR_PTR(-ENOMEM);
> +       memset(idx_map, 0xff, orig_cnt * sizeof(u32));
> +
> +       /* Use the same alloc method used when allocating env->insn_aux_data. */
> +       new_data = vzalloc(array_size(sizeof(*new_data), fini_cnt));
> +       if (!new_data) {
> +               kvfree(idx_map);
> +               return ERR_PTR(-ENOMEM);
> +       }
> +
> +       /* Copy over insn + calculate idx_map. */
> +       for (idx = 0, elem = list; elem; elem = elem->next) {
> +               int orig_idx = elem->orig_idx - 1;
> +
> +               if (orig_idx >= 0) {
> +                       idx_map[orig_idx] = idx;
> +
> +                       if (elem->flag & LIST_INSN_FLAG_REMOVED)
> +                               continue;
> +
> +                       new_data[idx] = env->insn_aux_data[orig_idx];
> +
> +                       if (elem->flag & LIST_INSN_FLAG_PATCHED)
> +                               new_data[idx].zext_dst =
> +                                       insn_has_def32(env, &elem->insn);
> +               } else {
> +                       new_data[idx].seen = true;
> +                       new_data[idx].zext_dst = insn_has_def32(env,
> +                                                               &elem->insn);
> +               }
> +               insns[idx++] = elem->insn;
> +       }
> +
> +       new_subinfo = kvzalloc(sizeof(env->subprog_info), GFP_KERNEL);
> +       if (!new_subinfo) {
> +               kvfree(idx_map);
> +               vfree(new_data);
> +               return ERR_PTR(-ENOMEM);
> +       }
> +       memcpy(new_subinfo, env->subprog_info, sizeof(env->subprog_info));
> +       memset(env->subprog_info, 0, sizeof(env->subprog_info));
> +       env->subprog_cnt = 0;
> +       env->prog = prog;
> +       ret = add_subprog(env, 0);
> +       if (ret < 0) {
> +               ret_env = ERR_PTR(ret);
> +               goto free_all_ret;
> +       }
> +       /* Relocate jumps using idx_map.
> +        *   old_dst = jmp_insn.old_target + old_pc + 1;
> +        *   new_dst = idx_map[old_dst] = jmp_insn.new_target + new_pc + 1;
> +        *   jmp_insn.new_target = new_dst - new_pc - 1;
> +        */
> +       for (idx = 0, elem = list; elem; elem = elem->next) {
> +               int orig_idx = elem->orig_idx;
> +
> +               if (elem->flag & LIST_INSN_FLAG_REMOVED)
> +                       continue;
> +               if ((elem->flag & LIST_INSN_FLAG_PATCHED) || !orig_idx) {
> +                       idx++;
> +                       continue;
> +               }
> +
> +               ret = bpf_jit_adj_imm_off(&insns[idx], orig_idx - 1, idx,
> +                                         idx_map);
> +               if (ret < 0) {
> +                       ret_env = ERR_PTR(ret);
> +                       goto free_all_ret;
> +               }
> +               /* Recalculate subprog start as we are at bpf2bpf call insn. */
> +               if (ret > 0) {
> +                       ret = add_subprog(env, idx + insns[idx].imm + 1);
> +                       if (ret < 0) {
> +                               ret_env = ERR_PTR(ret);
> +                               goto free_all_ret;
> +                       }
> +               }
> +               idx++;
> +       }
> +       if (ret < 0) {
> +               ret_env = ERR_PTR(ret);
> +               goto free_all_ret;
> +       }
> +
> +       env->subprog_info[env->subprog_cnt].start = fini_cnt;
> +       for (idx = 0; idx <= env->subprog_cnt; idx++)
> +               new_subinfo[idx].start = env->subprog_info[idx].start;
> +       memcpy(env->subprog_info, new_subinfo, sizeof(env->subprog_info));
> +
> +       /* Adjust linfo.
> +        * FIXME: no support for insn removal at the moment.
> +        */
> +       if (prog->aux->nr_linfo) {
> +               struct bpf_line_info *linfo = prog->aux->linfo;
> +               u32 nr_linfo = prog->aux->nr_linfo;
> +
> +               for (idx = 0; idx < nr_linfo; idx++)
> +                       linfo[idx].insn_off = idx_map[linfo[idx].insn_off];
> +       }
> +       vfree(env->insn_aux_data);
> +       env->insn_aux_data = new_data;
> +       goto free_mem_list_ret;
> +free_all_ret:
> +       vfree(new_data);
> +free_mem_list_ret:
> +       kvfree(new_subinfo);
> +       kvfree(idx_map);
> +       return ret_env;
> +}
> +
>  static int opt_remove_dead_code(struct bpf_verifier_env *env)
>  {
>         struct bpf_insn_aux_data *aux_data = env->insn_aux_data;
> --
> 2.7.4
>

^ permalink raw reply

* Re: [RFC bpf-next 1/8] bpf: introducing list based insn patching infra to core layer
From: Andrii Nakryiko @ 2019-07-10 17:49 UTC (permalink / raw)
  To: Jiong Wang
  Cc: Alexei Starovoitov, Daniel Borkmann, Edward Cree, Naveen N. Rao,
	Andrii Nakryiko, Jakub Kicinski, bpf, Networking, oss-drivers
In-Reply-To: <1562275611-31790-2-git-send-email-jiong.wang@netronome.com>

On Thu, Jul 4, 2019 at 2:32 PM Jiong Wang <jiong.wang@netronome.com> wrote:
>
> This patch introduces list based bpf insn patching infra to bpf core layer
> which is lower than verification layer.
>
> This layer has bpf insn sequence as the solo input, therefore the tasks
> to be finished during list linerization is:
>   - copy insn
>   - relocate jumps
>   - relocation line info.
>
> Suggested-by: Alexei Starovoitov <ast@kernel.org>
> Suggested-by: Edward Cree <ecree@solarflare.com>
> Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
> ---
>  include/linux/filter.h |  25 +++++
>  kernel/bpf/core.c      | 268 +++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 293 insertions(+)
>
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 1fe53e7..1fea68c 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -842,6 +842,31 @@ struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
>                                        const struct bpf_insn *patch, u32 len);
>  int bpf_remove_insns(struct bpf_prog *prog, u32 off, u32 cnt);
>
> +int bpf_jit_adj_imm_off(struct bpf_insn *insn, int old_idx, int new_idx,
> +                       int idx_map[]);
> +
> +#define LIST_INSN_FLAG_PATCHED 0x1
> +#define LIST_INSN_FLAG_REMOVED 0x2
> +struct bpf_list_insn {
> +       struct bpf_insn insn;
> +       struct bpf_list_insn *next;
> +       s32 orig_idx;
> +       u32 flag;
> +};
> +
> +struct bpf_list_insn *bpf_create_list_insn(struct bpf_prog *prog);
> +void bpf_destroy_list_insn(struct bpf_list_insn *list);
> +/* Replace LIST_INSN with new list insns generated from PATCH. */
> +struct bpf_list_insn *bpf_patch_list_insn(struct bpf_list_insn *list_insn,
> +                                         const struct bpf_insn *patch,
> +                                         u32 len);
> +/* Pre-patch list_insn with insns inside PATCH, meaning LIST_INSN is not
> + * touched. New list insns are inserted before it.
> + */
> +struct bpf_list_insn *bpf_prepatch_list_insn(struct bpf_list_insn *list_insn,
> +                                            const struct bpf_insn *patch,
> +                                            u32 len);
> +
>  void bpf_clear_redirect_map(struct bpf_map *map);
>
>  static inline bool xdp_return_frame_no_direct(void)
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index e2c1b43..e60703e 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -502,6 +502,274 @@ int bpf_remove_insns(struct bpf_prog *prog, u32 off, u32 cnt)
>         return WARN_ON_ONCE(bpf_adj_branches(prog, off, off + cnt, off, false));
>  }
>
> +int bpf_jit_adj_imm_off(struct bpf_insn *insn, int old_idx, int new_idx,
> +                       s32 idx_map[])
> +{
> +       u8 code = insn->code;
> +       s64 imm;
> +       s32 off;
> +
> +       if (BPF_CLASS(code) != BPF_JMP && BPF_CLASS(code) != BPF_JMP32)
> +               return 0;
> +
> +       if (BPF_CLASS(code) == BPF_JMP &&
> +           (BPF_OP(code) == BPF_EXIT ||
> +            (BPF_OP(code) == BPF_CALL && insn->src_reg != BPF_PSEUDO_CALL)))
> +               return 0;
> +
> +       /* BPF to BPF call. */
> +       if (BPF_OP(code) == BPF_CALL) {
> +               imm = idx_map[old_idx + insn->imm + 1] - new_idx - 1;
> +               if (imm < S32_MIN || imm > S32_MAX)
> +                       return -ERANGE;
> +               insn->imm = imm;
> +               return 1;
> +       }
> +
> +       /* Jump. */
> +       off = idx_map[old_idx + insn->off + 1] - new_idx - 1;
> +       if (off < S16_MIN || off > S16_MAX)
> +               return -ERANGE;
> +       insn->off = off;
> +       return 0;
> +}
> +
> +void bpf_destroy_list_insn(struct bpf_list_insn *list)
> +{
> +       struct bpf_list_insn *elem, *next;
> +
> +       for (elem = list; elem; elem = next) {
> +               next = elem->next;
> +               kvfree(elem);
> +       }
> +}
> +
> +struct bpf_list_insn *bpf_create_list_insn(struct bpf_prog *prog)
> +{
> +       unsigned int idx, len = prog->len;
> +       struct bpf_list_insn *hdr, *prev;
> +       struct bpf_insn *insns;
> +
> +       hdr = kvzalloc(sizeof(*hdr), GFP_KERNEL);
> +       if (!hdr)
> +               return ERR_PTR(-ENOMEM);
> +
> +       insns = prog->insnsi;
> +       hdr->insn = insns[0];
> +       hdr->orig_idx = 1;
> +       prev = hdr;

I'm not sure why you need this "prologue" instead of handling first
instruction uniformly in for loop below?

> +
> +       for (idx = 1; idx < len; idx++) {
> +               struct bpf_list_insn *node = kvzalloc(sizeof(*node),
> +                                                     GFP_KERNEL);
> +
> +               if (!node) {
> +                       /* Destroy what has been allocated. */
> +                       bpf_destroy_list_insn(hdr);
> +                       return ERR_PTR(-ENOMEM);
> +               }
> +               node->insn = insns[idx];
> +               node->orig_idx = idx + 1;

Why orig_idx is 1-based? It's really confusing.

> +               prev->next = node;
> +               prev = node;
> +       }
> +
> +       return hdr;
> +}
> +
> +/* Linearize bpf list insn to array. */
> +static struct bpf_prog *bpf_linearize_list_insn(struct bpf_prog *prog,
> +                                               struct bpf_list_insn *list)
> +{
> +       u32 *idx_map, idx, prev_idx, fini_cnt = 0, orig_cnt = prog->len;
> +       struct bpf_insn *insns, *insn;
> +       struct bpf_list_insn *elem;
> +
> +       /* Calculate final size. */
> +       for (elem = list; elem; elem = elem->next)
> +               if (!(elem->flag & LIST_INSN_FLAG_REMOVED))
> +                       fini_cnt++;
> +
> +       insns = prog->insnsi;
> +       /* If prog length remains same, nothing else to do. */
> +       if (fini_cnt == orig_cnt) {
> +               for (insn = insns, elem = list; elem; elem = elem->next, insn++)
> +                       *insn = elem->insn;
> +               return prog;
> +       }
> +       /* Realloc insn buffer when necessary. */
> +       if (fini_cnt > orig_cnt)
> +               prog = bpf_prog_realloc(prog, bpf_prog_size(fini_cnt),
> +                                       GFP_USER);
> +       if (!prog)
> +               return ERR_PTR(-ENOMEM);
> +       insns = prog->insnsi;
> +       prog->len = fini_cnt;
> +
> +       /* idx_map[OLD_IDX] = NEW_IDX */
> +       idx_map = kvmalloc(orig_cnt * sizeof(u32), GFP_KERNEL);
> +       if (!idx_map)
> +               return ERR_PTR(-ENOMEM);
> +       memset(idx_map, 0xff, orig_cnt * sizeof(u32));
> +
> +       /* Copy over insn + calculate idx_map. */
> +       for (idx = 0, elem = list; elem; elem = elem->next) {
> +               int orig_idx = elem->orig_idx - 1;
> +
> +               if (orig_idx >= 0) {
> +                       idx_map[orig_idx] = idx;
> +
> +                       if (elem->flag & LIST_INSN_FLAG_REMOVED)
> +                               continue;
> +               }
> +               insns[idx++] = elem->insn;
> +       }
> +
> +       /* Relocate jumps using idx_map.
> +        *   old_dst = jmp_insn.old_target + old_pc + 1;
> +        *   new_dst = idx_map[old_dst] = jmp_insn.new_target + new_pc + 1;
> +        *   jmp_insn.new_target = new_dst - new_pc - 1;
> +        */
> +       for (idx = 0, prev_idx = 0, elem = list; elem; elem = elem->next) {
> +               int ret, orig_idx;
> +
> +               /* A removed insn doesn't increase new_pc */
> +               if (elem->flag & LIST_INSN_FLAG_REMOVED)
> +                       continue;
> +
> +               orig_idx = elem->orig_idx - 1;
> +               ret = bpf_jit_adj_imm_off(&insns[idx],
> +                                         orig_idx >= 0 ? orig_idx : prev_idx,
> +                                         idx, idx_map);
> +               idx++;
> +               if (ret < 0) {
> +                       kvfree(idx_map);
> +                       return ERR_PTR(ret);
> +               }
> +               if (orig_idx >= 0)
> +                       /* Record prev_idx. it is used for relocating jump insn
> +                        * inside patch buffer. For example, when doing jit
> +                        * blinding, a jump could be moved to some other
> +                        * positions inside the patch buffer, and its old_dst
> +                        * could be calculated using prev_idx.
> +                        */
> +                       prev_idx = orig_idx;
> +       }
> +
> +       /* Adjust linfo.
> +        *
> +        * NOTE: the prog reached core layer has been adjusted to contain insns
> +        *       for single function, however linfo contains information for
> +        *       whole program, so we need to make sure linfo beyond current
> +        *       function is handled properly.
> +        */
> +       if (prog->aux->nr_linfo) {
> +               u32 linfo_idx, insn_start, insn_end, nr_linfo, idx, delta;
> +               struct bpf_line_info *linfo;
> +
> +               linfo_idx = prog->aux->linfo_idx;
> +               linfo = &prog->aux->linfo[linfo_idx];
> +               insn_start = linfo[0].insn_off;
> +               insn_end = insn_start + orig_cnt;
> +               nr_linfo = prog->aux->nr_linfo - linfo_idx;
> +               delta = fini_cnt - orig_cnt;
> +               for (idx = 0; idx < nr_linfo; idx++) {
> +                       int adj_off;
> +
> +                       if (linfo[idx].insn_off >= insn_end) {
> +                               linfo[idx].insn_off += delta;
> +                               continue;
> +                       }
> +
> +                       adj_off = linfo[idx].insn_off - insn_start;
> +                       linfo[idx].insn_off = idx_map[adj_off] + insn_start;
> +               }
> +       }
> +       kvfree(idx_map);
> +
> +       return prog;
> +}
> +
> +struct bpf_list_insn *bpf_patch_list_insn(struct bpf_list_insn *list_insn,
> +                                         const struct bpf_insn *patch,
> +                                         u32 len)
> +{
> +       struct bpf_list_insn *prev, *next;
> +       u32 insn_delta = len - 1;
> +       u32 idx;
> +
> +       list_insn->insn = *patch;
> +       list_insn->flag |= LIST_INSN_FLAG_PATCHED;
> +
> +       /* Since our patchlet doesn't expand the image, we're done. */
> +       if (insn_delta == 0)
> +               return list_insn;
> +
> +       len--;
> +       patch++;
> +
> +       prev = list_insn;
> +       next = list_insn->next;
> +       for (idx = 0; idx < len; idx++) {
> +               struct bpf_list_insn *node = kvzalloc(sizeof(*node),
> +                                                     GFP_KERNEL);
> +
> +               if (!node) {
> +                       /* Link what's allocated, so list destroyer could
> +                        * free them.
> +                        */
> +                       prev->next = next;

Why this special handling, if you can just insert element so that list
is well-formed after each instruction?

> +                       return ERR_PTR(-ENOMEM);
> +               }
> +
> +               node->insn = patch[idx];
> +               prev->next = node;
> +               prev = node;

E.g.,

node->next = next;
prev->next = node;
prev = node;

> +       }
> +
> +       prev->next = next;

And no need for this either.

> +       return prev;
> +}
> +
> +struct bpf_list_insn *bpf_prepatch_list_insn(struct bpf_list_insn *list_insn,
> +                                            const struct bpf_insn *patch,
> +                                            u32 len)

prepatch and patch functions should share the same logic.

Prepend is just that - insert all instructions from buffer before current insns.
Patch -> replace current one with first instriction in a buffer, then
prepend remaining ones before the next instruction (so patch should
call info prepend, with adjusted count and array pointer).

> +{
> +       struct bpf_list_insn *prev, *node, *begin_node;
> +       u32 idx;
> +
> +       if (!len)
> +               return list_insn;
> +
> +       node = kvzalloc(sizeof(*node), GFP_KERNEL);
> +       if (!node)
> +               return ERR_PTR(-ENOMEM);
> +       node->insn = patch[0];
> +       begin_node = node;
> +       prev = node;
> +
> +       for (idx = 1; idx < len; idx++) {
> +               node = kvzalloc(sizeof(*node), GFP_KERNEL);
> +               if (!node) {
> +                       node = begin_node;
> +                       /* Release what's has been allocated. */
> +                       while (node) {
> +                               struct bpf_list_insn *next = node->next;
> +
> +                               kvfree(node);
> +                               node = next;
> +                       }
> +                       return ERR_PTR(-ENOMEM);
> +               }
> +               node->insn = patch[idx];
> +               prev->next = node;
> +               prev = node;
> +       }
> +
> +       prev->next = list_insn;
> +       return begin_node;
> +}
> +
>  void bpf_prog_kallsyms_del_subprogs(struct bpf_prog *fp)
>  {
>         int i;
> --
> 2.7.4
>

^ permalink raw reply

* Re: [RFC bpf-next 0/8] bpf: accelerate insn patching speed
From: Andrii Nakryiko @ 2019-07-10 17:39 UTC (permalink / raw)
  To: Jiong Wang
  Cc: Alexei Starovoitov, Daniel Borkmann, Edward Cree, Naveen N. Rao,
	Andrii Nakryiko, Jakub Kicinski, bpf, Networking, oss-drivers
In-Reply-To: <1562275611-31790-1-git-send-email-jiong.wang@netronome.com>

On Thu, Jul 4, 2019 at 2:31 PM Jiong Wang <jiong.wang@netronome.com> wrote:
>
> This is an RFC based on latest bpf-next about acclerating insn patching
> speed, it is now near the shape of final PATCH set, and we could see the
> changes migrating to list patching would brings, so send out for
> comments. Most of the info are in cover letter. I splitted the code in a
> way to show API migration more easily.


Hey Jiong,


Sorry, took me a while to get to this and learn more about instruction
patching. Overall this looks good and I think is a good direction.
I'll post high-level feedback here, and some more
implementation-specific ones in corresponding patches.


>
> Test Results
> ===
>   - Full pass on test_verifier/test_prog/test_prog_32 under all three
>     modes (interpreter, JIT, JIT with blinding).
>
>   - Benchmarking shows 10 ~ 15x faster on medium sized prog, and reduce
>     patching time from 5100s (nearly one and a half hour) to less than
>     0.5s for 1M insn patching.
>
> Known Issues
> ===
>   - The following warning is triggered when running scale test which
>     contains 1M insns and patching:
>       warning of mm/page_alloc.c:4639 __alloc_pages_nodemask+0x29e/0x330
>
>     This is caused by existing code, it can be reproduced on bpf-next
>     master with jit blinding enabled, then run scale unit test, it will
>     shown up after half an hour. After this set, patching is very fast, so
>     it shows up quickly.
>
>   - No line info adjustment support when doing insn delete, subprog adj
>     is with bug when doing insn delete as well. Generally, removal of insns
>     could possibly cause remove of entire line or subprog, therefore
>     entries of prog->aux->linfo or env->subprog needs to be deleted. I
>     don't have good idea and clean code for integrating this into the
>     linearization code at the moment, will do more experimenting,
>     appreciate ideas and suggestions on this.

Is there any specific problem to detect which line info to delete? Or
what am I missing besides careful implementation?

>
>     Insn delete doesn't happen on normal programs, for example Cilium
>     benchmarks, and happens rarely on test_progs, so the test coverage is
>     not good. That's also why this RFC have a full pass on selftest with
>     this known issue.

I hope you'll add test for deletion (and w/ corresponding line info)
in final patch set :)

>
>   - Could further use mem pool to accelerate the speed, changes are trivial
>     on top of this RFC, and could be 2x extra faster. Not included in this
>     RFC as reducing the algo complexity from quadratic to linear of insn
>     number is the first step.

Honestly, I think that would add more complexity than necessary, and I
think we can further speed up performance without that, see below.

>
> Background
> ===
> This RFC aims to accelerate BPF insn patching speed, patching means expand
> one bpf insn at any offset inside bpf prog into a set of new insns, or
> remove insns.
>
> At the moment, insn patching is quadratic of insn number, this is due to
> branch targets of jump insns needs to be adjusted, and the algo used is:
>
>   for insn inside prog
>     patch insn + regeneate bpf prog
>     for insn inside new prog
>       adjust jump target
>
> This is causing significant time spending when a bpf prog requires large
> amount of patching on different insns. Benchmarking shows it could take
> more than half minutes to finish patching when patching number is more
> than 50K, and the time spent could be more than one hour when patching
> number is around 1M.
>
>   15000   :    3s
>   45000   :   29s
>   95000   :  125s
>   195000  :  712s
>   1000000 : 5100s
>
> This RFC introduces new patching infrastructure. Before doing insn
> patching, insns in bpf prog are turned into a singly linked list, insert
> new insns just insert new list node, delete insns just set delete flag.
> And finally, the list is linearized back into array, and branch target
> adjustment is done for all jump insns during linearization. This algo
> brings the time complexity from quadratic to linear of insn number.
>
> Benchmarking shows the new patching infrastructure could be 10 ~ 15x faster
> on medium sized prog, and for a 1M patching it reduce the time from 5100s
> to less than 0.5s.
>
> Patching API
> ===
> Insn patching could happen on two layers inside BPF. One is "core layer"
> where only BPF insns are patched. The other is "verification layer" where
> insns have corresponding aux info as well high level subprog info, so
> insn patching means aux info needs to be patched as well, and subprog info
> needs to be adjusted. BPF prog also has debug info associated, so line info
> should always be updated after insn patching.
>
> So, list creation, destroy, insert, delete is the same for both layer,
> but lineration is different. "verification layer" patching require extra
> work. Therefore the patch APIs are:
>
>    list creation:                bpf_create_list_insn
>    list patch:                   bpf_patch_list_insn
>    list pre-patch:               bpf_prepatch_list_insn

I think pre-patch name is very confusing, until I read full
description I couldn't understand what it's supposed to be used for.
Speaking of bpf_patch_list_insn, patch is also generic enough to leave
me wondering whether instruction buffer is inserted after instruction,
or instruction is replaced with a bunch of instructions.

So how about two more specific names:
bpf_patch_list_insn -> bpf_list_insn_replace (meaning replace given
instruction with a list of patch instructions)
bpf_prepatch_list_insn -> bpf_list_insn_prepend (well, I think this
one is pretty clear).

>    list lineration (core layer): prog = bpf_linearize_list_insn(prog, list)
>    list lineration (veri layer): env = verifier_linearize_list_insn(env, list)

These two functions are both quite involved, as well as share a lot of
common code. I'd rather have one linearize instruction, that takes env
as an optional parameter. If env is specified (which is the case for
all cases except for constant blinding pass), then adjust aux_data and
subprogs along the way.

This would keep logic less duplicated and shouldn't complexity beyond
few null checks in few places.

>    list destroy:                 bpf_destroy_list_insn
>

I'd also add a macro foreach_list_insn instead of explicit for loops
in multiple places. That would also allow to skip deleted instructions
transparently.

> list patch could change the insn at patch point, it will invalid the aux

typo: invalid -> invalidate

> info at patching point. list pre-patch insert new insns before patch point
> where the insn and associated aux info are not touched, it is used for
> example in convert_ctx_access when generating prologue.
>
> Typical API sequence for one patching pass:
>
>    struct bpf_list_insn list = bpf_create_list_insn(struct bpf_prog);
>    for (elem = list; elem; elem = elem->next)
>       patch_buf = gen_patch_buf_logic;
>       elem = bpf_patch_list_insn(elem, patch_buf, cnt);
>    bpf_prog = bpf_linearize_list_insn(list)
>    bpf_destroy_list_insn(list)
>
> Several patching passes could also share the same list:
>
>    struct bpf_list_insn list = bpf_create_list_insn(struct bpf_prog);
>    for (elem = list; elem; elem = elem->next)
>       patch_buf = gen_patch_buf_logic1;
>       elem = bpf_patch_list_insn(elem, patch_buf, cnt);
>    for (elem = list; elem; elem = elem->next)
>       patch_buf = gen_patch_buf_logic2;
>       elem = bpf_patch_list_insn(elem, patch_buf, cnt);
>    bpf_prog = bpf_linearize_list_insn(list)
>    bpf_destroy_list_insn(list)
>
> but note new inserted insns int early passes won't have aux info except
> zext info. So, if one patch pass requires all aux info updated and
> recalculated for all insns including those pathced, it should first
> linearize the old list, then re-create the list. The RFC always create and
> linearize the list for each migrated patching pass separately.

I think we should do just one list creation, few passes of patching
and then linearize once. That will save quite a lot of memory
allocation and will speed up a lot of things. All the verifier
patching happens one after the other without any other functionality
in between, so there shouldn't be any problem.

As for aux_data. We can solve that even more simply and reliably by
storing a pointer along the struct bpf_list_insn (btw, how about
calling it bpf_patchable_insn?).

Here's how I propose to represent this patchable instruction:

struct bpf_list_insn {
       struct bpf_insn insn;
       struct bpf_list_insn *next;
       struct bpf_list_insn *target;
       struct bpf_insn_aux_data *aux_data;
       s32 orig_idx; // can repurpose this to have three meanings:
                     // -2 - deleted
                     // -1 - patched/inserted insn
                     // >=0 - original idx
};

The idea would be as follows:
1. when creating original list, target pointer will point directly to
a patchable instruction wrapper for jumps/calls. This will allow to
stop tracking and re-calculating jump offsets and instruction indicies
until linearization.
2. aux_data is also filled at that point. Later at linearization time
you'd just iterate over all the instructions in final order and copy
original aux_data, if it's present. And then just repace env's
aux_data array at the end, should be very simple and fast.
3. during fix_bpf_calls, zext, ctx rewrite passes, we'll reuse the
same list of instructions and those passes will just keep inserting
instruction buffers. Given we have restriction that all the jumps are
only within patch buffer, it will be trivial to construct proper
patchable instruction wrappers for newly added instructions, with NULL
for aux_data and possibly non-NULL target (if it's a JMP insn).
4. After those passes, linearize, adjust subprogs (for this you'll
probably still need to create index mapping, right?), copy or create
new aux_data.
5. Done.

What do you think? I think this should be overall simpler and faster.
But let me know if I'm missing something.

>
> Compared with old patching code, this new infrastructure has much less core
> code, even though the final code has a couple of extra lines but that is
> mostly due to for list based infrastructure, we need to do more error
> checks, so the list and associated aux data structure could be freed when
> errors happens.
>
> Patching Restrictions
> ===
>   - For core layer, the linearization assume no new jumps inside patch buf.
>     Currently, the only user of this layer is jit blinding.
>   - For verifier layer, there could be new jumps inside patch buf, but
>     they should have branch target resolved themselves, meaning new jumps
>     doesn't jump to insns out of the patch buf. This is the case for all
>     existing verifier layer users.
>   - bpf_insn_aux_data for all patched insns including the one at patch
>     point are invalidated, only 32-bit zext info will be recalcuated.
>     If the aux data of insn at patch point needs to be retained, it is
>     purely insn insertion, so need to use the pre-patch API.
>
> I plan to send out a PATCH set once I finished insn deletion line info adj
> support, please have a looks at this RFC, and appreciate feedbacks.
>
> Jiong Wang (8):
>   bpf: introducing list based insn patching infra to core layer
>   bpf: extend list based insn patching infra to verification layer
>   bpf: migrate jit blinding to list patching infra
>   bpf: migrate convert_ctx_accesses to list patching infra
>   bpf: migrate fixup_bpf_calls to list patching infra
>   bpf: migrate zero extension opt to list patching infra
>   bpf: migrate insn remove to list patching infra
>   bpf: delete all those code around old insn patching infrastructure
>
>  include/linux/bpf_verifier.h |   1 -
>  include/linux/filter.h       |  27 +-
>  kernel/bpf/core.c            | 431 +++++++++++++++++-----------
>  kernel/bpf/verifier.c        | 649 +++++++++++++++++++------------------------
>  4 files changed, 580 insertions(+), 528 deletions(-)
>
> --
> 2.7.4
>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox