Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next v2 16/21] net/ipv6: Add gfp_flags to route add functions
From: Eric Dumazet @ 2018-04-18 17:30 UTC (permalink / raw)
  To: David Ahern, Eric Dumazet, netdev
  Cc: davem, idosch, roopa, weiwan, kafai, yoshfuji
In-Reply-To: <77bf271b-7a66-8b53-ea6f-be86bb65590c@gmail.com>



On 04/18/2018 10:10 AM, David Ahern wrote:
> On 4/18/18 11:06 AM, Eric Dumazet wrote:
>>
>>
>> On 04/17/2018 05:33 PM, David Ahern wrote:
>>> Most FIB entries can be added using memory allocated with GFP_KERNEL.
>>> Add gfp_flags to ip6_route_add and addrconf_dst_alloc. Code paths that
>>> can be reached from the packet path (e.g., ndisc and autoconfig) or
>>> atomic notifiers use GFP_ATOMIC; paths from user context (adding
>>> addresses and routes) use GFP_KERNEL.
>>>
>>> Signed-off-by: David Ahern <dsahern@gmail.com>
>>> ---
>>
>>
>> Hmmm....
>>
>> ipv6_ifa_notify() calls __ipv6_ifa_notify() under rcu_read_lock_bh()/rcu_read_unlock_bh()
>>
>> So using GFP_KERNEL in __ipv6_ifa_notify() is certainly not allowed.
>>
>>
> 
> Thanks for catching that. Will add fix to my followup set.
> 
BTW, I am not sure why we use rcu_read_lock_bh()/rcu_read_unlock_bh() there :/

Maybe it is no longer needed.

^ permalink raw reply

* Re: [PATCH 1/2] net: netsec: enable tx-irq during open callback
From: David Miller @ 2018-04-18 17:30 UTC (permalink / raw)
  To: jassisinghbrar; +Cc: netdev, masahisa.kojima, ard.biesheuvel, jaswinder.singh
In-Reply-To: <CABb+yY3ikHT54JTH0eFDjTEpV=z1aNNrTeagC6dWbRvODCPg2w@mail.gmail.com>

From: Jassi Brar <jassisinghbrar@gmail.com>
Date: Wed, 18 Apr 2018 18:27:59 +0530

> Just to make sure, let me please mention that c009f413b79de52 and
> 9a00b697ce31e are very much needed in stable kernel. Without these we
> couldn't install any OS over network.

Ok, both patches in this series queued up for -stable.

^ permalink raw reply

* Re: [PATCH RFC net-next 00/11] udp gso
From: David Miller @ 2018-04-18 17:28 UTC (permalink / raw)
  To: sowmini.varadhan
  Cc: willemdebruijn.kernel, sridhar.samudrala, netdev, willemb
In-Reply-To: <20180418123103.GC19633@oracle.com>

From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Wed, 18 Apr 2018 08:31:03 -0400

> However, I share Sridhar's concerns about the very fundamental change
> to UDP message boundary semantics here.  There is actually no such thing
> as a "segment" in udp, so in general this feature makes me a little
> uneasy.  Well behaved udp applications should already be sending mtu
> sized datagrams. And the not-so-well-behaved ones are probably relying
> on IP fragmentation/reassembly to take care of datagram boundary semantics
> for them?
> 
> As Sridhar points out, the feature is not really "negotiated" - one side
> unilaterally sets the option. If the receiver is a classic/POSIX UDP
> implementation, it will have no way of knowing that message boundaries
> have been re-adjusted at the sender.  

There are no "semantics".

What ends up on the wire is the same before the kernel/app changes as
afterwards.

The only difference is that instead of the application doing N - 1
sendmsg() calls with mtu sized writes, it's giving everything all at
once and asking the kernel to segment.

It even gives the application control over the size of the packets,
which I think is completely prudent in this situation.

^ permalink raw reply

* Re: SRIOV switchdev mode BoF minutes
From: Andy Gospodarek @ 2018-04-18 17:25 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Andy Gospodarek, Or Gerlitz, Samudrala, Sridhar, David Miller,
	Anjali Singhai Jain, Michael Chan, Simon Horman, John Fastabend,
	Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List
In-Reply-To: <20180418092634.13488a33@cakuba.netronome.com>

On Wed, Apr 18, 2018 at 09:26:34AM -0700, Jakub Kicinski wrote:
> On Wed, 18 Apr 2018 11:15:29 -0400, Andy Gospodarek wrote:
> > > A similar issue exists on multi-host for PFs, right?  If one of the
> > > hosts is down do we still show their PF repr?  IMHO yes.  
> > 
> > I would agree with that as well.  With today's model the VF reps are
> > created once a PF is put into switchdev mode, but I'm still working out
> > how we want to consider whether or not a PF rep for the other domains is
> > created locally or not and also how one can determine which domain is in
> > control.
> > 
> > Permanent config options (like NVRAM settings) could easily handle which
> > domain is in control, but that still does not mean that PF reps must be
> > created automatically, does it?
> 
> The control domain is tricky.  I'm not sure I understand how you could
> not have a PF rep for remote domains, though.  How do you configure
> switching to the PF netdev if there is no rep?

Yes, for complete control of all traffic using standard Linux APIs a PF
rep is a requirement.

^ permalink raw reply

* Re: [PATCH RFC net-next 00/11] udp gso
From: David Miller @ 2018-04-18 17:24 UTC (permalink / raw)
  To: pabeni; +Cc: willemdebruijn.kernel, netdev, willemb
In-Reply-To: <1524050274.2599.21.camel@redhat.com>

From: Paolo Abeni <pabeni@redhat.com>
Date: Wed, 18 Apr 2018 13:17:54 +0200

> When testing with Spectre/Meltdown mitigation in places, I expect
> that the most relevant part of the gain is due to the single syscall
> per burst.

I was going to say exactly this.

Batching schemes that were borderline beneficial before spectre/meltdown
are more worthwhile now.

^ permalink raw reply

* Re: [PATCH bpf-next v4 03/10] bpf: btf: Check members of struct/union
From: Jakub Kicinski @ 2018-04-18 17:22 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: netdev, Alexei Starovoitov, Daniel Borkmann, kernel-team
In-Reply-To: <20180417204243.4028831-4-kafai@fb.com>

On Tue, 17 Apr 2018 13:42:36 -0700, Martin KaFai Lau wrote:
> This patch checks a few things of struct's members:
> 
> 1) It has a valid size (e.g. a "const void" is invalid)
> 2) A member's size (+ its member's offset) does not exceed
>    the containing struct's size.
> 3) The member's offset satisfies the alignment requirement

Could we also introduce a requirement for members to have different
names?  Maybe it's there but I missed it.  Would BTF with duplicated
member names be considered valid?

> The above can only be done after the needs_resolve member's type
> is resolved.  Hence, the above is done together in
> btf_struct_resolve().
> 
> Each possible member's type (e.g. int, enum, modifier...) implements
> the check_member() ops which will be called from btf_struct_resolve().
> 
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> Acked-by: Alexei Starovoitov <ast@fb.com>

^ permalink raw reply

* Re: [PATCH net-next 2/2] udp: implement and use per cpu rx skbs cache
From: Paolo Abeni @ 2018-04-18 17:15 UTC (permalink / raw)
  To: Eric Dumazet, netdev; +Cc: David S. Miller
In-Reply-To: <890db004-4dfe-7f77-61ee-1ac0d7d2a24c@gmail.com>

Hi,

On Wed, 2018-04-18 at 09:56 -0700, Eric Dumazet wrote:
> 
> On 04/18/2018 03:22 AM, Paolo Abeni wrote:
> > This changeset extends the idea behind commit c8c8b127091b ("udp:
> > under rx pressure, try to condense skbs"), trading more BH cpu
> > time and memory bandwidth to decrease the load on the user space
> > receiver.
> > 
> > At boot time we allocate a limited amount of skbs with small
> > data buffer, storing them in per cpu arrays. Such skbs are never
> > freed.
> > 
> > At run time, under rx pressure, the BH tries to copy the current
> > skb contents into the cache - if the current cache skb is available,
> > and the ingress skb is small enough and without any head states.
> > 
> > When using the cache skb, the ingress skb is dropped by the BH
> > - while still hot on cache - and the cache skb is inserted into
> > the rx queue, after increasing its usage count. Also, the cache
> > array index is moved to the next entry.
> > 
> > The receive side is unmodified: in udp_rcvmsg() the usage skb
> > usage count is decreased and the skb is _not_ freed - since the
> > cache keeps usage > 0. Since skb->usage is hot in the cache of the
> > receiver at consume time - the receiver has just read skb->data,
> > which lies in the same cacheline - the whole skb_consume_udp() becomes
> > really cheap.
> > 
> > UDP receive performances under flood improve as follow:
> > 
> > NR RX queues	Kpps	Kpps	Delta (%)
> > 		Before	After
> > 
> > 1		2252	2305	2
> > 2		2151	2569	19
> > 4		2033	2396	17
> > 8		1969	2329	18
> > 
> > Overall performances of knotd DNS server under real traffic flood
> > improves as follow:
> > 
> > 		Kpps	Kpps	Delta (%)
> > 		Before	After
> > 
> > 		3777	3981	5
> 
> 
> It might be time for knotd DNS server to finally use SO_REUSEPORT instead of
> adding this bloat to the kernel ?
> 
> Sorry, 5% improvement while you easily can get 300% improvement with no kernel change
> is not appealing to me :/

Thank you for the feedback.
Sorry for not being clear about it, but knotd is using SO_REUSEPORT and
the above tests are leveraging it.

That 5% is on top of that 300%.

Cheers,

Paolo

^ permalink raw reply

* Re: [PATCH net-next v2 16/21] net/ipv6: Add gfp_flags to route add functions
From: David Ahern @ 2018-04-18 17:10 UTC (permalink / raw)
  To: Eric Dumazet, netdev; +Cc: davem, idosch, roopa, weiwan, kafai, yoshfuji
In-Reply-To: <7dd4a7ae-f940-70fd-fab8-3376d4662f70@gmail.com>

On 4/18/18 11:06 AM, Eric Dumazet wrote:
> 
> 
> On 04/17/2018 05:33 PM, David Ahern wrote:
>> Most FIB entries can be added using memory allocated with GFP_KERNEL.
>> Add gfp_flags to ip6_route_add and addrconf_dst_alloc. Code paths that
>> can be reached from the packet path (e.g., ndisc and autoconfig) or
>> atomic notifiers use GFP_ATOMIC; paths from user context (adding
>> addresses and routes) use GFP_KERNEL.
>>
>> Signed-off-by: David Ahern <dsahern@gmail.com>
>> ---
> 
> 
> Hmmm....
> 
> ipv6_ifa_notify() calls __ipv6_ifa_notify() under rcu_read_lock_bh()/rcu_read_unlock_bh()
> 
> So using GFP_KERNEL in __ipv6_ifa_notify() is certainly not allowed.
> 
> 

Thanks for catching that. Will add fix to my followup set.

^ permalink raw reply

* RE: SRIOV switchdev mode BoF minutes
From: Parikh, Neerav @ 2018-04-18 17:07 UTC (permalink / raw)
  To: Andy Gospodarek, Jakub Kicinski
  Cc: Or Gerlitz, Samudrala, Sridhar, David Miller, Singhai, Anjali,
	Michael Chan, Simon Horman, John Fastabend, Saeed Mahameed,
	Jiri Pirko, Rony Efraim, Linux Netdev List
In-Reply-To: <20180418151529.GL33938@C02RW35GFVH8.dhcp.broadcom.net>

> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]
> On Behalf Of Andy Gospodarek
> Sent: Wednesday, April 18, 2018 8:15 AM
> To: Jakub Kicinski <jakub.kicinski@netronome.com>
> Cc: Andy Gospodarek <andrew.gospodarek@broadcom.com>; Or Gerlitz
> <gerlitz.or@gmail.com>; Samudrala, Sridhar <sridhar.samudrala@intel.com>;
> David Miller <davem@davemloft.net>; Singhai, Anjali
> <anjali.singhai@intel.com>; Michael Chan <michael.chan@broadcom.com>;
> Simon Horman <simon.horman@netronome.com>; John Fastabend
> <john.fastabend@gmail.com>; Saeed Mahameed <saeedm@mellanox.com>;
> Jiri Pirko <jiri@mellanox.com>; Rony Efraim <ronye@mellanox.com>; Linux
> Netdev List <netdev@vger.kernel.org>
> Subject: Re: SRIOV switchdev mode BoF minutes
> 
> On Tue, Apr 17, 2018 at 04:19:15PM -0700, Jakub Kicinski wrote:
> > On Tue, 17 Apr 2018 10:47:00 -0400, Andy Gospodarek wrote:
> > > There is also a school of thought that the VF reps could be
> > > pre-allocated on the SmartNIC so that any application processing that
> > > traffic would sit idle when no traffic arrives on the rep, but could
> > > process frames that do arrive when the VFs were created on the host.
> > > This implementation will depend on how resources are allocated on a
> > > given bit of hardware, but can really work well.
> >
> > +1 if there is no FW resource allocation issues IMHO it's okay to
> > just show all reprs for "remote PCIes (PFs and VFs)" on the SmartNIC/
> > controller.  The reprs should just show link down as if PCIe cable
> > was unpluged until host actually enables them.
> 
> Yes we are on the same page on this.
> 
> > A similar issue exists on multi-host for PFs, right?  If one of the
> > hosts is down do we still show their PF repr?  IMHO yes.
> 
> I would agree with that as well.  With today's model the VF reps are
> created once a PF is put into switchdev mode, but I'm still working out
> how we want to consider whether or not a PF rep for the other domains is
> created locally or not and also how one can determine which domain is in
> control.
> 
> Permanent config options (like NVRAM settings) could easily handle which
> domain is in control, but that still does not mean that PF reps must be
> created automatically, does it?
> 
> > That makes the thing looks more like a switch with cables being plugged
> > in and out.
> 
> Yes, that's exactly how I view it as well.

If we need to behave like a switch or emulate that mode then is there a 
thought around the usability model?

So, while whichever domain is in control the implication above is that the
max number of vports supported (VFs and PFs) will need to be represented
regardless of whether they're "Enabled" or not in a given "Switch".
By "Enabled" I mean SRIOV VFs may not have been enabled but still the
representor exists.
For example if there are 256 VFs supported on a given PF when someone
switches into the switchdev mode there will be ~257 representor netdevs
added into the system. And if you've multi-port, multi-function devices
the number of representor netdevs will increase accordingly.
While representor netdevs' naming may help a bit here but a user will 
need to determine and differentiate between the sprawl of representors
netdevs and data netdevs to identify which all can be added into an 
OVS bridge (or vSwitch).
And switching to the "switchdev mode" becomes a pre-requisite before
any of the vSwitch bridges that uses these representor netdevs.

While, in a SmartNIC where users are not managing the devices this may
be deployed based on the NIC FW/SW capabilities. 
But, I'm not sure how the same model applied on a standard host running 
Linux will work across devices.

^ permalink raw reply

* Re: [PATCH net-next v2 16/21] net/ipv6: Add gfp_flags to route add functions
From: Eric Dumazet @ 2018-04-18 17:06 UTC (permalink / raw)
  To: David Ahern, netdev
  Cc: davem, idosch, roopa, eric.dumazet, weiwan, kafai, yoshfuji
In-Reply-To: <20180418003327.19992-17-dsahern@gmail.com>



On 04/17/2018 05:33 PM, David Ahern wrote:
> Most FIB entries can be added using memory allocated with GFP_KERNEL.
> Add gfp_flags to ip6_route_add and addrconf_dst_alloc. Code paths that
> can be reached from the packet path (e.g., ndisc and autoconfig) or
> atomic notifiers use GFP_ATOMIC; paths from user context (adding
> addresses and routes) use GFP_KERNEL.
> 
> Signed-off-by: David Ahern <dsahern@gmail.com>
> ---


Hmmm....

ipv6_ifa_notify() calls __ipv6_ifa_notify() under rcu_read_lock_bh()/rcu_read_unlock_bh()

So using GFP_KERNEL in __ipv6_ifa_notify() is certainly not allowed.

^ permalink raw reply

* Re: [PATCH bpf-next v4 07/10] bpf: btf: Add pretty print support to the basic arraymap
From: Jakub Kicinski @ 2018-04-18 17:05 UTC (permalink / raw)
  To: Daniel Borkmann, Martin KaFai Lau; +Cc: netdev, Alexei Starovoitov, kernel-team
In-Reply-To: <ecbcec5b-6e74-333a-de4d-3114aec27833@iogearbox.net>

On Wed, 18 Apr 2018 17:20:11 +0200, Daniel Borkmann wrote:
> Hi Martin,
> 
> first of all great work on the set! One issue that puzzled me
> while digesting it further below.
> 
> On 04/17/2018 10:42 PM, Martin KaFai Lau wrote:
> > This patch adds pretty print support to the basic arraymap.
> > Support for other bpf maps can be added later.
> > 
> > This patch adds new attrs to the BPF_MAP_CREATE command to allow
> > specifying the btf_fd, btf_key_id and btf_value_id.  The
> > BPF_MAP_CREATE can then associate the btf to the map if
> > the creating map supports BTF.  
> 
> Feels like this patch is doing two things at once, meaning i)
> attaching btf object to map object through bpf syscall at map
> creation time, and ...
> 
> > A BTF supported map needs to implement two new map ops,
> > map_seq_show_elem() and map_check_btf().  This patch has
> > implemented these new map ops for the basic arraymap.
> > 
> > It also adds file_operations to the pinned map
> > such that the pinned map can be opened and read.  
> 
> ... ii) pretty print map dump via bpf fs for array map.
> 
> > Here is a sample output when reading a pinned arraymap
> > with the following map's value:
> > 
> > struct map_value {
> > 	int count_a;
> > 	int count_b;
> > };
> > 
> > cat /sys/fs/bpf/pinned_array_map:
> > 
> > 0: {1,2}
> > 1: {3,4}
> > 2: {5,6}
> > ...  
> 
> Rather than adding this to the bpf fs itself, why not add full BTF
> support for the main debugging and introspection tool we have and
> ship with the kernel for BPF, namely bpftool? You can already dump
> the whole map's key/value pairs via the following command from a
> pinned file:
> 
>   bpftool map dump pinned /sys/fs/bpf/pinned_array_map
> 
> And given we already export the BTF info in your earlier patch through
> the BPF_OBJ_GET_INFO_BY_FD, this would fit perfectly for bpftool
> integration instead where the pretty-print which is done through the
> extra cb map_seq_show_elem (which only does a map lookup and print
> anyway) in this patch can simply all be done in user space w/o any
> additional kernel complexity.
> 
> Aside that this would be very valuable there it would also nicely
> demonstrate usage of it, but more importantly we could avoid implementing
> such pretty-print callback in the kernel for every other map type and
> then having two locations where a user now needs to go for debugging
> (bpftool being one, and cat of pinned file the other; this split seems
> confusing from a user perspective, imho, but also single key lookup +
> pretty-print cannot be realized with the latter whereas it's trivial
> with bpftool).
> 
> The same could be done for bpftool map lookup, updates, deletions, etc
> where the key resp. key/value pair can be specified through a struct
> like initializer from cmdline. (But dump/lookup should be good enough
> starting point initially.) Thoughts?

I felt the same way but I'm obviously biased so I didn't say anything ;)

I'm concerned about exposing the map data in this arbitrary format.
This will be a kernel ABI.  And the format even though it looks nice
and concise for an array may not be as good for more complex maps where
key is not an int.  Will key for hashes be specified as fields in {}?
Why the discrepancy between array and hash then (array's key isn't)?
This print type is also not any standard notation AFAIU.  In bpftool
user can just ask to render JSON, including actual field names.  Or we
can add other formats if we feel like or change the default one, it's
not kernel ABI.  I'd hate to see people deploying regexps to parse the
output of pinned map dump when there is modern tooling available...

How useful is it really to provide a dump on a pinned file?  I guess
this is mostly for beginners?  Is this really more handy than bpftool?

^ permalink raw reply

* Re: [PATCH bpf-next v3 01/11] bpf: adding bpf_xdp_adjust_tail helper
From: Alexei Starovoitov @ 2018-04-18 17:01 UTC (permalink / raw)
  To: Nikita V. Shirokov; +Cc: Alexei Starovoitov, Daniel Borkmann, netdev
In-Reply-To: <20180418044223.17685-2-tehnerd@tehnerd.com>

On Tue, Apr 17, 2018 at 09:42:13PM -0700, Nikita V. Shirokov wrote:
> Adding new bpf helper which would allow us to manipulate
> xdp's data_end pointer, and allow us to reduce packet's size
> indended use case: to generate ICMP messages from XDP context,
> where such message would contain truncated original packet.
> 
> Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>

Acked-by: Alexei Starovoitov <ast@kernel.org>

patches look great. thanks!

^ permalink raw reply

* Re: [PATCH bpf-next v2 01/11] bpf: adding bpf_xdp_adjust_tail helper
From: Alexei Starovoitov @ 2018-04-18 16:56 UTC (permalink / raw)
  To: Nikita V. Shirokov; +Cc: Alexei Starovoitov, Daniel Borkmann, netdev
In-Reply-To: <20180418042951.17183-2-tehnerd@tehnerd.com>

On Tue, Apr 17, 2018 at 09:29:41PM -0700, Nikita V. Shirokov wrote:
> Adding new bpf helper which would allow us to manipulate
> xdp's data_end pointer, and allow us to reduce packet's size
> indended use case: to generate ICMP messages from XDP context,
> where such message would contain truncated original packet.
> ---

your SOB is now missing in all patches.

^ permalink raw reply

* Re: [PATCH net-next 2/2] udp: implement and use per cpu rx skbs cache
From: Eric Dumazet @ 2018-04-18 16:56 UTC (permalink / raw)
  To: Paolo Abeni, netdev; +Cc: David S. Miller, Eric Dumazet
In-Reply-To: <dacbc7a3626bb170629e02159ed2f90120f06382.1524045911.git.pabeni@redhat.com>



On 04/18/2018 03:22 AM, Paolo Abeni wrote:
> This changeset extends the idea behind commit c8c8b127091b ("udp:
> under rx pressure, try to condense skbs"), trading more BH cpu
> time and memory bandwidth to decrease the load on the user space
> receiver.
> 
> At boot time we allocate a limited amount of skbs with small
> data buffer, storing them in per cpu arrays. Such skbs are never
> freed.
> 
> At run time, under rx pressure, the BH tries to copy the current
> skb contents into the cache - if the current cache skb is available,
> and the ingress skb is small enough and without any head states.
> 
> When using the cache skb, the ingress skb is dropped by the BH
> - while still hot on cache - and the cache skb is inserted into
> the rx queue, after increasing its usage count. Also, the cache
> array index is moved to the next entry.
> 
> The receive side is unmodified: in udp_rcvmsg() the usage skb
> usage count is decreased and the skb is _not_ freed - since the
> cache keeps usage > 0. Since skb->usage is hot in the cache of the
> receiver at consume time - the receiver has just read skb->data,
> which lies in the same cacheline - the whole skb_consume_udp() becomes
> really cheap.
> 
> UDP receive performances under flood improve as follow:
> 
> NR RX queues	Kpps	Kpps	Delta (%)
> 		Before	After
> 
> 1		2252	2305	2
> 2		2151	2569	19
> 4		2033	2396	17
> 8		1969	2329	18
> 
> Overall performances of knotd DNS server under real traffic flood
> improves as follow:
> 
> 		Kpps	Kpps	Delta (%)
> 		Before	After
> 
> 		3777	3981	5


It might be time for knotd DNS server to finally use SO_REUSEPORT instead of
adding this bloat to the kernel ?

Sorry, 5% improvement while you easily can get 300% improvement with no kernel change
is not appealing to me :/

^ permalink raw reply

* Re: [PATCH net-next 2/2] netns: isolate seqnums to use per-netns locks
From: Eric W. Biederman @ 2018-04-18 16:55 UTC (permalink / raw)
  To: Christian Brauner
  Cc: davem, netdev, linux-kernel, avagin, ktkhai, serge, gregkh
In-Reply-To: <20180418152106.18519-3-christian.brauner@ubuntu.com>

Christian Brauner <christian.brauner@ubuntu.com> writes:

> Now that it's possible to have a different set of uevents in different
> network namespaces, per-network namespace uevent sequence numbers are
> introduced. This increases performance as locking is now restricted to the
> network namespace affected by the uevent rather than locking
> everything.

Numbers please.  I personally expect that the netlink mc_list issues
will swamp any benefit you get from this.

Eric

^ permalink raw reply

* [PATCH bpf-next v2 4/9] bpf/verifier: improve register value range tracking with ARSH
From: Yonghong Song @ 2018-04-18 16:54 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180418165444.2263237-1-yhs@fb.com>

When helpers like bpf_get_stack returns an int value
and later on used for arithmetic computation, the LSH and ARSH
operations are often required to get proper sign extension into
64-bit. For example, without this patch:
    54: R0=inv(id=0,umax_value=800)
    54: (bf) r8 = r0
    55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
    55: (67) r8 <<= 32
    56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff00000000))
    56: (c7) r8 s>>= 32
    57: R8=inv(id=0)
With this patch:
    54: R0=inv(id=0,umax_value=800)
    54: (bf) r8 = r0
    55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
    55: (67) r8 <<= 32
    56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff00000000))
    56: (c7) r8 s>>= 32
    57: R8=inv(id=0, umax_value=800,var_off=(0x0; 0x3ff))
With better range of "R8", later on when "R8" is added to other register,
e.g., a map pointer or scalar-value register, the better register
range can be derived and verifier failure may be avoided.

In our later example,
    ......
    usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
    if (usize < 0)
        return 0;
    ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
    ......
Without improving ARSH value range tracking, the register representing
"max_len - usize" will have smin_value equal to S64_MIN and will be
rejected by verifier.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 kernel/bpf/verifier.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index a8302c3..6148d31 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2944,6 +2944,7 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env,
 		__update_reg_bounds(dst_reg);
 		break;
 	case BPF_RSH:
+	case BPF_ARSH:
 		if (umax_val >= insn_bitness) {
 			/* Shifts greater than 31 or 63 are undefined.
 			 * This includes shifts by a negative number.
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next v2 8/9] tools/bpf: add a verifier test case for bpf_get_stack helper and ARSH
From: Yonghong Song @ 2018-04-18 16:54 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180418165444.2263237-1-yhs@fb.com>

The test_verifier already has a few ARSH test cases.
This patch adds a new test case which takes advantage of newly
improved verifier behavior for bpf_get_stack and ARSH.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 tools/testing/selftests/bpf/test_verifier.c | 45 +++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index 3e7718b..cd595ba 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -11423,6 +11423,51 @@ static struct bpf_test tests[] = {
 		.errstr = "BPF_XADD stores into R2 packet",
 		.prog_type = BPF_PROG_TYPE_XDP,
 	},
+	{
+		"bpf_get_stack return R0 within range",
+		.insns = {
+			BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+			BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+				     BPF_FUNC_map_lookup_elem),
+			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 28),
+			BPF_MOV64_REG(BPF_REG_7, BPF_REG_0),
+			BPF_MOV64_IMM(BPF_REG_9, sizeof(struct test_val)),
+			BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_7),
+			BPF_MOV64_IMM(BPF_REG_3, sizeof(struct test_val)),
+			BPF_MOV64_IMM(BPF_REG_4, 256),
+			BPF_EMIT_CALL(BPF_FUNC_get_stack),
+			BPF_MOV64_IMM(BPF_REG_1, 0),
+			BPF_MOV64_REG(BPF_REG_8, BPF_REG_0),
+			BPF_ALU64_IMM(BPF_LSH, BPF_REG_8, 32),
+			BPF_ALU64_IMM(BPF_ARSH, BPF_REG_8, 32),
+			BPF_JMP_REG(BPF_JSLT, BPF_REG_1, BPF_REG_8, 16),
+			BPF_ALU64_REG(BPF_SUB, BPF_REG_9, BPF_REG_8),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_7),
+			BPF_ALU64_REG(BPF_ADD, BPF_REG_2, BPF_REG_8),
+			BPF_MOV64_REG(BPF_REG_1, BPF_REG_9),
+			BPF_ALU64_IMM(BPF_LSH, BPF_REG_1, 32),
+			BPF_ALU64_IMM(BPF_ARSH, BPF_REG_1, 32),
+			BPF_MOV64_REG(BPF_REG_3, BPF_REG_2),
+			BPF_ALU64_REG(BPF_ADD, BPF_REG_3, BPF_REG_1),
+			BPF_MOV64_REG(BPF_REG_1, BPF_REG_7),
+			BPF_MOV64_IMM(BPF_REG_5, sizeof(struct test_val)),
+			BPF_ALU64_REG(BPF_ADD, BPF_REG_1, BPF_REG_5),
+			BPF_JMP_REG(BPF_JGE, BPF_REG_3, BPF_REG_1, 4),
+			BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+			BPF_MOV64_REG(BPF_REG_3, BPF_REG_9),
+			BPF_MOV64_IMM(BPF_REG_4, 0),
+			BPF_EMIT_CALL(BPF_FUNC_get_stack),
+			BPF_EXIT_INSN(),
+		},
+		.fixup_map2 = { 4 },
+		.result = ACCEPT,
+		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
+	},
 };
 
 static int probe_filter_length(const struct bpf_insn *fp)
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next v2 5/9] tools/bpf: add bpf_get_stack helper to tools headers
From: Yonghong Song @ 2018-04-18 16:54 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180418165444.2263237-1-yhs@fb.com>

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 tools/include/uapi/linux/bpf.h            | 19 +++++++++++++++++--
 tools/testing/selftests/bpf/bpf_helpers.h |  2 ++
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 9d07465..63a4529 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -517,6 +517,17 @@ union bpf_attr {
  *             other bits - reserved
  *     Return: >= 0 stackid on success or negative error
  *
+ * int bpf_get_stack(ctx, buf, size, flags)
+ *     walk user or kernel stack and store the ips in buf
+ *     @ctx: struct pt_regs*
+ *     @buf: user buffer to fill stack
+ *     @size: the buf size
+ *     @flags: bits 0-7 - numer of stack frames to skip
+ *             bit 8 - collect user stack instead of kernel
+ *             bit 11 - get build-id as well if user stack
+ *             other bits - reserved
+ *     Return: >= 0 size copied on success or negative error
+ *
  * s64 bpf_csum_diff(from, from_size, to, to_size, seed)
  *     calculate csum diff
  *     @from: raw from buffer
@@ -821,7 +832,8 @@ union bpf_attr {
 	FN(msg_apply_bytes),		\
 	FN(msg_cork_bytes),		\
 	FN(msg_pull_data),		\
-	FN(bind),
+	FN(bind),			\
+	FN(get_stack),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -855,11 +867,14 @@ enum bpf_func_id {
 /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */
 #define BPF_F_TUNINFO_IPV6		(1ULL << 0)
 
-/* BPF_FUNC_get_stackid flags. */
+/* flags for both BPF_FUNC_get_stackid and BPF_FUNC_get_stack. */
 #define BPF_F_SKIP_FIELD_MASK		0xffULL
 #define BPF_F_USER_STACK		(1ULL << 8)
+/* flags used by BPF_FUNC_get_stackid only. */
 #define BPF_F_FAST_STACK_CMP		(1ULL << 9)
 #define BPF_F_REUSE_STACKID		(1ULL << 10)
+/* flags used by BPF_FUNC_get_stack only. */
+#define BPF_F_USER_BUILD_ID		(1ULL << 11)
 
 /* BPF_FUNC_skb_set_tunnel_key flags. */
 #define BPF_F_ZERO_CSUM_TX		(1ULL << 1)
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index d8223d9..acaed02 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -96,6 +96,8 @@ static int (*bpf_msg_pull_data)(void *ctx, int start, int end, int flags) =
 	(void *) BPF_FUNC_msg_pull_data;
 static int (*bpf_bind)(void *ctx, void *addr, int addr_len) =
 	(void *) BPF_FUNC_bind;
+static int (*bpf_get_stack)(void *ctx, void *buf, int size, int flags) =
+	(void *) BPF_FUNC_get_stack;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next v2 6/9] samples/bpf: move common-purpose perf_event functions to bpf_load.c
From: Yonghong Song @ 2018-04-18 16:54 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180418165444.2263237-1-yhs@fb.com>

There is no functionality change in this patch. The common-purpose
perf_event functions are moved from trace_output_user.c to bpf_load.c
so that these function can be reused later.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 samples/bpf/bpf_load.c          | 104 ++++++++++++++++++++++++++++++++++++
 samples/bpf/bpf_load.h          |   5 ++
 samples/bpf/trace_output_user.c | 113 ++++------------------------------------
 3 files changed, 118 insertions(+), 104 deletions(-)

diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index bebe418..62aa5cc 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -713,3 +713,107 @@ struct ksym *ksym_search(long key)
 	return &syms[0];
 }
 
+static int page_size;
+static int page_cnt = 8;
+static volatile struct perf_event_mmap_page *header;
+
+static int perf_event_mmap(int fd)
+{
+	void *base;
+	int mmap_size;
+
+	page_size = getpagesize();
+	mmap_size = page_size * (page_cnt + 1);
+
+	base = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	if (base == MAP_FAILED) {
+		printf("mmap err\n");
+		return -1;
+	}
+
+	header = base;
+	return 0;
+}
+
+static int perf_event_poll(int fd)
+{
+	struct pollfd pfd = { .fd = fd, .events = POLLIN };
+
+	return poll(&pfd, 1, 1000);
+}
+
+struct perf_event_sample {
+	struct perf_event_header header;
+	__u32 size;
+	char data[];
+};
+
+static void perf_event_read(perf_event_print_fn fn)
+{
+	__u64 data_tail = header->data_tail;
+	__u64 data_head = header->data_head;
+	__u64 buffer_size = page_cnt * page_size;
+	void *base, *begin, *end;
+	char buf[256];
+
+	asm volatile("" ::: "memory"); /* in real code it should be smp_rmb() */
+	if (data_head == data_tail)
+		return;
+
+	base = ((char *)header) + page_size;
+
+	begin = base + data_tail % buffer_size;
+	end = base + data_head % buffer_size;
+
+	while (begin != end) {
+		struct perf_event_sample *e;
+
+		e = begin;
+		if (begin + e->header.size > base + buffer_size) {
+			long len = base + buffer_size - begin;
+
+			assert(len < e->header.size);
+			memcpy(buf, begin, len);
+			memcpy(buf + len, base, e->header.size - len);
+			e = (void *) buf;
+			begin = base + e->header.size - len;
+		} else if (begin + e->header.size == base + buffer_size) {
+			begin = base;
+		} else {
+			begin += e->header.size;
+		}
+
+		if (e->header.type == PERF_RECORD_SAMPLE) {
+			fn(e->data, e->size);
+		} else if (e->header.type == PERF_RECORD_LOST) {
+			struct {
+				struct perf_event_header header;
+				__u64 id;
+				__u64 lost;
+			} *lost = (void *) e;
+			printf("lost %lld events\n", lost->lost);
+		} else {
+			printf("unknown event type=%d size=%d\n",
+			       e->header.type, e->header.size);
+		}
+	}
+
+	__sync_synchronize(); /* smp_mb() */
+	header->data_tail = data_head;
+}
+
+int perf_event_poller(int fd, perf_event_exec_fn exec_fn,
+		      perf_event_print_fn output_fn)
+{
+	if (perf_event_mmap(fd) < 0)
+		return 1;
+
+	exec_fn();
+
+	for (;;) {
+		perf_event_poll(fd);
+		perf_event_read(output_fn);
+	}
+
+	return 0;
+}
diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h
index 453c200..d618750 100644
--- a/samples/bpf/bpf_load.h
+++ b/samples/bpf/bpf_load.h
@@ -62,4 +62,9 @@ struct ksym {
 int load_kallsyms(void);
 struct ksym *ksym_search(long key);
 int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags);
+
+typedef void (*perf_event_exec_fn)(void);
+typedef void (*perf_event_print_fn)(void *data, int size);
+int perf_event_poller(int fd, perf_event_exec_fn exec_fn,
+		      perf_event_print_fn output_fn);
 #endif
diff --git a/samples/bpf/trace_output_user.c b/samples/bpf/trace_output_user.c
index ccca1e3..3d3991f 100644
--- a/samples/bpf/trace_output_user.c
+++ b/samples/bpf/trace_output_user.c
@@ -24,97 +24,6 @@
 
 static int pmu_fd;
 
-int page_size;
-int page_cnt = 8;
-volatile struct perf_event_mmap_page *header;
-
-typedef void (*print_fn)(void *data, int size);
-
-static int perf_event_mmap(int fd)
-{
-	void *base;
-	int mmap_size;
-
-	page_size = getpagesize();
-	mmap_size = page_size * (page_cnt + 1);
-
-	base = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
-	if (base == MAP_FAILED) {
-		printf("mmap err\n");
-		return -1;
-	}
-
-	header = base;
-	return 0;
-}
-
-static int perf_event_poll(int fd)
-{
-	struct pollfd pfd = { .fd = fd, .events = POLLIN };
-
-	return poll(&pfd, 1, 1000);
-}
-
-struct perf_event_sample {
-	struct perf_event_header header;
-	__u32 size;
-	char data[];
-};
-
-static void perf_event_read(print_fn fn)
-{
-	__u64 data_tail = header->data_tail;
-	__u64 data_head = header->data_head;
-	__u64 buffer_size = page_cnt * page_size;
-	void *base, *begin, *end;
-	char buf[256];
-
-	asm volatile("" ::: "memory"); /* in real code it should be smp_rmb() */
-	if (data_head == data_tail)
-		return;
-
-	base = ((char *)header) + page_size;
-
-	begin = base + data_tail % buffer_size;
-	end = base + data_head % buffer_size;
-
-	while (begin != end) {
-		struct perf_event_sample *e;
-
-		e = begin;
-		if (begin + e->header.size > base + buffer_size) {
-			long len = base + buffer_size - begin;
-
-			assert(len < e->header.size);
-			memcpy(buf, begin, len);
-			memcpy(buf + len, base, e->header.size - len);
-			e = (void *) buf;
-			begin = base + e->header.size - len;
-		} else if (begin + e->header.size == base + buffer_size) {
-			begin = base;
-		} else {
-			begin += e->header.size;
-		}
-
-		if (e->header.type == PERF_RECORD_SAMPLE) {
-			fn(e->data, e->size);
-		} else if (e->header.type == PERF_RECORD_LOST) {
-			struct {
-				struct perf_event_header header;
-				__u64 id;
-				__u64 lost;
-			} *lost = (void *) e;
-			printf("lost %lld events\n", lost->lost);
-		} else {
-			printf("unknown event type=%d size=%d\n",
-			       e->header.type, e->header.size);
-		}
-	}
-
-	__sync_synchronize(); /* smp_mb() */
-	header->data_tail = data_head;
-}
-
 static __u64 time_get_ns(void)
 {
 	struct timespec ts;
@@ -166,10 +75,17 @@ static void test_bpf_perf_event(void)
 	ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
 }
 
+static void exec_action(void)
+{
+	FILE *f;
+
+	f = popen("taskset 1 dd if=/dev/zero of=/dev/null", "r");
+	(void) f;
+}
+
 int main(int argc, char **argv)
 {
 	char filename[256];
-	FILE *f;
 
 	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
 
@@ -180,17 +96,6 @@ int main(int argc, char **argv)
 
 	test_bpf_perf_event();
 
-	if (perf_event_mmap(pmu_fd) < 0)
-		return 1;
-
-	f = popen("taskset 1 dd if=/dev/zero of=/dev/null", "r");
-	(void) f;
-
 	start_time = time_get_ns();
-	for (;;) {
-		perf_event_poll(pmu_fd);
-		perf_event_read(print_bpf_output);
-	}
-
-	return 0;
+	return perf_event_poller(pmu_fd, exec_action, print_bpf_output);
 }
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next v2 2/9] bpf: add bpf_get_stack helper
From: Yonghong Song @ 2018-04-18 16:54 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180418165444.2263237-1-yhs@fb.com>

Currently, stackmap and bpf_get_stackid helper are provided
for bpf program to get the stack trace. This approach has
a limitation though. If two stack traces have the same hash,
only one will get stored in the stackmap table,
so some stack traces are missing from user perspective.

This patch implements a new helper, bpf_get_stack, will
send stack traces directly to bpf program. The bpf program
is able to see all stack traces, and then can do in-kernel
processing or send stack traces to user space through
shared map or bpf_perf_event_output.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/linux/bpf.h      |  1 +
 include/linux/filter.h   |  3 ++-
 include/uapi/linux/bpf.h | 19 ++++++++++++--
 kernel/bpf/core.c        |  5 ++++
 kernel/bpf/stackmap.c    | 67 ++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c     | 10 ++++++++
 kernel/bpf/verifier.c    |  3 +++
 kernel/trace/bpf_trace.c | 50 +++++++++++++++++++++++++++++++++++-
 8 files changed, 154 insertions(+), 4 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 95a7abd..72ccb9a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -676,6 +676,7 @@ extern const struct bpf_func_proto bpf_get_current_comm_proto;
 extern const struct bpf_func_proto bpf_skb_vlan_push_proto;
 extern const struct bpf_func_proto bpf_skb_vlan_pop_proto;
 extern const struct bpf_func_proto bpf_get_stackid_proto;
+extern const struct bpf_func_proto bpf_get_stack_proto;
 extern const struct bpf_func_proto bpf_sock_map_update_proto;
 
 /* Shared helpers among cBPF and eBPF. */
diff --git a/include/linux/filter.h b/include/linux/filter.h
index fc4e8f9..9b64f63 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -467,7 +467,8 @@ struct bpf_prog {
 				dst_needed:1,	/* Do we need dst entry? */
 				blinded:1,	/* Was blinded */
 				is_func:1,	/* program is a bpf function */
-				kprobe_override:1; /* Do we override a kprobe? */
+				kprobe_override:1, /* Do we override a kprobe? */
+				need_callchain_buf:1; /* Needs callchain buffer? */
 	enum bpf_prog_type	type;		/* Type of BPF program */
 	enum bpf_attach_type	expected_attach_type; /* For some prog types */
 	u32			len;		/* Number of filter blocks */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c5ec897..dadca82 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -517,6 +517,17 @@ union bpf_attr {
  *             other bits - reserved
  *     Return: >= 0 stackid on success or negative error
  *
+ * int bpf_get_stack(ctx, buf, size, flags)
+ *     walk user or kernel stack and store the ips in buf
+ *     @ctx: struct pt_regs*
+ *     @buf: user buffer to fill stack
+ *     @size: the buf size
+ *     @flags: bits 0-7 - numer of stack frames to skip
+ *             bit 8 - collect user stack instead of kernel
+ *             bit 11 - get build-id as well if user stack
+ *             other bits - reserved
+ *     Return: >= 0 size copied on success or negative error
+ *
  * s64 bpf_csum_diff(from, from_size, to, to_size, seed)
  *     calculate csum diff
  *     @from: raw from buffer
@@ -821,7 +832,8 @@ union bpf_attr {
 	FN(msg_apply_bytes),		\
 	FN(msg_cork_bytes),		\
 	FN(msg_pull_data),		\
-	FN(bind),
+	FN(bind),			\
+	FN(get_stack),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -855,11 +867,14 @@ enum bpf_func_id {
 /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */
 #define BPF_F_TUNINFO_IPV6		(1ULL << 0)
 
-/* BPF_FUNC_get_stackid flags. */
+/* flags for both BPF_FUNC_get_stackid and BPF_FUNC_get_stack. */
 #define BPF_F_SKIP_FIELD_MASK		0xffULL
 #define BPF_F_USER_STACK		(1ULL << 8)
+/* flags used by BPF_FUNC_get_stackid only. */
 #define BPF_F_FAST_STACK_CMP		(1ULL << 9)
 #define BPF_F_REUSE_STACKID		(1ULL << 10)
+/* flags used by BPF_FUNC_get_stack only. */
+#define BPF_F_USER_BUILD_ID		(1ULL << 11)
 
 /* BPF_FUNC_skb_set_tunnel_key flags. */
 #define BPF_F_ZERO_CSUM_TX		(1ULL << 1)
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index d315b39..bf22eca 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -31,6 +31,7 @@
 #include <linux/rbtree_latch.h>
 #include <linux/kallsyms.h>
 #include <linux/rcupdate.h>
+#include <linux/perf_event.h>
 
 #include <asm/unaligned.h>
 
@@ -1709,6 +1710,10 @@ static void bpf_prog_free_deferred(struct work_struct *work)
 	aux = container_of(work, struct bpf_prog_aux, work);
 	if (bpf_prog_is_dev_bound(aux))
 		bpf_prog_offload_destroy(aux->prog);
+#ifdef CONFIG_PERF_EVENTS
+	if (aux->prog->need_callchain_buf)
+		put_callchain_buffers();
+#endif
 	for (i = 0; i < aux->func_cnt; i++)
 		bpf_jit_free(aux->func[i]);
 	if (aux->func_cnt) {
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 04f6ec1..4477cf6 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -402,6 +402,73 @@ const struct bpf_func_proto bpf_get_stackid_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_4(bpf_get_stack, struct pt_regs *, regs, void *, buf, u32, size,
+	   u64, flags)
+{
+	u32 init_nr, trace_nr, copy_len, elem_size, num_elem;
+	bool user_build_id = flags & BPF_F_USER_BUILD_ID;
+	u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
+	bool user = flags & BPF_F_USER_STACK;
+	struct perf_callchain_entry *trace;
+	bool kernel = !user;
+	int err = -EINVAL;
+	u64 *ips;
+
+	if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
+			       BPF_F_USER_BUILD_ID)))
+		goto clear;
+	if (kernel && user_build_id)
+		goto clear;
+
+	elem_size = (user && user_build_id) ? sizeof(struct bpf_stack_build_id)
+					    : sizeof(u64);
+	if (unlikely(size % elem_size))
+		goto clear;
+
+	num_elem = size / elem_size;
+	if (sysctl_perf_event_max_stack < num_elem)
+		init_nr = 0;
+	else
+		init_nr = sysctl_perf_event_max_stack - num_elem;
+	trace = get_perf_callchain(regs, init_nr, kernel, user,
+				   sysctl_perf_event_max_stack, false, false);
+	if (unlikely(!trace))
+		goto err_fault;
+
+	trace_nr = trace->nr - init_nr;
+	if (trace_nr <= skip)
+		goto err_fault;
+
+	trace_nr -= skip;
+	trace_nr = (trace_nr <= num_elem) ? trace_nr : num_elem;
+	copy_len = trace_nr * elem_size;
+	ips = trace->ip + skip + init_nr;
+	if (user && user_build_id)
+		stack_map_get_build_id_offset(buf, ips, trace_nr, user);
+	else
+		memcpy(buf, ips, copy_len);
+
+	if (size > copy_len)
+		memset(buf + copy_len, 0, size - copy_len);
+	return copy_len;
+
+err_fault:
+	err = -EFAULT;
+clear:
+	memset(buf, 0, size);
+	return err;
+}
+
+const struct bpf_func_proto bpf_get_stack_proto = {
+	.func		= bpf_get_stack,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_PTR_TO_UNINIT_MEM,
+	.arg3_type	= ARG_CONST_SIZE_OR_ZERO,
+	.arg4_type	= ARG_ANYTHING,
+};
+
 /* Called from eBPF program */
 static void *stack_map_lookup_elem(struct bpf_map *map, void *key)
 {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 4ca46df..584eb90 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1329,6 +1329,16 @@ static int bpf_prog_load(union bpf_attr *attr)
 	if (err)
 		goto free_used_maps;
 
+	if (prog->need_callchain_buf) {
+#ifdef CONFIG_PERF_EVENTS
+		err = get_callchain_buffers(sysctl_perf_event_max_stack);
+#else
+		err = -ENOTSUPP;
+#endif
+		if (err)
+			goto free_used_maps;
+	}
+
 	err = bpf_prog_new_fd(prog);
 	if (err < 0) {
 		/* failed to allocate fd.
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 5dd1dcb..aba9425 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2460,6 +2460,9 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
 	if (err)
 		return err;
 
+	if (func_id == BPF_FUNC_get_stack)
+		env->prog->need_callchain_buf = true;
+
 	if (changes_data)
 		clear_all_pkt_pointers(env);
 	return 0;
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index d88e96d..fe8476f 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -20,6 +20,7 @@
 #include "trace.h"
 
 u64 bpf_get_stackid(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
+u64 bpf_get_stack(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
 /**
  * trace_call_bpf - invoke BPF program
@@ -577,6 +578,8 @@ kprobe_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_perf_event_output_proto;
 	case BPF_FUNC_get_stackid:
 		return &bpf_get_stackid_proto;
+	case BPF_FUNC_get_stack:
+		return &bpf_get_stack_proto;
 	case BPF_FUNC_perf_event_read_value:
 		return &bpf_perf_event_read_value_proto;
 #ifdef CONFIG_BPF_KPROBE_OVERRIDE
@@ -664,6 +667,25 @@ static const struct bpf_func_proto bpf_get_stackid_proto_tp = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_4(bpf_get_stack_tp, void *, tp_buff, void *, buf, u32, size,
+	   u64, flags)
+{
+	struct pt_regs *regs = *(struct pt_regs **)tp_buff;
+
+	return bpf_get_stack((unsigned long) regs, (unsigned long) buf,
+			     (unsigned long) size, flags, 0);
+}
+
+static const struct bpf_func_proto bpf_get_stack_proto_tp = {
+	.func		= bpf_get_stack_tp,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_PTR_TO_UNINIT_MEM,
+	.arg3_type	= ARG_CONST_SIZE_OR_ZERO,
+	.arg4_type	= ARG_ANYTHING,
+};
+
 static const struct bpf_func_proto *
 tp_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
@@ -672,6 +694,8 @@ tp_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_perf_event_output_proto_tp;
 	case BPF_FUNC_get_stackid:
 		return &bpf_get_stackid_proto_tp;
+	case BPF_FUNC_get_stack:
+		return &bpf_get_stack_proto_tp;
 	default:
 		return tracing_func_proto(func_id, prog);
 	}
@@ -734,6 +758,8 @@ pe_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_perf_event_output_proto_tp;
 	case BPF_FUNC_get_stackid:
 		return &bpf_get_stackid_proto_tp;
+	case BPF_FUNC_get_stack:
+		return &bpf_get_stack_proto_tp;
 	case BPF_FUNC_perf_prog_read_value:
 		return &bpf_perf_prog_read_value_proto;
 	default:
@@ -744,7 +770,7 @@ pe_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 /*
  * bpf_raw_tp_regs are separate from bpf_pt_regs used from skb/xdp
  * to avoid potential recursive reuse issue when/if tracepoints are added
- * inside bpf_*_event_output and/or bpf_get_stack_id
+ * inside bpf_*_event_output, bpf_get_stackid and/or bpf_get_stack
  */
 static DEFINE_PER_CPU(struct pt_regs, bpf_raw_tp_regs);
 BPF_CALL_5(bpf_perf_event_output_raw_tp, struct bpf_raw_tracepoint_args *, args,
@@ -787,6 +813,26 @@ static const struct bpf_func_proto bpf_get_stackid_proto_raw_tp = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_4(bpf_get_stack_raw_tp, struct bpf_raw_tracepoint_args *, args,
+	   void *, buf, u32, size, u64, flags)
+{
+	struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs);
+
+	perf_fetch_caller_regs(regs);
+	return bpf_get_stack((unsigned long) regs, (unsigned long) buf,
+			     (unsigned long) size, flags, 0);
+}
+
+static const struct bpf_func_proto bpf_get_stack_proto_raw_tp = {
+	.func		= bpf_get_stack_raw_tp,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_PTR_TO_MEM,
+	.arg3_type	= ARG_CONST_SIZE_OR_ZERO,
+	.arg4_type	= ARG_ANYTHING,
+};
+
 static const struct bpf_func_proto *
 raw_tp_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
@@ -795,6 +841,8 @@ raw_tp_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_perf_event_output_proto_raw_tp;
 	case BPF_FUNC_get_stackid:
 		return &bpf_get_stackid_proto_raw_tp;
+	case BPF_FUNC_get_stack:
+		return &bpf_get_stack_proto_raw_tp;
 	default:
 		return tracing_func_proto(func_id, prog);
 	}
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next v2 9/9] tools/bpf: add a test_progs test case for bpf_get_stack helper
From: Yonghong Song @ 2018-04-18 16:54 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180418165444.2263237-1-yhs@fb.com>

The test_stacktrace_map is enhanced to call bpf_get_stack
in the helper to get the stack trace as well.
The stack traces from bpf_get_stack and bpf_get_stackid
are compared to ensure that for the same stack as
represented as the same hash, their ip addresses
must be the same.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 tools/testing/selftests/bpf/test_progs.c          | 41 ++++++++++++++++++++++-
 tools/testing/selftests/bpf/test_stacktrace_map.c | 20 +++++++++--
 2 files changed, 57 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
index faadbe2..8aa2844 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -865,9 +865,39 @@ static int compare_map_keys(int map1_fd, int map2_fd)
 	return 0;
 }
 
+static int compare_stack_ips(int smap_fd, int amap_fd)
+{
+	int max_len = PERF_MAX_STACK_DEPTH * sizeof(__u64);
+	__u32 key, next_key, *cur_key_p, *next_key_p;
+	char val_buf1[max_len], val_buf2[max_len];
+	int i, err;
+
+	cur_key_p = NULL;
+	next_key_p = &key;
+	while (bpf_map_get_next_key(smap_fd, cur_key_p, next_key_p) == 0) {
+		err = bpf_map_lookup_elem(smap_fd, next_key_p, val_buf1);
+		if (err)
+			return err;
+		err = bpf_map_lookup_elem(amap_fd, next_key_p, val_buf2);
+		if (err)
+			return err;
+		for (i = 0; i < max_len; i++) {
+			if (val_buf1[i] != val_buf2[i])
+				return -1;
+		}
+		key = *next_key_p;
+		cur_key_p = &key;
+		next_key_p = &next_key;
+	}
+	if (errno != ENOENT)
+		return -1;
+
+	return 0;
+}
+
 static void test_stacktrace_map()
 {
-	int control_map_fd, stackid_hmap_fd, stackmap_fd;
+	int control_map_fd, stackid_hmap_fd, stackmap_fd, stack_amap_fd;
 	const char *file = "./test_stacktrace_map.o";
 	int bytes, efd, err, pmu_fd, prog_fd;
 	struct perf_event_attr attr = {};
@@ -925,6 +955,10 @@ static void test_stacktrace_map()
 	if (stackmap_fd < 0)
 		goto disable_pmu;
 
+	stack_amap_fd = bpf_find_map(__func__, obj, "stack_amap");
+	if (stack_amap_fd < 0)
+		goto disable_pmu;
+
 	/* give some time for bpf program run */
 	sleep(1);
 
@@ -946,6 +980,11 @@ static void test_stacktrace_map()
 		  "err %d errno %d\n", err, errno))
 		goto disable_pmu_noerr;
 
+	err = compare_stack_ips(stackmap_fd, stack_amap_fd);
+	if (CHECK(err, "compare_stack_ips stackmap vs. stack_amap",
+		  "err %d errno %d\n", err, errno))
+		goto disable_pmu_noerr;
+
 	goto disable_pmu_noerr;
 disable_pmu:
 	error_cnt++;
diff --git a/tools/testing/selftests/bpf/test_stacktrace_map.c b/tools/testing/selftests/bpf/test_stacktrace_map.c
index 76d85c5d..f83c7b6 100644
--- a/tools/testing/selftests/bpf/test_stacktrace_map.c
+++ b/tools/testing/selftests/bpf/test_stacktrace_map.c
@@ -19,14 +19,21 @@ struct bpf_map_def SEC("maps") stackid_hmap = {
 	.type = BPF_MAP_TYPE_HASH,
 	.key_size = sizeof(__u32),
 	.value_size = sizeof(__u32),
-	.max_entries = 10000,
+	.max_entries = 16384,
 };
 
 struct bpf_map_def SEC("maps") stackmap = {
 	.type = BPF_MAP_TYPE_STACK_TRACE,
 	.key_size = sizeof(__u32),
 	.value_size = sizeof(__u64) * PERF_MAX_STACK_DEPTH,
-	.max_entries = 10000,
+	.max_entries = 16384,
+};
+
+struct bpf_map_def SEC("maps") stack_amap = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u64) * PERF_MAX_STACK_DEPTH,
+	.max_entries = 16384,
 };
 
 /* taken from /sys/kernel/debug/tracing/events/sched/sched_switch/format */
@@ -44,7 +51,10 @@ struct sched_switch_args {
 SEC("tracepoint/sched/sched_switch")
 int oncpu(struct sched_switch_args *ctx)
 {
+	__u32 max_len = PERF_MAX_STACK_DEPTH * sizeof(__u64);
 	__u32 key = 0, val = 0, *value_p;
+	void *stack_p;
+
 
 	value_p = bpf_map_lookup_elem(&control_map, &key);
 	if (value_p && *value_p)
@@ -52,8 +62,12 @@ int oncpu(struct sched_switch_args *ctx)
 
 	/* The size of stackmap and stackid_hmap should be the same */
 	key = bpf_get_stackid(ctx, &stackmap, 0);
-	if ((int)key >= 0)
+	if ((int)key >= 0) {
 		bpf_map_update_elem(&stackid_hmap, &key, &val, 0);
+		stack_p = bpf_map_lookup_elem(&stack_amap, &key);
+		if (stack_p)
+			bpf_get_stack(ctx, stack_p, max_len, 0);
+	}
 
 	return 0;
 }
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next v2 7/9] samples/bpf: add a test for bpf_get_stack helper
From: Yonghong Song @ 2018-04-18 16:54 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180418165444.2263237-1-yhs@fb.com>

The test attached a kprobe program to kernel function sys_write.
It tested to get stack for user space, kernel space and user
space with build_id request. It also tested to get user
and kernel stack into the same buffer with back-to-back
bpf_get_stack helper calls.

Whenever the kernel stack is available, the user space
application will check to ensure that sys_write/SyS_write
is part of the stack.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 samples/bpf/Makefile               |   4 +
 samples/bpf/trace_get_stack_kern.c |  86 +++++++++++++++++++++
 samples/bpf/trace_get_stack_user.c | 150 +++++++++++++++++++++++++++++++++++++
 3 files changed, 240 insertions(+)
 create mode 100644 samples/bpf/trace_get_stack_kern.c
 create mode 100644 samples/bpf/trace_get_stack_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 4d6a6ed..94e7b10 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -44,6 +44,7 @@ hostprogs-y += xdp_monitor
 hostprogs-y += xdp_rxq_info
 hostprogs-y += syscall_tp
 hostprogs-y += cpustat
+hostprogs-y += trace_get_stack
 
 # Libbpf dependencies
 LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o
@@ -95,6 +96,7 @@ xdp_monitor-objs := bpf_load.o $(LIBBPF) xdp_monitor_user.o
 xdp_rxq_info-objs := bpf_load.o $(LIBBPF) xdp_rxq_info_user.o
 syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
 cpustat-objs := bpf_load.o $(LIBBPF) cpustat_user.o
+trace_get_stack-objs := bpf_load.o $(LIBBPF) trace_get_stack_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -148,6 +150,7 @@ always += xdp_rxq_info_kern.o
 always += xdp2skb_meta_kern.o
 always += syscall_tp_kern.o
 always += cpustat_kern.o
+always += trace_get_stack_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
@@ -193,6 +196,7 @@ HOSTLOADLIBES_xdp_monitor += -lelf
 HOSTLOADLIBES_xdp_rxq_info += -lelf
 HOSTLOADLIBES_syscall_tp += -lelf
 HOSTLOADLIBES_cpustat += -lelf
+HOSTLOADLIBES_trace_get_stack += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/trace_get_stack_kern.c b/samples/bpf/trace_get_stack_kern.c
new file mode 100644
index 0000000..665e4ad
--- /dev/null
+++ b/samples/bpf/trace_get_stack_kern.c
@@ -0,0 +1,86 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/ptrace.h>
+#include <linux/version.h>
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+
+/* Permit pretty deep stack traces */
+#define MAX_STACK 100
+struct stack_trace_t {
+	int pid;
+	int kern_stack_size;
+	int user_stack_size;
+	int user_stack_buildid_size;
+	u64 kern_stack[MAX_STACK];
+	u64 user_stack[MAX_STACK];
+	struct bpf_stack_build_id user_stack_buildid[MAX_STACK];
+};
+
+struct bpf_map_def SEC("maps") perfmap = {
+	.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
+	.key_size = sizeof(int),
+	.value_size = sizeof(u32),
+	.max_entries = 2,
+};
+
+struct bpf_map_def SEC("maps") stackdata_map = {
+	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(struct stack_trace_t),
+	.max_entries = 1,
+};
+
+struct bpf_map_def SEC("maps") rawdata_map = {
+	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = MAX_STACK * sizeof(u64) * 2,
+	.max_entries = 1,
+};
+
+SEC("kprobe/sys_write")
+int bpf_prog1(struct pt_regs *ctx)
+{
+	int max_len, max_buildid_len, usize, ksize, total_size;
+	struct stack_trace_t *data;
+	void *raw_data;
+	u32 key = 0;
+
+	data = bpf_map_lookup_elem(&stackdata_map, &key);
+	if (!data)
+		return 0;
+
+	max_len = MAX_STACK * sizeof(u64);
+	max_buildid_len = MAX_STACK * sizeof(struct bpf_stack_build_id);
+	data->pid = bpf_get_current_pid_tgid();
+	data->kern_stack_size = bpf_get_stack(ctx, data->kern_stack,
+					      max_len, 0);
+	data->user_stack_size = bpf_get_stack(ctx, data->user_stack, max_len,
+					    BPF_F_USER_STACK);
+	data->user_stack_buildid_size = bpf_get_stack(
+		ctx, data->user_stack_buildid, max_buildid_len,
+		BPF_F_USER_STACK | BPF_F_USER_BUILD_ID);
+	bpf_perf_event_output(ctx, &perfmap, 0, data, sizeof(*data));
+
+	/* write both kernel and user stacks to the same buffer */
+	raw_data = bpf_map_lookup_elem(&rawdata_map, &key);
+	if (!raw_data)
+		return 0;
+
+	usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
+	if (usize < 0)
+		return 0;
+
+	ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
+	if (ksize < 0)
+		return 0;
+
+	total_size = usize + ksize;
+	if (total_size > 0 && total_size <= max_len)
+		bpf_perf_event_output(ctx, &perfmap, 0, raw_data, total_size);
+
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/trace_get_stack_user.c b/samples/bpf/trace_get_stack_user.c
new file mode 100644
index 0000000..f64f5a5
--- /dev/null
+++ b/samples/bpf/trace_get_stack_user.c
@@ -0,0 +1,150 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <stdio.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <fcntl.h>
+#include <poll.h>
+#include <linux/perf_event.h>
+#include <linux/bpf.h>
+#include <errno.h>
+#include <assert.h>
+#include <sys/syscall.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <time.h>
+#include <signal.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+#include "perf-sys.h"
+
+static int pmu_fd;
+
+#define MAX_CNT 10ull
+#define MAX_STACK 100
+struct stack_trace_t {
+	int pid;
+	int kern_stack_size;
+	int user_stack_size;
+	int user_stack_buildid_size;
+	__u64 kern_stack[MAX_STACK];
+	__u64 user_stack[MAX_STACK];
+	struct bpf_stack_build_id user_stack_buildid[MAX_STACK];
+};
+
+static void print_bpf_output(void *data, int size)
+{
+	struct stack_trace_t *e = data;
+	int i, num_stack;
+	static __u64 cnt;
+	bool found = false;
+
+	cnt++;
+
+	if (size < sizeof(struct stack_trace_t)) {
+		__u64 *raw_data = data;
+
+		num_stack = size / sizeof(__u64);
+		printf("sample size = %d, raw stack\n\t", size);
+		for (i = 0; i < num_stack; i++) {
+			struct ksym *ks = ksym_search(raw_data[i]);
+
+			printf("0x%llx ", raw_data[i]);
+			if (ks && (strcmp(ks->name, "sys_write") == 0 ||
+				   strcmp(ks->name, "SyS_write") == 0))
+				found = true;
+		}
+		printf("\n");
+	} else {
+		printf("sample size = %d, pid %d\n", size, e->pid);
+		if (e->kern_stack_size > 0) {
+			num_stack = e->kern_stack_size / sizeof(__u64);
+			printf("\tkernel_stack(%d): ", num_stack);
+			for (i = 0; i < num_stack; i++) {
+				struct ksym *ks = ksym_search(e->kern_stack[i]);
+
+				printf("0x%llx ", e->kern_stack[i]);
+				if (ks && (strcmp(ks->name, "sys_write") == 0 ||
+					   strcmp(ks->name, "SyS_write") == 0))
+					found = true;
+			}
+			printf("\n");
+		}
+		if (e->user_stack_size > 0) {
+			num_stack = e->user_stack_size / sizeof(__u64);
+			printf("\tuser_stack(%d): ", num_stack);
+			for (i = 0; i < num_stack; i++)
+				printf("0x%llx ", e->user_stack[i]);
+			printf("\n");
+		}
+		if (e->user_stack_buildid_size > 0) {
+			num_stack = e->user_stack_buildid_size /
+				    sizeof(struct bpf_stack_build_id);
+			printf("\tuser_stack_buildid(%d): ", num_stack);
+			for (i = 0; i < num_stack; i++) {
+				int j;
+
+				printf("(%d, 0x", e->user_stack_buildid[i].status);
+				for (j = 0; j < BPF_BUILD_ID_SIZE; j++)
+					printf("%02x", e->user_stack_buildid[i].build_id[i]);
+				printf(", %llx) ", e->user_stack_buildid[i].offset);
+			}
+			printf("\n");
+		}
+	}
+	if (!found) {
+		printf("received %lld events, kern symbol not found, exiting ...\n", cnt);
+		kill(0, SIGINT);
+	}
+
+	if (cnt == MAX_CNT) {
+		printf("received max %lld events, exiting ...\n", cnt);
+		kill(0, SIGINT);
+	}
+}
+
+static void test_bpf_perf_event(void)
+{
+	struct perf_event_attr attr = {
+		.sample_type = PERF_SAMPLE_RAW,
+		.type = PERF_TYPE_SOFTWARE,
+		.config = PERF_COUNT_SW_BPF_OUTPUT,
+	};
+	int key = 0;
+
+	pmu_fd = sys_perf_event_open(&attr, -1/*pid*/, 0/*cpu*/, -1/*group_fd*/, 0);
+
+	assert(pmu_fd >= 0);
+	assert(bpf_map_update_elem(map_fd[0], &key, &pmu_fd, BPF_ANY) == 0);
+	ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
+}
+
+static void action(void)
+{
+	FILE *f;
+
+	f = popen("taskset 1 dd if=/dev/zero of=/dev/null", "r");
+	(void) f;
+}
+
+int main(int argc, char **argv)
+{
+	char filename[256];
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+	if (load_kallsyms()) {
+		printf("failed to process /proc/kallsyms\n");
+		return 2;
+	}
+
+	if (load_bpf_file(filename)) {
+		printf("%s", bpf_log_buf);
+		return 1;
+	}
+
+	test_bpf_perf_event();
+	return perf_event_poller(pmu_fd, action, print_bpf_output);
+}
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next v2 3/9] bpf/verifier: refine retval R0 state for bpf_get_stack helper
From: Yonghong Song @ 2018-04-18 16:54 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180418165444.2263237-1-yhs@fb.com>

The special property of return values for helpers bpf_get_stack
and bpf_probe_read_str are captured in verifier.
Both helpers return a negative error code or
a length, which is equal to or smaller than the buffer
size argument. This additional information in the
verifier can avoid the condition such as "retval > bufsize"
in the bpf program. For example, for the code blow,
    usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
    if (usize < 0 || usize > max_len)
        return 0;
The verifier may have the following errors:
    52: (85) call bpf_get_stack#65
     R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
     R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
     R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
     R9_w=inv800 R10=fp0,call_-1
    53: (bf) r8 = r0
    54: (bf) r1 = r8
    55: (67) r1 <<= 32
    56: (bf) r2 = r1
    57: (77) r2 >>= 32
    58: (25) if r2 > 0x31f goto pc+33
     R0=inv(id=0) R1=inv(id=0,smax_value=9223372032559808512,
                         umax_value=18446744069414584320,
                         var_off=(0x0; 0xffffffff00000000))
     R2=inv(id=0,umax_value=799,var_off=(0x0; 0x3ff))
     R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
     R8=inv(id=0) R9=inv800 R10=fp0,call_-1
    59: (1f) r9 -= r8
    60: (c7) r1 s>>= 32
    61: (bf) r2 = r7
    62: (0f) r2 += r1
    math between map_value pointer and register with unbounded
    min value is not allowed
The failure is due to llvm compiler optimization where register "r2",
which is a copy of "r1", is tested for condition while later on "r1"
is used for map_ptr operation. The verifier is not able to track such
inst sequence effectively.

Without the "usize > max_len" condition, there is no llvm optimization
and the below generated code passed verifier:
    52: (85) call bpf_get_stack#65
     R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
     R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
     R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
     R9_w=inv800 R10=fp0,call_-1
    53: (b7) r1 = 0
    54: (bf) r8 = r0
    55: (67) r8 <<= 32
    56: (c7) r8 s>>= 32
    57: (6d) if r1 s> r8 goto pc+24
     R0=inv(id=0,umax_value=800) R1=inv0 R6=ctx(id=0,off=0,imm=0)
     R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
     R8=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff)) R9=inv800
     R10=fp0,call_-1
    58: (bf) r2 = r7
    59: (0f) r2 += r8
    60: (1f) r9 -= r8
    61: (bf) r1 = r6

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 kernel/bpf/verifier.c | 31 ++++++++++++++++++++++++++++++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index aba9425..a8302c3 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2333,10 +2333,32 @@ static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx)
 	return 0;
 }
 
+static void do_refine_retval_range(struct bpf_reg_state *regs, int ret_type,
+				   int func_id,
+				   struct bpf_reg_state *retval_state,
+				   bool is_check)
+{
+	struct bpf_reg_state *src_reg, *dst_reg;
+
+	if (ret_type != RET_INTEGER ||
+	    (func_id != BPF_FUNC_get_stack &&
+	     func_id != BPF_FUNC_probe_read_str))
+		return;
+
+	dst_reg = is_check ? retval_state : &regs[BPF_REG_0];
+	if (func_id == BPF_FUNC_get_stack)
+		src_reg = is_check ? &regs[BPF_REG_3] : retval_state;
+	else
+		src_reg = is_check ? &regs[BPF_REG_2] : retval_state;
+
+	dst_reg->smax_value = src_reg->smax_value;
+	dst_reg->umax_value = src_reg->umax_value;
+}
+
 static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn_idx)
 {
 	const struct bpf_func_proto *fn = NULL;
-	struct bpf_reg_state *regs;
+	struct bpf_reg_state *regs, retval_state;
 	struct bpf_call_arg_meta meta;
 	bool changes_data;
 	int i, err;
@@ -2415,6 +2437,10 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
 	}
 
 	regs = cur_regs(env);
+
+	/* before reset caller saved regs, check special ret value */
+	do_refine_retval_range(regs, fn->ret_type, func_id, &retval_state, 1);
+
 	/* reset caller saved regs */
 	for (i = 0; i < CALLER_SAVED_REGS; i++) {
 		mark_reg_not_init(env, regs, caller_saved[i]);
@@ -2456,6 +2482,9 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
 		return -EINVAL;
 	}
 
+	/* apply additional constraints to ret value */
+	do_refine_retval_range(regs, fn->ret_type, func_id, &retval_state, 0);
+
 	err = check_map_func_compatibility(env, meta.map_ptr, func_id);
 	if (err)
 		return err;
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next v2 0/9] bpf: add bpf_get_stack helper
From: Yonghong Song @ 2018-04-18 16:54 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team

Currently, stackmap and bpf_get_stackid helper are provided
for bpf program to get the stack trace. This approach has
a limitation though. If two stack traces have the same hash,
only one will get stored in the stackmap table regardless of
whether BPF_F_REUSE_STACKID is specified or not,
so some stack traces may be missing from user perspective.

This patch implements a new helper, bpf_get_stack, will
send stack traces directly to bpf program. The bpf program
is able to see all stack traces, and then can do in-kernel
processing or send stack traces to user space through
shared map or bpf_perf_event_output.

Patches #1 and #2 implemented the core kernel support.
Patches #3 and #4 are two verifier improves to make
bpf programming easier. Patch #5 synced the new helper
to tools headers. Patches #6 and #7 added a test in
samples/bpf by attaching to a kprobe. Patch #8 added
a verifier test in tools/bpf for new verifier change
and Patch #9 added a test by attaching to a tracepoint.

Changelogs:
  v1 -> v2:
    . fix compilation error when CONFIG_PERF_EVENTS is not enabled

Yonghong Song (9):
  bpf: change prototype for stack_map_get_build_id_offset
  bpf: add bpf_get_stack helper
  bpf/verifier: refine retval R0 state for bpf_get_stack helper
  bpf/verifier: improve register value range tracking with ARSH
  tools/bpf: add bpf_get_stack helper to tools headers
  samples/bpf: move common-purpose perf_event functions to bpf_load.c
  samples/bpf: add a test for bpf_get_stack helper
  tools/bpf: add a verifier test case for bpf_get_stack helper and ARSH
  tools/bpf: add a test_progs test case for bpf_get_stack helper

 include/linux/bpf.h                               |   1 +
 include/linux/filter.h                            |   3 +-
 include/uapi/linux/bpf.h                          |  19 ++-
 kernel/bpf/core.c                                 |   5 +
 kernel/bpf/stackmap.c                             |  80 ++++++++++--
 kernel/bpf/syscall.c                              |  10 ++
 kernel/bpf/verifier.c                             |  35 ++++-
 kernel/trace/bpf_trace.c                          |  50 +++++++-
 samples/bpf/Makefile                              |   4 +
 samples/bpf/bpf_load.c                            | 104 +++++++++++++++
 samples/bpf/bpf_load.h                            |   5 +
 samples/bpf/trace_get_stack_kern.c                |  86 +++++++++++++
 samples/bpf/trace_get_stack_user.c                | 150 ++++++++++++++++++++++
 samples/bpf/trace_output_user.c                   | 113 ++--------------
 tools/include/uapi/linux/bpf.h                    |  19 ++-
 tools/testing/selftests/bpf/bpf_helpers.h         |   2 +
 tools/testing/selftests/bpf/test_progs.c          |  41 +++++-
 tools/testing/selftests/bpf/test_stacktrace_map.c |  20 ++-
 tools/testing/selftests/bpf/test_verifier.c       |  45 +++++++
 19 files changed, 669 insertions(+), 123 deletions(-)
 create mode 100644 samples/bpf/trace_get_stack_kern.c
 create mode 100644 samples/bpf/trace_get_stack_user.c

-- 
2.9.5

^ permalink raw reply

* [PATCH bpf-next v2 1/9] bpf: change prototype for stack_map_get_build_id_offset
From: Yonghong Song @ 2018-04-18 16:54 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180418165444.2263237-1-yhs@fb.com>

This patch didn't incur functionality change. The function prototype
got changed so that the same function can be reused later.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 kernel/bpf/stackmap.c | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 57eeb12..04f6ec1 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -262,16 +262,11 @@ static int stack_map_get_build_id(struct vm_area_struct *vma,
 	return ret;
 }
 
-static void stack_map_get_build_id_offset(struct bpf_map *map,
-					  struct stack_map_bucket *bucket,
+static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
 					  u64 *ips, u32 trace_nr, bool user)
 {
 	int i;
 	struct vm_area_struct *vma;
-	struct bpf_stack_build_id *id_offs;
-
-	bucket->nr = trace_nr;
-	id_offs = (struct bpf_stack_build_id *)bucket->data;
 
 	/*
 	 * We cannot do up_read() in nmi context, so build_id lookup is
@@ -361,8 +356,10 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 			pcpu_freelist_pop(&smap->freelist);
 		if (unlikely(!new_bucket))
 			return -ENOMEM;
-		stack_map_get_build_id_offset(map, new_bucket, ips,
-					      trace_nr, user);
+		new_bucket->nr = trace_nr;
+		stack_map_get_build_id_offset(
+			(struct bpf_stack_build_id *)new_bucket->data,
+			ips, trace_nr, user);
 		trace_len = trace_nr * sizeof(struct bpf_stack_build_id);
 		if (hash_matches && bucket->nr == trace_nr &&
 		    memcmp(bucket->data, new_bucket->data, trace_len) == 0) {
-- 
2.9.5

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox