Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH rfc] packet: zerocopy packet_snd
From: Jason Wang @ 2014-11-27  9:10 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Willem de Bruijn, Network Development, David Miller, Eric Dumazet,
	Daniel Borkmann
In-Reply-To: <20141126211748.GA11904@redhat.com>



On Thu, Nov 27, 2014 at 5:17 AM, Michael S. Tsirkin <mst@redhat.com> 
wrote:
> On Wed, Nov 26, 2014 at 02:59:34PM -0500, Willem de Bruijn wrote:
>>  > The main problem with zero copy ATM is with queueing disciplines
>>  > which might keep the socket around essentially forever.
>>  > The case was described here:
>>  > https://lkml.org/lkml/2014/1/17/105
>>  > and of course this will make it more serious now that
>>  > more applications will be able to do this, so
>>  > chances that an administrator enables this
>>  > are higher.
>>  
>>  The denial of service issue raised there, that a single queue can
>>  block an entire virtio-net device, is less problematic in the case 
>> of
>>  packet sockets. A socket can run out of sk_wmem_alloc, but a prudent
>>  application can increase the limit or use separate sockets for
>>  separate flows.
> 
> Socket per flow? Maybe just use TCP then?  increasing the limit
> sounds like a wrong solution, it hurts security.
> 
>>  > One possible solution is some kind of timer orphaning frags
>>  > for skbs that have been around for too long.
>>  
>>  Perhaps this can be approximated without an explicit timer by 
>> calling
>>  skb_copy_ubufs on enqueue whenever qlen exceeds a threshold value?
> 
> Hard to say. Will have to see that patch to judge how robust this is.

This could not work, consider if the threshold is greater than vring 
size
or vhost_net pending limit, transmission may still be blocked.

^ permalink raw reply

* Re: 3.12.33 Bug with ipvs
From: Julian Anastasov @ 2014-11-27  8:08 UTC (permalink / raw)
  To: Smart Weblications GmbH - Florian Wiessner; +Cc: netdev
In-Reply-To: <54763E3F.4020306@smart-weblications.de>


	Hello,

On Wed, 26 Nov 2014, Smart Weblications GmbH - Florian Wiessner wrote:

> Hi netdev,
> 
> On 3.12.33 i see this every 3 hours or so on a box with ip_vs running with a
> setup which made no problems on 3.10.40. Could someone give me hints how to
> debug this? It seems to happen instantly, when i add ip_vs_ftp and have some nat
> rules. Setup is like this:
> 

> [13230.431740] RIP  [<ffffffff814ff2fc>] xfrm_selector_match+0x25/0x2f6
> [13230.431772]  RSP <ffff88083fd83a68>
> [13230.431795] CR2: 00000000000600d0
> [13230.432240] ---[ end trace 103912aa204977dc ]---
> 
> node01:/ocfs2/usr/src/linux-3.12.33/scripts# ./decodecode </tmp/oops.log
> [13230.431464] Code: 5d 41 5e 41 5f c3 41 55 66 83 fa 02 41 54 55 48 89 fd 53 48
> 89 f3 41 50 74 11 31 c0 66 83 fa 0a 0f 85 ce 02 00 00 e9 fd 00 00 00 <0f> b6 47
> 2a 8b 17 8b 76 18 84 c0 74 1a b9 20 00 00 00 31 f2 29
> All code
> ========
>    0:   5d                      pop    %rbp
>    1:   41 5e                   pop    %r14
>    3:   41 5f                   pop    %r15
>    5:   c3                      retq
>    6:   41 55                   push   %r13
>    8:   66 83 fa 02             cmp    $0x2,%dx
>    c:   41 54                   push   %r12
>    e:   55                      push   %rbp
>    f:   48 89 fd                mov    %rdi,%rbp
>   12:   53                      push   %rbx
>   13:   48 89 f3                mov    %rsi,%rbx
>   16:   41 50                   push   %r8
>   18:   74 11                   je     0x2b
>   1a:   31 c0                   xor    %eax,%eax
>   1c:   66 83 fa 0a             cmp    $0xa,%dx
>   20:   0f 85 ce 02 00 00       jne    0x2f4
>   26:   e9 fd 00 00 00          jmpq   0x128
>   2b:*  0f b6 47 2a             movzbl 0x2a(%rdi),%eax          <-- trapping
> instruction

	Above instruction is 'sel->prefixlen_d' from
the addr4_match call in __xfrm4_selector_match. Looks like
we dereference sel (%rdi) with bad value of 00000000000600a6.
xfrm_sk_policy_lookup() provides &pol->selector to
xfrm_selector_match, so pol has a bad value. I don't remember
for such problem, not sure if the 3-hour period is some timer
in xfrm.

> Could someone shed some light on the decoded output and point me somewhere so i
> can debug this further?

	If noone else has idea what can be wrong, can you try
some kernels between 3.10.40 and 3.12.33 or even some lastest
kernel?

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* RE: [patch net-next v3 09/17] bridge: add API to notify bridge driver of learned FBD on offloaded device
From: Arad, Ronen @ 2014-11-27  7:37 UTC (permalink / raw)
  To: Scott Feldman; +Cc: netdev@vger.kernel.org
In-Reply-To: <CAE4R7bC9o4ucO_Erb0tU4EwJNYj8SrapbKxNWvUTL8YD=RMBJQ@mail.gmail.com>



> -----Original Message-----
> From: Scott Feldman [mailto:sfeldma@gmail.com]
> Sent: Wednesday, November 26, 2014 11:04 PM
> To: Arad, Ronen
> Cc: netdev@vger.kernel.org
> Subject: Re: [patch net-next v3 09/17] bridge: add API to notify bridge driver
> of learned FBD on offloaded device
> 
> On Tue, Nov 25, 2014 at 9:37 PM, Arad, Ronen <ronen.arad@intel.com>
> wrote:
> >> >>>
> >> >>> Is there any case where this fdb entry gets re-used and is no
> >> >>> longer added by an external learning? Should we clear this flag
> somewhere?
> >> >>
> >> >> Once the FDB entry is marked "added_by_external_learn" it stays
> >> >> marked as such until removed by aging cleanup process (or flushed
> >> >> due to interface down, etc).  If aged out (and now deleted), the
> >> >> FDB entry may come back either by SW learn or by HW learn.  If SW
> >> >> learn comes first, and then HW learn, HW learn will override and
> >> >> mark the existing FDB entry "added_by_external_learn".  So there
> >> >> is take-over by HW but no give-back to SW.  And there is no
> >> >> explicit clearing of the mark short of deleting the FDB entry.
> >> >> The mark is mostly for letting user's know which FDB entries where
> >> >> learned by HW and synced to the bridge's FDB.
> >> >
> >> > Thanks, makes sense now. This is probably obvious in this context,
> >> > but maybe it would not hurt to come up with a documentation that
> >> > describe the offload API, FDB entry lifetime and HW/SW ownership etc...
> >>
> >> I have an updated Documentation/networking/switchdev.txt that covers
> >> the swdev APIs and usage and notes, but Jiri is being stingy with it.
> >> Will get this out, either in v4 or follow-on patches.  There is
> >> enough going on just with L2 offload that we're going to need some
> >> good documentation to guide implementers.
> >> --
> >
> > To control the lifetime of an externally learned FDB entry, the bridge shall
> provide an API for the switch driver to update the freshness of externally
> learned entries. Otherwise, the bridge aging will age entries which are
> currently or frequently used by the HW.
> > Is this covered in the updated document?
> > Is this functionality planned for v4?
> 
> Hi Ronen,
> 
> It's already there: driver calls br_fdb_external_learn_add() to refresh FBD
> entry, which updates the fdb->updated and fdb->used timestamps,
> preventing bridge from prematurely aging out the entry.
> We'll make sure that detail gets in the doc.  It's up to the driver on how
> frequently it calls br_fdb_external_learn_add().  Maybe it just blindly makes
> the call every 1s.  That's what rocker driver does (as long as the FDB entry
> continues to get hits).  From the user's perspective, 1s update is nice when
> looking at the stats dump for fdbs, since the timestamps are in secs.
> 
> -scott

Hi Scott,

Thanks. This should work. I overlooked that but now I see it in the code with a clear comment.

-ronen


^ permalink raw reply

* Re: [PATCH rfc] packet: zerocopy packet_snd
From: Michael S. Tsirkin @ 2014-11-27  7:27 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David Miller, Eric Dumazet, Daniel Borkmann
In-Reply-To: <CA+FuTSf_qiO964ZK1gB9skd1iQMj3iodcccscCt=vo6j92MsuA@mail.gmail.com>

On Wed, Nov 26, 2014 at 06:05:16PM -0500, Willem de Bruijn wrote:
> On Wed, Nov 26, 2014 at 4:20 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Wed, Nov 26, 2014 at 02:59:34PM -0500, Willem de Bruijn wrote:
> >> > The main problem with zero copy ATM is with queueing disciplines
> >> > which might keep the socket around essentially forever.
> >> > The case was described here:
> >> > https://lkml.org/lkml/2014/1/17/105
> >> > and of course this will make it more serious now that
> >> > more applications will be able to do this, so
> >> > chances that an administrator enables this
> >> > are higher.
> >>
> >> The denial of service issue raised there, that a single queue can
> >> block an entire virtio-net device, is less problematic in the case of
> >> packet sockets. A socket can run out of sk_wmem_alloc, but a prudent
> >> application can increase the limit or use separate sockets for
> >> separate flows.
> >
> > Sounds like this interface is very hard to use correctly.
> 
> Actually, this socket alloc issue is the same for zerocopy and
> non-zerocopy. Packets can be held in deep queues at which point
> the packet socket is blocked. This is accepted behavior.
>
> >From the above thread:
> 
> "It's ok for non-zerocopy packet to be blocked since VM1 thought the
> packets has been sent instead of pending in the virtqueue. So VM1 can
> still send packet to other destination."
> 
> This is very specific to virtio and vhost-net. I don't think that that
> concern applies to a packet interface.

Well, you are obviously building the interface with some use-case in mind.
Let's try to make it work for multiple use-cases.

So at some level, you are right.  The issue is not running out of wmem.
But I think I'm right too - this is hard to use correctly.

I think the difference is that with your patch, application
can't reuse the memory until packet is transmitted, otherwise junk goes
out on the network. Even closing the socket won't help.
Is this true?

I see this as a problem.

I'm trying to figure out how would one use this interface, one obvious
use would be to tunnel out raw packets directly from VM memory.
For this application, a zero copy packet never completing is a problem:
at minimum, you want to be able to remove the device, which
translates to a requirement that closing the socket effectively stops
using userspace memory. In case you want to be able to run e.g. a watchdog,
ability to specify a deadline also seems benefitial.


> Another issue, though, is that the method currently really only helps
> TSO because ll other paths cause a deep copy. There are more use
> cases once it can send up to 64KB MTU over loopback or send out
> GSO datagrams without triggering skb_copy_ubufs. I have not looked
> into how (or if) that can be achieved yet.

I think this was done intentionally at some point,
try to look at git history to find out the reasons.

> >
> >> > One possible solution is some kind of timer orphaning frags
> >> > for skbs that have been around for too long.
> >>
> >> Perhaps this can be approximated without an explicit timer by calling
> >> skb_copy_ubufs on enqueue whenever qlen exceeds a threshold value?
> >
> > Not sure.  I'll have to see that patch to judge.

^ permalink raw reply

* Re: Multiple DSA switch on shared MII
From: Rajib Karmakar @ 2014-11-27  7:18 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: netdev
In-Reply-To: <CAOi_9k_3+9U4q=SHdeEJOi_efHkRvYdTQ=UBqZtafgDAv_oOHg@mail.gmail.com>

This could be a naive question; but can I add all LAN and WAN
interfaces in a single DSA switch (one cpu port, one netdev)?

^ permalink raw reply

* Re: [patch net-next v3 09/17] bridge: add API to notify bridge driver of learned FBD on offloaded device
From: Scott Feldman @ 2014-11-27  7:03 UTC (permalink / raw)
  To: Arad, Ronen; +Cc: netdev@vger.kernel.org
In-Reply-To: <E4CD12F19ABA0C4D8729E087A761DC3505D7E13E@ORSMSX101.amr.corp.intel.com>

On Tue, Nov 25, 2014 at 9:37 PM, Arad, Ronen <ronen.arad@intel.com> wrote:
>> >>>
>> >>> Is there any case where this fdb entry gets re-used and is no
>> >>> longer added by an external learning? Should we clear this flag somewhere?
>> >>
>> >> Once the FDB entry is marked "added_by_external_learn" it stays
>> >> marked as such until removed by aging cleanup process (or flushed
>> >> due to interface down, etc).  If aged out (and now deleted), the
>> >> FDB entry may come back either by SW learn or by HW learn.  If SW
>> >> learn comes first, and then HW learn, HW learn will override and
>> >> mark the existing FDB entry "added_by_external_learn".  So there is
>> >> take-over by HW but no give-back to SW.  And there is no explicit
>> >> clearing of the mark short of deleting the FDB entry.  The mark is
>> >> mostly for letting user's know which FDB entries where learned by
>> >> HW and synced to the bridge's FDB.
>> >
>> > Thanks, makes sense now. This is probably obvious in this context,
>> > but maybe it would not hurt to come up with a documentation that
>> > describe the offload API, FDB entry lifetime and HW/SW ownership etc...
>>
>> I have an updated Documentation/networking/switchdev.txt that covers
>> the swdev APIs and usage and notes, but Jiri is being stingy with it.
>> Will get this out, either in v4 or follow-on patches.  There is enough
>> going on just with L2 offload that we're going to need some good
>> documentation to guide implementers.
>> --
>
> To control the lifetime of an externally learned FDB entry, the bridge shall provide an API for the switch driver to update the freshness of externally learned entries. Otherwise, the bridge aging will age entries which are currently or frequently used by the HW.
> Is this covered in the updated document?
> Is this functionality planned for v4?

Hi Ronen,

It's already there: driver calls br_fdb_external_learn_add() to
refresh FBD entry, which updates the fdb->updated and fdb->used
timestamps, preventing bridge from prematurely aging out the entry.
We'll make sure that detail gets in the doc.  It's up to the driver on
how frequently it calls br_fdb_external_learn_add().  Maybe it just
blindly makes the call every 1s.  That's what rocker driver does (as
long as the FDB entry continues to get hits).  From the user's
perspective, 1s update is nice when looking at the stats dump for
fdbs, since the timestamps are in secs.

-scott

^ permalink raw reply

* Re: [PATCH net-next v4] ipvlan: Initial check-in of the IPVLAN driver.
From: Mahesh Bandewar @ 2014-11-27  6:55 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: netdev, Eric Dumazet, Maciej Zenczykowski, Laurent Chavey,
	Tim Hockin, David Miller, Brandon Philips, Pavel Emelianov
In-Reply-To: <5476859E.609@lab.ntt.co.jp>

On Wed, Nov 26, 2014 at 5:59 PM, Toshiaki Makita
<makita.toshiaki@lab.ntt.co.jp> wrote:
> On 2014/11/27 2:05, Mahesh Bandewar wrote:
>> On Tue, Nov 25, 2014 at 10:41 PM, Toshiaki Makita
>> <makita.toshiaki@lab.ntt.co.jp> wrote:
>>> Hi Mahesh,
>>>
>>> I found that deleting the last ipvlan device triggers WARN_ON() in
>>> rtmsg_ifinfo().
>>> ipvlan_nl_fillinfo() seems to return -EINVAL in that case.
>>>
>>>> +static int ipvlan_nl_fillinfo(struct sk_buff *skb,
>>>> +                           const struct net_device *dev)
>>>> +{
>>>> +     struct ipvl_dev *ipvlan = netdev_priv(dev);
>>>> +     struct ipvl_port *port = ipvlan_port_get_rtnl(ipvlan->phy_dev);
>>>> +     int ret = -EINVAL;
>>>> +
>>>> +     if (!port)
>>>> +             goto err;
>>>> +
>>>> +     ret = -EMSGSIZE;
>>>> +     if (nla_put_u16(skb, IFLA_IPVLAN_MODE, port->mode))
>>>> +             goto err;
>>>> +
>>>> +     return 0;
>>>> +
>>>> +err:
>>>> +     return ret;
>>>> +}
>>>
>>> rollback_registered_many() calls rtmsg_ifinfo() after calling ndo_uninit().
>>> ndo_uninit() (ipvlan_uninit() -> ipvlan_port_destroy() ->
>>> netdev_rx_handler_unregister()) sets rx_handler_data into NULL.
>>> So, we cannot dereference "port" in ipvlan_nl_fillinfo().
>>>
>> Calling fillinfo() after calling uninit() seems pointless on any
>> device.
>
> bonding needs calling rtmsg_ifinfo() after calling ndo_uninit().
> 56bfa7ee7c88 ("unregister_netdevice : move RTM_DELLINK to until after
> ndo_uninit")
>
Thanks I'll take a look.

>> But how are you hitting this case? Can you share the command
>> sequence with me?
>
> # ip link add link eth0 name ipvl0 type ipvlan
> # ip link del ipvl0
>
Yes, I could reproduce the issue when the master device is a bond.

> Thanks,
> Toshiaki Makita
>
>>
>> Thanks,
>> --mahesh..
>>
>>> Maybe "mode" should belong to struct ipvl_dev?
>>>
>>> Thanks,
>>> Toshiaki Makita
>

^ permalink raw reply

* Re: [PATCH RFC net-next] net: Add GRO support for GRE tunneling of TEB packets
From: Or Gerlitz @ 2014-11-27  6:51 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Or Gerlitz, David S. Miller, Linux Netdev List, Eric Dumazet,
	H.K. Jerry Chu
In-Reply-To: <CA+mtBx_-TU+ktWY85bJ1gJOMZD1BOW5hbj=JdEHV5FohYxdjAg@mail.gmail.com>

On Wed, Nov 26, 2014 at 5:44 PM, Tom Herbert <therbert@google.com> wrote:
> On Wed, Nov 26, 2014 at 7:08 AM, Or Gerlitz <ogerlitz@mellanox.com> wrote:
>> Add the missing parts in the gre gro handlers when the inner protocol
>> is ETH_P_TEB which is the case for OVS based GRE tunneling.

> I don't think this is the right approach. It would probably be better
> to a add a gro_receive handler for ETH_P_TEB and then you wouldn't
> need to modify GRE path with special case code. That would also be
> applicable in geneve.

Makes sense, I'll do that, thanks for the feedback.

^ permalink raw reply

* Re: [patch net-next v3 02/17] net: make vid as a parameter for ndo_fdb_add/ndo_fdb_del
From: Scott Feldman @ 2014-11-27  6:50 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: John Fastabend, Jiri Pirko, Netdev, David S. Miller,
	nhorman@tuxdriver.com, Andy Gospodarek, Thomas Graf,
	dborkman@redhat.com, ogerlitz@mellanox.com, jesse@nicira.com,
	pshelar@nicira.com, azhou@nicira.com, ben@decadent.org.uk,
	stephen@networkplumber.org, Kirsher, Jeffrey T,
	vyasevic@redhat.com, Cong Wang, Eric Dumazet, Florian Fainelli,
	Roopa Prabhu, John Linville
In-Reply-To: <5475B952.2080500@mojatatu.com>

On Wed, Nov 26, 2014 at 1:28 AM, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> On 11/25/14 22:59, Scott Feldman wrote:
>>
>> On Tue, Nov 25, 2014 at 5:19 PM, Jamal Hadi Salim <jhs@mojatatu.com>
>> wrote:
>>>
>>> On 11/25/14 21:36, Scott Feldman wrote:
>
>
>>>
>>>>> Ok, guess i am gonna have to go stare at the code some more.
>>>>> I thought we returned one of the error codes?
>>>>> A bitmask would work for a single entry - because you have two
>>>>> options add to h/ware and/or s/ware. So response is easy to encode.
>>>>> But if i have 1000 and they are sparsely populated (think an indexed
>>>>> table and i have indices 1, 23, 45, etc), then a bitmask would be
>>>>> hard to use.
>>>>
>>>>
>>>>
>>>> I'm confused by this discussion.
>>>
>>>
>>>
>>> This is about the policy which states "install as many as you can, dont
>>> worry about failures". In such a case, how do you tell user space back
>>> "oh, btw you know your request #1, #23, and 45 went ok, but nothing else
>>> worked". A simple return code wont work. You could return a code to
>>> say "some worked". At which case user space could dump and find out only
>>> #1, #23 and #45 worked.
>>
>>
>> You request for what?  That's my confusion.
>
>
> Scott, you are gonna make do this all over again?;->
> The summary is there are three possible policies that could be
> identified by the user asking for a kernel operation.
> One use case example was to send a bunch of (for example)
> create/updates and request that the kernel should not abort on a
> failure of a single one but to keep going and create/update as many
> as possible. Is that part clear? I know it is not what you do,
> but there are use cases for that (Read John's response).
> Now assuming someone wants this and some entries failed;
> how do you tell user space back what was actually updated vs not?
> You could return a code which says "partial success".
> Forget whether the table is keyed or indexed but if you wanted
> to return more detailed info you would return an array/vector of some
> sort with status code per entry. Something netlink cant do.
> Is that a better description?
>
>> Are you trying to install
>> FDB entry into both SW and HW at same time?
>
>
>
> What is wrong with installing on both hardware and software? The
> point was to identify what kind of policies could be requested by
> the user; but even for the bridge why is it bad that i ask for
> both master&self?
> It is something I can do today with none of these patches.
>
>> And then do a bunch in a
>> batch?  I'm saying use MASTER for SW and SELF for HW in two steps,
>
>
> But that would be enforcing your policy on me.

Ok, I get it now.  I'm looking forward to see what solution people
come up with to solve this.

>
>> if
>> you want FDB entry installed in both Sw and HW.  Check your return
>> code each step.  Batch all to HW first, then batch all that PASSED to
>> SW.  I don't even know really why you're trying to install to both HW
>> and SW.  Install it to HW and be done. fdb_dump will set HW entries
>> via SELF.
>>
>
> First off: bad performance, but your call to do it that way
> (just please please dont enforce it on me;->)
>
> Lets take the hardware batching you mentioned above and see if
> i can help to clarify in the third policy choice (continue-on-failure).
> Lets say you have a keyed table such as the fdb table is.
> You send 10 entries to be created/added in hardware. #3 and #5 failed
> because you made a mistake and sent them with the same key. #9 and #10
> failed because the hardware doesnt have any more space.
> we didnt stop and go back for #3 and #5 because the user told
> us to continue and do the rest when we fail. And s/he did that because
> she wanted to put as many entries in hardware as possible without
> necessarily needing to know how much space exists.
>
>
>> Ah, Jamal, look again at patches 13-17/17 in last v3 set.  That was a
>> big steaming snickerdoodle just for you!  Now you can push policy
>> knobs down to port driver and or bridge to fine tune what ever you
>> want.  You'll find knobs for learning, flooding, learning sync to hw,
>> etc.  I thought you even ACKed some of these.
>
>
> I think it almost there.
> What you are missing is the policy decision to only sync when i
> say so. Having an ndo_ops is a necessity but i dont want the driver
> to decide for me just because it can ;->
> Telling hardware to learn is instructing it to self update its entries
> based on source lookup failure. That is distinctly different from
> telling to sync to the kernel. So if you add that knob we are in good
> shape.

It's there: IFLA_BRPORT_LEARNING_SYNC.  From iproute2:

$ bridge -d link show dev swp1
2: swp1 state UNKNOWN : <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
master br0 state forwarding priority 32 cost 2
    hairpin off guard off root_block off fastleave off learning off flood off
2: swp1 state UNKNOWN : <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br0
    learning on learning_sync on hwmode swdev

Turn it off:

$ bridge link set dev swp1 hwmode swdev learning_sync off

And now:

$ bridge -d link show dev swp1
2: swp1 state UNKNOWN : <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
master br0 state forwarding priority 32 cost 2
    hairpin off guard off root_block off fastleave off learning off flood off
2: swp1 state UNKNOWN : <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br0
    learning on learning_sync off hwmode swdev


> cheers,
> jamal
>
>
>> a) above knob is 14/17
>> patch, b) above is using existing learning knob on bridge, c) above I
>> don't get...no point in syncing that direction.
>>
>

^ permalink raw reply

* Re: [PATCH V1 net-next 1/2] pgtable: Add API to query if write combining is available
From: Or Gerlitz @ 2014-11-27  6:48 UTC (permalink / raw)
  To: Moshe Lazer
  Cc: David Miller, Or Gerlitz, Jack Morgenstein, talal@mellanox.com,
	Yevgeny Petrilin, Linux Netdev List, Amir Vadai, moshel
In-Reply-To: <543A4FE4.7010807@dev.mellanox.co.il>

On Sun, Oct 12, 2014 at 11:54 AM, Moshe Lazer <moshel@dev.mellanox.co.il> wrote:
>
> On 10/8/2014 7:24 PM, David Miller wrote:
>>
>> From: Moshe Lazer <moshel@dev.mellanox.co.il>
>> Date: Wed, 08 Oct 2014 11:44:57 +0300
>>
>>>> #if defined(__i386__) || defined(__x86_64__)
>>>>         if (map->type == _DRM_REGISTERS && !(map->flags &
>>>> _DRM_WRITE_COMBINING))
>>>>                 tmp = pgprot_noncached(tmp);
>>>>         else
>>>>                 tmp = pgprot_writecombine(tmp);
>>>> #elif defined(__powerpc__)
>>>>         pgprot_val(tmp) |= _PAGE_NO_CACHE;
>>>>         if (map->type == _DRM_REGISTERS)
>>>>                 pgprot_val(tmp) |= _PAGE_GUARDED;
>>>> #elif defined(__ia64__)
>>>>         if (efi_range_is_wc(vma->vm_start, vma->vm_end -
>>>>                                     vma->vm_start))
>>>>                 tmp = pgprot_writecombine(tmp);
>>>>         else
>>>>                 tmp = pgprot_noncached(tmp);
>>>> #elif defined(__sparc__) || defined(__arm__) || defined(__mips__)
>>>>         tmp = pgprot_noncached(tmp);
>>>> #endif
>>>
>>> The idea was to provide an indication as for whether the arch supports
>>> write-combining in general.
>>> If we want to benefit from blue flame operations, we need to map the
>>> blue flame registers as write combining - otherwise there is no
>>> benefit. So we would like to know if write combining is supported by
>>> the system or not.
>>>
>> You completely miss my point.  On a given architectuire it might be
>> _illegal_ to map certain address ranges as write-combining without
>> checks like the ones above that ia64 needs.
>>
>> Therefore your proposed interface is by definition insufficient.
>
> Thanks David, I'll try to clarify my point.
> For me the writecombine_available() is a way to know if the
> pgprot_writecombine() is effective or just cover call to the
> pgprot_noncached().
> I want to use the writecombine_available() regardless to the mapping
> address.
> For example in mlx4 query_device I want to indicate that blue-flame is not
> supported if  `writecombine_available() ==  false`.
> In this case we don't have the mapping address yet.
>
> Later on if an arch has write-combining (writecombine_available() ==  true)
> when we try to map the blue-flame registers (in mlx4_ib_mmap):
>
>     vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
>
>     if (io_remap_pfn_range(vma, vma->vm_start,
>         to_mucontext(context)->uar.pfn +
>         dev->dev->caps.num_uars,
>         PAGE_SIZE, vma->vm_page_prot))
>             return -EAGAIN;
>
> I can be sure that pgprot_writecombine() is not a cover for
> pgprot_noncached().
> The address checks that you mentioned should be part of io_remap_pfn_range,
> this function should fail if the vma->vm_start is not compatible to the
> vma->vm_page_prot.
> Please let me know if I misunderstood something.


Hi Dave,

Pinging you... could you respond on Moshe's email which hopefully
addresses your comments?

Or.

^ permalink raw reply

* [PATCH net-next] vhost: remove unnecessary forward declarations in vhost.h
From: Jason Wang @ 2014-11-27  6:41 UTC (permalink / raw)
  To: mst, virtualization, netdev, linux-kernel; +Cc: kvm

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.h | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 3eda654..7d039ef 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -12,8 +12,6 @@
 #include <linux/virtio_ring.h>
 #include <linux/atomic.h>
 
-struct vhost_device;
-
 struct vhost_work;
 typedef void (*vhost_work_fn_t)(struct vhost_work *work);
 
@@ -54,8 +52,6 @@ struct vhost_log {
 	u64 len;
 };
 
-struct vhost_virtqueue;
-
 /* The virtqueue structure describes a queue attached to a device. */
 struct vhost_virtqueue {
 	struct vhost_dev *dev;
-- 
1.9.1

^ permalink raw reply related

* [PATCH net-next V3] tun/macvtap: use consume_skb() instead of kfree_skb() when needed
From: Jason Wang @ 2014-11-27  6:36 UTC (permalink / raw)
  To: davem, netdev, linux-kernel; +Cc: mst, Jason Wang, Eric Dumazet

To be more friendly with drop monitor, we should only call kfree_skb() when
the packets were dropped and use consume_skb() in other cases.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
Changes from V2:
- use unlikely() when necessary
Changes from V1:
- check the return value of tun/macvtap_put_user()
---
 drivers/net/macvtap.c | 5 ++++-
 drivers/net/tun.c     | 5 ++++-
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 42a80d3..86f6bf8 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -862,7 +862,10 @@ static ssize_t macvtap_do_read(struct macvtap_queue *q,
 		}
 		iov_iter_init(&iter, READ, iv, segs, len);
 		ret = macvtap_put_user(q, skb, &iter);
-		kfree_skb(skb);
+		if (unlikely(ret < 0))
+			kfree_skb(skb);
+		else
+			consume_skb(skb);
 		break;
 	}
 
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index ac53a73..82a9bf0 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1363,7 +1363,10 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
 
 	iov_iter_init(&iter, READ, iv, segs, len);
 	ret = tun_put_user(tun, tfile, skb, &iter);
-	kfree_skb(skb);
+	if (unlikely(ret < 0))
+		kfree_skb(skb);
+	else
+		consume_skb(skb);
 
 	return ret;
 }
-- 
1.9.1

^ permalink raw reply related

* Re: [patch net-next v2 05/10] rocker: introduce rocker switch driver
From: Florian Fainelli @ 2014-11-27 14:09 UTC (permalink / raw)
  To: Jiri Pirko, Thomas Graf
  Cc: John Fastabend, netdev, davem, nhorman, andy, dborkman, ogerlitz,
	jesse, pshelar, azhou, ben, stephen, jeffrey.t.kirsher, vyasevic,
	xiyou.wangcong, john.r.fastabend, edumazet, jhs, sfeldma, roopa,
	linville, jasowang, ebiederm, nicolas.dichtel, ryazanov.s.a,
	buytenh, aviadr, nbd, alexei.starovoitov, Neil.Jerram, ronye,
	simon.horman, alexander.h.duyck, john.ronciak, mleitner, shrijeet,
	gospo, bcr
In-Reply-To: <20141111154017.GH1825@nanopsycho.lan>

Le 11/11/2014 07:40, Jiri Pirko a écrit :
> Tue, Nov 11, 2014 at 04:32:32PM CET, tgraf@suug.ch wrote:
>> On 11/11/14 at 04:19pm, Jiri Pirko wrote:
>>> Tue, Nov 11, 2014 at 03:29:46PM CET, tgraf@suug.ch wrote:
>>>> On 11/10/14 at 02:04pm, John Fastabend wrote:
>>>>> On 11/09/2014 02:51 AM, Jiri Pirko wrote:
>>>>>> +static int rocker_port_sw_parent_id_get(struct net_device *dev,
>>>>>> +					struct netdev_phys_item_id *psid)
>>>>>> +{
>>>>>> +	struct rocker_port *rocker_port = netdev_priv(dev);
>>>>>> +	struct rocker *rocker = rocker_port->rocker;
>>>>>> +
>>>>>
>>>>> hmm looks like you read this out of a magic switch register :) but
>>>>> my switch doesn't have this magic reg. I suposse the switch MAC address
>>>>> should work.
>>>>
>>>> This needs more work afterwards. Either we define that the switch ID
>>>> is only unique in combination with the parent ifindex or we need to
>>>> introduce a notation of uniquness into the switch ID itself.
>>>
>>> This is something similar to physical port id. Each driver should take
>>> care of generating that id.
>>
>> If the ID is only unique within a driver, then the user space cannot
>> rely on using the ID to group switch ports. Multiple drivers might
>> come up with the same ID.
> 
> Well, as I said, it is the same as for physical port id. But if needed,
> there can be added some simple mechanism for the id registration
> ensuring their uniqueness.

We could use the idr/ida subsystem to provide a global unique id per
switch device that gets registered, ultimately, I suspect that a
management application might want to get some sense of the topology by
exploiting some unique HW properties such as:

- MDIO bus address for MDIO-connected switches
- SPI chip-select address
- GPIO(s) used to connect
- PCI bus/slot

they are also unique by design, and add to that any revision/OUI
register that is available for the driver. I can't find of a good way to
hash that to produce a unique identifier, but maybe we can use that
information somehow.

> 
>>
>> Even now, multiple rocker instances would have the same ID.
> 
> It depends on what hw returns to driver.
> 

^ permalink raw reply

* Re: [PATCH] e1000: remove unused variables
From: Hisashi T Fujinaka @ 2014-11-27  5:59 UTC (permalink / raw)
  To: Sudip Mukherjee
  Cc: Jeff Kirsher, Jesse Brandeburg, Bruce Allan, Carolyn Wyborny,
	Don Skidmore, Greg Rose, Matthew Vick, John Ronciak,
	Mitch Williams, Linux NICS, e1000-devel, netdev, linux-kernel
In-Reply-To: <1417065728-5592-1-git-send-email-sudipm.mukherjee@gmail.com>

I'm pretty sure those double reads are there for a reason, so most of
this I'm going to have to check on Monday. We have a long holiday
weekend here in the US.

I'm not sure why you're bothering with an old driver like this, but if
you haven't actually tried this on all the hardware it pertains to, I'm
going want to NAK this.

I should do this from my todd.fujinaka@intel.com account but it's 10PM
on the first day of a long holiday weekend.

On Thu, 27 Nov 2014, Sudip Mukherjee wrote:

> these variables were only being assigned some values, but were never
> used.
>
> Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org>
> ---
> drivers/net/ethernet/intel/e1000/e1000_hw.c   | 142 ++++++++++++--------------
> drivers/net/ethernet/intel/e1000/e1000_main.c |   3 -
> 2 files changed, 66 insertions(+), 79 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/e1000/e1000_hw.c b/drivers/net/ethernet/intel/e1000/e1000_hw.c
> index 45c8c864..7812f59 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000_hw.c
> +++ b/drivers/net/ethernet/intel/e1000/e1000_hw.c
> @@ -154,7 +154,6 @@ static s32 e1000_set_phy_type(struct e1000_hw *hw)
>  */
> static void e1000_phy_init_script(struct e1000_hw *hw)
> {
> -	u32 ret_val;
> 	u16 phy_saved_data;
>
> 	if (hw->phy_init_script) {
> @@ -163,7 +162,7 @@ static void e1000_phy_init_script(struct e1000_hw *hw)
> 		/* Save off the current value of register 0x2F5B to be restored
> 		 * at the end of this routine.
> 		 */
> -		ret_val = e1000_read_phy_reg(hw, 0x2F5B, &phy_saved_data);
> +		e1000_read_phy_reg(hw, 0x2F5B, &phy_saved_data);
>
> 		/* Disabled the PHY transmitter */
> 		e1000_write_phy_reg(hw, 0x2F5B, 0x0003);
> @@ -402,7 +401,6 @@ s32 e1000_reset_hw(struct e1000_hw *hw)
> {
> 	u32 ctrl;
> 	u32 ctrl_ext;
> -	u32 icr;
> 	u32 manc;
> 	u32 led_ctrl;
> 	s32 ret_val;
> @@ -527,7 +525,7 @@ s32 e1000_reset_hw(struct e1000_hw *hw)
> 	ew32(IMC, 0xffffffff);
>
> 	/* Clear any pending interrupt events. */
> -	icr = er32(ICR);
> +	er32(ICR);
>
> 	/* If MWI was previously enabled, reenable it. */
> 	if (hw->mac_type == e1000_82542_rev2_0) {
> @@ -2396,16 +2394,13 @@ static s32 e1000_check_for_serdes_link_generic(struct e1000_hw *hw)
>  */
> s32 e1000_check_for_link(struct e1000_hw *hw)
> {
> -	u32 rxcw = 0;
> -	u32 ctrl;
> 	u32 status;
> 	u32 rctl;
> 	u32 icr;
> -	u32 signal = 0;
> 	s32 ret_val;
> 	u16 phy_data;
>
> -	ctrl = er32(CTRL);
> +	er32(CTRL);
> 	status = er32(STATUS);
>
> 	/* On adapters with a MAC newer than 82544, SW Definable pin 1 will be
> @@ -2414,12 +2409,9 @@ s32 e1000_check_for_link(struct e1000_hw *hw)
> 	 */
> 	if ((hw->media_type == e1000_media_type_fiber) ||
> 	    (hw->media_type == e1000_media_type_internal_serdes)) {
> -		rxcw = er32(RXCW);
> +		er32(RXCW);
>
> 		if (hw->media_type == e1000_media_type_fiber) {
> -			signal =
> -			    (hw->mac_type >
> -			     e1000_82544) ? E1000_CTRL_SWDPIN1 : 0;
> 			if (status & E1000_STATUS_LU)
> 				hw->get_link_status = false;
> 		}
> @@ -4698,78 +4690,76 @@ s32 e1000_led_off(struct e1000_hw *hw)
>  */
> static void e1000_clear_hw_cntrs(struct e1000_hw *hw)
> {
> -	volatile u32 temp;
> -
> -	temp = er32(CRCERRS);
> -	temp = er32(SYMERRS);
> -	temp = er32(MPC);
> -	temp = er32(SCC);
> -	temp = er32(ECOL);
> -	temp = er32(MCC);
> -	temp = er32(LATECOL);
> -	temp = er32(COLC);
> -	temp = er32(DC);
> -	temp = er32(SEC);
> -	temp = er32(RLEC);
> -	temp = er32(XONRXC);
> -	temp = er32(XONTXC);
> -	temp = er32(XOFFRXC);
> -	temp = er32(XOFFTXC);
> -	temp = er32(FCRUC);
> -
> -	temp = er32(PRC64);
> -	temp = er32(PRC127);
> -	temp = er32(PRC255);
> -	temp = er32(PRC511);
> -	temp = er32(PRC1023);
> -	temp = er32(PRC1522);
> -
> -	temp = er32(GPRC);
> -	temp = er32(BPRC);
> -	temp = er32(MPRC);
> -	temp = er32(GPTC);
> -	temp = er32(GORCL);
> -	temp = er32(GORCH);
> -	temp = er32(GOTCL);
> -	temp = er32(GOTCH);
> -	temp = er32(RNBC);
> -	temp = er32(RUC);
> -	temp = er32(RFC);
> -	temp = er32(ROC);
> -	temp = er32(RJC);
> -	temp = er32(TORL);
> -	temp = er32(TORH);
> -	temp = er32(TOTL);
> -	temp = er32(TOTH);
> -	temp = er32(TPR);
> -	temp = er32(TPT);
> -
> -	temp = er32(PTC64);
> -	temp = er32(PTC127);
> -	temp = er32(PTC255);
> -	temp = er32(PTC511);
> -	temp = er32(PTC1023);
> -	temp = er32(PTC1522);
> -
> -	temp = er32(MPTC);
> -	temp = er32(BPTC);
> +	er32(CRCERRS);
> +	er32(SYMERRS);
> +	er32(MPC);
> +	er32(SCC);
> +	er32(ECOL);
> +	er32(MCC);
> +	er32(LATECOL);
> +	er32(COLC);
> +	er32(DC);
> +	er32(SEC);
> +	er32(RLEC);
> +	er32(XONRXC);
> +	er32(XONTXC);
> +	er32(XOFFRXC);
> +	er32(XOFFTXC);
> +	er32(FCRUC);
> +
> +	er32(PRC64);
> +	er32(PRC127);
> +	er32(PRC255);
> +	er32(PRC511);
> +	er32(PRC1023);
> +	er32(PRC1522);
> +
> +	er32(GPRC);
> +	er32(BPRC);
> +	er32(MPRC);
> +	er32(GPTC);
> +	er32(GORCL);
> +	er32(GORCH);
> +	er32(GOTCL);
> +	er32(GOTCH);
> +	er32(RNBC);
> +	er32(RUC);
> +	er32(RFC);
> +	er32(ROC);
> +	er32(RJC);
> +	er32(TORL);
> +	er32(TORH);
> +	er32(TOTL);
> +	er32(TOTH);
> +	er32(TPR);
> +	er32(TPT);
> +
> +	er32(PTC64);
> +	er32(PTC127);
> +	er32(PTC255);
> +	er32(PTC511);
> +	er32(PTC1023);
> +	er32(PTC1522);
> +
> +	er32(MPTC);
> +	er32(BPTC);
>
> 	if (hw->mac_type < e1000_82543)
> 		return;
>
> -	temp = er32(ALGNERRC);
> -	temp = er32(RXERRC);
> -	temp = er32(TNCRS);
> -	temp = er32(CEXTERR);
> -	temp = er32(TSCTC);
> -	temp = er32(TSCTFC);
> +	er32(ALGNERRC);
> +	er32(RXERRC);
> +	er32(TNCRS);
> +	er32(CEXTERR);
> +	er32(TSCTC);
> +	er32(TSCTFC);
>
> 	if (hw->mac_type <= e1000_82544)
> 		return;
>
> -	temp = er32(MGTPRC);
> -	temp = er32(MGTPDC);
> -	temp = er32(MGTPTC);
> +	er32(MGTPRC);
> +	er32(MGTPDC);
> +	er32(MGTPTC);
> }
>
> /**
> diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
> index 24f3986..a70ea46 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
> +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
> @@ -2443,7 +2443,6 @@ static void e1000_watchdog(struct work_struct *work)
> 	if (link) {
> 		if (!netif_carrier_ok(netdev)) {
> 			u32 ctrl;
> -			bool txb2b = true;
> 			/* update snapshot of PHY registers on LSC */
> 			e1000_get_speed_and_duplex(hw,
> 						   &adapter->link_speed,
> @@ -2465,11 +2464,9 @@ static void e1000_watchdog(struct work_struct *work)
> 			adapter->tx_timeout_factor = 1;
> 			switch (adapter->link_speed) {
> 			case SPEED_10:
> -				txb2b = false;
> 				adapter->tx_timeout_factor = 16;
> 				break;
> 			case SPEED_100:
> -				txb2b = false;
> 				/* maybe add some timeout factor ? */
> 				break;
> 			}
>

-- 
Hisashi T Fujinaka - htodd@twofifty.com
BSEE + BSChem + BAEnglish + MSCS + $2.50 = coffee

^ permalink raw reply

* Re: [patch net-next v3 04/17] net: introduce generic switch devices support
From: Scott Feldman @ 2014-11-27  5:58 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Thomas Graf, Jiri Pirko, Netdev, David S. Miller,
	nhorman@tuxdriver.com, Andy Gospodarek, dborkman@redhat.com,
	ogerlitz@mellanox.com, jesse@nicira.com, pshelar@nicira.com,
	azhou@nicira.com, ben@decadent.org.uk, stephen@networkplumber.org,
	Kirsher, Jeffrey T, vyasevic@redhat.com, Cong Wang,
	Fastabend, John R, Eric Dumazet, Florian Fainelli, Roopa Prabhu,
	John Linville
In-Reply-To: <5475BB53.3070200@mojatatu.com>

On Wed, Nov 26, 2014 at 1:36 AM, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> On 11/25/14 23:18, Scott Feldman wrote:
>>
>> On Tue, Nov 25, 2014 at 5:33 PM, Jamal Hadi Salim <jhs@mojatatu.com>
>> wrote:
>
>
>>
>> You have a pointer to the kernel driver for that HW?
>
>
> I wasnt sure if that was a passive aggressive move there to
> question what i am claiming?(Only Canadians are allowed to be
> passive aggressive Scott). To answer your question, no
> code currently littered with vendor SDK unfortunately (as you
> would know!).

Drats, I was hoping there might be Open Source here.  I'm actually not
familiar with Netronome offerings.  I went to their web page and all
their Docs downloads require registration, so I should have guessed
same-old-same-old.  But you teased us with it, so I thought I would
ask.  Sorry for the trouble XOXOXOXO.  I'm not Canadian, as far as I
know.

> But hopefully if we get these changes in correctly it would
> not be hard to show the driver working fully in the kernel.
> There are definetely a few other pieces of hardware that are
> making me come back here and invest time and effort in these
> long discussions.

You have access to the inside scope.  We don't.  Ok, I don't.  We
(think we) know what the traditional L2/L3 and OVS-style flow stuff
looks like, but you know more, but you can't show us in code so it's
frustrating.  Not your fault.  Just continue to guide us and give some
disclaimer when we're your close to some proprietary knowledge, but it
is relevant to the discussion.


>> Can you show how
>> you're using Linux tc netlink msg in kernel to program HW?  I'd like
>> to see the in-kernel API.
>>
>
> Lets do the L2/port thing first. But yes, I am using Linux tc in
> kernel.
>
> cheers,
> jamal

^ permalink raw reply

* Re: [PATCH v2 11/19] selftests/memory-hotplug: add install target to enable installing test
From: Masami Hiramatsu @ 2014-11-27  5:49 UTC (permalink / raw)
  To: Shuah Khan
  Cc: gregkh, akpm, mmarek, davem, keescook, tranmanphong, dh.herrmann,
	hughd, bobby.prani, ebiederm, serge.hallyn, linux-kbuild,
	linux-kernel, linux-api, netdev, yrl.pp-manager.tt@hitachi.com
In-Reply-To: <8cdb2abb6eaa794801548e04ee6b9f403778e126.1415735831.git.shuahkh@osg.samsung.com>

(2014/11/12 5:27), Shuah Khan wrote:
> Add a new make target to enable installing test. This target
> installs test in the kselftest install location and add to the
> kselftest script to run the test. Install target can be run
> only from top level source dir.
> 
> Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
> ---
>  tools/testing/selftests/memory-hotplug/Makefile    |  17 +-
>  .../selftests/memory-hotplug/mem-on-off-test.sh    | 238 +++++++++++++++++++++
>  .../selftests/memory-hotplug/on-off-test.sh        | 238 ---------------------
>  3 files changed, 253 insertions(+), 240 deletions(-)
>  create mode 100644 tools/testing/selftests/memory-hotplug/mem-on-off-test.sh
>  delete mode 100644 tools/testing/selftests/memory-hotplug/on-off-test.sh
> 
> diff --git a/tools/testing/selftests/memory-hotplug/Makefile b/tools/testing/selftests/memory-hotplug/Makefile
> index d46b8d4..8921631 100644
> --- a/tools/testing/selftests/memory-hotplug/Makefile
> +++ b/tools/testing/selftests/memory-hotplug/Makefile
> @@ -1,9 +1,22 @@
> +TEST_STR=/bin/bash ./mem-on-off-test.sh -r 2 || echo memory-hotplug selftests: [FAIL]
> +
>  all:
>  
> +install:
> +ifdef INSTALL_KSFT_PATH
> +	install ./mem-on-off-test.sh $(INSTALL_KSFT_PATH)/mem-on-off-test.sh
> +	@echo echo Start memory hotplug test .... >> $(KSELFTEST)
> +	@echo "$(TEST_STR)" >> $(KSELFTEST) >> $(KSELFTEST)
> +	@echo echo End memory hotplug test .... >> $(KSELFTEST)
> +	@echo echo ============================== >> $(KSELFTEST)
> +else
> +	@echo Run make kselftest_install in top level source directory
> +endif

I saw this pattern repeated many times in this series.
Can we make it a macro and include it instead of repeating this code?

Thank you,

-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Research Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com



^ permalink raw reply

* Re: [PATCH v2 03/19] selftests: add install target to enable installing selftests
From: Masami Hiramatsu @ 2014-11-27  5:45 UTC (permalink / raw)
  To: Shuah Khan
  Cc: gregkh, akpm, mmarek, davem, keescook, tranmanphong, dh.herrmann,
	hughd, bobby.prani, ebiederm, serge.hallyn, linux-kbuild,
	linux-kernel, linux-api, netdev, yrl.pp-manager.tt@hitachi.com
In-Reply-To: <5e33b40696debde24fd04f6711e795a8860b7aee.1415735831.git.shuahkh@osg.samsung.com>

(2014/11/12 5:27), Shuah Khan wrote:
> Add a new make target to enable installing selftests. This
> new target will call install targets for the tests that are
> specified in INSTALL_TARGETS. During install, a script is
> generated to run tests that are installed. This script will
> be installed in the selftest install directory. Individual
> test Makefiles are changed to add to the script. This will
> allow new tests to add install and run test commands to the
> generated kselftest script. run_tests target runs the
> generated kselftest script to run tests when it is initiated
> from from "make kselftest" from top level source directory.
> 
> Approach:
> 
> make kselftest_target:
> -- exports kselftest INSTALL_KSFT_PATH
>    default $(INSTALL_MOD_PATH)/lib/kselftest/$(KERNELRELEASE)
> -- exports path for ksefltest.sh
> -- runs selftests make install target:
> 
> selftests make install target
> -- creates kselftest.sh script in install install dir
> -- runs install targets for all INSTALL_TARGETS
>    (Note: ftrace and powerpc aren't included in INSTALL_TARGETS,
>           to not add more content to patch v1 series. This work
>           will happen soon. In this series these two targets are
>           run after running the generated kselftest script, without
>           any regression in the way these tests are run with
>           "make kselftest" prior to this work.)
> -- install target can be run only from top level source dir.
> 
> Individual test make install targets:
> -- install test programs and/or scripts in install dir
> -- append to the ksefltest.sh file to add commands to run test
> -- install target can be run only from top level source dir.
> 
> Adds the following new ways to initiate selftests:
> -- Installing and running kselftest from install directory
>    by running  "make kselftest"
> -- Running kselftest script from install directory
> 
> Maintains the following ways to run tests:
> -- make -C tools/testing/selftests run_tests
> -- make -C tools/testing/selftests TARGETS=target run_tests
>    Ability specify targets: e.g TARGETS=net
> -- make run_tests from tools/testing/selftests
> -- make run_tests from individual test directories:
>    e.g: make run_tests in tools/testing/selftests/breakpoints
> 
> Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
> ---
>  tools/testing/selftests/Makefile | 31 ++++++++++++++++++++++++++++++-
>  1 file changed, 30 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
> index 45f145c..b9bdc1d 100644
> --- a/tools/testing/selftests/Makefile
> +++ b/tools/testing/selftests/Makefile
> @@ -16,6 +16,10 @@ TARGETS += sysctl
>  TARGETS += firmware
>  TARGETS += ftrace
>  
> +INSTALL_TARGETS = breakpoints cpu-hotplug efivarfs firmware ipc
> +INSTALL_TARGETS += kcmp memfd memory-hotplug mqueue mount net
> +INSTALL_TARGETS += ptrace sysctl timers user vm
> +
>  TARGETS_HOTPLUG = cpu-hotplug
>  TARGETS_HOTPLUG += memory-hotplug
>  

I think KSELFTEST itself should be defined here, since that is not
a parameter.

> @@ -24,10 +28,35 @@ all:
>  		make -C $$TARGET; \
>  	done;
>  
> -run_tests: all
> +install:
> +ifdef INSTALL_KSFT_PATH
> +	make all
> +	@echo #!/bin/sh\n# Kselftest Run Tests .... >> $(KSELFTEST)
> +	@echo # This file is generated during kselftest_install >> $(KSELFTEST)
> +	@echo # Please don't change it !!\n  >> $(KSELFTEST)
> +	@echo echo ============================== >> $(KSELFTEST)
> +	for TARGET in $(INSTALL_TARGETS); do \
> +		echo Installing $$TARGET; \
> +		make -C $$TARGET install; \

Please pass O= option and others here.

> +	done;
> +	chmod +x $(KSELFTEST)
> +else
> +	@echo Run make kselftest_install in top level source directory
> +endif
> +
> +run_tests:
> +ifdef INSTALL_KSFT_PATH
> +	@cd $(INSTALL_KSFT_PATH); ./kselftest.sh; cd -

We'd better use some macro instead of ./kselftest.sh?

Thank you,

> +# TODO: include ftrace and powerpc in install targets
> +	for TARGET in ftrace powerpc; do \
> +		make -C $$TARGET run_tests; \
> +	done;
> +else
> +	make all
>  	for TARGET in $(TARGETS); do \
>  		make -C $$TARGET run_tests; \
>  	done;
> +endif
>  
>  hotplug:
>  	for TARGET in $(TARGETS_HOTPLUG); do \
> 


-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com



^ permalink raw reply

* [PATCH net-next 5/6] samples: bpf: trivial eBPF program in C
From: Alexei Starovoitov @ 2014-11-27  5:42 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Andy Lutomirski, Daniel Borkmann,
	Hannes Frederic Sowa, Eric Dumazet, linux-api, netdev,
	linux-kernel
In-Reply-To: <1417066951-1999-1-git-send-email-ast@plumgrid.com>

this example does the same task as previous socket example
in assembler, but this one does it in C.

eBPF program in kernel does:
    int index = load_byte(skb, 14 + 9); /* ip_proto */
    long *value;

    value = bpf_map_lookup_elem(&my_map, &index);
    if (value)
        __sync_fetch_and_add(value, 1);

Corresponding user space reads map[tcp], map[udp], map[icmp]
and prints protocol stats every second

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 samples/bpf/Makefile       |   14 +++++++++++++
 samples/bpf/libbpf.h       |    2 +-
 samples/bpf/sockex1_kern.c |   23 +++++++++++++++++++++
 samples/bpf/sockex1_user.c |   49 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 87 insertions(+), 1 deletion(-)
 create mode 100644 samples/bpf/sockex1_kern.c
 create mode 100644 samples/bpf/sockex1_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index f46d3492d032..770d145186c3 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -4,12 +4,26 @@ obj- := dummy.o
 # List of programs to build
 hostprogs-y := test_verifier test_maps
 hostprogs-y += sock_example
+hostprogs-y += sockex1
 
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
 sock_example-objs := sock_example.o libbpf.o
+sockex1-objs := bpf_load.o libbpf.o sockex1_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
+always += sockex1_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
+
+HOSTCFLAGS_bpf_load.o += -I$(objtree)/usr/include -Wno-unused-variable
+HOSTLOADLIBES_sockex1 += -lelf
+
+# point this to your LLVM backend with bpf support
+LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
+
+%.o: %.c
+	clang $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \
+		-D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \
+		-O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=obj -o $@
diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h
index cc62ad4d95de..58c5fe1bdba1 100644
--- a/samples/bpf/libbpf.h
+++ b/samples/bpf/libbpf.h
@@ -15,7 +15,7 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
 		  const struct bpf_insn *insns, int insn_len,
 		  const char *license);
 
-#define LOG_BUF_SIZE 8192
+#define LOG_BUF_SIZE 65536
 extern char bpf_log_buf[LOG_BUF_SIZE];
 
 /* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */
diff --git a/samples/bpf/sockex1_kern.c b/samples/bpf/sockex1_kern.c
new file mode 100644
index 000000000000..e662779467de
--- /dev/null
+++ b/samples/bpf/sockex1_kern.c
@@ -0,0 +1,23 @@
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") my_map = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(long),
+	.max_entries = 256,
+};
+
+SEC("socket1")
+int bpf_prog1(struct sk_buff *skb)
+{
+	int index = load_byte(skb, 14 + 9);
+	long *value;
+
+	value = bpf_map_lookup_elem(&my_map, &index);
+	if (value)
+		__sync_fetch_and_add(value, 1);
+
+	return 0;
+}
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/sockex1_user.c b/samples/bpf/sockex1_user.c
new file mode 100644
index 000000000000..34a443ff3831
--- /dev/null
+++ b/samples/bpf/sockex1_user.c
@@ -0,0 +1,49 @@
+#include <stdio.h>
+#include <assert.h>
+#include <linux/bpf.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+#include <unistd.h>
+#include <arpa/inet.h>
+
+int main(int ac, char **argv)
+{
+	char filename[256];
+	FILE *f;
+	int i, sock;
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+	if (load_bpf_file(filename)) {
+		printf("%s", bpf_log_buf);
+		return 1;
+	}
+
+	sock = open_raw_sock("lo");
+
+	assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, prog_fd,
+			  sizeof(prog_fd[0])) == 0);
+
+	f = popen("ping -c5 localhost", "r");
+	(void) f;
+
+	for (i = 0; i < 5; i++) {
+		long long tcp_cnt, udp_cnt, icmp_cnt;
+		int key;
+
+		key = IPPROTO_TCP;
+		assert(bpf_lookup_elem(map_fd[0], &key, &tcp_cnt) == 0);
+
+		key = IPPROTO_UDP;
+		assert(bpf_lookup_elem(map_fd[0], &key, &udp_cnt) == 0);
+
+		key = IPPROTO_ICMP;
+		assert(bpf_lookup_elem(map_fd[0], &key, &icmp_cnt) == 0);
+
+		printf("TCP %lld UDP %lld ICMP %lld packets\n",
+		       tcp_cnt, udp_cnt, icmp_cnt);
+		sleep(1);
+	}
+
+	return 0;
+}
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH net-next 4/6] samples: bpf: elf_bpf file loader
From: Alexei Starovoitov @ 2014-11-27  5:42 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Andy Lutomirski, Daniel Borkmann,
	Hannes Frederic Sowa, Eric Dumazet, linux-api, netdev,
	linux-kernel
In-Reply-To: <1417066951-1999-1-git-send-email-ast@plumgrid.com>

simple .o parser and loader using BPF syscall.
.o is a standard ELF generated by LLVM backend

It parses elf file compiled by llvm .c->.o
- parses 'maps' section and creates maps via BPF syscall
- parses 'license' section and passes it to syscall
- parses elf relocations for BPF maps and adjusts BPF_LD_IMM64 insns
  by storing map_fd into insn->imm and marking such insns as BPF_PSEUDO_MAP_FD
- loads eBPF programs via BPF syscall

One ELF file can contain multiple BPF programs.

int load_bpf_file(char *path);
populates prog_fd[] and map_fd[] with FDs received from bpf syscall

bpf_helpers.h - helper functions available to eBPF programs written in C

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
These helpers and loader are done as separate patch to make eBPF C examples
(that follow in the next patches) to focus on demonstrating programming
of eBPF in restricted C.
---
 samples/bpf/bpf_helpers.h |   40 +++++++++
 samples/bpf/bpf_load.c    |  203 +++++++++++++++++++++++++++++++++++++++++++++
 samples/bpf/bpf_load.h    |   24 ++++++
 3 files changed, 267 insertions(+)
 create mode 100644 samples/bpf/bpf_helpers.h
 create mode 100644 samples/bpf/bpf_load.c
 create mode 100644 samples/bpf/bpf_load.h

diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
new file mode 100644
index 000000000000..ca0333146006
--- /dev/null
+++ b/samples/bpf/bpf_helpers.h
@@ -0,0 +1,40 @@
+#ifndef __BPF_HELPERS_H
+#define __BPF_HELPERS_H
+
+/* helper macro to place programs, maps, license in
+ * different sections in elf_bpf file. Section names
+ * are interpreted by elf_bpf loader
+ */
+#define SEC(NAME) __attribute__((section(NAME), used))
+
+/* helper functions called from eBPF programs written in C */
+static void *(*bpf_map_lookup_elem)(void *map, void *key) =
+	(void *) BPF_FUNC_map_lookup_elem;
+static int (*bpf_map_update_elem)(void *map, void *key, void *value,
+				  unsigned long long flags) =
+	(void *) BPF_FUNC_map_update_elem;
+static int (*bpf_map_delete_elem)(void *map, void *key) =
+	(void *) BPF_FUNC_map_delete_elem;
+
+/* llvm builtin functions that eBPF C program may use to
+ * emit BPF_LD_ABS and BPF_LD_IND instructions
+ */
+struct sk_buff;
+unsigned long long load_byte(void *skb,
+			     unsigned long long off) asm("llvm.bpf.load.byte");
+unsigned long long load_half(void *skb,
+			     unsigned long long off) asm("llvm.bpf.load.half");
+unsigned long long load_word(void *skb,
+			     unsigned long long off) asm("llvm.bpf.load.word");
+
+/* a helper structure used by eBPF C program
+ * to describe map attributes to elf_bpf loader
+ */
+struct bpf_map_def {
+	unsigned int type;
+	unsigned int key_size;
+	unsigned int value_size;
+	unsigned int max_entries;
+};
+
+#endif
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
new file mode 100644
index 000000000000..1831d236382b
--- /dev/null
+++ b/samples/bpf/bpf_load.c
@@ -0,0 +1,203 @@
+#include <stdio.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <libelf.h>
+#include <gelf.h>
+#include <errno.h>
+#include <unistd.h>
+#include <string.h>
+#include <stdbool.h>
+#include <linux/bpf.h>
+#include <linux/filter.h>
+#include "libbpf.h"
+#include "bpf_helpers.h"
+#include "bpf_load.h"
+
+static char license[128];
+static bool processed_sec[128];
+int map_fd[MAX_MAPS];
+int prog_fd[MAX_PROGS];
+int prog_cnt;
+
+static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
+{
+	int fd;
+	bool is_socket = strncmp(event, "socket", 6) == 0;
+
+	if (!is_socket)
+		/* tracing events tbd */
+		return -1;
+
+	fd = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER,
+			   prog, size, license);
+
+	if (fd < 0) {
+		printf("bpf_prog_load() err=%d\n%s", errno, bpf_log_buf);
+		return -1;
+	}
+
+	prog_fd[prog_cnt++] = fd;
+
+	return 0;
+}
+
+static int load_maps(struct bpf_map_def *maps, int len)
+{
+	int i;
+
+	for (i = 0; i < len / sizeof(struct bpf_map_def); i++) {
+
+		map_fd[i] = bpf_create_map(maps[i].type,
+					   maps[i].key_size,
+					   maps[i].value_size,
+					   maps[i].max_entries);
+		if (map_fd[i] < 0)
+			return 1;
+	}
+	return 0;
+}
+
+static int get_sec(Elf *elf, int i, GElf_Ehdr *ehdr, char **shname,
+		   GElf_Shdr *shdr, Elf_Data **data)
+{
+	Elf_Scn *scn;
+
+	scn = elf_getscn(elf, i);
+	if (!scn)
+		return 1;
+
+	if (gelf_getshdr(scn, shdr) != shdr)
+		return 2;
+
+	*shname = elf_strptr(elf, ehdr->e_shstrndx, shdr->sh_name);
+	if (!*shname || !shdr->sh_size)
+		return 3;
+
+	*data = elf_getdata(scn, 0);
+	if (!*data || elf_getdata(scn, *data) != NULL)
+		return 4;
+
+	return 0;
+}
+
+static int parse_relo_and_apply(Elf_Data *data, Elf_Data *symbols,
+				GElf_Shdr *shdr, struct bpf_insn *insn)
+{
+	int i, nrels;
+
+	nrels = shdr->sh_size / shdr->sh_entsize;
+
+	for (i = 0; i < nrels; i++) {
+		GElf_Sym sym;
+		GElf_Rel rel;
+		unsigned int insn_idx;
+
+		gelf_getrel(data, i, &rel);
+
+		insn_idx = rel.r_offset / sizeof(struct bpf_insn);
+
+		gelf_getsym(symbols, GELF_R_SYM(rel.r_info), &sym);
+
+		if (insn[insn_idx].code != (BPF_LD | BPF_IMM | BPF_DW)) {
+			printf("invalid relo for insn[%d].code 0x%x\n",
+			       insn_idx, insn[insn_idx].code);
+			return 1;
+		}
+		insn[insn_idx].src_reg = BPF_PSEUDO_MAP_FD;
+		insn[insn_idx].imm = map_fd[sym.st_value / sizeof(struct bpf_map_def)];
+	}
+
+	return 0;
+}
+
+int load_bpf_file(char *path)
+{
+	int fd, i;
+	Elf *elf;
+	GElf_Ehdr ehdr;
+	GElf_Shdr shdr, shdr_prog;
+	Elf_Data *data, *data_prog, *symbols = NULL;
+	char *shname, *shname_prog;
+
+	if (elf_version(EV_CURRENT) == EV_NONE)
+		return 1;
+
+	fd = open(path, O_RDONLY, 0);
+	if (fd < 0)
+		return 1;
+
+	elf = elf_begin(fd, ELF_C_READ, NULL);
+
+	if (!elf)
+		return 1;
+
+	if (gelf_getehdr(elf, &ehdr) != &ehdr)
+		return 1;
+
+	/* scan over all elf sections to get license and map info */
+	for (i = 1; i < ehdr.e_shnum; i++) {
+
+		if (get_sec(elf, i, &ehdr, &shname, &shdr, &data))
+			continue;
+
+		if (0) /* helpful for llvm debugging */
+			printf("section %d:%s data %p size %zd link %d flags %d\n",
+			       i, shname, data->d_buf, data->d_size,
+			       shdr.sh_link, (int) shdr.sh_flags);
+
+		if (strcmp(shname, "license") == 0) {
+			processed_sec[i] = true;
+			memcpy(license, data->d_buf, data->d_size);
+		} else if (strcmp(shname, "maps") == 0) {
+			processed_sec[i] = true;
+			if (load_maps(data->d_buf, data->d_size))
+				return 1;
+		} else if (shdr.sh_type == SHT_SYMTAB) {
+			symbols = data;
+		}
+	}
+
+	/* load programs that need map fixup (relocations) */
+	for (i = 1; i < ehdr.e_shnum; i++) {
+
+		if (get_sec(elf, i, &ehdr, &shname, &shdr, &data))
+			continue;
+		if (shdr.sh_type == SHT_REL) {
+			struct bpf_insn *insns;
+
+			if (get_sec(elf, shdr.sh_info, &ehdr, &shname_prog,
+				    &shdr_prog, &data_prog))
+				continue;
+
+			insns = (struct bpf_insn *) data_prog->d_buf;
+
+			processed_sec[shdr.sh_info] = true;
+			processed_sec[i] = true;
+
+			if (parse_relo_and_apply(data, symbols, &shdr, insns))
+				continue;
+
+			if (memcmp(shname_prog, "events/", 7) == 0 ||
+			    memcmp(shname_prog, "socket", 6) == 0)
+				load_and_attach(shname_prog, insns, data_prog->d_size);
+		}
+	}
+
+	/* load programs that don't use maps */
+	for (i = 1; i < ehdr.e_shnum; i++) {
+
+		if (processed_sec[i])
+			continue;
+
+		if (get_sec(elf, i, &ehdr, &shname, &shdr, &data))
+			continue;
+
+		if (memcmp(shname, "events/", 7) == 0 ||
+		    memcmp(shname, "socket", 6) == 0)
+			load_and_attach(shname, data->d_buf, data->d_size);
+	}
+
+	close(fd);
+	return 0;
+}
diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h
new file mode 100644
index 000000000000..27789a34f5e6
--- /dev/null
+++ b/samples/bpf/bpf_load.h
@@ -0,0 +1,24 @@
+#ifndef __BPF_LOAD_H
+#define __BPF_LOAD_H
+
+#define MAX_MAPS 32
+#define MAX_PROGS 32
+
+extern int map_fd[MAX_MAPS];
+extern int prog_fd[MAX_PROGS];
+
+/* parses elf file compiled by llvm .c->.o
+ * . parses 'maps' section and creates maps via BPF syscall
+ * . parses 'license' section and passes it to syscall
+ * . parses elf relocations for BPF maps and adjusts BPF_LD_IMM64 insns by
+ *   storing map_fd into insn->imm and marking such insns as BPF_PSEUDO_MAP_FD
+ * . loads eBPF programs via BPF syscall
+ *
+ * One ELF file can contain multiple BPF programs which will be loaded
+ * and their FDs stored stored in prog_fd array
+ *
+ * returns zero on success
+ */
+int load_bpf_file(char *path);
+
+#endif
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH net-next 3/6] samples: bpf: example of stateful socket filtering
From: Alexei Starovoitov @ 2014-11-27  5:42 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Andy Lutomirski, Daniel Borkmann,
	Hannes Frederic Sowa, Eric Dumazet, linux-api, netdev,
	linux-kernel
In-Reply-To: <1417066951-1999-1-git-send-email-ast@plumgrid.com>

this socket filter example does:
- creates arraymap in kernel with key 4 bytes and value 8 bytes

- loads eBPF program:
  r0 = skb[14 + 9]; // load one byte of ip->proto
  *(u32*)(fp - 4) = r0;
  value = bpf_map_lookup_elem(map_fd, fp - 4);
  if (value)
       (*(u64*)value) += 1;

- attaches this program to raw socket

- every second user space reads map[tcp], map[udp], map[icmp] to see
  how many packets of given protocol were seen on loopback interface

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 samples/bpf/Makefile       |    2 +
 samples/bpf/libbpf.c       |   28 +++++++++++++
 samples/bpf/libbpf.h       |   13 ++++++
 samples/bpf/sock_example.c |   97 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 140 insertions(+)
 create mode 100644 samples/bpf/sock_example.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 0718d9ce4619..f46d3492d032 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -3,9 +3,11 @@ obj- := dummy.o
 
 # List of programs to build
 hostprogs-y := test_verifier test_maps
+hostprogs-y += sock_example
 
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
+sock_example-objs := sock_example.o libbpf.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
diff --git a/samples/bpf/libbpf.c b/samples/bpf/libbpf.c
index 17bb520eb57f..46d50b7ddf79 100644
--- a/samples/bpf/libbpf.c
+++ b/samples/bpf/libbpf.c
@@ -7,6 +7,10 @@
 #include <linux/netlink.h>
 #include <linux/bpf.h>
 #include <errno.h>
+#include <net/ethernet.h>
+#include <net/if.h>
+#include <linux/if_packet.h>
+#include <arpa/inet.h>
 #include "libbpf.h"
 
 static __u64 ptr_to_u64(void *ptr)
@@ -93,3 +97,27 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
 
 	return syscall(__NR_bpf, BPF_PROG_LOAD, &attr, sizeof(attr));
 }
+
+int open_raw_sock(const char *name)
+{
+	struct sockaddr_ll sll;
+	int sock;
+
+	sock = socket(PF_PACKET, SOCK_RAW | SOCK_NONBLOCK | SOCK_CLOEXEC, htons(ETH_P_ALL));
+	if (sock < 0) {
+		printf("cannot create raw socket\n");
+		return -1;
+	}
+
+	memset(&sll, 0, sizeof(sll));
+	sll.sll_family = AF_PACKET;
+	sll.sll_ifindex = if_nametoindex(name);
+	sll.sll_protocol = htons(ETH_P_ALL);
+	if (bind(sock, (struct sockaddr *)&sll, sizeof(sll)) < 0) {
+		printf("bind to %s: %s\n", name, strerror(errno));
+		close(sock);
+		return -1;
+	}
+
+	return sock;
+}
diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h
index f8678e5f48bf..cc62ad4d95de 100644
--- a/samples/bpf/libbpf.h
+++ b/samples/bpf/libbpf.h
@@ -99,6 +99,16 @@ extern char bpf_log_buf[LOG_BUF_SIZE];
 	BPF_LD_IMM64_RAW(DST, BPF_PSEUDO_MAP_FD, MAP_FD)
 
 
+/* Direct packet access, R0 = *(uint *) (skb->data + imm32) */
+
+#define BPF_LD_ABS(SIZE, IMM)					\
+	((struct bpf_insn) {					\
+		.code  = BPF_LD | BPF_SIZE(SIZE) | BPF_ABS,	\
+		.dst_reg = 0,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
 /* Memory load, dst_reg = *(uint *) (src_reg + off16) */
 
 #define BPF_LDX_MEM(SIZE, DST, SRC, OFF)			\
@@ -169,4 +179,7 @@ extern char bpf_log_buf[LOG_BUF_SIZE];
 		.off   = 0,					\
 		.imm   = 0 })
 
+/* create RAW socket and bind to interface 'name' */
+int open_raw_sock(const char *name);
+
 #endif
diff --git a/samples/bpf/sock_example.c b/samples/bpf/sock_example.c
new file mode 100644
index 000000000000..d74b58523458
--- /dev/null
+++ b/samples/bpf/sock_example.c
@@ -0,0 +1,97 @@
+/* eBPF example program:
+ * - creates arraymap in kernel with key 4 bytes and value 8 bytes
+ *
+ * - loads eBPF program:
+ *   r0 = skb[14 + 9]; // load one byte of ip->proto
+ *   *(u32*)(fp - 4) = r0;
+ *   value = bpf_map_lookup_elem(map_fd, fp - 4);
+ *   if (value)
+ *        (*(u64*)value) += 1;
+ *
+ * - attaches this program to eth0 raw socket
+ *
+ * - every second user space reads map[tcp], map[udp], map[icmp] to see
+ *   how many packets of given protocol were seen on eth0
+ */
+#include <stdio.h>
+#include <unistd.h>
+#include <assert.h>
+#include <linux/bpf.h>
+#include <string.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <sys/socket.h>
+#include <arpa/inet.h>
+#include "libbpf.h"
+
+static int test_sock(void)
+{
+	int sock = -1, map_fd, prog_fd, i, key;
+	long long value = 0, tcp_cnt, udp_cnt, icmp_cnt;
+
+	map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(key), sizeof(value),
+				256);
+	if (map_fd < 0) {
+		printf("failed to create map '%s'\n", strerror(errno));
+		goto cleanup;
+	}
+
+	struct bpf_insn prog[] = {
+		BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+		BPF_LD_ABS(BPF_B, 14 + 9 /* R0 = ip->proto */),
+		BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */
+		BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = fp - 4 */
+		BPF_LD_MAP_FD(BPF_REG_1, map_fd),
+		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
+		BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */
+		BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
+		BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */
+		BPF_EXIT_INSN(),
+	};
+
+	prog_fd = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, prog, sizeof(prog),
+				"GPL");
+	if (prog_fd < 0) {
+		printf("failed to load prog '%s'\n", strerror(errno));
+		goto cleanup;
+	}
+
+	sock = open_raw_sock("lo");
+
+	if (setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd,
+		       sizeof(prog_fd)) < 0) {
+		printf("setsockopt %s\n", strerror(errno));
+		goto cleanup;
+	}
+
+	for (i = 0; i < 10; i++) {
+		key = IPPROTO_TCP;
+		assert(bpf_lookup_elem(map_fd, &key, &tcp_cnt) == 0);
+
+		key = IPPROTO_UDP;
+		assert(bpf_lookup_elem(map_fd, &key, &udp_cnt) == 0);
+
+		key = IPPROTO_ICMP;
+		assert(bpf_lookup_elem(map_fd, &key, &icmp_cnt) == 0);
+
+		printf("TCP %lld UDP %lld ICMP %lld packets\n",
+		       tcp_cnt, udp_cnt, icmp_cnt);
+		sleep(1);
+	}
+
+cleanup:
+	/* maps, programs, raw sockets will auto cleanup on process exit */
+	return 0;
+}
+
+int main(void)
+{
+	FILE *f;
+
+	f = popen("ping -c5 localhost", "r");
+	(void)f;
+
+	return test_sock();
+}
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH net-next 6/6] samples: bpf: large eBPF program in C
From: Alexei Starovoitov @ 2014-11-27  5:42 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Andy Lutomirski, Daniel Borkmann,
	Hannes Frederic Sowa, Eric Dumazet, linux-api, netdev,
	linux-kernel
In-Reply-To: <1417066951-1999-1-git-send-email-ast@plumgrid.com>

sockex2_kern.c is purposefully large eBPF program in C.
llvm compiles ~200 lines of C code into ~300 eBPF instructions.

It's similar to __skb_flow_dissect() to demonstrate that complex packet parsing
can be done by eBPF.
Then it uses (struct flow_keys)->dst IP address (or hash of ipv6 dst) to keep
stats of number of packets per IP.
User space loads eBPF program, attaches it to loopback interface and prints
dest_ip->#packets stats every second.

Usage:
$sudo samples/bpf/sockex2
ip 127.0.0.1 count 19
ip 127.0.0.1 count 178115
ip 127.0.0.1 count 369437
ip 127.0.0.1 count 559841
ip 127.0.0.1 count 750539

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 samples/bpf/Makefile       |    4 +
 samples/bpf/sockex2_kern.c |  215 ++++++++++++++++++++++++++++++++++++++++++++
 samples/bpf/sockex2_user.c |   44 +++++++++
 3 files changed, 263 insertions(+)
 create mode 100644 samples/bpf/sockex2_kern.c
 create mode 100644 samples/bpf/sockex2_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 770d145186c3..b5b3600dcdf5 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -5,20 +5,24 @@ obj- := dummy.o
 hostprogs-y := test_verifier test_maps
 hostprogs-y += sock_example
 hostprogs-y += sockex1
+hostprogs-y += sockex2
 
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
 sock_example-objs := sock_example.o libbpf.o
 sockex1-objs := bpf_load.o libbpf.o sockex1_user.o
+sockex2-objs := bpf_load.o libbpf.o sockex2_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
 always += sockex1_kern.o
+always += sockex2_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
 HOSTCFLAGS_bpf_load.o += -I$(objtree)/usr/include -Wno-unused-variable
 HOSTLOADLIBES_sockex1 += -lelf
+HOSTLOADLIBES_sockex2 += -lelf
 
 # point this to your LLVM backend with bpf support
 LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
diff --git a/samples/bpf/sockex2_kern.c b/samples/bpf/sockex2_kern.c
new file mode 100644
index 000000000000..6f0135f0f217
--- /dev/null
+++ b/samples/bpf/sockex2_kern.c
@@ -0,0 +1,215 @@
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+#include <uapi/linux/in.h>
+#include <uapi/linux/if.h>
+#include <uapi/linux/if_ether.h>
+#include <uapi/linux/ip.h>
+#include <uapi/linux/ipv6.h>
+#include <uapi/linux/if_tunnel.h>
+#define IP_MF		0x2000
+#define IP_OFFSET	0x1FFF
+
+struct vlan_hdr {
+	__be16 h_vlan_TCI;
+	__be16 h_vlan_encapsulated_proto;
+};
+
+struct flow_keys {
+	__be32 src;
+	__be32 dst;
+	union {
+		__be32 ports;
+		__be16 port16[2];
+	};
+	__u16 thoff;
+	__u8 ip_proto;
+};
+
+static inline int proto_ports_offset(__u64 proto)
+{
+	switch (proto) {
+	case IPPROTO_TCP:
+	case IPPROTO_UDP:
+	case IPPROTO_DCCP:
+	case IPPROTO_ESP:
+	case IPPROTO_SCTP:
+	case IPPROTO_UDPLITE:
+		return 0;
+	case IPPROTO_AH:
+		return 4;
+	default:
+		return 0;
+	}
+}
+
+static inline int ip_is_fragment(struct sk_buff *ctx, __u64 nhoff)
+{
+	return load_half(ctx, nhoff + offsetof(struct iphdr, frag_off))
+		& (IP_MF | IP_OFFSET);
+}
+
+static inline __u32 ipv6_addr_hash(struct sk_buff *ctx, __u64 off)
+{
+	__u64 w0 = load_word(ctx, off);
+	__u64 w1 = load_word(ctx, off + 4);
+	__u64 w2 = load_word(ctx, off + 8);
+	__u64 w3 = load_word(ctx, off + 12);
+
+	return (__u32)(w0 ^ w1 ^ w2 ^ w3);
+}
+
+static inline __u64 parse_ip(struct sk_buff *skb, __u64 nhoff, __u64 *ip_proto,
+			     struct flow_keys *flow)
+{
+	__u64 verlen;
+
+	if (unlikely(ip_is_fragment(skb, nhoff)))
+		*ip_proto = 0;
+	else
+		*ip_proto = load_byte(skb, nhoff + offsetof(struct iphdr, protocol));
+
+	if (*ip_proto != IPPROTO_GRE) {
+		flow->src = load_word(skb, nhoff + offsetof(struct iphdr, saddr));
+		flow->dst = load_word(skb, nhoff + offsetof(struct iphdr, daddr));
+	}
+
+	verlen = load_byte(skb, nhoff + 0/*offsetof(struct iphdr, ihl)*/);
+	if (likely(verlen == 0x45))
+		nhoff += 20;
+	else
+		nhoff += (verlen & 0xF) << 2;
+
+	return nhoff;
+}
+
+static inline __u64 parse_ipv6(struct sk_buff *skb, __u64 nhoff, __u64 *ip_proto,
+			       struct flow_keys *flow)
+{
+	*ip_proto = load_byte(skb,
+			      nhoff + offsetof(struct ipv6hdr, nexthdr));
+	flow->src = ipv6_addr_hash(skb,
+				   nhoff + offsetof(struct ipv6hdr, saddr));
+	flow->dst = ipv6_addr_hash(skb,
+				   nhoff + offsetof(struct ipv6hdr, daddr));
+	nhoff += sizeof(struct ipv6hdr);
+
+	return nhoff;
+}
+
+static inline bool flow_dissector(struct sk_buff *skb, struct flow_keys *flow)
+{
+	__u64 nhoff = ETH_HLEN;
+	__u64 ip_proto;
+	__u64 proto = load_half(skb, 12);
+	int poff;
+
+	if (proto == ETH_P_8021AD) {
+		proto = load_half(skb, nhoff + offsetof(struct vlan_hdr,
+							h_vlan_encapsulated_proto));
+		nhoff += sizeof(struct vlan_hdr);
+	}
+
+	if (proto == ETH_P_8021Q) {
+		proto = load_half(skb, nhoff + offsetof(struct vlan_hdr,
+							h_vlan_encapsulated_proto));
+		nhoff += sizeof(struct vlan_hdr);
+	}
+
+	if (likely(proto == ETH_P_IP))
+		nhoff = parse_ip(skb, nhoff, &ip_proto, flow);
+	else if (proto == ETH_P_IPV6)
+		nhoff = parse_ipv6(skb, nhoff, &ip_proto, flow);
+	else
+		return false;
+
+	switch (ip_proto) {
+	case IPPROTO_GRE: {
+		struct gre_hdr {
+			__be16 flags;
+			__be16 proto;
+		};
+
+		__u64 gre_flags = load_half(skb,
+					    nhoff + offsetof(struct gre_hdr, flags));
+		__u64 gre_proto = load_half(skb,
+					    nhoff + offsetof(struct gre_hdr, proto));
+
+		if (gre_flags & (GRE_VERSION|GRE_ROUTING))
+			break;
+
+		proto = gre_proto;
+		nhoff += 4;
+		if (gre_flags & GRE_CSUM)
+			nhoff += 4;
+		if (gre_flags & GRE_KEY)
+			nhoff += 4;
+		if (gre_flags & GRE_SEQ)
+			nhoff += 4;
+
+		if (proto == ETH_P_8021Q) {
+			proto = load_half(skb,
+					  nhoff + offsetof(struct vlan_hdr,
+							   h_vlan_encapsulated_proto));
+			nhoff += sizeof(struct vlan_hdr);
+		}
+
+		if (proto == ETH_P_IP)
+			nhoff = parse_ip(skb, nhoff, &ip_proto, flow);
+		else if (proto == ETH_P_IPV6)
+			nhoff = parse_ipv6(skb, nhoff, &ip_proto, flow);
+		else
+			return false;
+		break;
+	}
+	case IPPROTO_IPIP:
+		nhoff = parse_ip(skb, nhoff, &ip_proto, flow);
+		break;
+	case IPPROTO_IPV6:
+		nhoff = parse_ipv6(skb, nhoff, &ip_proto, flow);
+		break;
+	default:
+		break;
+	}
+
+	flow->ip_proto = ip_proto;
+	poff = proto_ports_offset(ip_proto);
+	if (poff >= 0) {
+		nhoff += poff;
+		flow->ports = load_word(skb, nhoff);
+	}
+
+	flow->thoff = (__u16) nhoff;
+
+	return true;
+}
+
+struct bpf_map_def SEC("maps") hash_map = {
+	.type = BPF_MAP_TYPE_HASH,
+	.key_size = sizeof(__be32),
+	.value_size = sizeof(long),
+	.max_entries = 1024,
+};
+
+SEC("socket2")
+int bpf_prog2(struct sk_buff *skb)
+{
+	struct flow_keys flow;
+	long *value;
+	u32 key;
+
+	if (!flow_dissector(skb, &flow))
+		return 0;
+
+	key = flow.dst;
+	value = bpf_map_lookup_elem(&hash_map, &key);
+	if (value) {
+		__sync_fetch_and_add(value, 1);
+	} else {
+		long val = 1;
+
+		bpf_map_update_elem(&hash_map, &key, &val, BPF_ANY);
+	}
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/sockex2_user.c b/samples/bpf/sockex2_user.c
new file mode 100644
index 000000000000..d2d5f5a790d3
--- /dev/null
+++ b/samples/bpf/sockex2_user.c
@@ -0,0 +1,44 @@
+#include <stdio.h>
+#include <assert.h>
+#include <linux/bpf.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+#include <unistd.h>
+#include <arpa/inet.h>
+
+int main(int ac, char **argv)
+{
+	char filename[256];
+	FILE *f;
+	int i, sock;
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+	if (load_bpf_file(filename)) {
+		printf("%s", bpf_log_buf);
+		return 1;
+	}
+
+	sock = open_raw_sock("lo");
+
+	assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, prog_fd,
+			  sizeof(prog_fd[0])) == 0);
+
+	f = popen("ping -c5 localhost", "r");
+	(void) f;
+
+	for (i = 0; i < 5; i++) {
+		int key = 0, next_key;
+		long long value;
+
+		while (bpf_get_next_key(map_fd[0], &key, &next_key) == 0) {
+			bpf_lookup_elem(map_fd[0], &next_key, &value);
+			printf("ip %s count %lld\n",
+			       inet_ntoa((struct in_addr){htonl(next_key)}),
+			       value);
+			key = next_key;
+		}
+		sleep(1);
+	}
+	return 0;
+}
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH net-next 2/6] net: sock: allow eBPF programs to be attached to sockets
From: Alexei Starovoitov @ 2014-11-27  5:42 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Andy Lutomirski, Daniel Borkmann,
	Hannes Frederic Sowa, Eric Dumazet, linux-api, netdev,
	linux-kernel
In-Reply-To: <1417066951-1999-1-git-send-email-ast@plumgrid.com>

introduce new setsockopt() command:

setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd))

where prog_fd was received from syscall bpf(BPF_PROG_LOAD, attr, ...)
and attr->prog_type == BPF_PROG_TYPE_SOCKET_FILTER

setsockopt() calls bpf_prog_get() which increments refcnt of the program,
so it doesn't get unloaded while socket is using the program.

The same eBPF program can be attached to multiple sockets.

User task exit automatically closes socket which calls sk_filter_uncharge()
which decrements refcnt of eBPF program

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
Note, I'm not happy about 'ifdef', but 'select or depend BPF_SYSCALL' will
make tinification folks cringe, so use ifdef until native eBPF use cases
become widespread.
---
 arch/alpha/include/uapi/asm/socket.h   |    3 +
 arch/avr32/include/uapi/asm/socket.h   |    3 +
 arch/cris/include/uapi/asm/socket.h    |    3 +
 arch/frv/include/uapi/asm/socket.h     |    3 +
 arch/ia64/include/uapi/asm/socket.h    |    3 +
 arch/m32r/include/uapi/asm/socket.h    |    3 +
 arch/mips/include/uapi/asm/socket.h    |    3 +
 arch/mn10300/include/uapi/asm/socket.h |    3 +
 arch/parisc/include/uapi/asm/socket.h  |    3 +
 arch/powerpc/include/uapi/asm/socket.h |    3 +
 arch/s390/include/uapi/asm/socket.h    |    3 +
 arch/sparc/include/uapi/asm/socket.h   |    3 +
 arch/xtensa/include/uapi/asm/socket.h  |    3 +
 include/linux/bpf.h                    |    4 ++
 include/linux/filter.h                 |    1 +
 include/uapi/asm-generic/socket.h      |    3 +
 net/core/filter.c                      |   97 +++++++++++++++++++++++++++++++-
 net/core/sock.c                        |   13 +++++
 18 files changed, 155 insertions(+), 2 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h
index e2fe0700b3b4..9a20821b111c 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -89,4 +89,7 @@
 
 #define SO_INCOMING_CPU		49
 
+#define SO_ATTACH_BPF		50
+#define SO_DETACH_BPF		SO_DETACH_FILTER
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/avr32/include/uapi/asm/socket.h b/arch/avr32/include/uapi/asm/socket.h
index 92121b0f5b98..2b65ed6b277c 100644
--- a/arch/avr32/include/uapi/asm/socket.h
+++ b/arch/avr32/include/uapi/asm/socket.h
@@ -82,4 +82,7 @@
 
 #define SO_INCOMING_CPU		49
 
+#define SO_ATTACH_BPF		50
+#define SO_DETACH_BPF		SO_DETACH_FILTER
+
 #endif /* _UAPI__ASM_AVR32_SOCKET_H */
diff --git a/arch/cris/include/uapi/asm/socket.h b/arch/cris/include/uapi/asm/socket.h
index 60f60f5b9b35..e2503d9f1869 100644
--- a/arch/cris/include/uapi/asm/socket.h
+++ b/arch/cris/include/uapi/asm/socket.h
@@ -84,6 +84,9 @@
 
 #define SO_INCOMING_CPU		49
 
+#define SO_ATTACH_BPF		50
+#define SO_DETACH_BPF		SO_DETACH_FILTER
+
 #endif /* _ASM_SOCKET_H */
 
 
diff --git a/arch/frv/include/uapi/asm/socket.h b/arch/frv/include/uapi/asm/socket.h
index 2c6890209ea6..4823ad125578 100644
--- a/arch/frv/include/uapi/asm/socket.h
+++ b/arch/frv/include/uapi/asm/socket.h
@@ -82,5 +82,8 @@
 
 #define SO_INCOMING_CPU		49
 
+#define SO_ATTACH_BPF		50
+#define SO_DETACH_BPF		SO_DETACH_FILTER
+
 #endif /* _ASM_SOCKET_H */
 
diff --git a/arch/ia64/include/uapi/asm/socket.h b/arch/ia64/include/uapi/asm/socket.h
index 09a93fb566f6..59be3d87f86d 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -91,4 +91,7 @@
 
 #define SO_INCOMING_CPU		49
 
+#define SO_ATTACH_BPF		50
+#define SO_DETACH_BPF		SO_DETACH_FILTER
+
 #endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/m32r/include/uapi/asm/socket.h b/arch/m32r/include/uapi/asm/socket.h
index e8589819c274..7bc4cb273856 100644
--- a/arch/m32r/include/uapi/asm/socket.h
+++ b/arch/m32r/include/uapi/asm/socket.h
@@ -82,4 +82,7 @@
 
 #define SO_INCOMING_CPU		49
 
+#define SO_ATTACH_BPF		50
+#define SO_DETACH_BPF		SO_DETACH_FILTER
+
 #endif /* _ASM_M32R_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h
index 2e9ee8c55a10..dec3c850f36b 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -100,4 +100,7 @@
 
 #define SO_INCOMING_CPU		49
 
+#define SO_ATTACH_BPF		50
+#define SO_DETACH_BPF		SO_DETACH_FILTER
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/mn10300/include/uapi/asm/socket.h b/arch/mn10300/include/uapi/asm/socket.h
index f3492e8c9f70..cab7d6d50051 100644
--- a/arch/mn10300/include/uapi/asm/socket.h
+++ b/arch/mn10300/include/uapi/asm/socket.h
@@ -82,4 +82,7 @@
 
 #define SO_INCOMING_CPU		49
 
+#define SO_ATTACH_BPF		50
+#define SO_DETACH_BPF		SO_DETACH_FILTER
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h
index 7984a1cab3da..a5cd40cd8ee1 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -81,4 +81,7 @@
 
 #define SO_INCOMING_CPU		0x402A
 
+#define SO_ATTACH_BPF		0x402B
+#define SO_DETACH_BPF		SO_DETACH_FILTER
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/powerpc/include/uapi/asm/socket.h b/arch/powerpc/include/uapi/asm/socket.h
index 3474e4ef166d..c046666038f8 100644
--- a/arch/powerpc/include/uapi/asm/socket.h
+++ b/arch/powerpc/include/uapi/asm/socket.h
@@ -89,4 +89,7 @@
 
 #define SO_INCOMING_CPU		49
 
+#define SO_ATTACH_BPF		50
+#define SO_DETACH_BPF		SO_DETACH_FILTER
+
 #endif	/* _ASM_POWERPC_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h b/arch/s390/include/uapi/asm/socket.h
index 8457636c33e1..296942d56e6a 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -88,4 +88,7 @@
 
 #define SO_INCOMING_CPU		49
 
+#define SO_ATTACH_BPF		50
+#define SO_DETACH_BPF		SO_DETACH_FILTER
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h
index 4a8003a94163..e6a16c40be5f 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -78,6 +78,9 @@
 
 #define SO_INCOMING_CPU		0x0033
 
+#define SO_ATTACH_BPF		0x0034
+#define SO_DETACH_BPF		SO_DETACH_FILTER
+
 /* Security levels - as per NRL IPv6 - don't actually do anything */
 #define SO_SECURITY_AUTHENTICATION		0x5001
 #define SO_SECURITY_ENCRYPTION_TRANSPORT	0x5002
diff --git a/arch/xtensa/include/uapi/asm/socket.h b/arch/xtensa/include/uapi/asm/socket.h
index c46f6a696849..4120af086160 100644
--- a/arch/xtensa/include/uapi/asm/socket.h
+++ b/arch/xtensa/include/uapi/asm/socket.h
@@ -93,4 +93,7 @@
 
 #define SO_INCOMING_CPU		49
 
+#define SO_ATTACH_BPF		50
+#define SO_DETACH_BPF		SO_DETACH_FILTER
+
 #endif	/* _XTENSA_SOCKET_H */
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 75e94eaa228b..bbfceb756452 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -128,7 +128,11 @@ struct bpf_prog_aux {
 	struct work_struct work;
 };
 
+#ifdef CONFIG_BPF_SYSCALL
 void bpf_prog_put(struct bpf_prog *prog);
+#else
+static inline void bpf_prog_put(struct bpf_prog *prog) {}
+#endif
 struct bpf_prog *bpf_prog_get(u32 ufd);
 /* verify correctness of eBPF program */
 int bpf_check(struct bpf_prog *fp, union bpf_attr *attr);
diff --git a/include/linux/filter.h b/include/linux/filter.h
index ca95abd2bed1..caac2087a4d5 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -381,6 +381,7 @@ int bpf_prog_create(struct bpf_prog **pfp, struct sock_fprog_kern *fprog);
 void bpf_prog_destroy(struct bpf_prog *fp);
 
 int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk);
+int sk_attach_bpf(u32 ufd, struct sock *sk);
 int sk_detach_filter(struct sock *sk);
 
 int bpf_check_classic(const struct sock_filter *filter, unsigned int flen);
diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
index f541ccefd4ac..5c15c2a5c123 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -84,4 +84,7 @@
 
 #define SO_INCOMING_CPU		49
 
+#define SO_ATTACH_BPF		50
+#define SO_DETACH_BPF		SO_DETACH_FILTER
+
 #endif /* __ASM_GENERIC_SOCKET_H */
diff --git a/net/core/filter.c b/net/core/filter.c
index 647b12265e18..8cc3c03078b3 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -44,6 +44,7 @@
 #include <linux/ratelimit.h>
 #include <linux/seccomp.h>
 #include <linux/if_vlan.h>
+#include <linux/bpf.h>
 
 /**
  *	sk_filter - run a packet through a socket filter
@@ -813,8 +814,12 @@ static void bpf_release_orig_filter(struct bpf_prog *fp)
 
 static void __bpf_prog_release(struct bpf_prog *prog)
 {
-	bpf_release_orig_filter(prog);
-	bpf_prog_free(prog);
+	if (prog->aux->prog_type == BPF_PROG_TYPE_SOCKET_FILTER) {
+		bpf_prog_put(prog);
+	} else {
+		bpf_release_orig_filter(prog);
+		bpf_prog_free(prog);
+	}
 }
 
 static void __sk_filter_release(struct sk_filter *fp)
@@ -1088,6 +1093,94 @@ int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk)
 }
 EXPORT_SYMBOL_GPL(sk_attach_filter);
 
+#ifdef CONFIG_BPF_SYSCALL
+int sk_attach_bpf(u32 ufd, struct sock *sk)
+{
+	struct sk_filter *fp, *old_fp;
+	struct bpf_prog *prog;
+
+	if (sock_flag(sk, SOCK_FILTER_LOCKED))
+		return -EPERM;
+
+	prog = bpf_prog_get(ufd);
+	if (!prog)
+		return -EINVAL;
+
+	if (prog->aux->prog_type != BPF_PROG_TYPE_SOCKET_FILTER) {
+		/* valid fd, but invalid program type */
+		bpf_prog_put(prog);
+		return -EINVAL;
+	}
+
+	fp = kmalloc(sizeof(*fp), GFP_KERNEL);
+	if (!fp) {
+		bpf_prog_put(prog);
+		return -ENOMEM;
+	}
+	fp->prog = prog;
+
+	atomic_set(&fp->refcnt, 0);
+
+	if (!sk_filter_charge(sk, fp)) {
+		__sk_filter_release(fp);
+		return -ENOMEM;
+	}
+
+	old_fp = rcu_dereference_protected(sk->sk_filter,
+					   sock_owned_by_user(sk));
+	rcu_assign_pointer(sk->sk_filter, fp);
+
+	if (old_fp)
+		sk_filter_uncharge(sk, old_fp);
+
+	return 0;
+}
+
+/* allow socket filters to call
+ * bpf_map_lookup_elem(), bpf_map_update_elem(), bpf_map_delete_elem()
+ */
+static const struct bpf_func_proto *sock_filter_func_proto(enum bpf_func_id func_id)
+{
+	switch (func_id) {
+	case BPF_FUNC_map_lookup_elem:
+		return &bpf_map_lookup_elem_proto;
+	case BPF_FUNC_map_update_elem:
+		return &bpf_map_update_elem_proto;
+	case BPF_FUNC_map_delete_elem:
+		return &bpf_map_delete_elem_proto;
+	default:
+		return NULL;
+	}
+}
+
+static bool sock_filter_is_valid_access(int off, int size, enum bpf_access_type type)
+{
+	/* skb fields cannot be accessed yet */
+	return false;
+}
+
+static struct bpf_verifier_ops sock_filter_ops = {
+	.get_func_proto = sock_filter_func_proto,
+	.is_valid_access = sock_filter_is_valid_access,
+};
+
+static struct bpf_prog_type_list tl = {
+	.ops = &sock_filter_ops,
+	.type = BPF_PROG_TYPE_SOCKET_FILTER,
+};
+
+static int __init register_sock_filter_ops(void)
+{
+	bpf_register_prog_type(&tl);
+	return 0;
+}
+late_initcall(register_sock_filter_ops);
+#else
+int sk_attach_bpf(u32 ufd, struct sock *sk)
+{
+	return -EOPNOTSUPP;
+}
+#endif
 int sk_detach_filter(struct sock *sk)
 {
 	int ret = -ENOENT;
diff --git a/net/core/sock.c b/net/core/sock.c
index 0725cf0cb685..9a56b2000c3f 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -888,6 +888,19 @@ set_rcvbuf:
 		}
 		break;
 
+	case SO_ATTACH_BPF:
+		ret = -EINVAL;
+		if (optlen == sizeof(u32)) {
+			u32 ufd;
+
+			ret = -EFAULT;
+			if (copy_from_user(&ufd, optval, sizeof(ufd)))
+				break;
+
+			ret = sk_attach_bpf(ufd, sk);
+		}
+		break;
+
 	case SO_DETACH_FILTER:
 		ret = sk_detach_filter(sk);
 		break;
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH net-next 1/6] bpf: verifier: add checks for BPF_ABS | BPF_IND instructions
From: Alexei Starovoitov @ 2014-11-27  5:42 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Andy Lutomirski, Daniel Borkmann,
	Hannes Frederic Sowa, Eric Dumazet, linux-api, netdev,
	linux-kernel
In-Reply-To: <1417066951-1999-1-git-send-email-ast@plumgrid.com>

introduce program type BPF_PROG_TYPE_SOCKET_FILTER that is used
for attaching programs to sockets where ctx == skb.

add verifier checks for ABS/IND instructions which can only be seen
in socket filters, therefore the check:
  if (env->prog->aux->prog_type != BPF_PROG_TYPE_SOCKET_FILTER)
    verbose("BPF_LD_ABS|IND instructions are only allowed in socket filters\n");

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 include/uapi/linux/bpf.h |    1 +
 kernel/bpf/verifier.c    |   70 ++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 69 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 4a3d0f84f178..45da7ec7d274 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -117,6 +117,7 @@ enum bpf_map_type {
 
 enum bpf_prog_type {
 	BPF_PROG_TYPE_UNSPEC,
+	BPF_PROG_TYPE_SOCKET_FILTER,
 };
 
 /* flags for BPF_MAP_UPDATE_ELEM command */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index b6a1f7c14a67..a28e09c7825d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1172,6 +1172,70 @@ static int check_ld_imm(struct verifier_env *env, struct bpf_insn *insn)
 	return 0;
 }
 
+/* verify safety of LD_ABS|LD_IND instructions:
+ * - they can only appear in the programs where ctx == skb
+ * - since they are wrappers of function calls, they scratch R1-R5 registers,
+ *   preserve R6-R9, and store return value into R0
+ *
+ * Implicit input:
+ *   ctx == skb == R6 == CTX
+ *
+ * Explicit input:
+ *   SRC == any register
+ *   IMM == 32-bit immediate
+ *
+ * Output:
+ *   R0 - 8/16/32-bit skb data converted to cpu endianness
+ */
+static int check_ld_abs(struct verifier_env *env, struct bpf_insn *insn)
+{
+	struct reg_state *regs = env->cur_state.regs;
+	u8 mode = BPF_MODE(insn->code);
+	struct reg_state *reg;
+	int i, err;
+
+	if (env->prog->aux->prog_type != BPF_PROG_TYPE_SOCKET_FILTER) {
+		verbose("BPF_LD_ABS|IND instructions are only allowed in socket filters\n");
+		return -EINVAL;
+	}
+
+	if (insn->dst_reg != BPF_REG_0 || insn->off != 0 ||
+	    (mode == BPF_ABS && insn->src_reg != BPF_REG_0)) {
+		verbose("BPF_LD_ABS uses reserved fields\n");
+		return -EINVAL;
+	}
+
+	/* check whether implicit source operand (register R6) is readable */
+	err = check_reg_arg(regs, BPF_REG_6, SRC_OP);
+	if (err)
+		return err;
+
+	if (regs[BPF_REG_6].type != PTR_TO_CTX) {
+		verbose("at the time of BPF_LD_ABS|IND R6 != pointer to skb\n");
+		return -EINVAL;
+	}
+
+	if (mode == BPF_IND) {
+		/* check explicit source operand */
+		err = check_reg_arg(regs, insn->src_reg, SRC_OP);
+		if (err)
+			return err;
+	}
+
+	/* reset caller saved regs to unreadable */
+	for (i = 0; i < CALLER_SAVED_REGS; i++) {
+		reg = regs + caller_saved[i];
+		reg->type = NOT_INIT;
+		reg->imm = 0;
+	}
+
+	/* mark destination R0 register as readable, since it contains
+	 * the value fetched from the packet
+	 */
+	regs[BPF_REG_0].type = UNKNOWN_VALUE;
+	return 0;
+}
+
 /* non-recursive DFS pseudo code
  * 1  procedure DFS-iterative(G,v):
  * 2      label v as discovered
@@ -1677,8 +1741,10 @@ process_bpf_exit:
 			u8 mode = BPF_MODE(insn->code);
 
 			if (mode == BPF_ABS || mode == BPF_IND) {
-				verbose("LD_ABS is not supported yet\n");
-				return -EINVAL;
+				err = check_ld_abs(env, insn);
+				if (err)
+					return err;
+
 			} else if (mode == BPF_IMM) {
 				err = check_ld_imm(env, insn);
 				if (err)
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH net-next 0/6] allow eBPF programs to be attached to sockets
From: Alexei Starovoitov @ 2014-11-27  5:42 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Andy Lutomirski, Daniel Borkmann,
	Hannes Frederic Sowa, Eric Dumazet,
	linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Introduce BPF_PROG_TYPE_SOCKET_FILTER type of eBPF programs that can be
attached to sockets with setsockopt().
Allow such programs to access maps via lookup/update/delete helpers.

This feature was previewed by bpf manpage in commit b4fc1a460f30("Merge branch 'bpf-next'")
Now it can actually run.

1st patch adds LD_ABS/LD_IND instruction verification and
2nd patch adds new setsockopt() flag.
Patches 3-6 are examples in assembler and in C.

Though native eBPF programs are way more powerful than classic filters
(attachable through similar setsockopt() call), they don't have skb field
accessors yet. Like skb->pkt_type, skb->dev->ifindex are not accessible.
There are sevaral ways to achieve that. That will be in the next set of patches.
So in this set native eBPF programs can only read data from packet and
access maps.

The most powerful example is sockex2_kern.c from patch 6 where ~200 lines of C
are compiled into ~300 of eBPF instructions.
It shows how quite complex packet parsing can be done.

LLVM used to build examples is at https://github.com/iovisor/llvm
which is fork of llvm trunk that I'm cleaning up for upstreaming.

Alexei Starovoitov (6):
  bpf: verifier: add checks for BPF_ABS | BPF_IND instructions
  net: sock: allow eBPF programs to be attached to sockets
  samples: bpf: example of stateful socket filtering
  samples: bpf: elf_bpf file loader
  samples: bpf: trivial eBPF program in C
  samples: bpf: large eBPF program in C

 arch/alpha/include/uapi/asm/socket.h   |    3 +
 arch/avr32/include/uapi/asm/socket.h   |    3 +
 arch/cris/include/uapi/asm/socket.h    |    3 +
 arch/frv/include/uapi/asm/socket.h     |    3 +
 arch/ia64/include/uapi/asm/socket.h    |    3 +
 arch/m32r/include/uapi/asm/socket.h    |    3 +
 arch/mips/include/uapi/asm/socket.h    |    3 +
 arch/mn10300/include/uapi/asm/socket.h |    3 +
 arch/parisc/include/uapi/asm/socket.h  |    3 +
 arch/powerpc/include/uapi/asm/socket.h |    3 +
 arch/s390/include/uapi/asm/socket.h    |    3 +
 arch/sparc/include/uapi/asm/socket.h   |    3 +
 arch/xtensa/include/uapi/asm/socket.h  |    3 +
 include/linux/bpf.h                    |    4 +
 include/linux/filter.h                 |    1 +
 include/uapi/asm-generic/socket.h      |    3 +
 include/uapi/linux/bpf.h               |    1 +
 kernel/bpf/verifier.c                  |   70 ++++++++++-
 net/core/filter.c                      |   97 +++++++++++++-
 net/core/sock.c                        |   13 ++
 samples/bpf/Makefile                   |   20 +++
 samples/bpf/bpf_helpers.h              |   40 ++++++
 samples/bpf/bpf_load.c                 |  203 ++++++++++++++++++++++++++++++
 samples/bpf/bpf_load.h                 |   24 ++++
 samples/bpf/libbpf.c                   |   28 +++++
 samples/bpf/libbpf.h                   |   15 ++-
 samples/bpf/sock_example.c             |   97 ++++++++++++++
 samples/bpf/sockex1_kern.c             |   23 ++++
 samples/bpf/sockex1_user.c             |   49 ++++++++
 samples/bpf/sockex2_kern.c             |  215 ++++++++++++++++++++++++++++++++
 samples/bpf/sockex2_user.c             |   44 +++++++
 31 files changed, 981 insertions(+), 5 deletions(-)
 create mode 100644 samples/bpf/bpf_helpers.h
 create mode 100644 samples/bpf/bpf_load.c
 create mode 100644 samples/bpf/bpf_load.h
 create mode 100644 samples/bpf/sock_example.c
 create mode 100644 samples/bpf/sockex1_kern.c
 create mode 100644 samples/bpf/sockex1_user.c
 create mode 100644 samples/bpf/sockex2_kern.c
 create mode 100644 samples/bpf/sockex2_user.c

-- 
1.7.9.5

^ permalink raw reply

* Re: [PATCH v2 02/19] kbuild: kselftest_install - add a new make target to install selftests
From: Masami Hiramatsu @ 2014-11-27  5:32 UTC (permalink / raw)
  To: Shuah Khan
  Cc: gregkh, akpm, mmarek, davem, keescook, tranmanphong, dh.herrmann,
	hughd, bobby.prani, ebiederm, serge.hallyn, linux-kbuild,
	linux-kernel, linux-api, netdev
In-Reply-To: <a2344d4df903d673afe1631118f40917f773cc9a.1415735831.git.shuahkh@osg.samsung.com>

(2014/11/12 5:27), Shuah Khan wrote:
> Add a new make target to install to install kernel selftests.
> This new target will build and install selftests. kselftest
> target now depends on kselftest_install and runs the generated
> kselftest script to reduce duplicate work and for common look
> and feel when running tests.
> 
> Approach:
> 
> make kselftest_target:

kselftest_install?

> -- exports kselftest INSTALL_KSFT_PATH
>    default $(INSTALL_MOD_PATH)/lib/kselftest/$(KERNELRELEASE)
> -- exports path for ksefltest.sh
> -- runs selftests make install target:

This direction is OK to me.

BTW, I've found another path to make selftest in Makefile,
Actually you can do

make -C tools/ selftest

And there are selftest_install and selftest_clean targets (but
currently it has a bug and doesn't work, anyway)

I think we'd better do subdir make instead of adding these targets.
This means that "make kselftest*" should be an alias of "make -C tools/ selftest*"

Also, I'd like to request passing some options like as O=$(objtree)
so that we can make test kmodules in selftests.

Thank you,


-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Research Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox