Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Michael S. Tsirkin @ 2018-06-19 20:32 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Siwei Liu, Samudrala, Sridhar, Alexander Duyck, virtio-dev,
	aaron.f.brown, Jiri Pirko, Jakub Kicinski, Netdev, qemu-devel,
	virtualization, konrad.wilk, boris.ostrovsky, Joao Martins,
	Venu Busireddy, vijay.balakrishna
In-Reply-To: <20180619125453.2d2dfb2d.cohuck@redhat.com>

On Tue, Jun 19, 2018 at 12:54:53PM +0200, Cornelia Huck wrote:
> Sorry about dragging mainframes into this, but this will only work for
> homogenous device coupling, not for heterogenous. Consider my vfio-pci
> + virtio-net-ccw example again: The guest cannot find out that the two
> belong together by checking some group ID, it has to either use the MAC
> or some needs-to-be-architectured property.
> 
> Alternatively, we could propose that mechanism as pci-only, which means
> we can rely on mechanisms that won't necessarily work on non-pci
> transports. (FWIW, I don't see a use case for using vfio-ccw to pass
> through a network card anytime in the near future, due to the nature of
> network cards currently in use on s390.)

That's what it boils down to, yes.  If there's need to have this for
non-pci devices, then we should put it in config space.
Cornelia, what do you think?

-- 
MST

^ permalink raw reply

* Re: iproute2 won't compile without AF_VSOCK
From: David Ahern @ 2018-06-19 20:29 UTC (permalink / raw)
  To: Stephen Hemminger, Steve Wise; +Cc: netdev
In-Reply-To: <f8c68944-57b6-18a6-aa50-c019baa2a63b@gmail.com>

On 6/19/18 2:27 PM, David Ahern wrote:
> On 6/19/18 9:47 AM, Stephen Hemminger wrote:
>> On Tue, 19 Jun 2018 10:17:45 -0500
>> Steve Wise <swise@opengridcomputing.com> wrote:
>>
>>> Hey David,
>>>
>>> I'm trying to compile the latest iproute2 on an RHEL-7.3 distro, and it
>>> fails to compile because AF_VSOCK is not defined.  Should this
>>> functionality be a configure option to disable it on older distros?
>>>
>>>
>>> Thanks,
>>>
>>> Steve.
>>>
>>> ----
>>>
>>> misc
>>>     CC       ss.o
>>> ss.c:301:27: error: ‘AF_VSOCK’ undeclared here (not in a function)
>>>    .families = FAMILY_MASK(AF_VSOCK),
>>>                            ^
>>> ss.c:252:46: note: in definition of macro ‘FAMILY_MASK’
>>>  #define FAMILY_MASK(family) ((uint64_t)1 << (family))
>>>                                               ^
>>> ss.c:334:2: error: array index in initializer not of integer type
>>>   [AF_VSOCK] = {
>>>   ^
>>> ss.c:334:2: error: (near initialization for ‘default_afs’)
>>> make[1]: *** [ss.o] Error 1
>>> make: *** [all] Error 2
>>>
>>
>> Probably should just add an #ifdef to takeout that if not present
>>
> 
> Most userspace tools have a compat header for cases like this.
> 
> #ifndef AF_VSOCK
> #define AF_VSOCK 	40
> #endif
> 

Add the above to include//utils.h; AF_MPLS is already there.

^ permalink raw reply

* Re: iproute2 won't compile without AF_VSOCK
From: David Ahern @ 2018-06-19 20:27 UTC (permalink / raw)
  To: Stephen Hemminger, Steve Wise; +Cc: netdev
In-Reply-To: <20180619084732.0de6d75d@xeon-e3>

On 6/19/18 9:47 AM, Stephen Hemminger wrote:
> On Tue, 19 Jun 2018 10:17:45 -0500
> Steve Wise <swise@opengridcomputing.com> wrote:
> 
>> Hey David,
>>
>> I'm trying to compile the latest iproute2 on an RHEL-7.3 distro, and it
>> fails to compile because AF_VSOCK is not defined.  Should this
>> functionality be a configure option to disable it on older distros?
>>
>>
>> Thanks,
>>
>> Steve.
>>
>> ----
>>
>> misc
>>     CC       ss.o
>> ss.c:301:27: error: ‘AF_VSOCK’ undeclared here (not in a function)
>>    .families = FAMILY_MASK(AF_VSOCK),
>>                            ^
>> ss.c:252:46: note: in definition of macro ‘FAMILY_MASK’
>>  #define FAMILY_MASK(family) ((uint64_t)1 << (family))
>>                                               ^
>> ss.c:334:2: error: array index in initializer not of integer type
>>   [AF_VSOCK] = {
>>   ^
>> ss.c:334:2: error: (near initialization for ‘default_afs’)
>> make[1]: *** [ss.o] Error 1
>> make: *** [all] Error 2
>>
> 
> Probably should just add an #ifdef to takeout that if not present
> 

Most userspace tools have a compat header for cases like this.

#ifndef AF_VSOCK
#define AF_VSOCK 	40
#endif

^ permalink raw reply

* Re: [PATCH bpf-net] bpf: Change bpf_fib_lookup to return lookup status
From: David Ahern @ 2018-06-19 20:16 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: dsahern, netdev, borkmann, ast, davem
In-Reply-To: <20180619163644.gwdrauesfeq6vc5v@kafai-mbp.dhcp.thefacebook.com>

On 6/19/18 10:36 AM, Martin KaFai Lau wrote:
> On Tue, Jun 19, 2018 at 09:34:28AM -0600, David Ahern wrote:
>> On 6/19/18 9:25 AM, Martin KaFai Lau wrote:
>>> On Mon, Jun 18, 2018 at 03:35:25PM -0600, David Ahern wrote:
>>>> On 6/18/18 2:55 PM, Martin KaFai Lau wrote:
>>>>>> 	/* rc > 0 case */
>>>>>> 	switch(rc) {
>>>>>> 	case BPF_FIB_LKUP_RET_BLACKHOLE:
>>>>>> 	case BPF_FIB_LKUP_RET_UNREACHABLE:
>>>>>> 	case BPF_FIB_LKUP_RET_PROHIBIT:
>>>>>> 		return XDP_DROP;
>>>>>> 	}
>>>>>>
>>>>>> For the others it becomes a question of do we share why the stack needs
>>>>>> to be involved? Maybe the program wants to collect stats to show traffic
>>>>>> patterns that can be improved (BPF_FIB_LKUP_RET_FRAG_NEEDED) or support
>>>>>> in the kernel needs to be improved (BPF_FIB_LKUP_RET_UNSUPP_LWT) or an
>>>>>> interface is misconfigured (BPF_FIB_LKUP_RET_FWD_DISABLED).
>>>>> Thanks for the explanation.
>>>>>
>>>>> Agree on the bpf able to collect stats will be useful.
>>>>>
>>>>> I am wondering, if a new BPF_FIB_LKUP_RET_XYZ is added later,
>>>>> how may the old xdp_prog work/not-work?  As of now, the return value
>>>>> is straight forward, FWD, PASS (to stack) or DROP (error).
>>>>> With this change, the xdp_prog needs to match/switch() the
>>>>> BPF_FIB_LKUP_RET_* to at least PASS and DROP.
>>>>
>>>> IMO, programs should only call XDP_DROP for known reasons - like the 3
>>>> above. Anything else punt to the stack.
>>>>
>>>> If a new RET_XYZ comes along:
>>>> 1. the new XYZ is a new ACL response where the packet is to be dropped.
>>>> If the program does not understand XYZ and punts to the stack
>>>> (recommendation), then a second lookup is done during normal packet
>>>> processing and the stack drops it.
>>>>
>>>> 2. the new XYZ is a new path in the kernel that is unsupported with
>>>> respect to XDP forwarding, nothing new for the program to do.
>>>>
>>>> Either way I would expect stats on BPF_FIB_LKUP_RET_* to give a hint to
>>>> the program writer.
>>>>
>>>> Worst case of punting packets to the stack for any rc != 0 means the
>>>> stack is doing 2 lookups - 1 in XDP based on its lookup parameters and 1
>>>> in normal stack processing - to handle the packet.
>>> Instead of having the xdp_prog to follow the meaning of what RET_SYZ is,
>>> should the bpf_*_fib_lookup() return value be kept as is such that
>>> the xdp_prog is clear what to do.  The reason can be returned in
>>> the 'struct bpf_fib_lookup'.  The number of reasons can be extended.
>>> If the xdp_prog does not understand a reason, it still will not
>>> affect its decision because the return value is clear.
>>> I think the situation here is similar to regular syscall which usually
>>> uses -1 to clearly states error and errno to spells out the reason.
>>>
>>
>> I did consider returning the status in struct bpf_fib_lookup. However,
>> it is 64 bytes and can not be extended without a big performance
>> penalty, so the only option there is to make an existing entry a union
>> the most logical of which is the ifindex. It seemed odd to me to have
>> the result by hidden in the struct as a union on ifindex and returning
>> the egress index from the function:
>>
>> @@ -2625,7 +2636,11 @@ struct bpf_fib_lookup {
>>
>>         /* total length of packet from network header - used for MTU
>> check */
>>         __u16   tot_len;
>> -       __u32   ifindex;  /* L3 device index for lookup */
>> +
>> +       union {
>> +               __u32   ifindex;  /* input: L3 device index for lookup */
>> +               __u32   result;   /* output: one of BPF_FIB_LKUP_RET_* */
>> +       };
>>
>>
>> It seemed more natural to have ifindex stay ifindex and only change
>> value on return:
>>
>> @@ -2625,7 +2639,11 @@ struct bpf_fib_lookup {
>>
>>  	/* total length of packet from network header - used for MTU check */
>>  	__u16	tot_len;
>> -	__u32	ifindex;  /* L3 device index for lookup */
>> +
>> +	/* input: L3 device index for lookup
>> +	 * output: nexthop device index from FIB lookup
>> +	 */
>> +	__u32	ifindex;
>>
>>  	union {
>>  		/* inputs to lookup */
>>
>>
>> From a program's perspective:
>>
>> rc < 0  -- program is passing incorrect data
>> rc == 0 -- packet can be forwarded
>> rc > 0  -- packet can not be forwarded.
>>
>> BPF programs are not required to track the LKUP_RET values any more than
>> a function returning multiple negative values - the caller just checks
>> rc < 0 means failure. If the program cares it can look at specific
>> values of rc to see the specific value.
>>
>> The same applies with the LKUP_RET values - they are there to provide
>> insight into why the packet is not forwarded directly if the program
>> cares to know why.
> hmm...ic. My concern is, the prog can interpret rc > 0 (in this patch) to be
> drop vs pass (although we can advise them in bpf.h to always pass if it does
> not understand a rc but it is not a strong contract),  it may catch people
> a surprise if a xdp_prog suddenly drops everything when running in a
> newer kernel where the upper stack can actually handle it.
> 
> while the current behavior (i.e. before this patch, rc == 0) is always pass
> to the stack.
> 
> I think at least comments should be put in the enum such that
> the xdp/tc_prog should expect the enum could be extended later, so
> the suggested behavior should be a pass for unknown LKUP_RET and let
> the stack to decide.
> 

All APIs with enums have the inherent quality that more can be added.
Nothing about rc > 0 says it is ok to drop the packet and nothing in the
documentation says it is ok to drop the packet. The program author needs
to look at the extra information provided by the rc. Specific values are
a hint that yes the packet can be dropped; others merely say 'packet
needs help from the stack' with a reason why it needs help.

My intention is to allow the XDP program to understand FIB based ACLs.
To that end a fib lookup result of blackhole, unreachable and prohibit
needs to be returned to the xdp program with the effective summary of
"can drop in the xdp program".

If the other return codes are going to cause confusion then less shorten
the list:

enum {
	BPF_FIB_LKUP_RET_SUCCESS,      /* packet is to be forwareded */
	BPF_FIB_LKUP_RET_CAN_DROP,     /* XDP program can drop the packet */
	BPF_FIB_LKUP_RET_NEED_STACK,   /* packet needs full stack assist */
};

But, that still does not solve the problem of rc > 0 means xdp program
can drop the packet, but then that is not the intention of rc > 0. The
intention is only "here's more information about why this packet can not
be forwarded at this layer"

^ permalink raw reply

* Re: [PATCH net] net/sched: act_ife: fix recursive lock and idr leak
From: Cong Wang @ 2018-06-19 20:13 UTC (permalink / raw)
  To: Davide Caratti
  Cc: Jamal Hadi Salim, David S. Miller,
	Linux Kernel Network Developers
In-Reply-To: <40b45f70ef007b222b36a4676174e597c41d697f.1529415169.git.dcaratti@redhat.com>

On Tue, Jun 19, 2018 at 6:39 AM, Davide Caratti <dcaratti@redhat.com> wrote:
> a recursive lock warning [1] can be observed with the following script,
>
>  # $TC actions add action ife encode allow prio pass index 42
>  IFE type 0xED3E
>  # $TC actions replace action ife encode allow tcindex pass index 42
>
> in case the kernel was unable to run the last command (e.g. because of
> the impossibility to load 'act_meta_skbtcindex'). For a similar reason,
> the kernel can leak idr in the error path of tcf_ife_init(), because
> tcf_idr_release() is not called after successful idr reservation:
>
>  # $TC actions add action ife encode allow tcindex index 47
>  IFE type 0xED3E
>  RTNETLINK answers: No such file or directory
>  We have an error talking to the kernel
>  # $TC actions add action ife encode allow tcindex index 47
>  IFE type 0xED3E
>  RTNETLINK answers: No space left on device
>  We have an error talking to the kernel
>  # $TC actions add action ife encode use mark 7 type 0xfefe pass index 47
>  IFE type 0xFEFE
>  RTNETLINK answers: No space left on device
>  We have an error talking to the kernel
>
> Since tcfa_lock is already taken when the action is being edited, a call
> to tcf_idr_release() wrongly makes tcf_idr_cleanup() take the same lock
> again. On the other hand, tcf_idr_release() needs to be called in the
> error path of tcf_ife_init(), to undo the last tcf_idr_create() invocation.
> Fix both problems in tcf_ife_init().
> Since the cleanup() routine can now be called when ife->params is NULL,
> also add a NULL pointer check to avoid calling kfree_rcu(NULL, rcu).

Acked-by: Cong Wang <xiyou.wangcong@gmail.com>

^ permalink raw reply

* Re: [PATCH] Revert "net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends"
From: Eric Dumazet @ 2018-06-19 20:10 UTC (permalink / raw)
  To: Andreas Schwab, Eric Dumazet
  Cc: Mathieu Malaterre, David S. Miller, Eric Dumazet, LKML,
	Christophe LEROY, Meelis Roos, netdev, linuxppc-dev
In-Reply-To: <87po0mvbgl.fsf@igel.home>



On 06/19/2018 12:10 PM, Andreas Schwab wrote:
> On Jun 18 2018, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
>> DUMP_PREFIX_ADDRESS might give us more information (say alignment problem, or crossing page boundaries)
> 
> DUMP_PREFIX_ADDRESS is useless for that purpose.
> 
> Here are some samples of broken csums:
> 
> [  853.849225] sungem: sungem wrong csum : 9886/07be, len 94 bytes, c0000001fa187e02
> [  853.849232] raw data: 00000000: 00 0d 93 43 81 62 18 d6 c7 51 b8 1c 08 00 45 10  ...C.b...Q....E.
> [  853.849235] raw data: 00000010: 00 4c cb a0 40 00 40 11 d9 97 c0 a8 0a 01 c0 a8  .L..@.@.........
> [  853.849237] raw data: 00000020: 0a 07 00 7b 00 7b 00 38 69 e1 1c 03 0c f7 00 00  ...{.{.8i.......
> [  853.849240] raw data: 00000030: 08 f0 00 00 15 f0 c0 35 67 67 de d3 ca c9 d9 5b  .......5gg.....[
> [  853.849242] raw data: 00000040: 1f ff de d3 d2 86 8f 67 fa f2 de d3 d2 86 8f 38  .......g.......8
> [  853.849244] raw data: 00000050: 2f ff de d3 d2 86 8f 3b ff ff d1 93 bc 50        /......;.....P

Thanks.

4 bytes in excess.

Might be the FCS, and it does not look like provided csum has a relation with it.

For some reason FCS stripping was disabled by :

commit 3e32011d4da6424b3bc65b1e1a047e30ac9882c7
Author: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date:   Mon May 19 09:39:11 2003 -0700

    [SUNGEM]: Updates from PowerPC people.
    
    Support more chips and split out all the complex PHY
    handling into a seperate file.


Since this NIC never had CHECKSUM_COMPLETE support (since we have to trim each skb,
thus were forcing ip_summed to CHECKSUM_NONE) we probably should remove it and be happy.

Unless you guys find a way to let the NIC strip the FCS, and double check the csum is a real csum ;)

diff --git a/drivers/net/ethernet/sun/sungem.c b/drivers/net/ethernet/sun/sungem.c
index 7a16d40a72d13cf1d522e8a3a396c826fe76f9b9..85439308375e95c3854e4a1561697d69ec85399b 100644
--- a/drivers/net/ethernet/sun/sungem.c
+++ b/drivers/net/ethernet/sun/sungem.c
@@ -760,7 +760,6 @@ static int gem_rx(struct gem *gp, int work_to_do)
        struct net_device *dev = gp->dev;
        int entry, drops, work_done = 0;
        u32 done;
-       __sum16 csum;
 
        if (netif_msg_rx_status(gp))
                printk(KERN_DEBUG "%s: rx interrupt, done: %d, rx_new: %d\n",
@@ -855,9 +854,6 @@ static int gem_rx(struct gem *gp, int work_to_do)
                        skb = copy_skb;
                }
 
-               csum = (__force __sum16)htons((status & RXDCTRL_TCPCSUM) ^ 0xffff);
-               skb->csum = csum_unfold(csum);
-               skb->ip_summed = CHECKSUM_COMPLETE;
                skb->protocol = eth_type_trans(skb, gp->dev);
 
                napi_gro_receive(&gp->napi, skb);

^ permalink raw reply related

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Siwei Liu @ 2018-06-19 20:09 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Samudrala, Sridhar, Alexander Duyck, virtio-dev, aaron.f.brown,
	Jiri Pirko, Michael S. Tsirkin, Jakub Kicinski, Netdev,
	qemu-devel, virtualization, konrad.wilk, boris.ostrovsky,
	Joao Martins, Venu Busireddy, vijay.balakrishna
In-Reply-To: <20180619125453.2d2dfb2d.cohuck@redhat.com>

On Tue, Jun 19, 2018 at 3:54 AM, Cornelia Huck <cohuck@redhat.com> wrote:
> On Fri, 15 Jun 2018 10:06:07 -0700
> Siwei Liu <loseweigh@gmail.com> wrote:
>
>> On Fri, Jun 15, 2018 at 4:48 AM, Cornelia Huck <cohuck@redhat.com> wrote:
>> > On Thu, 14 Jun 2018 18:57:11 -0700
>> > Siwei Liu <loseweigh@gmail.com> wrote:
>> >
>> >> Thank you for sharing your thoughts, Cornelia. With questions below, I
>> >> think you raised really good points, some of which I don't have answer
>> >> yet and would also like to explore here.
>> >>
>> >> First off, I don't want to push the discussion to the extreme at this
>> >> point, or sell anything about having QEMU manage everything
>> >> automatically. Don't get me wrong, it's not there yet. Let's don't
>> >> assume we are tied to a specific or concerte solution. I think the key
>> >> for our discussion might be to define or refine the boundary between
>> >> VM and guest,  e.g. what each layer is expected to control and manage
>> >> exactly.
>> >>
>> >> In my view, there might be possibly 3 different options to represent
>> >> the failover device conceipt to QEMU and libvirt (or any upper layer
>> >> software):
>> >>
>> >> a. Seperate device: in this model, virtio and passthough remains
>> >> separate devices just as today. QEMU exposes the standby feature bit
>> >> for virtio, and publish status/event around the negotiation process of
>> >> this feature bit for libvirt to react upon. Since Libvirt has the
>> >> pairing relationship itself, maybe through MAC address or something
>> >> else, it can control the presence of primary by hot plugging or
>> >> unplugging the passthrough device, although it has to work tightly
>> >> with virtio's feature negotation process. Not just for migration but
>> >> also various corner scenarios (driver/feature ok, device reset,
>> >> reboot, legacy guest etc) along virtio's feature negotiation.
>> >
>> > Yes, that one has obvious tie-ins to virtio's modus operandi.
>> >
>> >>
>> >> b. Coupled device: in this model, virtio and passthough devices are
>> >> weakly coupled using some group ID, i.e. QEMU match the passthough
>> >> device for a standby virtio instance by comparing the group ID value
>> >> present behind each device's bridge. Libvirt provides QEMU the group
>> >> ID for both type of devices, and only deals with hot plug for
>> >> migration, by checking some migration status exposed (e.g. the feature
>> >> negotiation status on the virtio device) by QEMU. QEMU manages the
>> >> visibility of the primary in guest along virtio's feature negotiation
>> >> process.
>> >
>> > I'm a bit confused here. What, exactly, ties the two devices together?
>>
>> The group UUID. Since QEMU VFIO dvice does not have insight of MAC
>> address (which it doesn't have to), the association between VFIO
>> passthrough and standby must be specificed for QEMU to understand the
>> relationship with this model. Note, standby feature is no longer
>> required to be exposed under this model.
>
> Isn't that a bit limiting, though?
>
> With this model, you can probably tie a vfio-pci device and a
> virtio-net-pci device together. But this will fail if you have
> different transports: Consider tying together a vfio-pci device and a
> virtio-net-ccw device on s390, for example. The standby feature bit is
> on the virtio-net level and should not have any dependency on the
> transport used.

Probably we'd limit the support for grouping to virtio-net-pci device
and vfio-pci device only. For virtio-net-pci, as you might see with
Venu's patch, we store the group UUID on the config space of
virtio-pci, which is only applicable to PCI transport.

If virtio-net-ccw needs to support the same, I think similar grouping
interface should be defined on the VirtIO CCW transport. I think the
current implementation of the Linux failover driver assumes that it's
SR-IOV VF with same MAC address which the virtio-net-pci needs to pair
with, and that the PV path is on same PF without needing to update
network of the port-MAC association change. If we need to extend the
grouping mechanism to virtio-net-ccw, it has to pass such failover
mode to virtio driver specifically through some other option I guess.

>
>>
>> > If libvirt already has the knowledge that it should manage the two as a
>> > couple, why do we need the group id (or something else for other
>> > architectures)? (Maybe I'm simply missing something because I'm not
>> > that familiar with pci.)
>>
>> The idea is to have QEMU control the visibility and enumeration order
>> of the passthrough VFIO for the failover scenario. Hotplug can be one
>> way to achieve it, and perhaps there's other way around also. The
>> group ID is not just for QEMU to couple devices, it's also helpful to
>> guest too as grouping using MAC address is just not safe.
>
> Sorry about dragging mainframes into this, but this will only work for
> homogenous device coupling, not for heterogenous. Consider my vfio-pci
> + virtio-net-ccw example again: The guest cannot find out that the two
> belong together by checking some group ID, it has to either use the MAC
> or some needs-to-be-architectured property.
>
> Alternatively, we could propose that mechanism as pci-only, which means
> we can rely on mechanisms that won't necessarily work on non-pci
> transports. (FWIW, I don't see a use case for using vfio-ccw to pass
> through a network card anytime in the near future, due to the nature of
> network cards currently in use on s390.)

Yes, let's do this just for PCI transport (homogenous) for now.

>
>>
>> >
>> >>
>> >> c. Fully combined device: in this model, virtio and passthough devices
>> >> are viewed as a single VM interface altogther. QEMU not just controls
>> >> the visibility of the primary in guest, but can also manage the
>> >> exposure of the passthrough for migratability. It can be like that
>> >> libvirt supplies the group ID to QEMU. Or libvirt does not even have
>> >> to provide group ID for grouping the two devices, if just one single
>> >> combined device is exposed by QEMU. In either case, QEMU manages all
>> >> aspect of such internal construct, including virtio feature
>> >> negotiation, presence of the primary, and live migration.
>> >
>> > Same question as above.
>> >
>> >>
>> >> It looks like to me that, in your opinion, you seem to prefer go with
>> >> (a). While I'm actually okay with either (b) or (c). Do I understand
>> >> your point correctly?
>> >
>> > I'm not yet preferring anything, as I'm still trying to understand how
>> > this works :) I hope we can arrive at a model that covers the use case
>> > and that is also flexible enough to be extended to other platforms.
>> >
>> >>
>> >> The reason that I feel that (a) might not be ideal, just as Michael
>> >> alluded to (quoting below), is that as management stack, it really
>> >> doesn't need to care about the detailed process of feature negotiation
>> >> (if we view the guest presence of the primary as part of feature
>> >> negotiation at an extended level not just virtio). All it needs to be
>> >> done is to hand in the required devices to QEMU and that's all. Why do
>> >> we need to addd various hooks, events for whichever happens internally
>> >> within the guest?
>> >>
>> >> ''
>> >> Primary device is added with a special "primary-failover" flag.
>> >> A virtual machine is then initialized with just a standby virtio
>> >> device. Primary is not yet added.
>> >>
>> >> Later QEMU detects that guest driver device set DRIVER_OK.
>> >> It then exposes the primary device to the guest, and triggers
>> >> a device addition event (hot-plug event) for it.
>> >>
>> >> If QEMU detects guest driver removal, it initiates a hot-unplug sequence
>> >> to remove the primary driver.  In particular, if QEMU detects guest
>> >> re-initialization (e.g. by detecting guest reset) it immediately removes
>> >> the primary device.
>> >> ''
>> >>
>> >> and,
>> >>
>> >> ''
>> >> management just wants to give the primary to guest and later take it back,
>> >> it really does not care about the details of the process,
>> >> so I don't see what does pushing it up the stack buy you.
>> >>
>> >> So I don't think it *needs* to be done in libvirt. It probably can if you
>> >> add a bunch of hooks so it knows whenever vm reboots, driver binds and
>> >> unbinds from device, and can check that backup flag was set.
>> >> If you are pushing for a setup like that please get a buy-in
>> >> from libvirt maintainers or better write a patch.
>> >> ''
>> >
>> > This actually seems to mean the opposite to me: We need to know what
>> > the guest is doing and when, as it directly drives what we need to do
>> > with the devices. If we switch to a visibility vs a hotplug model (see
>> > the other mail), we might be able to handle that part within qemu.
>>
>> In the model of (b), I think it essentially turns hotplug to one of
>> mechanisms for QEMU to control the visibility. The libvirt can still
>> manage the hotplug of individual devices during live migration or in
>> normal situation to hot add/remove devices. Though the visibility of
>> the VFIO is under the controll of QEMU, and it's possible that the hot
>> add/remove request does not involve actual hot plug activity in guest
>> at all.
>
> That depends on how you model visibility, I guess. You'll probably want
> to stop traffic flowing through one or the other of the cards; would
> link down or similar be enough for the virtio device?

I'm not sure if it is a good idea. The guest user will see two devices
with same MAC but one of them is down. Do you expect user to use it or
not? And since the guest is going to be migrated, we need to unplug a
broken VF from guest before migrating, why do we bother plugging in
this useless VF at the first place?


Thanks,
-Siwei
>
>>
>> In the model of (c), the hotplug semantics of the combined device
>> would mean differently - it would end up with devices plugged in or
>> out altogther. To make this work, we either have to build a brand new
>> bond-like QEMU device consist of virtio and VFIO internally, or need
>> to have some abstraction in place for libvirt to manipulate the
>> combined device (and prohibit libvirt from operating on individual
>> internal device directly). Note with this model the group ID doesn't
>> even need to get exposed to libvirt, just imagine libvirt to supply
>> all options required to configure two regular virtio-net and VFIO
>> devices for a single device object, and QEMU will deal with the
>> device's visibility and enumeration, such when to hot plug VFIO device
>> in to or out from the guest.
>>
>> It might be complicated to implement (c) though.
>
> I think (c) would be even more complicated for heterogenous setups.

^ permalink raw reply

* Re: [PATCH net] net/sched: act_ife: preserve the action control in case of error
From: Cong Wang @ 2018-06-19 19:59 UTC (permalink / raw)
  To: Davide Caratti
  Cc: Jamal Hadi Salim, David S. Miller,
	Linux Kernel Network Developers
In-Reply-To: <717d0682de4fb24ed6818f2bf264202d22e1e8be.1529415179.git.dcaratti@redhat.com>

On Tue, Jun 19, 2018 at 6:45 AM, Davide Caratti <dcaratti@redhat.com> wrote:
> in the following script
>
>  # tc actions add action ife encode allow prio pass index 42
>  # tc actions replace action ife encode allow tcindex drop index 42
>
> the action control should remain equal to 'pass', if the kernel failed
> to replace the TC action. Pospone the assignment of the action control,
> to ensure it is not overwritten in the error path of tcf_ife_init().
>
> Fixes: ef6980b6becb ("introduce IFE action")
> Signed-off-by: Davide Caratti <dcaratti@redhat.com>

Acked-by: Cong Wang <xiyou.wangcong@gmail.com>

^ permalink raw reply

* Re: [PATCH] dt-bindings: Fix unbalanced quotation marks
From: Florian Fainelli @ 2018-06-19 19:49 UTC (permalink / raw)
  To: Jonathan Neuschäfer, devicetree
  Cc: Mark Rutland, linux-mips, Ulf Hansson, Madalin Bucur, James Hogan,
	Linus Walleij, alsa-devel, linux-kernel, Thierry Reding,
	Alexandre Torgue, linux-samsung-soc, Kevin Hilman,
	Rafał Miłecki, Krzysztof Kozlowski, Jonathan Hunter,
	Kukjin Kim, linux-input, Philipp Zabel, Jason Cooper, linux-pm,
	Marc Zyngier
In-Reply-To: <20180617143127.11421-1-j.neuschaefer@gmx.net>

On 06/17/2018 07:31 AM, Jonathan Neuschäfer wrote:
> Multiple binding documents have various forms of unbalanced quotation
> marks. Fix them.
> 
> Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
> ---

[snip]

>  Documentation/devicetree/bindings/mips/brcm/soc.txt             | 2 +-

Acked-by: Florian Fainelli <f.fainelli@gmail.com>
-- 
Florian

^ permalink raw reply

* Re: [PATCH V2] brcmfmac: stop watchdog before detach and free everything
From: Andy Shevchenko @ 2018-06-19 19:37 UTC (permalink / raw)
  To: Arend van Spriel
  Cc: Kalle Valo, Michael Trimarchi, Franky Lin, Hante Meuleman,
	Chi-Hsien Lin, Wright Feng, David S. Miller,
	Pieter-Paul Giesberts, Ian Molton, open list:TI WILINK WIRELES...,
	open list:BROADCOM BRCM80211 IEEE802.11n WIRELESS DRIVER,
	brcm80211-dev-list, netdev, Linux Kernel Mailing List
In-Reply-To: <5B29590C.3040800@broadcom.com>

On Tue, Jun 19, 2018 at 10:27 PM, Arend van Spriel
<arend.vanspriel@broadcom.com> wrote:
> On 6/19/2018 9:06 PM, Kalle Valo wrote:
>>
>> Arend van Spriel <arend.vanspriel@broadcom.com> writes:
>>
>>> On 5/30/2018 11:06 AM, Michael Trimarchi wrote:
>>>>
>>>> Using built-in in kernel image without a firmware in filesystem
>>>> or in the kernel image can lead to a kernel NULL pointer deference.
>>>> Watchdog need to be stopped in brcmf_sdio_remove
>>>>
>>>> The system is going down NOW!
>>>> [ 1348.110759] Unable to handle kernel NULL pointer dereference at
>>>> virtual address 000002f8
>>>> Sent SIGTERM to all processes
>>>> [ 1348.121412] Mem abort info:
>>>> [ 1348.126962]   ESR = 0x96000004
>>>> [ 1348.130023]   Exception class = DABT (current EL), IL = 32 bits
>>>> [ 1348.135948]   SET = 0, FnV = 0
>>>> [ 1348.138997]   EA = 0, S1PTW = 0
>>>> [ 1348.142154] Data abort info:
>>>> [ 1348.145045]   ISV = 0, ISS = 0x00000004
>>>> [ 1348.148884]   CM = 0, WnR = 0
>>>> [ 1348.151861] user pgtable: 4k pages, 48-bit VAs, pgdp =
>>>> (____ptrval____)
>>>> [ 1348.158475] [00000000000002f8] pgd=0000000000000000
>>>> [ 1348.163364] Internal error: Oops: 96000004 [#1] PREEMPT SMP
>>>> [ 1348.168927] Modules linked in: ipv6
>>>> [ 1348.172421] CPU: 3 PID: 1421 Comm: brcmf_wdog/mmc0 Not tainted
>>>> 4.17.0-rc5-next-20180517 #18
>>>> [ 1348.180757] Hardware name: Amarula A64-Relic (DT)
>>>> [ 1348.185455] pstate: 60000005 (nZCv daif -PAN -UAO)
>>>> [ 1348.190251] pc : brcmf_sdiod_freezer_count+0x0/0x20
>>>> [ 1348.195124] lr : brcmf_sdio_watchdog_thread+0x64/0x290
>>>> [ 1348.200253] sp : ffff00000b85be30
>>>> [ 1348.203561] x29: ffff00000b85be30 x28: 0000000000000000
>>>> [ 1348.208868] x27: ffff00000b6cb918 x26: ffff80003b990638
>>>> [ 1348.214176] x25: ffff0000087b1a20 x24: ffff80003b94f800
>>>> [ 1348.219483] x23: ffff000008e620c8 x22: ffff000008f0b660
>>>> [ 1348.224790] x21: ffff000008c6a858 x20: 00000000fffffe00
>>>> [ 1348.230097] x19: ffff80003b94f800 x18: 0000000000000001
>>>> [ 1348.235404] x17: 0000ffffab2e8a74 x16: ffff0000080d7de8
>>>> [ 1348.240711] x15: 0000000000000000 x14: 0000000000000400
>>>> [ 1348.246018] x13: 0000000000000400 x12: 0000000000000001
>>>> [ 1348.251324] x11: 00000000000002c4 x10: 0000000000000a10
>>>> [ 1348.256631] x9 : ffff00000b85bc40 x8 : ffff80003be11870
>>>> [ 1348.261937] x7 : ffff80003dfc7308 x6 : 000000078ff08b55
>>>> [ 1348.267243] x5 : 00000139e1058400 x4 : 0000000000000000
>>>> [ 1348.272550] x3 : dead000000000100 x2 : 958f2788d6618100
>>>> [ 1348.277856] x1 : 00000000fffffe00 x0 : 0000000000000000
>>>
>>>
>>> Forgot about this one.
>>>
>>> Acked-by: Arend van Spriel <arend.vanspriel@broadcom.com>
>>
>>
>> Should this go to 4.18?
>
>
> It is a bit of a corner case, but yeah let's do that.
>
> Regards,
> Arend

Guys, sorry, didn't have time to test. Do it ASAP.

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply

* Re: [PATCH V2] brcmfmac: stop watchdog before detach and free everything
From: Arend van Spriel @ 2018-06-19 19:27 UTC (permalink / raw)
  To: Kalle Valo
  Cc: Michael Trimarchi, Franky Lin, Hante Meuleman, Chi-Hsien Lin,
	Wright Feng, David S. Miller, Pieter-Paul Giesberts, Ian Molton,
	linux-wireless, brcm80211-dev-list.pdl, brcm80211-dev-list,
	netdev, linux-kernel
In-Reply-To: <87fu1izjcs.fsf@kamboji.qca.qualcomm.com>

On 6/19/2018 9:06 PM, Kalle Valo wrote:
> Arend van Spriel <arend.vanspriel@broadcom.com> writes:
>
>> On 5/30/2018 11:06 AM, Michael Trimarchi wrote:
>>> Using built-in in kernel image without a firmware in filesystem
>>> or in the kernel image can lead to a kernel NULL pointer deference.
>>> Watchdog need to be stopped in brcmf_sdio_remove
>>>
>>> The system is going down NOW!
>>> [ 1348.110759] Unable to handle kernel NULL pointer dereference at
>>> virtual address 000002f8
>>> Sent SIGTERM to all processes
>>> [ 1348.121412] Mem abort info:
>>> [ 1348.126962]   ESR = 0x96000004
>>> [ 1348.130023]   Exception class = DABT (current EL), IL = 32 bits
>>> [ 1348.135948]   SET = 0, FnV = 0
>>> [ 1348.138997]   EA = 0, S1PTW = 0
>>> [ 1348.142154] Data abort info:
>>> [ 1348.145045]   ISV = 0, ISS = 0x00000004
>>> [ 1348.148884]   CM = 0, WnR = 0
>>> [ 1348.151861] user pgtable: 4k pages, 48-bit VAs, pgdp = (____ptrval____)
>>> [ 1348.158475] [00000000000002f8] pgd=0000000000000000
>>> [ 1348.163364] Internal error: Oops: 96000004 [#1] PREEMPT SMP
>>> [ 1348.168927] Modules linked in: ipv6
>>> [ 1348.172421] CPU: 3 PID: 1421 Comm: brcmf_wdog/mmc0 Not tainted
>>> 4.17.0-rc5-next-20180517 #18
>>> [ 1348.180757] Hardware name: Amarula A64-Relic (DT)
>>> [ 1348.185455] pstate: 60000005 (nZCv daif -PAN -UAO)
>>> [ 1348.190251] pc : brcmf_sdiod_freezer_count+0x0/0x20
>>> [ 1348.195124] lr : brcmf_sdio_watchdog_thread+0x64/0x290
>>> [ 1348.200253] sp : ffff00000b85be30
>>> [ 1348.203561] x29: ffff00000b85be30 x28: 0000000000000000
>>> [ 1348.208868] x27: ffff00000b6cb918 x26: ffff80003b990638
>>> [ 1348.214176] x25: ffff0000087b1a20 x24: ffff80003b94f800
>>> [ 1348.219483] x23: ffff000008e620c8 x22: ffff000008f0b660
>>> [ 1348.224790] x21: ffff000008c6a858 x20: 00000000fffffe00
>>> [ 1348.230097] x19: ffff80003b94f800 x18: 0000000000000001
>>> [ 1348.235404] x17: 0000ffffab2e8a74 x16: ffff0000080d7de8
>>> [ 1348.240711] x15: 0000000000000000 x14: 0000000000000400
>>> [ 1348.246018] x13: 0000000000000400 x12: 0000000000000001
>>> [ 1348.251324] x11: 00000000000002c4 x10: 0000000000000a10
>>> [ 1348.256631] x9 : ffff00000b85bc40 x8 : ffff80003be11870
>>> [ 1348.261937] x7 : ffff80003dfc7308 x6 : 000000078ff08b55
>>> [ 1348.267243] x5 : 00000139e1058400 x4 : 0000000000000000
>>> [ 1348.272550] x3 : dead000000000100 x2 : 958f2788d6618100
>>> [ 1348.277856] x1 : 00000000fffffe00 x0 : 0000000000000000
>>
>> Forgot about this one.
>>
>> Acked-by: Arend van Spriel <arend.vanspriel@broadcom.com>
>
> Should this go to 4.18?

It is a bit of a corner case, but yeah let's do that.

Regards,
Arend

^ permalink raw reply

* Re: [PATCH] Revert "net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends"
From: Andreas Schwab @ 2018-06-19 19:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Mathieu Malaterre, David S. Miller, Eric Dumazet, LKML,
	Christophe LEROY, Meelis Roos, netdev, linuxppc-dev
In-Reply-To: <dd6f13bc-c5f2-85f8-c08d-837bc024fc7c@gmail.com>

On Jun 18 2018, Eric Dumazet <eric.dumazet@gmail.com> wrote:

> DUMP_PREFIX_ADDRESS might give us more information (say alignment problem, or crossing page boundaries)

DUMP_PREFIX_ADDRESS is useless for that purpose.

Here are some samples of broken csums:

[  853.849225] sungem: sungem wrong csum : 9886/07be, len 94 bytes, c0000001fa187e02
[  853.849232] raw data: 00000000: 00 0d 93 43 81 62 18 d6 c7 51 b8 1c 08 00 45 10  ...C.b...Q....E.
[  853.849235] raw data: 00000010: 00 4c cb a0 40 00 40 11 d9 97 c0 a8 0a 01 c0 a8  .L..@.@.........
[  853.849237] raw data: 00000020: 0a 07 00 7b 00 7b 00 38 69 e1 1c 03 0c f7 00 00  ...{.{.8i.......
[  853.849240] raw data: 00000030: 08 f0 00 00 15 f0 c0 35 67 67 de d3 ca c9 d9 5b  .......5gg.....[
[  853.849242] raw data: 00000040: 1f ff de d3 d2 86 8f 67 fa f2 de d3 d2 86 8f 38  .......g.......8
[  853.849244] raw data: 00000050: 2f ff de d3 d2 86 8f 3b ff ff d1 93 bc 50        /......;.....P

[  857.883052] sungem: sungem wrong csum : dbb4/c48f, len 94 bytes, c0000001fa185882
[  857.883058] raw data: 00000000: 00 0d 93 43 81 62 18 d6 c7 51 b8 1c 08 00 45 00  ...C.b...Q....E.
[  857.883070] raw data: 00000010: 00 4c a1 97 40 00 3a 11 ce ed d9 5b 2c 11 c0 a8  .L..@.:....[,...
[  857.883080] raw data: 00000020: 0a 07 00 7b 00 7b 00 38 14 4b 24 02 06 ea 00 00  ...{.{.8.K$.....
[  857.883085] raw data: 00000030: 00 0b 00 00 02 99 c0 a8 64 09 de d3 d2 5a 36 e4  ........d....Z6.
[  857.883090] raw data: 00000040: bc f5 de d3 d2 8a 8f 2c 17 44 de d3 d2 8a 93 8b  .......,.D......
[  857.883094] raw data: 00000050: d7 b7 de d3 d2 8a 93 97 69 6e 39 7b d2 5a        ........in9{.Z

[  858.124689] sungem: sungem wrong csum : 1f4f/02d0, len 118 bytes, c0000001fa185602
[  858.124700] raw data: 00000000: 00 0d 93 43 81 62 d4 3d 7e 4c 48 b7 86 dd 61 01  ...C.b.=~LH...a.
[  858.124705] raw data: 00000010: 1e b1 00 3c 06 40 20 01 0a 62 17 11 88 01 00 00  ...<.@ ..b......
[  858.124709] raw data: 00000020: 00 00 00 00 0a 38 20 01 0a 62 17 11 88 01 00 00  .....8 ..b......
[  858.124714] raw data: 00000030: 00 00 00 00 00 07 94 b4 00 16 86 f5 29 e8 36 cb  ............).6.
[  858.124718] raw data: 00000040: 50 49 80 18 05 93 9a 53 00 00 01 01 08 0a 58 b2  PI.....S......X.
[  858.124723] raw data: 00000050: de 54 61 5f 2f 3c 00 00 00 10 cc 08 55 f7 da 21  .Ta_/<......U..!
[  858.124727] raw data: 00000060: f4 60 0a 6b 3c aa b9 b3 7e 61 10 b8 c2 be 9a 0b  .`.k<...~a......
[  858.124731] raw data: 00000070: c7 e9 5b 97 1b ac                                ..[...

[  858.126522] sungem: sungem wrong csum : 0836/19e9, len 90 bytes, c0000001fa185382
[  858.126530] raw data: 00000000: 00 0d 93 43 81 62 d4 3d 7e 4c 48 b7 86 dd 61 01  ...C.b.=~LH...a.
[  858.126532] raw data: 00000010: 1e b1 00 20 06 40 20 01 0a 62 17 11 88 01 00 00  ... .@ ..b......
[  858.126535] raw data: 00000020: 00 00 00 00 0a 38 20 01 0a 62 17 11 88 01 00 00  .....8 ..b......
[  858.126537] raw data: 00000030: 00 00 00 00 00 07 94 b4 00 16 86 f5 2a 04 36 cb  ............*.6.
[  858.126540] raw data: 00000040: 50 65 80 10 05 93 3e 56 00 00 01 01 08 0a 58 b2  Pe....>V......X.
[  858.126542] raw data: 00000050: de 56 61 5f 30 4d 1d 58 42 d2                    .Va_0M.XB.

[  858.131559] sungem: sungem wrong csum : 5891/c98d, len 90 bytes, c0000001fa185102
[  858.131567] raw data: 00000000: 00 0d 93 43 81 62 d4 3d 7e 4c 48 b7 86 dd 61 01  ...C.b.=~LH...a.
[  858.131570] raw data: 00000010: 1e b1 00 20 06 40 20 01 0a 62 17 11 88 01 00 00  ... .@ ..b......
[  858.131572] raw data: 00000020: 00 00 00 00 0a 38 20 01 0a 62 17 11 88 01 00 00  .....8 ..b......
[  858.131574] raw data: 00000030: 00 00 00 00 00 07 94 b4 00 16 86 f5 2a 04 36 cb  ............*.6.
[  858.131577] raw data: 00000040: 50 a1 80 10 05 93 3e 10 00 00 01 01 08 0a 58 b2  P.....>.......X.
[  858.131579] raw data: 00000050: de 5b 61 5f 30 52 3f ea 70 9b                    .[a_0R?.p.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply

* Re: [PATCH V2] brcmfmac: stop watchdog before detach and free everything
From: Kalle Valo @ 2018-06-19 19:06 UTC (permalink / raw)
  To: Arend van Spriel
  Cc: Michael Trimarchi, Franky Lin, Hante Meuleman, Chi-Hsien Lin,
	Wright Feng, David S. Miller, Pieter-Paul Giesberts, Ian Molton,
	linux-wireless, brcm80211-dev-list.pdl, brcm80211-dev-list,
	netdev, linux-kernel
In-Reply-To: <5B294920.5050909@broadcom.com>

Arend van Spriel <arend.vanspriel@broadcom.com> writes:

> On 5/30/2018 11:06 AM, Michael Trimarchi wrote:
>> Using built-in in kernel image without a firmware in filesystem
>> or in the kernel image can lead to a kernel NULL pointer deference.
>> Watchdog need to be stopped in brcmf_sdio_remove
>>
>> The system is going down NOW!
>> [ 1348.110759] Unable to handle kernel NULL pointer dereference at
>> virtual address 000002f8
>> Sent SIGTERM to all processes
>> [ 1348.121412] Mem abort info:
>> [ 1348.126962]   ESR = 0x96000004
>> [ 1348.130023]   Exception class = DABT (current EL), IL = 32 bits
>> [ 1348.135948]   SET = 0, FnV = 0
>> [ 1348.138997]   EA = 0, S1PTW = 0
>> [ 1348.142154] Data abort info:
>> [ 1348.145045]   ISV = 0, ISS = 0x00000004
>> [ 1348.148884]   CM = 0, WnR = 0
>> [ 1348.151861] user pgtable: 4k pages, 48-bit VAs, pgdp = (____ptrval____)
>> [ 1348.158475] [00000000000002f8] pgd=0000000000000000
>> [ 1348.163364] Internal error: Oops: 96000004 [#1] PREEMPT SMP
>> [ 1348.168927] Modules linked in: ipv6
>> [ 1348.172421] CPU: 3 PID: 1421 Comm: brcmf_wdog/mmc0 Not tainted
>> 4.17.0-rc5-next-20180517 #18
>> [ 1348.180757] Hardware name: Amarula A64-Relic (DT)
>> [ 1348.185455] pstate: 60000005 (nZCv daif -PAN -UAO)
>> [ 1348.190251] pc : brcmf_sdiod_freezer_count+0x0/0x20
>> [ 1348.195124] lr : brcmf_sdio_watchdog_thread+0x64/0x290
>> [ 1348.200253] sp : ffff00000b85be30
>> [ 1348.203561] x29: ffff00000b85be30 x28: 0000000000000000
>> [ 1348.208868] x27: ffff00000b6cb918 x26: ffff80003b990638
>> [ 1348.214176] x25: ffff0000087b1a20 x24: ffff80003b94f800
>> [ 1348.219483] x23: ffff000008e620c8 x22: ffff000008f0b660
>> [ 1348.224790] x21: ffff000008c6a858 x20: 00000000fffffe00
>> [ 1348.230097] x19: ffff80003b94f800 x18: 0000000000000001
>> [ 1348.235404] x17: 0000ffffab2e8a74 x16: ffff0000080d7de8
>> [ 1348.240711] x15: 0000000000000000 x14: 0000000000000400
>> [ 1348.246018] x13: 0000000000000400 x12: 0000000000000001
>> [ 1348.251324] x11: 00000000000002c4 x10: 0000000000000a10
>> [ 1348.256631] x9 : ffff00000b85bc40 x8 : ffff80003be11870
>> [ 1348.261937] x7 : ffff80003dfc7308 x6 : 000000078ff08b55
>> [ 1348.267243] x5 : 00000139e1058400 x4 : 0000000000000000
>> [ 1348.272550] x3 : dead000000000100 x2 : 958f2788d6618100
>> [ 1348.277856] x1 : 00000000fffffe00 x0 : 0000000000000000
>
> Forgot about this one.
>
> Acked-by: Arend van Spriel <arend.vanspriel@broadcom.com>

Should this go to 4.18?

-- 
Kalle Valo

^ permalink raw reply

* Re: [PATCH] ucc_geth: Add BQL support
From: Joakim Tjernlund @ 2018-06-19 19:06 UTC (permalink / raw)
  To: leoyang.li@nxp.com, dave.taht@gmail.com; +Cc: netdev@vger.kernel.org
In-Reply-To: <CAA93jw4fk1zCqSJq3pZ2uNKd8N=foChF56O1TX5At1yi3yPd5Q@mail.gmail.com>

On Tue, 2018-06-19 at 11:37 -0700, Dave Taht wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
> 
> 
> very happy to see this. is there a specific chip or devboard this runs on?

This driver is for MPC83xx family SOCs(possibly others as well) on our custom boards, used in 
our telecom product.

You are actually the reason I impl. this :)

 Jocke

> 
> On Tue, Jun 19, 2018 at 11:24 AM, Li Yang <leoyang.li@nxp.com> wrote:
> > On Tue, Jun 19, 2018 at 11:30 AM, Joakim Tjernlund
> > <joakim.tjernlund@infinera.com> wrote:
> > > Signed-off-by: Joakim Tjernlund <joakim.tjernlund@infinera.com>
> > 
> > Acked-by: Li Yang <leoyang.li@nxp.com>
> > 
> > > ---
> > >  drivers/net/ethernet/freescale/ucc_geth.c | 7 ++++++-
> > >  1 file changed, 6 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/net/ethernet/freescale/ucc_geth.c b/drivers/net/ethernet/freescale/ucc_geth.c
> > > index f77ba9fa257b..6c99a9af6647 100644
> > > --- a/drivers/net/ethernet/freescale/ucc_geth.c
> > > +++ b/drivers/net/ethernet/freescale/ucc_geth.c
> > > @@ -3096,6 +3096,7 @@ static int ucc_geth_start_xmit(struct sk_buff *skb, struct net_device *dev)
> > > 
> > >         ugeth_vdbg("%s: IN", __func__);
> > > 
> > > +       netdev_sent_queue(dev, skb->len);
> > >         spin_lock_irqsave(&ugeth->lock, flags);
> > > 
> > >         dev->stats.tx_bytes += skb->len;
> > > @@ -3242,6 +3243,8 @@ static int ucc_geth_tx(struct net_device *dev, u8 txQ)
> > >         struct ucc_geth_private *ugeth = netdev_priv(dev);
> > >         u8 __iomem *bd;         /* BD pointer */
> > >         u32 bd_status;
> > > +       int howmany = 0;
> > > +       unsigned int bytes_sent = 0;
> > > 
> > >         bd = ugeth->confBd[txQ];
> > >         bd_status = in_be32((u32 __iomem *)bd);
> > > @@ -3257,7 +3260,8 @@ static int ucc_geth_tx(struct net_device *dev, u8 txQ)
> > >                 skb = ugeth->tx_skbuff[txQ][ugeth->skb_dirtytx[txQ]];
> > >                 if (!skb)
> > >                         break;
> > > -
> > > +               howmany++;
> > > +               bytes_sent += skb->len;
> > >                 dev->stats.tx_packets++;
> > > 
> > >                 dev_consume_skb_any(skb);
> > > @@ -3279,6 +3283,7 @@ static int ucc_geth_tx(struct net_device *dev, u8 txQ)
> > >                 bd_status = in_be32((u32 __iomem *)bd);
> > >         }
> > >         ugeth->confBd[txQ] = bd;
> > > +       netdev_completed_queue(dev, howmany, bytes_sent);
> > >         return 0;
> > >  }
> > > 
> > > --
> > > 2.13.6
> > > 
> 
> 
> 
> --
> 
> Dave Täht
> CEO, TekLibre, LLC
> http://www.teklibre.com
> Tel: 1-669-226-2619

^ permalink raw reply

* Re: [PATCH] ucc_geth: Add BQL support
From: Dave Taht @ 2018-06-19 18:37 UTC (permalink / raw)
  To: Li Yang; +Cc: Joakim Tjernlund, Netdev
In-Reply-To: <CADRPPNRWh=L2DhiWbPn4z5MXzwLutps_B-xZT9Rp4R5B0VMgzQ@mail.gmail.com>

very happy to see this. is there a specific chip or devboard this runs on?

On Tue, Jun 19, 2018 at 11:24 AM, Li Yang <leoyang.li@nxp.com> wrote:
> On Tue, Jun 19, 2018 at 11:30 AM, Joakim Tjernlund
> <joakim.tjernlund@infinera.com> wrote:
>> Signed-off-by: Joakim Tjernlund <joakim.tjernlund@infinera.com>
>
> Acked-by: Li Yang <leoyang.li@nxp.com>
>
>> ---
>>  drivers/net/ethernet/freescale/ucc_geth.c | 7 ++++++-
>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/ethernet/freescale/ucc_geth.c b/drivers/net/ethernet/freescale/ucc_geth.c
>> index f77ba9fa257b..6c99a9af6647 100644
>> --- a/drivers/net/ethernet/freescale/ucc_geth.c
>> +++ b/drivers/net/ethernet/freescale/ucc_geth.c
>> @@ -3096,6 +3096,7 @@ static int ucc_geth_start_xmit(struct sk_buff *skb, struct net_device *dev)
>>
>>         ugeth_vdbg("%s: IN", __func__);
>>
>> +       netdev_sent_queue(dev, skb->len);
>>         spin_lock_irqsave(&ugeth->lock, flags);
>>
>>         dev->stats.tx_bytes += skb->len;
>> @@ -3242,6 +3243,8 @@ static int ucc_geth_tx(struct net_device *dev, u8 txQ)
>>         struct ucc_geth_private *ugeth = netdev_priv(dev);
>>         u8 __iomem *bd;         /* BD pointer */
>>         u32 bd_status;
>> +       int howmany = 0;
>> +       unsigned int bytes_sent = 0;
>>
>>         bd = ugeth->confBd[txQ];
>>         bd_status = in_be32((u32 __iomem *)bd);
>> @@ -3257,7 +3260,8 @@ static int ucc_geth_tx(struct net_device *dev, u8 txQ)
>>                 skb = ugeth->tx_skbuff[txQ][ugeth->skb_dirtytx[txQ]];
>>                 if (!skb)
>>                         break;
>> -
>> +               howmany++;
>> +               bytes_sent += skb->len;
>>                 dev->stats.tx_packets++;
>>
>>                 dev_consume_skb_any(skb);
>> @@ -3279,6 +3283,7 @@ static int ucc_geth_tx(struct net_device *dev, u8 txQ)
>>                 bd_status = in_be32((u32 __iomem *)bd);
>>         }
>>         ugeth->confBd[txQ] = bd;
>> +       netdev_completed_queue(dev, howmany, bytes_sent);
>>         return 0;
>>  }
>>
>> --
>> 2.13.6
>>



-- 

Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619

^ permalink raw reply

* Re: [PATCH] ucc_geth: Add BQL support
From: Li Yang @ 2018-06-19 18:24 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: Netdev
In-Reply-To: <20180619163036.20578-1-joakim.tjernlund@infinera.com>

On Tue, Jun 19, 2018 at 11:30 AM, Joakim Tjernlund
<joakim.tjernlund@infinera.com> wrote:
> Signed-off-by: Joakim Tjernlund <joakim.tjernlund@infinera.com>

Acked-by: Li Yang <leoyang.li@nxp.com>

> ---
>  drivers/net/ethernet/freescale/ucc_geth.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/freescale/ucc_geth.c b/drivers/net/ethernet/freescale/ucc_geth.c
> index f77ba9fa257b..6c99a9af6647 100644
> --- a/drivers/net/ethernet/freescale/ucc_geth.c
> +++ b/drivers/net/ethernet/freescale/ucc_geth.c
> @@ -3096,6 +3096,7 @@ static int ucc_geth_start_xmit(struct sk_buff *skb, struct net_device *dev)
>
>         ugeth_vdbg("%s: IN", __func__);
>
> +       netdev_sent_queue(dev, skb->len);
>         spin_lock_irqsave(&ugeth->lock, flags);
>
>         dev->stats.tx_bytes += skb->len;
> @@ -3242,6 +3243,8 @@ static int ucc_geth_tx(struct net_device *dev, u8 txQ)
>         struct ucc_geth_private *ugeth = netdev_priv(dev);
>         u8 __iomem *bd;         /* BD pointer */
>         u32 bd_status;
> +       int howmany = 0;
> +       unsigned int bytes_sent = 0;
>
>         bd = ugeth->confBd[txQ];
>         bd_status = in_be32((u32 __iomem *)bd);
> @@ -3257,7 +3260,8 @@ static int ucc_geth_tx(struct net_device *dev, u8 txQ)
>                 skb = ugeth->tx_skbuff[txQ][ugeth->skb_dirtytx[txQ]];
>                 if (!skb)
>                         break;
> -
> +               howmany++;
> +               bytes_sent += skb->len;
>                 dev->stats.tx_packets++;
>
>                 dev_consume_skb_any(skb);
> @@ -3279,6 +3283,7 @@ static int ucc_geth_tx(struct net_device *dev, u8 txQ)
>                 bd_status = in_be32((u32 __iomem *)bd);
>         }
>         ugeth->confBd[txQ] = bd;
> +       netdev_completed_queue(dev, howmany, bytes_sent);
>         return 0;
>  }
>
> --
> 2.13.6
>

^ permalink raw reply

* Re: [PATCH] dt-bindings: Fix unbalanced quotation marks
From: Dmitry Torokhov @ 2018-06-19 18:23 UTC (permalink / raw)
  To: Jonathan Neuschäfer
  Cc: devicetree, Kukjin Kim, Krzysztof Kozlowski, Rob Herring,
	Mark Rutland, Linus Walleij, Thomas Gleixner, Jason Cooper,
	Marc Zyngier, Thierry Reding, Jonathan Hunter, Maxime Coquelin,
	Alexandre Torgue, Hauke Mehrtens, Rafał Miłecki,
	Ralf Baechle, Paul Burton, James Hogan, Madalin Bucur
In-Reply-To: <20180617143127.11421-1-j.neuschaefer@gmx.net>

On Sun, Jun 17, 2018 at 04:31:18PM +0200, Jonathan Neuschäfer wrote:
> diff --git a/Documentation/devicetree/bindings/input/touchscreen/hideep.txt b/Documentation/devicetree/bindings/input/touchscreen/hideep.txt
> index 121d9b7c79a2..1063c30d53f7 100644
> --- a/Documentation/devicetree/bindings/input/touchscreen/hideep.txt
> +++ b/Documentation/devicetree/bindings/input/touchscreen/hideep.txt
> @@ -32,7 +32,7 @@ i2c@00000000 {
>  		reg = <0x6c>;
>  		interrupt-parent = <&gpx1>;
>  		interrupts = <2 IRQ_TYPE_LEVEL_LOW>;
> -		vdd-supply = <&ldo15_reg>";
> +		vdd-supply = <&ldo15_reg>;
>  		vid-supply = <&ldo18_reg>;
>  		reset-gpios = <&gpx1 5 0>;
>  		touchscreen-size-x = <1080>;

Acked-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

-- 
Dmitry

^ permalink raw reply

* Re: [PATCH V2] brcmfmac: stop watchdog before detach and free everything
From: Arend van Spriel @ 2018-06-19 18:19 UTC (permalink / raw)
  To: Michael Trimarchi
  Cc: Franky Lin, Hante Meuleman, Chi-Hsien Lin, Wright Feng,
	Kalle Valo, David S. Miller, Pieter-Paul Giesberts, Ian Molton,
	linux-wireless, brcm80211-dev-list.pdl, brcm80211-dev-list,
	netdev, linux-kernel
In-Reply-To: <20180530090633.GA15390@panicking>

On 5/30/2018 11:06 AM, Michael Trimarchi wrote:
> Using built-in in kernel image without a firmware in filesystem
> or in the kernel image can lead to a kernel NULL pointer deference.
> Watchdog need to be stopped in brcmf_sdio_remove
>
> The system is going down NOW!
> [ 1348.110759] Unable to handle kernel NULL pointer dereference at virtual address 000002f8
> Sent SIGTERM to all processes
> [ 1348.121412] Mem abort info:
> [ 1348.126962]   ESR = 0x96000004
> [ 1348.130023]   Exception class = DABT (current EL), IL = 32 bits
> [ 1348.135948]   SET = 0, FnV = 0
> [ 1348.138997]   EA = 0, S1PTW = 0
> [ 1348.142154] Data abort info:
> [ 1348.145045]   ISV = 0, ISS = 0x00000004
> [ 1348.148884]   CM = 0, WnR = 0
> [ 1348.151861] user pgtable: 4k pages, 48-bit VAs, pgdp = (____ptrval____)
> [ 1348.158475] [00000000000002f8] pgd=0000000000000000
> [ 1348.163364] Internal error: Oops: 96000004 [#1] PREEMPT SMP
> [ 1348.168927] Modules linked in: ipv6
> [ 1348.172421] CPU: 3 PID: 1421 Comm: brcmf_wdog/mmc0 Not tainted 4.17.0-rc5-next-20180517 #18
> [ 1348.180757] Hardware name: Amarula A64-Relic (DT)
> [ 1348.185455] pstate: 60000005 (nZCv daif -PAN -UAO)
> [ 1348.190251] pc : brcmf_sdiod_freezer_count+0x0/0x20
> [ 1348.195124] lr : brcmf_sdio_watchdog_thread+0x64/0x290
> [ 1348.200253] sp : ffff00000b85be30
> [ 1348.203561] x29: ffff00000b85be30 x28: 0000000000000000
> [ 1348.208868] x27: ffff00000b6cb918 x26: ffff80003b990638
> [ 1348.214176] x25: ffff0000087b1a20 x24: ffff80003b94f800
> [ 1348.219483] x23: ffff000008e620c8 x22: ffff000008f0b660
> [ 1348.224790] x21: ffff000008c6a858 x20: 00000000fffffe00
> [ 1348.230097] x19: ffff80003b94f800 x18: 0000000000000001
> [ 1348.235404] x17: 0000ffffab2e8a74 x16: ffff0000080d7de8
> [ 1348.240711] x15: 0000000000000000 x14: 0000000000000400
> [ 1348.246018] x13: 0000000000000400 x12: 0000000000000001
> [ 1348.251324] x11: 00000000000002c4 x10: 0000000000000a10
> [ 1348.256631] x9 : ffff00000b85bc40 x8 : ffff80003be11870
> [ 1348.261937] x7 : ffff80003dfc7308 x6 : 000000078ff08b55
> [ 1348.267243] x5 : 00000139e1058400 x4 : 0000000000000000
> [ 1348.272550] x3 : dead000000000100 x2 : 958f2788d6618100
> [ 1348.277856] x1 : 00000000fffffe00 x0 : 0000000000000000

Forgot about this one.

Acked-by: Arend van Spriel <arend.vanspriel@broadcom.com>
> Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
> ---
>   drivers/net/wireless/broadcom/brcm80211/brcmfmac/sdio.c | 7 +++++++
>   1 file changed, 7 insertions(+)
>
> diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/sdio.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/sdio.c
> index 412a05b..061f69d 100644
> --- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/sdio.c
> +++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/sdio.c
> @@ -4294,6 +4294,13 @@ void brcmf_sdio_remove(struct brcmf_sdio *bus)
>   	brcmf_dbg(TRACE, "Enter\n");
>
>   	if (bus) {
> +		/* Stop watchdog task */
> +		if (bus->watchdog_tsk) {
> +			send_sig(SIGTERM, bus->watchdog_tsk, 1);
> +			kthread_stop(bus->watchdog_tsk);
> +			bus->watchdog_tsk = NULL;
> +		}
> +
>   		/* De-register interrupt handler */
>   		brcmf_sdiod_intr_unregister(bus->sdiodev);
>
>

^ permalink raw reply

* Re: [net RFC] net/mlx4_en: Use frag stride in crossing page boundary condition
From: Saeed Mahameed @ 2018-06-19 18:05 UTC (permalink / raw)
  To: eric.dumazet@gmail.com, kafai@fb.com, Tariq Toukan
  Cc: netdev@vger.kernel.org, edumazet@google.com
In-Reply-To: <1ddecaaa-9613-03ba-d761-a4d3410c4f7d@gmail.com>

On Thu, 2018-06-14 at 16:49 -0700, Eric Dumazet wrote:
> 
> On 06/14/2018 02:04 PM, Saeed Mahameed wrote:
> 
> > I was looking at the code without my fix :)
> > 
> > with my fix:
> > release = frags->page_offset + frag_info->frag_stride > PAGE_SIZE;
> > 
> > for XDP: frag_info->frag_stride is PAGE_SIZE, so release will
> > always be
> > true regardless of PAGE_SIZE.
> > 
> > So i guess i didn't quite understand your PowerPC concern.. can you
> > elaborate ?
> > 
> 
> So your maths with PAGE_SIZE=65536 and MTU 9000
> 
> frag_stride is about 9344
> 
> So if the last chunk of the page has 9100 bytes, we wont be able to
> use it, while really we should be able to use it.
> 
> 

this is only true for XDP setup, for non XDP max stride_size can only
be around ~3k and only for mtu > ~6k

For XDP setup you suggested:
-               priv->frag_info[0].frag_size = eff_mtu;
+               priv->frag_info[0].frag_size = PAGE_SIZE;

currently the condition is:

release = frags->page_offset + frag_info->frag_size > PAGE_SIZE;

so my solution and yours have the same problem you described above.

the problem is not with the initial values or with stride/farg size
math, it just that in XDP we shouldn't reuse at ALL. I agree with you
that we need to optimize and maybe for PAGE_SIZE > 8k we need to allow
XDP setup to reuses. but for now there is a data corruption to handle.

^ permalink raw reply

* [RFC v2 PATCH 3/4] ebpf: Add sample ebpf program for SOCKET_SG_FILTER
From: Tushar Dave @ 2018-06-19 18:00 UTC (permalink / raw)
  To: ast, daniel, davem, jakub.kicinski, quentin.monnet, jiong.wang,
	guro, sandipan, john.fastabend, kafai, rdna, brakmo, netdev, acme,
	sowmini.varadhan
In-Reply-To: <1529431217-5264-1-git-send-email-tushar.n.dave@oracle.com>

Add a sample program that shows how socksg program is used and attached
to socket filter. The kernel sample program deals with struct
scatterlist that is passed as bpf context.

When run in server mode, the sample RDS program opens PF_RDS socket,
attaches eBPF program to RDS socket which then uses bpf_sg_next
helper along with bpf tail calls to retrieve packet data contained in
struct scatterlist form.

To ease testing, RDS client functionality is also added so that users
can generate RDS packet.

Server:
[root@lab71 bpf]# ./rds_filter -s 192.168.3.71 -t tcp
running server in a loop
transport tcp
server bound to address: 192.168.3.71 port 4000
server listening on 192.168.3.71

Client:
[root@lab70 bpf]# ./rds_filter -s 192.168.3.71 -c 192.168.3.70 -t tcp
transport tcp
client bound to address: 192.168.3.70 port 25278
client sending 8192 byte message  from 192.168.3.70 to 192.168.3.71 on
port 25278
payload contains:30 31 32 33 34 35 36 37 38 39 ...

Server output:
192.168.3.71 received a packet from 192.168.3.71 of len 8192 cmsg len 0,
on port 25278
payload contains:30 31 32 33 34 35 36 37 38 39 ...
server listening on 192.168.3.71

BPF program output:
[root@lab71]# cat /sys/kernel/debug/tracing/trace_pipe
          <idle>-0     [007] ..s.   525.994894: 0: Print first 6 bytes from sg element
          <idle>-0     [007] ..s.   525.994897: 0: First sg element:
          <idle>-0     [007] ..s.   525.994899: 0: 30 31 32
          <idle>-0     [007] ..s.   525.994900: 0: 33 34 35
          <idle>-0     [007] ..s.   525.994901: 0: next sg element:
          <idle>-0     [007] ..s.   525.994902: 0: a8 a9 aa
          <idle>-0     [007] ..s.   525.994903: 0: ab ac ad
          <idle>-0     [007] ..s.   525.994904: 0: next sg element:
          <idle>-0     [007] ..s.   525.994905: 0: 50 51 52
          <idle>-0     [007] ..s.   525.994905: 0: 53 54 55
          <idle>-0     [007] ..s.   525.994906: 0: next sg element:
          <idle>-0     [007] ..s.   525.994907: 0: f8 f9 fa
          <idle>-0     [007] ..s.   525.994907: 0: fb fc fd
          <idle>-0     [007] ..s.   525.994908: 0: next sg element:
          <idle>-0     [007] ..s.   525.994909: 0: a0 a1 a2
          <idle>-0     [007] ..s.   525.994909: 0: a3 a4 a5
          <idle>-0     [007] ..s.   525.994910: 0: next sg element:
          <idle>-0     [007] ..s.   525.994911: 0: 48 49 4a
          <idle>-0     [007] ..s.   525.994911: 0: 4b 4c 4d
          <idle>-0     [007] ..s.   525.994912: 0: no more sg element

Similary specifying '-t ib' will run this on IB link.

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 samples/bpf/Makefile          |   3 +
 samples/bpf/rds_filter_kern.c |  78 ++++++++++
 samples/bpf/rds_filter_user.c | 339 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 420 insertions(+)
 create mode 100644 samples/bpf/rds_filter_kern.c
 create mode 100644 samples/bpf/rds_filter_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 1303af1..5de238b 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -52,6 +52,7 @@ hostprogs-y += xdp_adjust_tail
 hostprogs-y += xdpsock
 hostprogs-y += xdp_fwd
 hostprogs-y += task_fd_query
+hostprogs-y += rds_filter
 
 # Libbpf dependencies
 LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
@@ -107,6 +108,7 @@ xdp_adjust_tail-objs := xdp_adjust_tail_user.o
 xdpsock-objs := bpf_load.o xdpsock_user.o
 xdp_fwd-objs := bpf_load.o xdp_fwd_user.o
 task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS)
+rds_filter-objs := bpf_load.o rds_filter_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -163,6 +165,7 @@ always += xdp_adjust_tail_kern.o
 always += xdpsock_kern.o
 always += xdp_fwd_kern.o
 always += task_fd_query_kern.o
+always += rds_filter_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
diff --git a/samples/bpf/rds_filter_kern.c b/samples/bpf/rds_filter_kern.c
new file mode 100644
index 0000000..8fe3d3c
--- /dev/null
+++ b/samples/bpf/rds_filter_kern.c
@@ -0,0 +1,78 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/filter.h>
+#include <linux/ptrace.h>
+#include <linux/version.h>
+#include <uapi/linux/bpf.h>
+#include <linux/rds.h>
+#include "bpf_helpers.h"
+
+#define PROG(F) SEC("socksg/"__stringify(F)) int bpf_func_##F
+
+#define bpf_printk(fmt, ...)				\
+({							\
+	char ____fmt[] = fmt;				\
+	bpf_trace_printk(____fmt, sizeof(____fmt),	\
+			##__VA_ARGS__);			\
+})
+
+struct bpf_map_def SEC("maps") jmp_table = {
+	.type = BPF_MAP_TYPE_PROG_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(u32),
+	.max_entries = 2,
+};
+
+#define SG1 1
+
+static inline void dump_sg(struct sg_filter_md *sg)
+{
+	void *data = (void *)(long) sg->data;
+	void *data_end = (void *)(long) sg->data_end;
+	unsigned char *d;
+
+	if (data + 8 > data_end)
+		return;
+
+	d = (unsigned char *)data;
+	bpf_printk("%x %x %x\n", d[0], d[1], d[2]);
+	bpf_printk("%x %x %x\n", d[3], d[4], d[5]);
+
+	return;
+
+}
+
+static void sg_dispatcher(struct sg_filter_md *sg)
+{
+	int ret;
+
+	ret = bpf_sg_next(sg);
+	if (ret == -ENODATA) {
+		bpf_printk("no more sg element\n");
+		return;
+	}
+
+	/* We use same function to walk sg list */
+	bpf_tail_call(sg, &jmp_table, 1);
+}
+
+/* walk sg list */
+PROG(SG1)(struct sg_filter_md *sg)
+{
+	bpf_printk("next sg element:\n");
+	dump_sg(sg);
+	sg_dispatcher(sg);
+	return 0;
+}
+
+SEC("socksg/0")
+int main_prog(struct sg_filter_md *sg)
+{
+	bpf_printk("Print first 6 bytes from sg element\n");
+	bpf_printk("First sg element:\n");
+	dump_sg(sg);
+	sg_dispatcher(sg);
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/rds_filter_user.c b/samples/bpf/rds_filter_user.c
new file mode 100644
index 0000000..1165f1e
--- /dev/null
+++ b/samples/bpf/rds_filter_user.c
@@ -0,0 +1,339 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <arpa/inet.h>
+#include <assert.h>
+#include "bpf_load.h"
+#include <getopt.h>
+#include <errno.h>
+#include <netinet/in.h>
+#include <limits.h>
+#include <linux/sockios.h>
+#include <linux/rds.h>
+#include <linux/errqueue.h>
+#include <linux/bpf.h>
+#include <strings.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <string.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <unistd.h>
+
+#define TESTPORT	4000
+#define BUFSIZE		8192
+
+int transport = -1;
+
+static int str2trans(const char *trans)
+{
+	if (strcmp(trans, "tcp") == 0)
+		return RDS_TRANS_TCP;
+	if (strcmp(trans, "ib") == 0)
+		return RDS_TRANS_IB;
+	return (RDS_TRANS_NONE);
+}
+
+static const char *trans2str(int trans)
+{
+	switch (trans) {
+	case RDS_TRANS_TCP:
+		return ("tcp");
+	case RDS_TRANS_IB:
+		return ("ib");
+	case RDS_TRANS_NONE:
+		return ("none");
+	default:
+		return ("unknown");
+	}
+}
+
+static int gettransport(int sock)
+{
+	int err;
+	char val;
+	socklen_t len = sizeof(int);
+
+	err = getsockopt(sock, SOL_RDS, SO_RDS_TRANSPORT,
+			 (char *)&val, &len);
+	if (err < 0) {
+		fprintf(stderr, "%s: getsockopt %s\n",
+			__func__, strerror(errno));
+		return err;
+	}
+	return (int)val;
+}
+
+static int settransport(int sock, int transport)
+{
+	int err;
+
+	err = setsockopt(sock, SOL_RDS, SO_RDS_TRANSPORT,
+			 (char *)&transport, sizeof(transport));
+	if (err < 0) {
+		fprintf(stderr, "could not set transport %s, %s\n",
+			trans2str(transport), strerror(errno));
+	}
+	return err;
+}
+
+static void print_sock_local_info(int fd, char *str, struct sockaddr_in *ret)
+{
+	socklen_t sin_size = sizeof(struct sockaddr_in);
+	struct sockaddr_in sin;
+	int err;
+
+	err = getsockname(fd, (struct sockaddr *)&sin, &sin_size);
+	if (err < 0) {
+		fprintf(stderr, "%s getsockname %s\n",
+			__func__, strerror(errno));
+		return;
+	}
+	printf("%s address: %s port %d\n",
+		(str ? str : ""), inet_ntoa(sin.sin_addr), ntohs(sin.sin_port));
+
+	if (ret != NULL)
+		*ret = sin;
+}
+
+static void print_payload(char *buf)
+{
+	int i;
+
+	printf("payload contains:");
+	for (i = 0; i < 10; i++)
+		printf("%x ", buf[i]);
+	printf("...\n");
+}
+
+static void server(char *address, in_port_t port)
+{
+	struct sockaddr_in sin, din;
+	struct msghdr msg;
+	struct iovec *iov;
+	int rc, sock;
+	char *buf;
+
+	buf = calloc(BUFSIZE, sizeof(char));
+	if (!buf) {
+		fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+		return;
+	}
+
+	sock = socket(PF_RDS, SOCK_SEQPACKET, 0);
+	if (sock < 0) {
+		fprintf(stderr, "%s: socket %s\n", __func__, strerror(errno));
+		goto out;
+	}
+	if (settransport(sock, transport) < 0)
+		goto out;
+
+	printf("transport %s\n", trans2str(gettransport(sock)));
+
+	memset(&sin, 0, sizeof(sin));
+	sin.sin_family = AF_INET;
+	sin.sin_addr.s_addr = inet_addr(address);
+	sin.sin_port = htons(port);
+
+	rc = bind(sock, (struct sockaddr *)&sin, sizeof(sin));
+	if (rc < 0) {
+		fprintf(stderr, "%s: bind %s\n", __func__, strerror(errno));
+		goto out;
+	}
+
+	/* attach bpf prog */
+	assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd[1],
+			  sizeof(prog_fd[0])) == 0);
+
+	print_sock_local_info(sock, "server bound to", NULL);
+
+	iov = calloc(1, sizeof(struct iovec));
+	if (!iov) {
+		fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+		goto out;
+	}
+
+	while (1) {
+		memset(buf, 0, BUFSIZE);
+		iov[0].iov_base = buf;
+		iov[0].iov_len = BUFSIZE;
+
+		memset(&msg, 0, sizeof(msg));
+		msg.msg_name = &din;
+		msg.msg_namelen = sizeof(din);
+		msg.msg_iov = iov;
+		msg.msg_iovlen = 1;
+
+		printf("server listening on %s\n", inet_ntoa(sin.sin_addr));
+
+		rc = recvmsg(sock, &msg, 0);
+		if (rc < 0) {
+			fprintf(stderr, "%s: recvmsg %s\n",
+				__func__, strerror(errno));
+			break;
+		}
+
+		printf("%s received a packet from %s of len %d cmsg len %d, on port %d\n",
+			inet_ntoa(sin.sin_addr),
+			inet_ntoa(din.sin_addr),
+			(uint32_t) iov[0].iov_len,
+			(uint32_t) msg.msg_controllen,
+			ntohs(din.sin_port));
+
+		print_payload(buf);
+	}
+	free(iov);
+out:
+	free(buf);
+}
+
+static void create_message(char *buf)
+{
+	unsigned int i;
+
+	for (i = 0; i < BUFSIZE; i++) {
+		buf[i] = i + 0x30;
+	}
+}
+
+static int build_rds_packet(struct msghdr *msg, char *buf)
+{
+	struct iovec *iov;
+
+	iov = calloc(1, sizeof(struct iovec));
+	if (!iov) {
+		fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+		return -1;
+	}
+
+	msg->msg_iov = iov;
+	msg->msg_iovlen = 1;
+
+	iov[0].iov_base = buf;
+	iov[0].iov_len = BUFSIZE * sizeof(char);
+
+	return 0;
+}
+
+static void client(char *localaddr, char *remoteaddr, in_port_t server_port)
+{
+	struct sockaddr_in sin, din;
+	struct msghdr msg;
+	int rc, sock;
+	char *buf;
+
+	buf = calloc(BUFSIZE, sizeof(char));
+	if (!buf) {
+		fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+		return;
+	}
+
+	create_message(buf);
+
+	sock = socket(PF_RDS, SOCK_SEQPACKET, 0);
+	if (sock < 0) {
+		fprintf(stderr, "%s: socket %s\n", __func__, strerror(errno));
+		goto out;
+	}
+
+	if (settransport(sock, transport) < 0)
+		goto out;
+
+	printf("transport %s\n", trans2str(gettransport(sock)));
+
+	memset(&sin, 0, sizeof(sin));
+	sin.sin_family = AF_INET;
+	sin.sin_addr.s_addr = inet_addr(localaddr);
+	sin.sin_port = 0;
+
+	rc = bind(sock, (struct sockaddr *)&sin, sizeof(sin));
+	if (rc < 0) {
+		fprintf(stderr, "%s: bind %s\n", __func__, strerror(errno));
+		goto out;
+	}
+	print_sock_local_info(sock, "client bound to",  &sin);
+
+	memset(&msg, 0, sizeof(msg));
+	msg.msg_name = &din;
+	msg.msg_namelen = sizeof(din);
+
+	memset(&din, 0, sizeof(din));
+	din.sin_family = AF_INET;
+	din.sin_addr.s_addr = inet_addr(remoteaddr);
+	din.sin_port = htons(server_port);
+
+	rc = build_rds_packet(&msg, buf);
+	if (rc < 0)
+		goto out;
+
+	printf("client sending %d byte message from %s to %s on port %d\n",
+		(uint32_t) msg.msg_iov->iov_len, localaddr,
+		remoteaddr, ntohs(sin.sin_port));
+
+	rc = sendmsg(sock, &msg, 0);
+	if (rc < 0)
+		fprintf(stderr, "%s: sendmsg %s\n", __func__, strerror(errno));
+
+	print_payload(buf);
+
+	if (msg.msg_control)
+		free(msg.msg_control);
+	if (msg.msg_iov)
+		free(msg.msg_iov);
+out:
+	free(buf);
+
+	return;
+}
+
+static void usage(char *progname)
+{
+	fprintf(stderr, "Usage %s [-s srvaddr] [-c clientaddr] [-t transport]"
+		"\n", progname);
+}
+
+int main(int argc, char **argv)
+{
+	in_port_t server_port = TESTPORT;
+	char *serveraddr = NULL;
+	char *clientaddr = NULL;
+	char filename[256];
+	int opt;
+
+	while ((opt = getopt(argc, argv, "s:c:t:")) != -1) {
+		switch (opt) {
+		case 's':
+			serveraddr = optarg;
+			break;
+		case 'c':
+			clientaddr = optarg;
+			break;
+		case 't':
+			transport = str2trans(optarg);
+			if (transport == RDS_TRANS_NONE) {
+				fprintf(stderr,
+					"unknown transport %s\n", optarg);
+					usage(argv[0]);
+					return (-1);
+			}
+			break;
+		default:
+			usage(argv[0]);
+			return 1;
+		}
+	}
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+	if (load_bpf_file(filename)) {
+		fprintf(stderr, "Error: load_bpf_file %s", bpf_log_buf);
+		return 1;
+	}
+
+	if (serveraddr && !clientaddr) {
+		printf("running server in a loop\n");
+		server(serveraddr, server_port);
+	} else if (serveraddr && clientaddr) {
+		client(clientaddr, serveraddr, server_port);
+	}
+
+	return 0;
+}
-- 
1.8.3.1

^ permalink raw reply related

* [RFC v2 PATCH 4/4] rds: invoke socket sg filter attached to rds socket
From: Tushar Dave @ 2018-06-19 18:00 UTC (permalink / raw)
  To: ast, daniel, davem, jakub.kicinski, quentin.monnet, jiong.wang,
	guro, sandipan, john.fastabend, kafai, rdna, brakmo, netdev, acme,
	sowmini.varadhan
In-Reply-To: <1529431217-5264-1-git-send-email-tushar.n.dave@oracle.com>

RDS module sits on top of TCP (rds_tcp) and IB (rds_rdma), so messages
arrive in form of skb (over TCP) and scatterlist (over IB/RDMA).
However, because socket filter only deal with skb (e.g. struct skb as
bpf context) we can only use socket filter for rds_tcp and not for
rds_rdma.

Considering one filtering solution for RDS, it seems that the common
denominator between sk_buff and scatterlist is scatterlist. Therefore,
this patch converts skb to sgvec and invoke sg_filter_run for
rds_tcp and simply invoke sg_filter_run for IB/rds_rdma.

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 net/rds/ib.c       |  1 +
 net/rds/ib.h       |  1 +
 net/rds/ib_recv.c  | 12 ++++++++++++
 net/rds/rds.h      |  2 ++
 net/rds/recv.c     | 16 ++++++++++++++++
 net/rds/tcp.c      |  2 ++
 net/rds/tcp.h      |  2 ++
 net/rds/tcp_recv.c | 38 ++++++++++++++++++++++++++++++++++++++
 8 files changed, 74 insertions(+)

diff --git a/net/rds/ib.c b/net/rds/ib.c
index 02deee2..3027832 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -421,6 +421,7 @@ struct rds_transport rds_ib_transport = {
 	.conn_path_shutdown	= rds_ib_conn_path_shutdown,
 	.inc_copy_to_user	= rds_ib_inc_copy_to_user,
 	.inc_free		= rds_ib_inc_free,
+	.inc_to_sg_get		= rds_ib_inc_to_sg_get,
 	.cm_initiate_connect	= rds_ib_cm_initiate_connect,
 	.cm_handle_connect	= rds_ib_cm_handle_connect,
 	.cm_connect_complete	= rds_ib_cm_connect_complete,
diff --git a/net/rds/ib.h b/net/rds/ib.h
index a6f4d7d..699b5b9b 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -375,6 +375,7 @@ void rds_ib_cm_connect_complete(struct rds_connection *conn,
 void rds_ib_recv_free_caches(struct rds_ib_connection *ic);
 void rds_ib_recv_refill(struct rds_connection *conn, int prefill, gfp_t gfp);
 void rds_ib_inc_free(struct rds_incoming *inc);
+int rds_ib_inc_to_sg_get(struct rds_incoming *inc, struct scatterlist **sg);
 int rds_ib_inc_copy_to_user(struct rds_incoming *inc, struct iov_iter *to);
 void rds_ib_recv_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc,
 			     struct rds_ib_ack_state *state);
diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index b4e421a..62be497 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -219,6 +219,18 @@ void rds_ib_inc_free(struct rds_incoming *inc)
 	rds_ib_recv_cache_put(&ibinc->ii_cache_entry, &ic->i_cache_incs);
 }
 
+int rds_ib_inc_to_sg_get(struct rds_incoming *inc, struct scatterlist **sg)
+{
+	struct rds_ib_incoming *ibinc;
+	struct rds_page_frag *frag;
+
+	ibinc = container_of(inc, struct rds_ib_incoming, ii_inc);
+	frag = list_entry(ibinc->ii_frags.next, struct rds_page_frag, f_item);
+	*sg =  &frag->f_sg;
+
+	return 0;
+}
+
 static void rds_ib_recv_clear_one(struct rds_ib_connection *ic,
 				  struct rds_ib_recv_work *recv)
 {
diff --git a/net/rds/rds.h b/net/rds/rds.h
index b04c333..f5ea833 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -528,6 +528,8 @@ struct rds_transport {
 	int (*recv_path)(struct rds_conn_path *cp);
 	int (*inc_copy_to_user)(struct rds_incoming *inc, struct iov_iter *to);
 	void (*inc_free)(struct rds_incoming *inc);
+	int (*inc_to_sg_get)(struct rds_incoming *inc, struct scatterlist **sg);
+	void (*inc_to_sg_put)(struct scatterlist **sg);
 
 	int (*cm_handle_connect)(struct rdma_cm_id *cm_id,
 				 struct rdma_cm_event *event);
diff --git a/net/rds/recv.c b/net/rds/recv.c
index dc67458..e0c5b4c 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -286,6 +286,7 @@ void rds_recv_incoming(struct rds_connection *conn, __be32 saddr, __be32 daddr,
 	struct sock *sk;
 	unsigned long flags;
 	struct rds_conn_path *cp;
+	struct sk_filter *filter;
 
 	inc->i_conn = conn;
 	inc->i_rx_jiffies = jiffies;
@@ -369,6 +370,21 @@ void rds_recv_incoming(struct rds_connection *conn, __be32 saddr, __be32 daddr,
 	/* We can be racing with rds_release() which marks the socket dead. */
 	sk = rds_rs_to_sk(rs);
 
+	rcu_read_lock();
+	filter = rcu_dereference(sk->sk_filter);
+	if (filter) {
+		if (conn->c_trans->inc_to_sg_get) {
+			struct scatterlist *sg;
+
+			if (conn->c_trans->inc_to_sg_get(inc, &sg) == 0) {
+				sg_filter_run(sk, sg);
+				if (conn->c_trans->inc_to_sg_put)
+					conn->c_trans->inc_to_sg_put(&sg);
+			}
+		}
+	}
+	rcu_read_unlock();
+
 	/* serialize with rds_release -> sock_orphan */
 	write_lock_irqsave(&rs->rs_recv_lock, flags);
 	if (!sock_flag(sk, SOCK_DEAD)) {
diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index 351a284..b431854 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -376,6 +376,8 @@ struct rds_transport rds_tcp_transport = {
 	.conn_path_shutdown	= rds_tcp_conn_path_shutdown,
 	.inc_copy_to_user	= rds_tcp_inc_copy_to_user,
 	.inc_free		= rds_tcp_inc_free,
+	.inc_to_sg_get		= rds_tcp_inc_to_sg_get,
+	.inc_to_sg_put		= rds_tcp_inc_to_sg_put,
 	.stats_info_copy	= rds_tcp_stats_info_copy,
 	.exit			= rds_tcp_exit,
 	.t_owner		= THIS_MODULE,
diff --git a/net/rds/tcp.h b/net/rds/tcp.h
index c6fa080..466bdb9 100644
--- a/net/rds/tcp.h
+++ b/net/rds/tcp.h
@@ -82,6 +82,8 @@ void rds_tcp_restore_callbacks(struct socket *sock,
 int rds_tcp_recv_path(struct rds_conn_path *cp);
 void rds_tcp_inc_free(struct rds_incoming *inc);
 int rds_tcp_inc_copy_to_user(struct rds_incoming *inc, struct iov_iter *to);
+int rds_tcp_inc_to_sg_get(struct rds_incoming *inc, struct scatterlist **sg);
+void rds_tcp_inc_to_sg_put(struct scatterlist **sg);
 
 /* tcp_send.c */
 void rds_tcp_xmit_path_prepare(struct rds_conn_path *cp);
diff --git a/net/rds/tcp_recv.c b/net/rds/tcp_recv.c
index b9fbd2e..ce62712 100644
--- a/net/rds/tcp_recv.c
+++ b/net/rds/tcp_recv.c
@@ -56,6 +56,44 @@ void rds_tcp_inc_free(struct rds_incoming *inc)
 	kmem_cache_free(rds_tcp_incoming_slab, tinc);
 }
 
+#define MAX_SG 17
+int rds_tcp_inc_to_sg_get(struct rds_incoming *inc, struct scatterlist **sg)
+{
+	struct scatterlist *sg_list;
+	struct rds_tcp_incoming *tinc;
+	struct sk_buff *skb;
+	int num_sg = 0;
+
+	tinc = container_of(inc, struct rds_tcp_incoming, ti_inc);
+
+	/* For now we are assuming that the max sg elements we need is MAX_SG.
+	 * To determine actual number of sg elements we need to traverse the
+	 * skb queue e.g.
+	 *
+	 * skb_queue_walk(&tinc->ti_skb_list, skb) {
+	 *	num_sg += skb_shinfo(skb)->nr_frags + 1;
+	 * }
+	 */
+	sg_list = kzalloc(sizeof(*sg_list) * MAX_SG, GFP_KERNEL);
+	if (!sg_list)
+		return -ENOMEM;
+
+	sg_init_table(sg_list, MAX_SG);
+	skb_queue_walk(&tinc->ti_skb_list, skb) {
+		num_sg += skb_to_sgvec_nomark(skb, &sg_list[num_sg], 0,
+					      skb->len);
+	}
+	sg_mark_end(&sg_list[num_sg - 1]);
+	*sg = sg_list;
+
+	return 0;
+}
+
+void rds_tcp_inc_to_sg_put(struct scatterlist **sg)
+{
+	kfree(*sg);
+}
+
 /*
  * this is pretty lame, but, whatever.
  */
-- 
1.8.3.1

^ permalink raw reply related

* [RFC v2 PATCH 2/4] ebpf: Add sg_filter_run and sg helper
From: Tushar Dave @ 2018-06-19 18:00 UTC (permalink / raw)
  To: ast, daniel, davem, jakub.kicinski, quentin.monnet, jiong.wang,
	guro, sandipan, john.fastabend, kafai, rdna, brakmo, netdev, acme,
	sowmini.varadhan
In-Reply-To: <1529431217-5264-1-git-send-email-tushar.n.dave@oracle.com>

When sg_filter_run() is invoked it runs the attached eBPF
SOCKET_SG_FILTER program which deals with struct scatterlist.

In addition, this patch also adds bpf_sg_next helper function that
allows users to retrieve the next sg element from sg list.

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 include/linux/filter.h                    |  2 +
 include/uapi/linux/bpf.h                  | 10 ++++-
 net/core/filter.c                         | 72 +++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h            | 10 ++++-
 tools/testing/selftests/bpf/bpf_helpers.h |  3 ++
 5 files changed, 95 insertions(+), 2 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 71618b1..d176402 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1072,4 +1072,6 @@ struct bpf_sock_ops_kern {
 					 */
 };
 
+int sg_filter_run(struct sock *sk, struct scatterlist *sg);
+
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ef0a7b6..036432b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2076,6 +2076,13 @@ struct bpf_stack_build_id {
  * 	Return
  * 		A 64-bit integer containing the current cgroup id based
  * 		on the cgroup within which the current task is running.
+ *
+ * int bpf_sg_next(struct bpf_scatterlist *sg)
+ *	Description
+ *		This helper allows user to retrieve next sg element from
+ *		sg list.
+ *	Return
+ *		Returns 0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2158,7 +2165,8 @@ struct bpf_stack_build_id {
 	FN(rc_repeat),			\
 	FN(rc_keydown),			\
 	FN(skb_cgroup_id),		\
-	FN(get_current_cgroup_id),
+	FN(get_current_cgroup_id),	\
+	FN(sg_next),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index 8f67942..702ff5b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -121,6 +121,53 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
 }
 EXPORT_SYMBOL(sk_filter_trim_cap);
 
+int sg_filter_run(struct sock *sk, struct scatterlist *sg)
+{
+	struct sk_filter *filter;
+	int err;
+
+	rcu_read_lock();
+	filter = rcu_dereference(sk->sk_filter);
+	if (filter) {
+		struct bpf_scatterlist bpfsg;
+		int num_sg;
+
+		if (!sg) {
+			err = -EINVAL;
+			goto out;
+		}
+
+		num_sg = sg_nents(sg);
+		if (num_sg <= 0) {
+			err = -EINVAL;
+			goto out;
+		}
+
+		/* We store a reference  to the sg list so it can later used by
+		 * eBPF helpers to retrieve the next sg element.
+		 */
+		bpfsg.num_sg = num_sg;
+		bpfsg.cur_sg = 0;
+		bpfsg.sg = sg;
+
+		/* For the first sg element, we store the pkt access pointers
+		 * into start and end so eBPF program can have pkt access using
+		 * data and data_end. The pkt access for subsequent element of
+		 * sg list is possible when eBPF program invokes bpf_sg_next
+		 * which takes care of setting start and end to the correct sg
+		 * element.
+		 */
+		bpfsg.start = sg_virt(sg);
+		bpfsg.end = bpfsg.start + sg->length;
+		BPF_PROG_RUN(filter->prog, &bpfsg);
+	}
+out:
+	rcu_read_unlock();
+
+	return err;
+}
+EXPORT_SYMBOL(sg_filter_run);
+
 BPF_CALL_1(bpf_skb_get_pay_offset, struct sk_buff *, skb)
 {
 	return skb_get_poff(skb);
@@ -3753,6 +3800,29 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const void *src_buff,
 	.arg1_type      = ARG_PTR_TO_CTX,
 };
 
+BPF_CALL_1(bpf_sg_next, struct bpf_scatterlist *, bpfsg)
+{
+	struct scatterlist *sg = bpfsg->sg;
+	int cur_sg = bpfsg->cur_sg;
+
+	cur_sg++;
+	if (cur_sg >= bpfsg->num_sg)
+		return -ENODATA;
+
+	bpfsg->cur_sg = cur_sg;
+	bpfsg->start = sg_virt(&sg[cur_sg]);
+	bpfsg->end = bpfsg->start + sg[cur_sg].length;
+
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_sg_next_proto = {
+	.func		= bpf_sg_next,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+};
+
 BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
 	   int, level, int, optname, char *, optval, int, optlen)
 {
@@ -4720,6 +4790,8 @@ bool bpf_helper_changes_pkt_data(void *func)
 socksg_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
 	switch (func_id) {
+	case BPF_FUNC_sg_next:
+		return &bpf_sg_next_proto;
 	default:
 		return bpf_base_func_proto(func_id);
 	}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index c87ae16..a298498 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2076,6 +2076,13 @@ struct bpf_stack_build_id {
  * 	Return
  * 		A 64-bit integer containing the current cgroup id based
  * 		on the cgroup within which the current task is running.
+ *
+ * int bpf_sg_next(struct bpf_scatterlist *sg)
+ *	Description
+ *		This helper allows user to retrieve next sg element from
+ *		sg list.
+ *	Return
+ *		Returns 0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2158,7 +2165,8 @@ struct bpf_stack_build_id {
 	FN(rc_repeat),			\
 	FN(rc_keydown),			\
 	FN(skb_cgroup_id),		\
-	FN(get_current_cgroup_id),
+	FN(get_current_cgroup_id),	\
+	FN(sg_next),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index f2f28b6..1997ba2 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -133,6 +133,9 @@ static int (*bpf_rc_keydown)(void *ctx, unsigned int protocol,
 	(void *) BPF_FUNC_rc_keydown;
 static unsigned long long (*bpf_get_current_cgroup_id)(void) =
 	(void *) BPF_FUNC_get_current_cgroup_id;
+static unsigned long long (*bpf_sg_next)(void *ctx) =
+	(void *) BPF_FUNC_sg_next;
+
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
-- 
1.8.3.1

^ permalink raw reply related

* [RFC v2 PATCH 1/4] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER
From: Tushar Dave @ 2018-06-19 18:00 UTC (permalink / raw)
  To: ast, daniel, davem, jakub.kicinski, quentin.monnet, jiong.wang,
	guro, sandipan, john.fastabend, kafai, rdna, brakmo, netdev, acme,
	sowmini.varadhan
In-Reply-To: <1529431217-5264-1-git-send-email-tushar.n.dave@oracle.com>

Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER which uses the
existing socket filter infrastructure for bpf program attach and load.
SOCKET_SG_FILTER eBPF program receives struct scatterlist as bpf context
contrast to SOCKET_FILTER which deals with struct skb. This is useful
for kernel entities that don't have skb to represent packet data but
want to run eBPF socket filter on packet data that is in form of struct
scatterlist e.g. IB/RDMA

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 include/linux/bpf_types.h      |  1 +
 include/linux/filter.h         |  8 +++++
 include/uapi/linux/bpf.h       |  7 ++++
 kernel/bpf/syscall.c           |  1 +
 kernel/bpf/verifier.c          |  1 +
 net/core/filter.c              | 77 ++++++++++++++++++++++++++++++++++++++++--
 samples/bpf/bpf_load.c         | 11 ++++--
 tools/bpf/bpftool/prog.c       |  1 +
 tools/include/uapi/linux/bpf.h |  7 ++++
 tools/lib/bpf/libbpf.c         |  3 ++
 tools/lib/bpf/libbpf.h         |  2 ++
 11 files changed, 114 insertions(+), 5 deletions(-)

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index c5700c2..f8b4b56 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -16,6 +16,7 @@
 BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops)
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_SKB, sk_skb)
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_MSG, sk_msg)
+BPF_PROG_TYPE(BPF_PROG_TYPE_SOCKET_SG_FILTER, socksg_filter)
 #endif
 #ifdef CONFIG_BPF_EVENTS
 BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe)
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 45fc0f5..71618b1 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -517,6 +517,14 @@ struct bpf_skb_data_end {
 	void *data_end;
 };
 
+struct bpf_scatterlist {
+	struct scatterlist *sg;
+	void *start;
+	void *end;
+	int cur_sg;
+	int num_sg;
+};
+
 struct sk_msg_buff {
 	void *data;
 	void *data_end;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 59b19b6..ef0a7b6 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -144,6 +144,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
 	BPF_PROG_TYPE_LWT_SEG6LOCAL,
 	BPF_PROG_TYPE_LIRC_MODE2,
+	BPF_PROG_TYPE_SOCKET_SG_FILTER,
 };
 
 enum bpf_attach_type {
@@ -2358,6 +2359,12 @@ enum sk_action {
 	SK_PASS,
 };
 
+/* use accessible scatterlist */
+struct sg_filter_md {
+	void *data; /* sg_virt(sg) */
+	void *data_end; /* sg_virt(sg) + sg->length */
+};
+
 /* user accessible metadata for SK_MSG packet hook, new fields must
  * be added to the end of this structure
  */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0fa2062..74193a8 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1300,6 +1300,7 @@ static int bpf_prog_load(union bpf_attr *attr)
 
 	if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
 	    type != BPF_PROG_TYPE_CGROUP_SKB &&
+	    type != BPF_PROG_TYPE_SOCKET_SG_FILTER &&
 	    !capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d6403b5..a00d3eb 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1320,6 +1320,7 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
 	case BPF_PROG_TYPE_LWT_XMIT:
 	case BPF_PROG_TYPE_SK_SKB:
 	case BPF_PROG_TYPE_SK_MSG:
+	case BPF_PROG_TYPE_SOCKET_SG_FILTER:
 		if (meta)
 			return meta->pkt_access;
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 3d9ba7e..8f67942 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1130,7 +1130,8 @@ static void bpf_release_orig_filter(struct bpf_prog *fp)
 
 static void __bpf_prog_release(struct bpf_prog *prog)
 {
-	if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER) {
+	if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER ||
+	    prog->type == BPF_PROG_TYPE_SOCKET_SG_FILTER) {
 		bpf_prog_put(prog);
 	} else {
 		bpf_release_orig_filter(prog);
@@ -1551,10 +1552,16 @@ int sk_reuseport_attach_filter(struct sock_fprog *fprog, struct sock *sk)
 
 static struct bpf_prog *__get_bpf(u32 ufd, struct sock *sk)
 {
+	struct bpf_prog *prog;
+
 	if (sock_flag(sk, SOCK_FILTER_LOCKED))
 		return ERR_PTR(-EPERM);
 
-	return bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_FILTER);
+	prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_FILTER);
+	if (IS_ERR(prog))
+		prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_SG_FILTER);
+
+	return prog;
 }
 
 int sk_attach_bpf(u32 ufd, struct sock *sk)
@@ -4710,6 +4717,15 @@ bool bpf_helper_changes_pkt_data(void *func)
 }
 
 static const struct bpf_func_proto *
+socksg_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	default:
+		return bpf_base_func_proto(func_id);
+	}
+}
+
+static const struct bpf_func_proto *
 tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
 	switch (func_id) {
@@ -5037,6 +5053,30 @@ static bool sk_filter_is_valid_access(int off, int size,
 	return bpf_skb_is_valid_access(off, size, type, prog, info);
 }
 
+static bool socksg_filter_is_valid_access(int off, int size,
+					  enum bpf_access_type type,
+					  const struct bpf_prog *prog,
+					  struct bpf_insn_access_aux *info)
+{
+	switch (off) {
+	case offsetof(struct sg_filter_md, data):
+		info->reg_type = PTR_TO_PACKET;
+		break;
+	case offsetof(struct sg_filter_md, data_end):
+		info->reg_type = PTR_TO_PACKET_END;
+		break;
+	}
+
+	if (off < 0 || off >= sizeof(struct sg_filter_md))
+		return false;
+	if (off % size != 0)
+		return false;
+	if (size != sizeof(__u64))
+		return false;
+
+	return true;
+}
+
 static bool lwt_is_valid_access(int off, int size,
 				enum bpf_access_type type,
 				const struct bpf_prog *prog,
@@ -6516,6 +6556,30 @@ static u32 sk_skb_convert_ctx_access(enum bpf_access_type type,
 	return insn - insn_buf;
 }
 
+static u32 socksg_filter_convert_ctx_access(enum bpf_access_type type,
+					    const struct bpf_insn *si,
+					    struct bpf_insn *insn_buf,
+					    struct bpf_prog *prog,
+					    u32 *target_size)
+{
+	struct bpf_insn *insn = insn_buf;
+
+	switch (si->off) {
+	case offsetof(struct sg_filter_md, data):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_scatterlist, start),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_scatterlist, start));
+		break;
+	case offsetof(struct sg_filter_md, data_end):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_scatterlist, end),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_scatterlist, end));
+		break;
+	}
+
+	return insn - insn_buf;
+}
+
 static u32 sk_msg_convert_ctx_access(enum bpf_access_type type,
 				     const struct bpf_insn *si,
 				     struct bpf_insn *insn_buf,
@@ -6654,6 +6718,15 @@ static u32 sk_msg_convert_ctx_access(enum bpf_access_type type,
 	.test_run		= bpf_prog_test_run_skb,
 };
 
+const struct bpf_verifier_ops socksg_filter_verifier_ops = {
+	.get_func_proto         = socksg_filter_func_proto,
+	.is_valid_access        = socksg_filter_is_valid_access,
+	.convert_ctx_access     = socksg_filter_convert_ctx_access,
+};
+
+const struct bpf_prog_ops socksg_filter_prog_ops = {
+};
+
 const struct bpf_verifier_ops tc_cls_act_verifier_ops = {
 	.get_func_proto		= tc_cls_act_func_proto,
 	.is_valid_access	= tc_cls_act_is_valid_access,
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 89161c9..15c355e 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -69,6 +69,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 	bool is_sockops = strncmp(event, "sockops", 7) == 0;
 	bool is_sk_skb = strncmp(event, "sk_skb", 6) == 0;
 	bool is_sk_msg = strncmp(event, "sk_msg", 6) == 0;
+	bool is_socksg = strncmp(event, "socksg", 6) == 0;
+
 	size_t insns_cnt = size / sizeof(struct bpf_insn);
 	enum bpf_prog_type prog_type;
 	char buf[256];
@@ -102,6 +104,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 		prog_type = BPF_PROG_TYPE_SK_SKB;
 	} else if (is_sk_msg) {
 		prog_type = BPF_PROG_TYPE_SK_MSG;
+	} else if (is_socksg) {
+		prog_type = BPF_PROG_TYPE_SOCKET_SG_FILTER;
 	} else {
 		printf("Unknown event '%s'\n", event);
 		return -1;
@@ -119,8 +123,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 	if (is_xdp || is_perf_event || is_cgroup_skb || is_cgroup_sk)
 		return 0;
 
-	if (is_socket || is_sockops || is_sk_skb || is_sk_msg) {
-		if (is_socket)
+	if (is_socket || is_sockops || is_sk_skb || is_sk_msg || is_socksg) {
+		if (is_socket || is_socksg)
 			event += 6;
 		else
 			event += 7;
@@ -624,7 +628,8 @@ static int do_load_bpf_file(const char *path, fixup_map_cb fixup_map)
 		    memcmp(shname, "cgroup/", 7) == 0 ||
 		    memcmp(shname, "sockops", 7) == 0 ||
 		    memcmp(shname, "sk_skb", 6) == 0 ||
-		    memcmp(shname, "sk_msg", 6) == 0) {
+		    memcmp(shname, "sk_msg", 6) == 0 ||
+		    memcmp(shname, "socksg", 6) == 0) {
 			ret = load_and_attach(shname, data->d_buf,
 					      data->d_size);
 			if (ret != 0)
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index a4f4352..06b2fef 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -72,6 +72,7 @@
 	[BPF_PROG_TYPE_RAW_TRACEPOINT]	= "raw_tracepoint",
 	[BPF_PROG_TYPE_CGROUP_SOCK_ADDR] = "cgroup_sock_addr",
 	[BPF_PROG_TYPE_LIRC_MODE2]	= "lirc_mode2",
+	[BPF_PROG_TYPE_SOCKET_SG_FILTER] = "socket_sg_filter",
 };
 
 static void print_boot_time(__u64 nsecs, char *buf, unsigned int size)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index e0b0678..c87ae16 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -144,6 +144,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
 	BPF_PROG_TYPE_LWT_SEG6LOCAL,
 	BPF_PROG_TYPE_LIRC_MODE2,
+	BPF_PROG_TYPE_SOCKET_SG_FILTER,
 };
 
 enum bpf_attach_type {
@@ -2358,6 +2359,12 @@ enum sk_action {
 	SK_PASS,
 };
 
+/* use accessible scatterlist */
+struct sg_filter_md {
+	void *data; /* sg_virt(sg) */
+	void *data_end; /* sg_virt(sg) + sg->length */
+};
+
 /* user accessible metadata for SK_MSG packet hook, new fields must
  * be added to the end of this structure
  */
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index a1e96b5..7628278 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1463,6 +1463,7 @@ static bool bpf_prog_type__needs_kver(enum bpf_prog_type type)
 	case BPF_PROG_TYPE_SK_MSG:
 	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
 	case BPF_PROG_TYPE_LIRC_MODE2:
+	case BPF_PROG_TYPE_SOCKET_SG_FILTER:
 		return false;
 	case BPF_PROG_TYPE_UNSPEC:
 	case BPF_PROG_TYPE_KPROBE:
@@ -1998,6 +1999,7 @@ static bool bpf_program__is_type(struct bpf_program *prog,
 BPF_PROG_TYPE_FNS(raw_tracepoint, BPF_PROG_TYPE_RAW_TRACEPOINT);
 BPF_PROG_TYPE_FNS(xdp, BPF_PROG_TYPE_XDP);
 BPF_PROG_TYPE_FNS(perf_event, BPF_PROG_TYPE_PERF_EVENT);
+BPF_PROG_TYPE_FNS(socket_sg_filter, BPF_PROG_TYPE_SOCKET_SG_FILTER);
 
 void bpf_program__set_expected_attach_type(struct bpf_program *prog,
 					   enum bpf_attach_type type)
@@ -2048,6 +2050,7 @@ void bpf_program__set_expected_attach_type(struct bpf_program *prog,
 	BPF_SA_PROG_SEC("cgroup/sendmsg6", BPF_CGROUP_UDP6_SENDMSG),
 	BPF_S_PROG_SEC("cgroup/post_bind4", BPF_CGROUP_INET4_POST_BIND),
 	BPF_S_PROG_SEC("cgroup/post_bind6", BPF_CGROUP_INET6_POST_BIND),
+	BPF_PROG_SEC("socksg",          BPF_PROG_TYPE_SOCKET_SG_FILTER),
 };
 
 #undef BPF_PROG_SEC
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 0997653..3be165b 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -195,6 +195,7 @@ int bpf_program__set_prep(struct bpf_program *prog, int nr_instance,
 void bpf_program__set_type(struct bpf_program *prog, enum bpf_prog_type type);
 void bpf_program__set_expected_attach_type(struct bpf_program *prog,
 					   enum bpf_attach_type type);
+int bpf_program__set_socket_sg_filter(struct bpf_program *prog);
 
 bool bpf_program__is_socket_filter(struct bpf_program *prog);
 bool bpf_program__is_tracepoint(struct bpf_program *prog);
@@ -204,6 +205,7 @@ void bpf_program__set_expected_attach_type(struct bpf_program *prog,
 bool bpf_program__is_sched_act(struct bpf_program *prog);
 bool bpf_program__is_xdp(struct bpf_program *prog);
 bool bpf_program__is_perf_event(struct bpf_program *prog);
+bool bpf_program__is_socket_sg_filter(struct bpf_program *prog);
 
 /*
  * No need for __attribute__((packed)), all members of 'bpf_map_def'
-- 
1.8.3.1

^ permalink raw reply related

* [RFC v2 PATCH 0/4] eBPF and struct scatterlist
From: Tushar Dave @ 2018-06-19 18:00 UTC (permalink / raw)
  To: ast, daniel, davem, jakub.kicinski, quentin.monnet, jiong.wang,
	guro, sandipan, john.fastabend, kafai, rdna, brakmo, netdev, acme,
	sowmini.varadhan

This follows up on https://patchwork.ozlabs.org/cover/927050/
where the review feedback was to use bpf_skb_load_bytes() to deal with
linear and non-linear skbs. While that feedback is valid and correct,
the motivation for this work is to allow eBPF based firewalling for
kernel modules that do not always get their packet as an sk_buff from
their downlink drivers. One such instance of this use-case is RDS, which
can be run both over IB (driver RDMA's a scatterlist to the RDS module)
or over TCP (TCP passes an sk_buff to the RDS module)

This RFC (call it v2) uses exiting socket filter infrastructure and
extend it with new eBPF program type that deals with struct scatterlist.
For RDS, the integrated approach treats the scatterlist as the common
denominator, and allows the application to write a filter for processing
a scatterlist.

Details:
Patch 1 adds new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER which
uses the existing socket filter infrastructure for bpf program attach
and load. eBPF program of type BPF_PROG_TYPE_SOCKET_SG_FILTER receives
struct scatterlist as bpf context contrast to
BPF_PROG_TYPE_SOCKET_FILTER which deals with struct skb. This new eBPF
program type allow socket filter to run on packet data that is in form
form of struct scatterlist.

Patch 2 adds functionality to run BPF_PROG_TYPE_SOCKET_SG_FILTER socket
filter program. A bpf helpers bpf_sg_next() is also added so users can
retrieve sg elements from scatterlist.

Patch 3 adds socket filter eBPF sample program that uses patch 1 and
patch 2. The sample program opens an rds socket, attach ebpf program
(socksg i.e. BPF_PROG_TYPE_SOCKET_SG_FILTER) to rds socket and uses
bpf_sg_next helper to look into sg. For a test, current ebpf program
only prints first few bytes from each elements of sg list.

Finally, patch 4 allows rds_recv_incoming to invoke socket filter
program which deals with scatterlist.

Thanks.

-Tushar

Tushar Dave (4):
  eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER
  ebpf: Add sg_filter_run and sg helper
  ebpf: Add sample ebpf program for SOCKET_SG_FILTER
  rds: invoke socket sg filter attached to rds socket

 include/linux/bpf_types.h                 |   1 +
 include/linux/filter.h                    |  10 +
 include/uapi/linux/bpf.h                  |  17 +-
 kernel/bpf/syscall.c                      |   1 +
 kernel/bpf/verifier.c                     |   1 +
 net/core/filter.c                         | 149 ++++++++++++-
 net/rds/ib.c                              |   1 +
 net/rds/ib.h                              |   1 +
 net/rds/ib_recv.c                         |  12 ++
 net/rds/rds.h                             |   2 +
 net/rds/recv.c                            |  16 ++
 net/rds/tcp.c                             |   2 +
 net/rds/tcp.h                             |   2 +
 net/rds/tcp_recv.c                        |  38 ++++
 samples/bpf/Makefile                      |   3 +
 samples/bpf/bpf_load.c                    |  11 +-
 samples/bpf/rds_filter_kern.c             |  78 +++++++
 samples/bpf/rds_filter_user.c             | 339 ++++++++++++++++++++++++++++++
 tools/bpf/bpftool/prog.c                  |   1 +
 tools/include/uapi/linux/bpf.h            |  17 +-
 tools/lib/bpf/libbpf.c                    |   3 +
 tools/lib/bpf/libbpf.h                    |   2 +
 tools/testing/selftests/bpf/bpf_helpers.h |   3 +
 23 files changed, 703 insertions(+), 7 deletions(-)
 create mode 100644 samples/bpf/rds_filter_kern.c
 create mode 100644 samples/bpf/rds_filter_user.c

-- 
1.8.3.1

^ permalink raw reply

* Re: [PATCH 1/3] net: ethernet: fix suspend/resume in davinci_emac
From: Lukas Wunner @ 2018-06-19 18:00 UTC (permalink / raw)
  To: Bartosz Golaszewski
  Cc: Grygorii Strashko, David S . Miller, Florian Fainelli,
	Dan Carpenter, Ivan Khoronzhuk, Rob Herring, Kevin Hilman,
	David Lechner, Sekhar Nori, Andrew Lunn, linux-omap, netdev,
	linux-kernel, Bartosz Golaszewski, stable
In-Reply-To: <20180619160950.6283-2-brgl@bgdev.pl>

On Tue, Jun 19, 2018 at 06:09:48PM +0200, Bartosz Golaszewski wrote:
> From: Bartosz Golaszewski <bgolaszewski@baylibre.com>
> 
> This patch reverts commit 3243ff2a05ec ("net: ethernet: davinci_emac:
> Deduplicate bus_find_device() by name matching") and adds a comment
> which should stop anyone from reintroducing the same "fix" in the future.
> 
> We can't use bus_find_device_by_name() here because the device name is
> not guaranteed to be 'davinci_mdio'. On some systems it can be
> 'davinci_mdio.0' so we need to use strncmp() against the first part of
> the string to correctly match it.
> 
> Fixes: 3243ff2a05ec ("net: ethernet: davinci_emac: Deduplicate bus_find_device() by name matching")
> Cc: stable@vger.kernel.org
> Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>

This is still

Acked-by: Lukas Wunner <lukas@wunner.de>

given that my patch which is reverted here seems to have been incorrect.
Feel free to keep the ack if you respin in response to Florian Fainelli's
comments.

Thanks,

Lukas

> ---
>  drivers/net/ethernet/ti/davinci_emac.c | 15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/ti/davinci_emac.c b/drivers/net/ethernet/ti/davinci_emac.c
> index 06d7c9e4dcda..a1a6445b5a7e 100644
> --- a/drivers/net/ethernet/ti/davinci_emac.c
> +++ b/drivers/net/ethernet/ti/davinci_emac.c
> @@ -1385,6 +1385,11 @@ static int emac_devioctl(struct net_device *ndev, struct ifreq *ifrq, int cmd)
>  		return -EOPNOTSUPP;
>  }
>  
> +static int match_first_device(struct device *dev, void *data)
> +{
> +	return !strncmp(dev_name(dev), "davinci_mdio", 12);
> +}
> +
>  /**
>   * emac_dev_open - EMAC device open
>   * @ndev: The DaVinci EMAC network adapter
> @@ -1484,8 +1489,14 @@ static int emac_dev_open(struct net_device *ndev)
>  
>  	/* use the first phy on the bus if pdata did not give us a phy id */
>  	if (!phydev && !priv->phy_id) {
> -		phy = bus_find_device_by_name(&mdio_bus_type, NULL,
> -					      "davinci_mdio");
> +		/* NOTE: we can't use bus_find_device_by_name() here because
> +		 * the device name is not guaranteed to be 'davinci_mdio'. On
> +		 * some systems it can be 'davinci_mdio.0' so we need to use
> +		 * strncmp() against the first part of the string to correctly
> +		 * match it.
> +		 */
> +		phy = bus_find_device(&mdio_bus_type, NULL, NULL,
> +				      match_first_device);
>  		if (phy) {
>  			priv->phy_id = dev_name(phy);
>  			if (!priv->phy_id || !*priv->phy_id)
> -- 
> 2.17.1
> 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox