Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH bpf-next 2/3] bpf: btf: add btf json print functionality
From: Jakub Kicinski @ 2018-06-22 21:49 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Okash Khawaja, Daniel Borkmann, Alexei Starovoitov, Yonghong Song,
	Quentin Monnet, David S. Miller, netdev, kernel-team,
	linux-kernel
In-Reply-To: <20180622142743.2b890d0f@cakuba.netronome.com>

On Fri, 22 Jun 2018 14:27:43 -0700, Jakub Kicinski wrote:
> BTF in JSON is very useful, and will help people who writes simple
> orchestration/scripts based on bpftool *a* *lot*.  I really appreciate
> this addition to bpftool and will start using it myself as soon as it
> lands.  I'm not sure why the reluctance to slightly change the output
> format?

Ohh, maybe that's the misunderstanding, you only implemented JSON so
you wanted it to be as readable and clean as possible.  Hence the hex
output and cutting out the old cruft!  That perspective makes sense!
But I think we should keep JSON for machines (but including BTF
formatted values) and let's make the plain text output nice and clean,
agreed.

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Siwei Liu @ 2018-06-22 21:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Cornelia Huck, Samudrala, Sridhar, Alexander Duyck, virtio-dev,
	aaron.f.brown, Jiri Pirko, Jakub Kicinski, Netdev, qemu-devel,
	virtualization, konrad.wilk, boris.ostrovsky, Joao Martins,
	Venu Busireddy, vijay.balakrishna
In-Reply-To: <20180623002628-mutt-send-email-mst@kernel.org>

On Fri, Jun 22, 2018 at 2:29 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Fri, Jun 22, 2018 at 01:03:04PM -0700, Siwei Liu wrote:
>> On Fri, Jun 22, 2018 at 1:00 PM, Siwei Liu <loseweigh@gmail.com> wrote:
>> > On Thu, Jun 21, 2018 at 7:32 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> >> On Thu, Jun 21, 2018 at 06:21:55PM -0700, Siwei Liu wrote:
>> >>> On Thu, Jun 21, 2018 at 7:59 AM, Cornelia Huck <cohuck@redhat.com> wrote:
>> >>> > On Wed, 20 Jun 2018 22:48:58 +0300
>> >>> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> >>> >
>> >>> >> On Wed, Jun 20, 2018 at 06:06:19PM +0200, Cornelia Huck wrote:
>> >>> >> > In any case, I'm not sure anymore why we'd want the extra uuid.
>> >>> >>
>> >>> >> It's mostly so we can have e.g. multiple devices with same MAC
>> >>> >> (which some people seem to want in order to then use
>> >>> >> then with different containers).
>> >>> >>
>> >>> >> But it is also handy for when you assign a PF, since then you
>> >>> >> can't set the MAC.
>> >>> >>
>> >>> >
>> >>> > OK, so what about the following:
>> >>> >
>> >>> > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
>> >>> >   that we have a new uuid field in the virtio-net config space
>> >>> > - in QEMU, add a property for virtio-net that allows to specify a uuid,
>> >>> >   offer VIRTIO_NET_F_STANDBY_UUID if set
>> >>> > - when configuring, set the property to the group UUID of the vfio-pci
>> >>> >   device
>> >>>
>> >>> If feature negotiation fails on VIRTIO_NET_F_STANDBY_UUID, is it safe
>> >>> to still expose UUID in the config space on virtio-pci?
>> >>
>> >>
>> >> Yes but guest is not supposed to read it.
>> >>
>> >>> I'm not even sure if it's sane to expose group UUID on the PCI bridge
>> >>> where the corresponding vfio-pci device attached to for a guest which
>> >>> doesn't support the feature (legacy).
>> >>>
>> >>> -Siwei
>> >>
>> >> Yes but you won't add the primary behind such a bridge.
>> >
>> > I assume the UUID feature is a new one besides the exiting
>> > VIRTIO_NET_F_STANDBY feature, where I think the current proposal is,
>> > if UUID feature is present and supported by guest, we'll do pairing
>> > based on UUID; if UUID feature present
>>                                  ^^^^^^^  is NOT present
>
> Let's go back for a bit.
>
> What is wrong with matching by mac?
>
> 1. Does not allow multiple NICs with same mac
> 2. Requires MAC to be programmed by host in the PT device
>    (which is often possible with VFs but isn't possible with all PT
>    devices)

Might not neccessarily be something wrong, but it's very limited to
prohibit the MAC of VF from changing when enslaved by failover. The
same as you indicated on the PT device. I suspect the same is
prohibited on even virtio-net and failover is because of the same
limitation.

>
> Both issues are up to host: if host needs them it needs the UUID
> feature, if not MAC matching is sufficient.

I know. What I'm saying is that we rely on STANDBY feature to control
the visibility of the primary, not the UUID feature.

-Siwei

>
>
>> > or not supported by guest,
>> > we'll still plug in the VF (if guest supports the _F_STANDBY feature)
>> > but the pairing will be fallback to using MAC address. Are you saying
>> > you don't even want to plug in the primary when the UUID feature is
>> > not supported by guest? Does the feature negotiation UUID have to
>> > depend on STANDBY being set, or the other way around? I thought that
>> > just the absence of STANDBY will prevent primary to get exposed to the
>> > guest.
>> >
>> > -Siwei
>> >
>> >>
>> >>>
>> >>> > - in the guest, use the uuid from the virtio-net device's config space
>> >>> >   if applicable; else, fall back to matching by MAC as done today
>

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Siwei Liu @ 2018-06-22 21:57 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, konrad.wilk,
	Jakub Kicinski, Samudrala, Sridhar, Cornelia Huck, qemu-devel,
	virtualization, Venu Busireddy, Netdev, boris.ostrovsky,
	aaron.f.brown, Joao Martins
In-Reply-To: <20180623003022-mutt-send-email-mst@kernel.org>

On Fri, Jun 22, 2018 at 2:32 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Fri, Jun 22, 2018 at 01:21:45PM -0700, Siwei Liu wrote:
>> On Fri, Jun 22, 2018 at 12:05 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Fri, Jun 22, 2018 at 05:09:55PM +0200, Cornelia Huck wrote:
>> >> On Thu, 21 Jun 2018 21:20:13 +0300
>> >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> >>
>> >> > On Thu, Jun 21, 2018 at 04:59:13PM +0200, Cornelia Huck wrote:
>> >> > > OK, so what about the following:
>> >> > >
>> >> > > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
>> >> > >   that we have a new uuid field in the virtio-net config space
>> >> > > - in QEMU, add a property for virtio-net that allows to specify a uuid,
>> >> > >   offer VIRTIO_NET_F_STANDBY_UUID if set
>> >> > > - when configuring, set the property to the group UUID of the vfio-pci
>> >> > >   device
>> >> > > - in the guest, use the uuid from the virtio-net device's config space
>> >> > >   if applicable; else, fall back to matching by MAC as done today
>> >> > >
>> >> > > That should work for all virtio transports.
>> >> >
>> >> > True. I'm a bit unhappy that it's virtio net specific though
>> >> > since down the road I expect we'll have a very similar feature
>> >> > for scsi (and maybe others).
>> >> >
>> >> > But we do not have a way to have fields that are portable
>> >> > both across devices and transports, and I think it would
>> >> > be a useful addition. How would this work though? Any idea?
>> >>
>> >> Can we introduce some kind of device-independent config space area?
>> >> Pushing back the device-specific config space by a certain value if the
>> >> appropriate feature is negotiated and use that for things like the uuid?
>> >
>> > So config moves back and forth?
>> > Reminds me of the msi vector mess we had with pci.
>> > I'd rather have every transport add a new config.
>> >
>> >> But regardless of that, I'm not sure whether extending this approach to
>> >> other device types is the way to go. Tying together two different
>> >> devices is creating complicated situations at least in the hypervisor
>> >> (even if it's fairly straightforward in the guest). [I have not come
>> >> around again to look at the "how to handle visibility in QEMU"
>> >> questions due to lack of cycles, sorry about that.]
>> >>
>> >> So, what's the goal of this approach? Only to allow migration with
>> >> vfio-pci, or also to plug in a faster device and use it instead of an
>> >> already attached paravirtualized device?
>> >
>> > These are two sides of the same coin, I think the second approach
>> > is closer to what we are doing here.
>>
>> I'm leaning towards being conservative to keep the potential of doing
>> both. It's the vfio migration itself that has to be addessed, not to
>> make every virtio device to have an accelerated datapath, right?
>>
>> -Siwei
>
> I think that the approach we took (signal configuration
> through standby) assumes standby always present and
> primary appearing and disappearing.

I can imagine that's still true even for 1-netdev model. As long as we
can hide the lower devices.

That's what I said "to keep the potential to support both" models. But
we should not go further along to assume the other aspect ie. to get
PV accelerated whenever possible, but not to get VF migrated whenever
possible. That's essetially a big diveregence of the priority for the
goal we'd want to pursue.

-Siwei

>
> Anything else isn't well supported and I'm not sure we
> should complicate code too much to support it.
>
>>
>> >
>> >> What about migration of vfio devices that are not easily replaced by a
>> >> paravirtualized device? I'm thinking of vfio-ccw, where our main (and
>> >> currently only) supported device is dasd (disks) -- which can do a lot
>> >> of specialized things that virtio-blk does not support (and should not
>> >> or even cannot support).
>> >
>> > But maybe virtio-scsi can?
>> >
>> >> Would it be more helpful to focus on generic
>> >> migration support for vfio instead of going about it device by device?
>> >>
>> >> This network device approach already seems far along, so it makes sense
>> >> to continue with it. But I'm not sure whether we want to spend time and
>> >> energy on that for other device types rather than working on a general
>> >> solution for vfio migration.
>> >
>> > I'm inclined to say finalizing this feature would be a good first step
>> > and will teach us how we can move forward.
>> >
>> > --
>> > MST

^ permalink raw reply

* napi: SCHED flag set/clear
From: Krishna Chaitanya @ 2018-06-22 22:22 UTC (permalink / raw)
  To: netdev

Hi,

While debugging a NAPI related issue (calling napi_disable() twice
leading to deadlock),
I have come across below query. The issue is not documented anywhere
(except for Ben grear's  ath10k patch)

Below are my expectations:

napi_disable(): waits for SCHED (ignoring other flags)
napi_enable(): set SCHED flag

but in reality,

napi_disable: test and set SCHED flag in a while() loop?
napi_enable: clear SCHED flag

This is a bit counter-intuitive, SCHED generally means pollable (enabled).

Am i missing something?

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Michael S. Tsirkin @ 2018-06-22 22:25 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Cornelia Huck, Samudrala, Sridhar, Alexander Duyck, virtio-dev,
	aaron.f.brown, Jiri Pirko, Jakub Kicinski, Netdev, qemu-devel,
	virtualization, konrad.wilk, boris.ostrovsky, Joao Martins,
	Venu Busireddy, vijay.balakrishna
In-Reply-To: <CADGSJ20td7b6He6S1PBSJjCq=bNvgqMGbOjiho2eeQm2pgpK3g@mail.gmail.com>

On Fri, Jun 22, 2018 at 02:51:11PM -0700, Siwei Liu wrote:
> On Fri, Jun 22, 2018 at 2:29 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Fri, Jun 22, 2018 at 01:03:04PM -0700, Siwei Liu wrote:
> >> On Fri, Jun 22, 2018 at 1:00 PM, Siwei Liu <loseweigh@gmail.com> wrote:
> >> > On Thu, Jun 21, 2018 at 7:32 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> >> On Thu, Jun 21, 2018 at 06:21:55PM -0700, Siwei Liu wrote:
> >> >>> On Thu, Jun 21, 2018 at 7:59 AM, Cornelia Huck <cohuck@redhat.com> wrote:
> >> >>> > On Wed, 20 Jun 2018 22:48:58 +0300
> >> >>> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >> >>> >
> >> >>> >> On Wed, Jun 20, 2018 at 06:06:19PM +0200, Cornelia Huck wrote:
> >> >>> >> > In any case, I'm not sure anymore why we'd want the extra uuid.
> >> >>> >>
> >> >>> >> It's mostly so we can have e.g. multiple devices with same MAC
> >> >>> >> (which some people seem to want in order to then use
> >> >>> >> then with different containers).
> >> >>> >>
> >> >>> >> But it is also handy for when you assign a PF, since then you
> >> >>> >> can't set the MAC.
> >> >>> >>
> >> >>> >
> >> >>> > OK, so what about the following:
> >> >>> >
> >> >>> > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
> >> >>> >   that we have a new uuid field in the virtio-net config space
> >> >>> > - in QEMU, add a property for virtio-net that allows to specify a uuid,
> >> >>> >   offer VIRTIO_NET_F_STANDBY_UUID if set
> >> >>> > - when configuring, set the property to the group UUID of the vfio-pci
> >> >>> >   device
> >> >>>
> >> >>> If feature negotiation fails on VIRTIO_NET_F_STANDBY_UUID, is it safe
> >> >>> to still expose UUID in the config space on virtio-pci?
> >> >>
> >> >>
> >> >> Yes but guest is not supposed to read it.
> >> >>
> >> >>> I'm not even sure if it's sane to expose group UUID on the PCI bridge
> >> >>> where the corresponding vfio-pci device attached to for a guest which
> >> >>> doesn't support the feature (legacy).
> >> >>>
> >> >>> -Siwei
> >> >>
> >> >> Yes but you won't add the primary behind such a bridge.
> >> >
> >> > I assume the UUID feature is a new one besides the exiting
> >> > VIRTIO_NET_F_STANDBY feature, where I think the current proposal is,
> >> > if UUID feature is present and supported by guest, we'll do pairing
> >> > based on UUID; if UUID feature present
> >>                                  ^^^^^^^  is NOT present
> >
> > Let's go back for a bit.
> >
> > What is wrong with matching by mac?
> >
> > 1. Does not allow multiple NICs with same mac
> > 2. Requires MAC to be programmed by host in the PT device
> >    (which is often possible with VFs but isn't possible with all PT
> >    devices)
> 
> Might not neccessarily be something wrong, but it's very limited to
> prohibit the MAC of VF from changing when enslaved by failover.

You mean guest changing MAC? I'm not sure why we prohibit that.


> The
> same as you indicated on the PT device. I suspect the same is
> prohibited on even virtio-net and failover is because of the same
> limitation.
> 
> >
> > Both issues are up to host: if host needs them it needs the UUID
> > feature, if not MAC matching is sufficient.
> 
> I know. What I'm saying is that we rely on STANDBY feature to control
> the visibility of the primary, not the UUID feature.
> 
> -Siwei

And what I'm saying is that it's up to a host policy.

> >
> >
> >> > or not supported by guest,
> >> > we'll still plug in the VF (if guest supports the _F_STANDBY feature)
> >> > but the pairing will be fallback to using MAC address. Are you saying
> >> > you don't even want to plug in the primary when the UUID feature is
> >> > not supported by guest? Does the feature negotiation UUID have to
> >> > depend on STANDBY being set, or the other way around? I thought that
> >> > just the absence of STANDBY will prevent primary to get exposed to the
> >> > guest.
> >> >
> >> > -Siwei
> >> >
> >> >>
> >> >>>
> >> >>> > - in the guest, use the uuid from the virtio-net device's config space
> >> >>> >   if applicable; else, fall back to matching by MAC as done today
> >

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Michael S. Tsirkin @ 2018-06-22 22:33 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, konrad.wilk,
	Jakub Kicinski, Samudrala, Sridhar, Cornelia Huck, qemu-devel,
	virtualization, Venu Busireddy, Netdev, boris.ostrovsky,
	aaron.f.brown, Joao Martins
In-Reply-To: <CADGSJ22oFhP955cuTmMWeSt0UOsG5K1A--c34QUSMx5z3Up2SA@mail.gmail.com>

On Fri, Jun 22, 2018 at 02:57:39PM -0700, Siwei Liu wrote:
> On Fri, Jun 22, 2018 at 2:32 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Fri, Jun 22, 2018 at 01:21:45PM -0700, Siwei Liu wrote:
> >> On Fri, Jun 22, 2018 at 12:05 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> > On Fri, Jun 22, 2018 at 05:09:55PM +0200, Cornelia Huck wrote:
> >> >> On Thu, 21 Jun 2018 21:20:13 +0300
> >> >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >> >>
> >> >> > On Thu, Jun 21, 2018 at 04:59:13PM +0200, Cornelia Huck wrote:
> >> >> > > OK, so what about the following:
> >> >> > >
> >> >> > > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
> >> >> > >   that we have a new uuid field in the virtio-net config space
> >> >> > > - in QEMU, add a property for virtio-net that allows to specify a uuid,
> >> >> > >   offer VIRTIO_NET_F_STANDBY_UUID if set
> >> >> > > - when configuring, set the property to the group UUID of the vfio-pci
> >> >> > >   device
> >> >> > > - in the guest, use the uuid from the virtio-net device's config space
> >> >> > >   if applicable; else, fall back to matching by MAC as done today
> >> >> > >
> >> >> > > That should work for all virtio transports.
> >> >> >
> >> >> > True. I'm a bit unhappy that it's virtio net specific though
> >> >> > since down the road I expect we'll have a very similar feature
> >> >> > for scsi (and maybe others).
> >> >> >
> >> >> > But we do not have a way to have fields that are portable
> >> >> > both across devices and transports, and I think it would
> >> >> > be a useful addition. How would this work though? Any idea?
> >> >>
> >> >> Can we introduce some kind of device-independent config space area?
> >> >> Pushing back the device-specific config space by a certain value if the
> >> >> appropriate feature is negotiated and use that for things like the uuid?
> >> >
> >> > So config moves back and forth?
> >> > Reminds me of the msi vector mess we had with pci.
> >> > I'd rather have every transport add a new config.
> >> >
> >> >> But regardless of that, I'm not sure whether extending this approach to
> >> >> other device types is the way to go. Tying together two different
> >> >> devices is creating complicated situations at least in the hypervisor
> >> >> (even if it's fairly straightforward in the guest). [I have not come
> >> >> around again to look at the "how to handle visibility in QEMU"
> >> >> questions due to lack of cycles, sorry about that.]
> >> >>
> >> >> So, what's the goal of this approach? Only to allow migration with
> >> >> vfio-pci, or also to plug in a faster device and use it instead of an
> >> >> already attached paravirtualized device?
> >> >
> >> > These are two sides of the same coin, I think the second approach
> >> > is closer to what we are doing here.
> >>
> >> I'm leaning towards being conservative to keep the potential of doing
> >> both. It's the vfio migration itself that has to be addessed, not to
> >> make every virtio device to have an accelerated datapath, right?
> >>
> >> -Siwei
> >
> > I think that the approach we took (signal configuration
> > through standby) assumes standby always present and
> > primary appearing and disappearing.
> 
> I can imagine that's still true even for 1-netdev model. As long as we
> can hide the lower devices.
> 
> That's what I said "to keep the potential to support both" models. But
> we should not go further along to assume the other aspect ie. to get
> PV accelerated whenever possible, but not to get VF migrated whenever
> possible. That's essetially a big diveregence of the priority for the
> goal we'd want to pursue.
> 
> -Siwei

I suspect the diveregence will be lost on most users though
simply because they don't even care about vfio. They just
want things to go fast.

Rephrasing requirements in terms of acceleration actually
made things implementable. So I'm not in a hurry to try
and go back to asking for a migrateable vfio - very
easy to get stuck there.

> >
> > Anything else isn't well supported and I'm not sure we
> > should complicate code too much to support it.
> >
> >>
> >> >
> >> >> What about migration of vfio devices that are not easily replaced by a
> >> >> paravirtualized device? I'm thinking of vfio-ccw, where our main (and
> >> >> currently only) supported device is dasd (disks) -- which can do a lot
> >> >> of specialized things that virtio-blk does not support (and should not
> >> >> or even cannot support).
> >> >
> >> > But maybe virtio-scsi can?
> >> >
> >> >> Would it be more helpful to focus on generic
> >> >> migration support for vfio instead of going about it device by device?
> >> >>
> >> >> This network device approach already seems far along, so it makes sense
> >> >> to continue with it. But I'm not sure whether we want to spend time and
> >> >> energy on that for other device types rather than working on a general
> >> >> solution for vfio migration.
> >> >
> >> > I'm inclined to say finalizing this feature would be a good first step
> >> > and will teach us how we can move forward.
> >> >
> >> > --
> >> > MST

^ permalink raw reply

* Re: [PATCH bpf-next 2/3] bpf: btf: add btf json print functionality
From: Okash Khawaja @ 2018-06-22 22:48 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Martin KaFai Lau, Daniel Borkmann, Alexei Starovoitov,
	Yonghong Song, Quentin Monnet, David S. Miller, netdev,
	kernel-team, linux-kernel
In-Reply-To: <20180622142743.2b890d0f@cakuba.netronome.com>

On Fri, Jun 22, 2018 at 02:27:43PM -0700, Jakub Kicinski wrote:

[...]

> > > should just use the BTF, I'm not sure we should format indexes for
> > > arrays nicely or not :S
> > >   
> > > > The key could also be a struct (e.g. a struct to describe ip:port).
> > > > Can you suggest how the "key_struct" will look like?  
> > > 
> > > Hm.  I think in my mind it has only been a struct but that's not true :S
> > > So the struct in the name makes very limited sense now.
> > > 
> > > Should we do:
> > > "formatted" : {
> > > 	"value" : XXX
> > > }
> > > 
> > > Where
> > > XXX == plain int for integers, e.g. "value":0
> > > XXX == array for arrays, e.g. "value":[1,2,3,4]
> > > XXX == object for objects, e.g. "value":{"field":XXX, "field2":XXX}  
> > It is exactly how this patch is using json's {}, [] and int. ;)
> 
> Great, then just wrap that in a "formatted" object instead of
> redefining fields and we're good?
Please don't let two formats of same data in a single ouput. Overloading
same output with multple formats suggests a confused design, IMO. It
will confuse users too.

> 
> > but other than that, it does not have to be json.
> > In the next spin, lets stop calling this output json to avoid wrong
> > user's expection and I also don't want the future readability
> > improvements to be limited by that.  Lets call it something
> > like plain text output with BTF.
> 
> I don't understand.  We are discussing JSON output here.  The example we
> are commenting on is output of:
> 
> $ sudo bpftool map dump -p id 14
> 
> That -p means JSON.  Nobody objects to plain test output changes.  I
> actually didn't realize you haven't implemented plain text in this
> series, we should have both.
> 
> > How about:
> > When "bpftool map dump id 1" is used, it will print the BTF plaintext output
> > if a map has BTF available.  If not, it will print the existing
> > plaintext output.  That should solve the concern about the user may not
> > know BTF is available.
> > 
> > This ascii output is for human.  The script should not parse the ascii output
> > because it is silly not to use the binary ABI (like what this patch is using)
> > which does not suffer backward compat issue.
> 
> What binary ABI?
Reading binary data and linking it with BTF information.

> I'm also not 100% sure what this patch is doing as it
> adds bunch of code in new files that never gets called:
Please take a look at the patch then :) Code in these files does get
called.

We seem to be conflating two different - and conflicting - intents here.
First is progam-readability of the output and second is human
readability of the output. I believe both are important. Let's leave
existing output for programs to read. That way we can try to keep it
backward compatible as well as JSON compatible. Let's dedicate the new
BTF format for humans to read. That way, we don't have to worry about
backward compatibility or JSON compatibility.

Let me know if this is not clear. If we agree on above then I think we
can move towards a solution.

Thanks,
Okash

^ permalink raw reply

* Re: [PATCH bpf-next 2/3] bpf: btf: add btf json print functionality
From: Martin KaFai Lau @ 2018-06-22 22:54 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Okash Khawaja, Daniel Borkmann, Alexei Starovoitov, Yonghong Song,
	Quentin Monnet, David S. Miller, netdev, kernel-team,
	linux-kernel
In-Reply-To: <20180622142743.2b890d0f@cakuba.netronome.com>

On Fri, Jun 22, 2018 at 02:27:43PM -0700, Jakub Kicinski wrote:
> On Fri, 22 Jun 2018 13:58:52 -0700, Martin KaFai Lau wrote:
> > On Fri, Jun 22, 2018 at 11:40:32AM -0700, Jakub Kicinski wrote:
> > > On Thu, 21 Jun 2018 18:20:52 -0700, Martin KaFai Lau wrote:  
> > > > On Thu, Jun 21, 2018 at 05:25:23PM -0700, Jakub Kicinski wrote:  
> > > > > On Thu, 21 Jun 2018 16:58:15 -0700, Martin KaFai Lau wrote:    
> > > > > > On Thu, Jun 21, 2018 at 04:07:19PM -0700, Jakub Kicinski wrote:    
> > > > > > > On Thu, 21 Jun 2018 15:51:17 -0700, Martin KaFai Lau wrote:      
> > > > > > > > On Thu, Jun 21, 2018 at 02:59:35PM -0700, Jakub Kicinski wrote:      
> > > > > > > > > On Wed, 20 Jun 2018 13:30:53 -0700, Okash Khawaja wrote:        
> > > > > > > > > > $ sudo bpftool map dump -p id 14
> > > > > > > > > > [{
> > > > > > > > > >         "key": 0
> > > > > > > > > >     },{
> > > > > > > > > >         "value": {
> > > > > > > > > >             "m": 1,
> > > > > > > > > >             "n": 2,
> > > > > > > > > >             "o": "c",
> > > > > > > > > >             "p": [15,16,17,18,15,16,17,18
> > > > > > > > > >             ],
> > > > > > > > > >             "q": [[25,26,27,28,25,26,27,28
> > > > > > > > > >                 ],[35,36,37,38,35,36,37,38
> > > > > > > > > >                 ],[45,46,47,48,45,46,47,48
> > > > > > > > > >                 ],[55,56,57,58,55,56,57,58
> > > > > > > > > >                 ]
> > > > > > > > > >             ],
> > > > > > > > > >             "r": 1,
> > > > > > > > > >             "s": 0x7ffff6f70568,
> > > > > > > > > >             "t": {
> > > > > > > > > >                 "x": 5,
> > > > > > > > > >                 "y": 10
> > > > > > > > > >             },
> > > > > > > > > >             "u": 100,
> > > > > > > > > >             "v": 20,
> > > > > > > > > >             "w1": 0x7,
> > > > > > > > > >             "w2": 0x3
> > > > > > > > > >         }
> > > > > > > > > >     }
> > > > > > > > > > ]        
> > > > > > > > > 
> > > > > > > > > I don't think this format is okay, JSON output is an API you shouldn't
> > > > > > > > > break.  You can change the non-JSON output whatever way you like, but
> > > > > > > > > JSON must remain backwards compatible.
> > > > > > > > > 
> > > > > > > > > The dump today has object per entry, e.g.:
> > > > > > > > > 
> > > > > > > > > {
> > > > > > > > >         "key":["0x00","0x00","0x00","0x00",
> > > > > > > > >         ],
> > > > > > > > >         "value": ["0x02","0x00","0x00","0x00","0x00","0x00","0x00","0x00"
> > > > > > > > >         ]
> > > > > > > > > }
> > > > > > > > > 
> > > > > > > > > This format must remain, you may only augment it with new fields.  E.g.:
> > > > > > > > > 
> > > > > > > > > {
> > > > > > > > >         "key":["0x00","0x00","0x00","0x00",
> > > > > > > > >         ],
> > > > > > > > > 	"key_struct":{
> > > > > > > > > 		"index":0
> > > > > > > > > 	},    
> > > > Got a few questions.
> > > > 
> > > > When we support hashtab later, the key could be int
> > > > but reusing the name as "index" is weird.  
> > > 
> > > Ugh, yes, naturally.  I just typed that out without thinking, so for
> > > array maps there is usually no BTF info?...  For hashes obviously we  
> > The key of array map also has BTF info which must be an int.
> 
> Perfect.
> 
> > > should just use the BTF, I'm not sure we should format indexes for
> > > arrays nicely or not :S
> > >   
> > > > The key could also be a struct (e.g. a struct to describe ip:port).
> > > > Can you suggest how the "key_struct" will look like?  
> > > 
> > > Hm.  I think in my mind it has only been a struct but that's not true :S
> > > So the struct in the name makes very limited sense now.
> > > 
> > > Should we do:
> > > "formatted" : {
> > > 	"value" : XXX
> > > }
> > > 
> > > Where
> > > XXX == plain int for integers, e.g. "value":0
> > > XXX == array for arrays, e.g. "value":[1,2,3,4]
> > > XXX == object for objects, e.g. "value":{"field":XXX, "field2":XXX}  
> > It is exactly how this patch is using json's {}, [] and int. ;)
> 
> Great, then just wrap that in a "formatted" object instead of
> redefining fields and we're good?
> 
> > but other than that, it does not have to be json.
> > In the next spin, lets stop calling this output json to avoid wrong
> > user's expection and I also don't want the future readability
> > improvements to be limited by that.  Lets call it something
> > like plain text output with BTF.
> 
> I don't understand.  We are discussing JSON output here.  The example we
> are commenting on is output of:
> 
> $ sudo bpftool map dump -p id 14
> 
> That -p means JSON.  Nobody objects to plain test output changes.  I
> actually didn't realize you haven't implemented plain text in this
> series, we should have both.
> 
> > How about:
> > When "bpftool map dump id 1" is used, it will print the BTF plaintext output
> > if a map has BTF available.  If not, it will print the existing
> > plaintext output.  That should solve the concern about the user may not
> > know BTF is available.
> > 
> > This ascii output is for human.  The script should not parse the ascii output
> > because it is silly not to use the binary ABI (like what this patch is using)
> > which does not suffer backward compat issue.
> 
> What binary ABI?  I'm also not 100% sure what this patch is doing as it
> adds bunch of code in new files that never gets called:
I meant the BTF format, the new kernel API to get BTF and how it is used
on map data are stable.

Yes, there is new codes to get and consume the new BTF format but so does
any new kernel API.  and it should not drive everybody to parse ascii.

> 
>  tools/bpf/bpftool/btf_dumper.c |  247 +++++++++++++++++++++++++++++++++++++++++
>  tools/bpf/bpftool/btf_dumper.h |   18 ++
>  2 files changed, 265 insertions(+)
> 
> > The existing "-j" can be used to dump the map's binary data
> > for remote debugging purpose.  The BTF is in one of the elf section of
> > the bpf_prog.o.
> 
> > > > > > > > >         "value": ["0x02","0x00","0x00","0x00","0x00","0x00","0x00","0x00"
> > > > > > > > >         ],
> > > > > > > > > 	"value_struct":{
> > > > > > > > > 		"src_ip":2,    
> > > > If for the same map the user changes the "src_ip" to an array of int[4]
> > > > later (e.g. to support ipv6), it will become "src_ip": [1, 2, 3, 4].
> > > > Is it breaking backward compat?
> > > > i.e.
> > > > struct five_tuples {
> > > > -	int src_ip;
> > > > +	int src_ip[4];
> > > > /* ... */
> > > > };  
> > > 
> > > Well, it is breaking backward compat, but it's the program doing it,
> > > not bpftool :)  BTF changes so does the output.  
> > As we see, the key/value's btf-output is inherently not backward compat.
> > Hence, "-j" and "-p" will stay as is.  The whole existing json will
> > be backward compat instead of only partly backward compat.
> 
> No.  There is a difference between user of a facility changing their
> input and kernel/libraries providing different output in response to
> that, and the libraries suddenly changing the output on their own.
> 
> Your example is like saying if user started using IPv6 addresses
> instead of IPv4 the netlink attributes in dumps will be different so
> kernel didn't keep backwards compat.  While what you're doing is more
> equivalent to dropping support for old ioctl interfaces because there
> is a better mechanism now.
Sorry, I don't follow this.  I don't see netlink suffer json issue like
the one on "key" and "value".

All I can grasp is, the json should normally be backward compat but now
we are saying anything added by btf-output is an exception because
the script parsing it will treat it differently than "key" and "value"

> 
> BTF in JSON is very useful, and will help people who writes simple
> orchestration/scripts based on bpftool *a* *lot*.  I really appreciate
Can you share what the script will do?  I want to understand why
it cannot directly use the BTF format and the map data.

> this addition to bpftool and will start using it myself as soon as it
> lands.  I'm not sure why the reluctance to slightly change the output
> format?
The initial change argument is because the json has to be backward compat.

Then we show that btf-output is inherently not backward compat, so
printing it in json does not make sense at all.

However, now it is saying part of it does not have to be backward compat.

I am fine putting it under "formatted" for "-j" or "-p" if that is the
case, other than the double output is still confusing.  Lets wait for
Okash's input.

At the same time, the same output will be used as the default plaintext
output when BTF is available.  Then the plaintext BTF output
will not be limited by the json restrictions when we want
to improve human readability later.  Apparently, the
improvements on plaintext will not be always applicable
to json output.

^ permalink raw reply

* arm64 regression bisected to "bpf: reject any prog that failed read-only lock"
From: dann frazier @ 2018-06-22 23:09 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: linux-kernel, netdev, Martin KaFai Lau, linux-arm-kernel,
	Alexei Starovoitov

fyi, I'm seeing a regression that I've bisected down to:
  9facc336876f bpf: reject any prog that failed read-only lock

A snippet of the boot messages are below. I'll be out of pocket until
Monday, so apologies in advance for delays in responding to follow-ups.

[   28.544142] Internal error: SP/PC alignment exception: 8a000000 [#21] SMP
[   28.550918] Modules linked in: btrfs zstd_compress hid_generic raid10 raid456 usbhid async_raid6_recov async_memcpy async_pq async_xor hid async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear marvell hibmc_drm aes_ce_blk ttm aes_ce_cipher drm_kms_helper crc32_ce crct10dif_ce syscopyarea ghash_ce qla2xxx sysfillrect sha2_ce sysimgblt fb_sys_fops sha256_arm64 sha1_ce hns3 mpt3sas nvme_fc hisi_sas_v3_hw nvme_fabrics drm ixgbe hisi_sas_main scsi_transport_fc hclge libsas ahci raid_class hnae3 libahci scsi_transport_sas mdio aes_neon_bs aes_neon_blk crypto_simd cryptd aes_arm64
[   28.602352] CPU: 19 PID: 1423 Comm: systemd-udevd Tainted: G      D           4.18.0-rc1+ #19
[   28.610861] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 EC UEFI Nemo 2.0 RC0 - B308 (V0.38) 06/02/2018
[   28.620674] pstate: 40400009 (nZcv daif +PAN -UAO)
[   28.625451] pc : 0xffff00000327f857
[   28.628928] lr : sk_filter_trim_cap+0x8c/0x1b8
[   28.633358] sp : ffff00002b1bba80
[   28.636659] x29: ffff00002b1bba80 x28: 0000000000000001 
[   28.641958] x27: ffff803f71e25800 x26: ffff803f6b16d000 
[   28.647257] x25: 0000000000000000 x24: 0000000000000000 
[   28.652555] x23: 0000000000000001 x22: ffff00002ab06000 
[   28.657854] x21: 0000000000000000 x20: ffff0000095c8000 
[   28.663152] x19: ffff803f73c7a700 x18: 0000fffff53ce450 
[   28.668451] x17: 0000ffffb96575f8 x16: ffff0000089c1c38 
[   28.673750] x15: 0000000000000020 x14: 5953425553003331 
[   28.679049] x13: 3a303a332d796870 x12: 2f7968705f736173 
[   28.684347] x11: 2f33313a303a332d x10: 00000000000000db 
[   28.689646] x9 : ffff00002b1bbd50 x8 : ffff803f73c7a700 
[   28.694945] x7 : ffff803f73c7be00 x6 : 0000000000001900 
[   28.700243] x5 : 0000000000000183 x4 : ffff803f80ccf9d0 
[   28.705542] x3 : 0000000000000000 x2 : ffff00000327f857 
[   28.710840] x1 : ffff00002ab06038 x0 : ffff803f73c7a700 
[   28.716140] Process systemd-udevd (pid: 1423, stack limit = 0x(____ptrval____))
[   28.723434] Call trace:
[   28.725867]  0xffff00000327f857
[   28.728995]  netlink_broadcast_filtered+0x234/0x418
[   28.733860]  netlink_sendmsg+0x354/0x360
[   28.737769]  sock_sendmsg+0x4c/0x68
[   28.741244]  ___sys_sendmsg+0x2ac/0x2f0
[   28.745067]  __sys_sendmsg+0x7c/0xd0
[   28.748629]  sys_sendmsg+0x38/0x48
[   28.752017]  el0_svc_naked+0x30/0x34
[   28.755580] Code: 202000d4 202000d4 202000d4 202000d4 (a9bf7bfd) 
[   28.761660] ---[ end trace 4451ed1a5b395c65 ]---
[   28.766267] Internal error: SP/PC alignment exception: 8a000000 [#22] SMP
[   28.773044] Modules linked in: btrfs zstd_compress hid_generic raid10 raid456 usbhid async_raid6_recov async_memcpy async_pq async_xor hid async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear marvell hibmc_drm aes_ce_blk ttm aes_ce_cipher drm_kms_helper crc32_ce crct10dif_ce syscopyarea ghash_ce qla2xxx sysfillrect sha2_ce sysimgblt fb_sys_fops sha256_arm64 sha1_ce hns3 mpt3sas nvme_fc hisi_sas_v3_hw nvme_fabrics drm ixgbe hisi_sas_main scsi_transport_fc hclge libsas ahci raid_class hnae3 libahci scsi_transport_sas mdio aes_neon_bs aes_neon_blk crypto_simd cryptd aes_arm64
[   28.824485] CPU: 20 PID: 1424 Comm: systemd-udevd Tainted: G      D           4.18.0-rc1+ #19
[   28.832994] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 EC UEFI Nemo 2.0 RC0 - B308 (V0.38) 06/02/2018
[   28.842805] pstate: 40400009 (nZcv daif +PAN -UAO)
[   28.847583] pc : 0xffff00000327f857
[   28.851060] lr : sk_filter_trim_cap+0x8c/0x1b8
[   28.855490] sp : ffff00002b1c3a80
[   28.858791] x29: ffff00002b1c3a80 x28: 0000000000000001 
[   28.864090] x27: ffff803f71e25800 x26: ffff803f6b169000 
[   28.869388] x25: 0000000000000000 x24: 0000000000000000 
[   28.874687] x23: 0000000000000001 x22: ffff00002ab06000 
[   28.879985] x21: 0000000000000000 x20: ffff0000095c8000 
[   28.885284] x19: ffff803f742c0900 x18: 0000fffff53ce450 
[   28.890583] x17: 0000ffffb96575f8 x16: ffff0000089c1c38 
[   28.895881] x15: 0000000000000020 x14: 5953425553003431 
[   28.901180] x13: 3a303a332d796870 x12: 2f7968705f736173 
[   28.906478] x11: 2f34313a303a332d x10: 00000000000000db 
[   28.911777] x9 : ffff00002b1c3d50 x8 : ffff803f742c0900 
[   28.917075] x7 : ffff803f742c0c00 x6 : 0000000000000500 
[   28.922374] x5 : 0000000000000045 x4 : ffff803f80ce89d0 
[   28.927673] x3 : 0000000000000000 x2 : ffff00000327f857 
[   28.932971] x1 : ffff00002ab06038 x0 : ffff803f742c0900 
[   28.938271] Process systemd-udevd (pid: 1424, stack limit = 0x(____ptrval____))
[   28.945565] Call trace:
[   28.947998]  0xffff00000327f857
[   28.951127]  netlink_broadcast_filtered+0x234/0x418
[   28.955991]  netlink_sendmsg+0x354/0x360
[   28.959901]  sock_sendmsg+0x4c/0x68
[   28.963377]  ___sys_sendmsg+0x2ac/0x2f0
[   28.967199]  __sys_sendmsg+0x7c/0xd0
[   28.970761]  sys_sendmsg+0x38/0x48
[   28.974150]  el0_svc_naked+0x30/0x34
[   28.977713] Code: 202000d4 202000d4 202000d4 202000d4 (a9bf7bfd) 
[   28.983793] ---[ end trace 4451ed1a5b395c66 ]---
[   28.988399] Internal error: SP/PC alignment exception: 8a000000 [#23] SMP
[   28.995175] Modules linked in: btrfs zstd_compress hid_generic raid10 raid456 usbhid async_raid6_recov async_memcpy async_pq async_xor hid async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear marvell hibmc_drm aes_ce_blk ttm aes_ce_cipher drm_kms_helper crc32_ce crct10dif_ce syscopyarea ghash_ce qla2xxx sysfillrect sha2_ce sysimgblt fb_sys_fops sha256_arm64 sha1_ce hns3 mpt3sas nvme_fc hisi_sas_v3_hw nvme_fabrics drm ixgbe hisi_sas_main scsi_transport_fc hclge libsas ahci raid_class hnae3 libahci scsi_transport_sas mdio aes_neon_bs aes_neon_blk crypto_simd cryptd aes_arm64
[   29.046612] CPU: 22 PID: 1425 Comm: systemd-udevd Tainted: G      D           4.18.0-rc1+ #19
[   29.055121] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 EC UEFI Nemo 2.0 RC0 - B308 (V0.38) 06/02/2018
[   29.064932] pstate: 40400009 (nZcv daif +PAN -UAO)
[   29.069710] pc : 0xffff00000327f857
[   29.073187] lr : sk_filter_trim_cap+0x8c/0x1b8
[   29.077616] sp : ffff00002b1cba80
[   29.080917] x29: ffff00002b1cba80 x28: 0000000000000001 
[   29.086216] x27: ffff803f71e25800 x26: ffff803f6b16b000 
[   29.091515] x25: 0000000000000000 x24: 0000000000000000 
[   29.096813] x23: 0000000000000001 x22: ffff00002ab06000 
[   29.102112] x21: 0000000000000000 x20: ffff0000095c8000 
[   29.107411] x19: ffff803f74362c00 x18: 0000fffff53ce450 
[   29.112709] x17: 0000ffffb96575f8 x16: ffff0000089c1c38 
[   29.118008] x15: 0000000000000020 x14: 5953425553003531 
[   29.123306] x13: 3a303a332d796870 x12: 2f7968705f736173 
[   29.128605] x11: 2f35313a303a332d x10: 00000000000000db 
[   29.133904] x9 : ffff00002b1cbd50 x8 : ffff803f74362c00 
[   29.139202] x7 : ffff803f74363900 x6 : 0000000000001500 
[   29.144501] x5 : 00000000000000c2 x4 : ffff803f80d1a9d0 
[   29.149799] x3 : 0000000000000000 x2 : ffff00000327f857 
[   29.155098] x1 : ffff00002ab06038 x0 : ffff803f74362c00 
[   29.160397] Process systemd-udevd (pid: 1425, stack limit = 0x(____ptrval____))
[   29.167691] Call trace:
[   29.170125]  0xffff00000327f857
[   29.173254]  netlink_broadcast_filtered+0x234/0x418
[   29.178118]  netlink_sendmsg+0x354/0x360
[   29.182027]  sock_sendmsg+0x4c/0x68
[   29.185503]  ___sys_sendmsg+0x2ac/0x2f0
[   29.189325]  __sys_sendmsg+0x7c/0xd0
[   29.192888]  sys_sendmsg+0x38/0x48
[   29.196276]  el0_svc_naked+0x30/0x34
[   29.199839] Code: 202000d4 202000d4 202000d4 202000d4 (a9bf7bfd) 
[   29.205919] ---[ end trace 4451ed1a5b395c67 ]---

^ permalink raw reply

* Re: [PATCH bpf-next 2/3] bpf: btf: add btf json print functionality
From: Martin KaFai Lau @ 2018-06-22 23:19 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Okash Khawaja, Daniel Borkmann, Alexei Starovoitov, Yonghong Song,
	Quentin Monnet, David S. Miller, netdev, kernel-team,
	linux-kernel
In-Reply-To: <20180622144952.353d50c0@cakuba.netronome.com>

On Fri, Jun 22, 2018 at 02:49:52PM -0700, Jakub Kicinski wrote:
> On Fri, 22 Jun 2018 14:27:43 -0700, Jakub Kicinski wrote:
> > BTF in JSON is very useful, and will help people who writes simple
> > orchestration/scripts based on bpftool *a* *lot*.  I really appreciate
> > this addition to bpftool and will start using it myself as soon as it
> > lands.  I'm not sure why the reluctance to slightly change the output
> > format?
> 
> Ohh, maybe that's the misunderstanding, you only implemented JSON so
> you wanted it to be as readable and clean as possible.  Hence the hex
> output and cutting out the old cruft!  That perspective makes sense!
> But I think we should keep JSON for machines (but including BTF
> formatted values) and let's make the plain text output nice and clean,
> agreed.
Right, it is what my earlier comment meant on "this ascii output is
for human".  We merely call it json because we are reusing
the json's meaning on {}, [] and int since it fits nicely
on what we want to achieve, readability.  Other than that,
it does not have to follow other json's requirements.
We can call it whatever except json to avoid wrong
user expectation.  Putting it under "-j"/"-p" was a mistake.
Hence, I said this patch belongs to the 'plaintext" output.

If we really needed a similar output under "-j" or "-p", things
had to be changed and it belongs to a separate patch.

^ permalink raw reply

* [PATCH net-next] netns: get more entropy from net_hash_mix()
From: Eric Dumazet @ 2018-06-22 23:27 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Eric Dumazet

struct net are effectively allocated from order-1 pages on x86,
with one object per slab, meaning that the 13 low order bits
of their addresses are zero.

Once shifted by L1_CACHE_SHIFT, this leaves 7 zero-bits,
meaning that net_hash_mix() does not help spreading
objects on various hash tables.

For example, TCP listen table has 32 buckets, meaning that
all netns use the same bucket for port 80 or port 443.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Maciej Żenczykowski <maze@google.com>
---
 include/net/netns/hash.h | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/include/net/netns/hash.h b/include/net/netns/hash.h
index 24c78183a4c262e086c29cde1e61b7f397d8bab0..16a842456189f2fc1a3363685b5dd4310a32b2b8 100644
--- a/include/net/netns/hash.h
+++ b/include/net/netns/hash.h
@@ -9,12 +9,7 @@ struct net;
 static inline u32 net_hash_mix(const struct net *net)
 {
 #ifdef CONFIG_NET_NS
-	/*
-	 * shift this right to eliminate bits, that are
-	 * always zeroed
-	 */
-
-	return (u32)(((unsigned long)net) >> L1_CACHE_SHIFT);
+	return (u32)(((unsigned long)net) >> ilog2(sizeof(*net)));
 #else
 	return 0;
 #endif
-- 
2.18.0.rc2.346.g013aa6912e-goog

^ permalink raw reply related

* Re: [PATCH bpf-next 2/3] bpf: btf: add btf json print functionality
From: Jakub Kicinski @ 2018-06-22 23:32 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Okash Khawaja, Daniel Borkmann, Alexei Starovoitov, Yonghong Song,
	Quentin Monnet, David S. Miller, netdev, kernel-team,
	linux-kernel
In-Reply-To: <20180622225408.jor7lpvsksnwiiec@kafai-mbp.dhcp.thefacebook.com>

On Fri, 22 Jun 2018 15:54:08 -0700, Martin KaFai Lau wrote:
> > > > > > > > > >         "value": ["0x02","0x00","0x00","0x00","0x00","0x00","0x00","0x00"
> > > > > > > > > >         ],
> > > > > > > > > > 	"value_struct":{
> > > > > > > > > > 		"src_ip":2,      
> > > > > If for the same map the user changes the "src_ip" to an array of int[4]
> > > > > later (e.g. to support ipv6), it will become "src_ip": [1, 2, 3, 4].
> > > > > Is it breaking backward compat?
> > > > > i.e.
> > > > > struct five_tuples {
> > > > > -	int src_ip;
> > > > > +	int src_ip[4];
> > > > > /* ... */
> > > > > };    
> > > > 
> > > > Well, it is breaking backward compat, but it's the program doing it,
> > > > not bpftool :)  BTF changes so does the output.    
> > > As we see, the key/value's btf-output is inherently not backward compat.
> > > Hence, "-j" and "-p" will stay as is.  The whole existing json will
> > > be backward compat instead of only partly backward compat.  
> > 
> > No.  There is a difference between user of a facility changing their
> > input and kernel/libraries providing different output in response to
> > that, and the libraries suddenly changing the output on their own.
> > 
> > Your example is like saying if user started using IPv6 addresses
> > instead of IPv4 the netlink attributes in dumps will be different so
> > kernel didn't keep backwards compat.  While what you're doing is more
> > equivalent to dropping support for old ioctl interfaces because there
> > is a better mechanism now.  
> Sorry, I don't follow this.  I don't see netlink suffer json issue like
> the one on "key" and "value".
> 
> All I can grasp is, the json should normally be backward compat but now
> we are saying anything added by btf-output is an exception because
> the script parsing it will treat it differently than "key" and "value"

Backward compatibility means that if I run *the same* program against
different kernels/libraries it continues to work.  If someone decides
to upgrade their program to work with IPv6 (which was your example)
obviously there is no way system as a whole will look 1:1 the same.

> > BTF in JSON is very useful, and will help people who writes simple
> > orchestration/scripts based on bpftool *a* *lot*.  I really appreciate  
> Can you share what the script will do?  I want to understand why
> it cannot directly use the BTF format and the map data.

Think about a python script which wants to read a counter in a map.
Right now it would have to get the BTF, find out which bytes are the
counter, then convert the bytes into a larger int.  With JSON BTF if
just does entry["formatted"]["value"]["counter"].

Real life example from my test code (conversion of 3 element counter
array):

def str2int(strtab):
    inttab = []
    for i in strtab:
        inttab.append(int(i, 16))
    ba = bytearray(inttab)
    if len(strtab) == 4:
        fmt = "I"
    elif len(strtab) == 8:
        fmt = "Q"
    else:
        raise Exception("String array of len %d can't be unpacked to an int" %
                        (len(strtab)))
    return struct.unpack(fmt, ba)[0]

def convert(elems, idx):
    val = []
    for i in range(3):
        part = elems[idx]["value"][i * length:(i + 1) * length]
        val.append(str2int(part))
    return val

With BTF it would be:

	elems[idx]["formatted"]["value"]

Which is fairly awesome.

> > this addition to bpftool and will start using it myself as soon as it
> > lands.  I'm not sure why the reluctance to slightly change the output
> > format?  
> The initial change argument is because the json has to be backward compat.
> 
> Then we show that btf-output is inherently not backward compat, so
> printing it in json does not make sense at all.
> 
> However, now it is saying part of it does not have to be backward compat.

Compatibility of "formatted" member is defined as -> fields broken out
according to BTF.  So it is backward compatible.  The definition of
"value" member is -> an array of unfortunately formatted array of
ugly hex strings :(

> I am fine putting it under "formatted" for "-j" or "-p" if that is the
> case, other than the double output is still confusing.  Lets wait for
> Okash's input.
>
> At the same time, the same output will be used as the default plaintext
> output when BTF is available.  Then the plaintext BTF output
> will not be limited by the json restrictions when we want
> to improve human readability later.  Apparently, the
> improvements on plaintext will not be always applicable
> to json output.

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Siwei Liu @ 2018-06-22 23:40 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Cornelia Huck, Samudrala, Sridhar, Alexander Duyck, virtio-dev,
	aaron.f.brown, Jiri Pirko, Jakub Kicinski, Netdev, qemu-devel,
	virtualization, konrad.wilk, boris.ostrovsky, Joao Martins,
	Venu Busireddy, vijay.balakrishna
In-Reply-To: <20180623012406-mutt-send-email-mst@kernel.org>

On Fri, Jun 22, 2018 at 3:25 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Fri, Jun 22, 2018 at 02:51:11PM -0700, Siwei Liu wrote:
>> On Fri, Jun 22, 2018 at 2:29 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Fri, Jun 22, 2018 at 01:03:04PM -0700, Siwei Liu wrote:
>> >> On Fri, Jun 22, 2018 at 1:00 PM, Siwei Liu <loseweigh@gmail.com> wrote:
>> >> > On Thu, Jun 21, 2018 at 7:32 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> >> >> On Thu, Jun 21, 2018 at 06:21:55PM -0700, Siwei Liu wrote:
>> >> >>> On Thu, Jun 21, 2018 at 7:59 AM, Cornelia Huck <cohuck@redhat.com> wrote:
>> >> >>> > On Wed, 20 Jun 2018 22:48:58 +0300
>> >> >>> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> >> >>> >
>> >> >>> >> On Wed, Jun 20, 2018 at 06:06:19PM +0200, Cornelia Huck wrote:
>> >> >>> >> > In any case, I'm not sure anymore why we'd want the extra uuid.
>> >> >>> >>
>> >> >>> >> It's mostly so we can have e.g. multiple devices with same MAC
>> >> >>> >> (which some people seem to want in order to then use
>> >> >>> >> then with different containers).
>> >> >>> >>
>> >> >>> >> But it is also handy for when you assign a PF, since then you
>> >> >>> >> can't set the MAC.
>> >> >>> >>
>> >> >>> >
>> >> >>> > OK, so what about the following:
>> >> >>> >
>> >> >>> > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
>> >> >>> >   that we have a new uuid field in the virtio-net config space
>> >> >>> > - in QEMU, add a property for virtio-net that allows to specify a uuid,
>> >> >>> >   offer VIRTIO_NET_F_STANDBY_UUID if set
>> >> >>> > - when configuring, set the property to the group UUID of the vfio-pci
>> >> >>> >   device
>> >> >>>
>> >> >>> If feature negotiation fails on VIRTIO_NET_F_STANDBY_UUID, is it safe
>> >> >>> to still expose UUID in the config space on virtio-pci?
>> >> >>
>> >> >>
>> >> >> Yes but guest is not supposed to read it.
>> >> >>
>> >> >>> I'm not even sure if it's sane to expose group UUID on the PCI bridge
>> >> >>> where the corresponding vfio-pci device attached to for a guest which
>> >> >>> doesn't support the feature (legacy).
>> >> >>>
>> >> >>> -Siwei
>> >> >>
>> >> >> Yes but you won't add the primary behind such a bridge.
>> >> >
>> >> > I assume the UUID feature is a new one besides the exiting
>> >> > VIRTIO_NET_F_STANDBY feature, where I think the current proposal is,
>> >> > if UUID feature is present and supported by guest, we'll do pairing
>> >> > based on UUID; if UUID feature present
>> >>                                  ^^^^^^^  is NOT present
>> >
>> > Let's go back for a bit.
>> >
>> > What is wrong with matching by mac?
>> >
>> > 1. Does not allow multiple NICs with same mac
>> > 2. Requires MAC to be programmed by host in the PT device
>> >    (which is often possible with VFs but isn't possible with all PT
>> >    devices)
>>
>> Might not neccessarily be something wrong, but it's very limited to
>> prohibit the MAC of VF from changing when enslaved by failover.
>
> You mean guest changing MAC? I'm not sure why we prohibit that.

I think Sridhar and Jiri might be better person to answer it. My
impression was that sync'ing the MAC address change between all 3
devices is challenging, as the failover driver uses MAC address to
match net_device internally.

>
>
>> The
>> same as you indicated on the PT device. I suspect the same is
>> prohibited on even virtio-net and failover is because of the same
>> limitation.
>>
>> >
>> > Both issues are up to host: if host needs them it needs the UUID
>> > feature, if not MAC matching is sufficient.
>>
>> I know. What I'm saying is that we rely on STANDBY feature to control
>> the visibility of the primary, not the UUID feature.
>>
>> -Siwei
>
> And what I'm saying is that it's up to a host policy.

So lets say host policy explicitly requires UUID but the guest does
not support it. The guest could be a legacy guest with no STANDBY
support, or a relevately recent guest with STANDBY support but without
the UUID support. In either case, since host asked for UUID matching
explicitly, maybe because it requires different NIC using same MAC, so
it has to override the negotiation result of STANDBY, even if STANDBY
is set and supported? That will end up with inter-feature dependency
over STANDBY, as you see.

-Siwei

>
>> >
>> >
>> >> > or not supported by guest,
>> >> > we'll still plug in the VF (if guest supports the _F_STANDBY feature)
>> >> > but the pairing will be fallback to using MAC address. Are you saying
>> >> > you don't even want to plug in the primary when the UUID feature is
>> >> > not supported by guest? Does the feature negotiation UUID have to
>> >> > depend on STANDBY being set, or the other way around? I thought that
>> >> > just the absence of STANDBY will prevent primary to get exposed to the
>> >> > guest.
>> >> >
>> >> > -Siwei
>> >> >
>> >> >>
>> >> >>>
>> >> >>> > - in the guest, use the uuid from the virtio-net device's config space
>> >> >>> >   if applicable; else, fall back to matching by MAC as done today
>> >

^ permalink raw reply

* Re: [PATCH bpf-next 2/3] bpf: btf: add btf json print functionality
From: Jakub Kicinski @ 2018-06-22 23:40 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Okash Khawaja, Daniel Borkmann, Alexei Starovoitov, Yonghong Song,
	Quentin Monnet, David S. Miller, netdev, kernel-team,
	linux-kernel
In-Reply-To: <20180622231940.fvkxfqccvyf5uewk@kafai-mbp.dhcp.thefacebook.com>

On Fri, 22 Jun 2018 16:19:40 -0700, Martin KaFai Lau wrote:
> On Fri, Jun 22, 2018 at 02:49:52PM -0700, Jakub Kicinski wrote:
> > On Fri, 22 Jun 2018 14:27:43 -0700, Jakub Kicinski wrote:  
> > > BTF in JSON is very useful, and will help people who writes simple
> > > orchestration/scripts based on bpftool *a* *lot*.  I really appreciate
> > > this addition to bpftool and will start using it myself as soon as it
> > > lands.  I'm not sure why the reluctance to slightly change the output
> > > format?  
> > 
> > Ohh, maybe that's the misunderstanding, you only implemented JSON so
> > you wanted it to be as readable and clean as possible.  Hence the hex
> > output and cutting out the old cruft!  That perspective makes sense!
> > But I think we should keep JSON for machines (but including BTF
> > formatted values) and let's make the plain text output nice and clean,
> > agreed.  
> Right, it is what my earlier comment meant on "this ascii output is
> for human".  We merely call it json because we are reusing
> the json's meaning on {}, [] and int since it fits nicely
> on what we want to achieve, readability.  Other than that,
> it does not have to follow other json's requirements.
> We can call it whatever except json to avoid wrong
> user expectation.  Putting it under "-j"/"-p" was a mistake.
> Hence, I said this patch belongs to the 'plaintext" output.

Yes, that were the confusion came from, right.  I'm personally not sold
on JSON as "human readable" but I'm very far from a UI guru so up to
you :)  

I think it may be good to intentionally do non-JSON things, like use
hex integers, don't put quotes around strings, or always add a comma
after value, so people can't use it as JSON even if they try.

Basic printer is trivial to write, I'm concerned that the reuse of JSON
writer to create a user-readable output will bite us on the posterior
later on...

^ permalink raw reply

* Re: [PATCH bpf-next 2/3] bpf: btf: add btf json print functionality
From: Martin KaFai Lau @ 2018-06-22 23:58 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Okash Khawaja, Daniel Borkmann, Alexei Starovoitov, Yonghong Song,
	Quentin Monnet, David S. Miller, netdev, kernel-team,
	linux-kernel
In-Reply-To: <20180622164048.6283b469@cakuba.netronome.com>

On Fri, Jun 22, 2018 at 04:40:48PM -0700, Jakub Kicinski wrote:
> On Fri, 22 Jun 2018 16:19:40 -0700, Martin KaFai Lau wrote:
> > On Fri, Jun 22, 2018 at 02:49:52PM -0700, Jakub Kicinski wrote:
> > > On Fri, 22 Jun 2018 14:27:43 -0700, Jakub Kicinski wrote:  
> > > > BTF in JSON is very useful, and will help people who writes simple
> > > > orchestration/scripts based on bpftool *a* *lot*.  I really appreciate
> > > > this addition to bpftool and will start using it myself as soon as it
> > > > lands.  I'm not sure why the reluctance to slightly change the output
> > > > format?  
> > > 
> > > Ohh, maybe that's the misunderstanding, you only implemented JSON so
> > > you wanted it to be as readable and clean as possible.  Hence the hex
> > > output and cutting out the old cruft!  That perspective makes sense!
> > > But I think we should keep JSON for machines (but including BTF
> > > formatted values) and let's make the plain text output nice and clean,
> > > agreed.  
> > Right, it is what my earlier comment meant on "this ascii output is
> > for human".  We merely call it json because we are reusing
> > the json's meaning on {}, [] and int since it fits nicely
> > on what we want to achieve, readability.  Other than that,
> > it does not have to follow other json's requirements.
> > We can call it whatever except json to avoid wrong
> > user expectation.  Putting it under "-j"/"-p" was a mistake.
> > Hence, I said this patch belongs to the 'plaintext" output.
> 
> Yes, that were the confusion came from, right.  I'm personally not sold
> on JSON as "human readable" but I'm very far from a UI guru so up to
> you :)  
We have thought of a few options but nothing else shine in term
of saving the verbosity on spelling out the word like "it is a struct"
, "it is an array"... and at the same time trying to use some short form
that people are already familiar with.

> 
> I think it may be good to intentionally do non-JSON things, like use
> hex integers, don't put quotes around strings, or always add a comma
> after value, so people can't use it as JSON even if they try.
Good point.

> 
> Basic printer is trivial to write, I'm concerned that the reuse of JSON
> writer to create a user-readable output will bite us on the posterior
> later on...
If we can think of some better way later, we can also provide another
plaintext output.  I don't want to over design it at this point.  Let
see how it goes.

^ permalink raw reply

* Re: [PATCH] infiniband: i40iw, nes: don't use wall time for TCP sequence numbers
From: Shiraz Saleem @ 2018-06-23  0:01 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Latif, Faisal, Doug Ledford, Jason Gunthorpe, David S. Miller,
	y2038@lists.linaro.org, netdev@vger.kernel.org, Orosco, Henry,
	Nikolova, Tatyana E, Ismail, Mustafa, Jia-Ju Bai, Yuval Shaia,
	Bart Van Assche, Kees Cook, Reshetova, Elena,
	linux-rdma@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20180618144443.137068-1-arnd@arndb.de>

On Mon, Jun 18, 2018 at 08:44:17AM -0600, Arnd Bergmann wrote:
> The nes infiniband driver uses current_kernel_time() to get a nanosecond
> granunarity timestamp to initialize its tcp sequence counters. This is
> one of only a few remaining users of that deprecated function, so we
> should try to get rid of it.
> 
> Aside from using a deprecated API, there are several problems I see here:
> 
> - Using a CLOCK_REALTIME based time source makes it predictable in
>   case the time base is synchronized.
> - Using a coarse timestamp means it only gets updated once per jiffie,
>   making it even more predictable in order to avoid having to access
>   the hardware clock source
> - The upper 2 bits are always zero because the nanoseconds are at most
>   999999999.
> 
> For the Linux TCP implementation, we use secure_tcp_seq(), which appears
> to be appropriate here as well, and solves all the above problems.
> 
> I'm doing the same change in both versions of the nes driver, with
> i40iw being a later copy of the same code.
> 
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---

Thanks Arnd for the patch!

[...]

> @@ -2164,7 +2165,6 @@ static struct i40iw_cm_node *i40iw_make_cm_node(
>  				   struct i40iw_cm_listener *listener)
>  {
>  	struct i40iw_cm_node *cm_node;
> -	struct timespec ts;
>  	int oldarpindex;
>  	int arpindex;
>  	struct net_device *netdev = iwdev->netdev;
> @@ -2214,8 +2214,10 @@ static struct i40iw_cm_node *i40iw_make_cm_node(
>  	cm_node->tcp_cntxt.rcv_wscale = I40IW_CM_DEFAULT_RCV_WND_SCALE;
>  	cm_node->tcp_cntxt.rcv_wnd =
>  			I40IW_CM_DEFAULT_RCV_WND_SCALED >> I40IW_CM_DEFAULT_RCV_WND_SCALE;
> -	ts = current_kernel_time();
> -	cm_node->tcp_cntxt.loc_seq_num = ts.tv_nsec;
> +	cm_node->tcp_cntxt.loc_seq_num = secure_tcp_seq(htonl(cm_node->loc_addr[0]),
> +							htonl(cm_node->rem_addr[0]),
> +							htons(cm_node->loc_port),
> +							htons(cm_node->rem_port));

Should we not be using secure_tcpv6_seq() when we are ipv6?

Shiraz

>  	cm_node->tcp_cntxt.mss = (cm_node->ipv4) ? (iwdev->vsi.mtu - I40IW_MTU_TO_MSS_IPV4) :
>  				 (iwdev->vsi.mtu - I40IW_MTU_TO_MSS_IPV6);
>  
> 

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Siwei Liu @ 2018-06-23  0:05 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Cornelia Huck, Samudrala, Sridhar, Alexander Duyck, virtio-dev,
	aaron.f.brown, Jiri Pirko, Jakub Kicinski, Netdev, qemu-devel,
	virtualization, konrad.wilk, boris.ostrovsky, Joao Martins,
	Venu Busireddy, vijay.balakrishna
In-Reply-To: <20180623012934-mutt-send-email-mst@kernel.org>

On Fri, Jun 22, 2018 at 3:33 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Fri, Jun 22, 2018 at 02:57:39PM -0700, Siwei Liu wrote:
>> On Fri, Jun 22, 2018 at 2:32 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Fri, Jun 22, 2018 at 01:21:45PM -0700, Siwei Liu wrote:
>> >> On Fri, Jun 22, 2018 at 12:05 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> >> > On Fri, Jun 22, 2018 at 05:09:55PM +0200, Cornelia Huck wrote:
>> >> >> On Thu, 21 Jun 2018 21:20:13 +0300
>> >> >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> >> >>
>> >> >> > On Thu, Jun 21, 2018 at 04:59:13PM +0200, Cornelia Huck wrote:
>> >> >> > > OK, so what about the following:
>> >> >> > >
>> >> >> > > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
>> >> >> > >   that we have a new uuid field in the virtio-net config space
>> >> >> > > - in QEMU, add a property for virtio-net that allows to specify a uuid,
>> >> >> > >   offer VIRTIO_NET_F_STANDBY_UUID if set
>> >> >> > > - when configuring, set the property to the group UUID of the vfio-pci
>> >> >> > >   device
>> >> >> > > - in the guest, use the uuid from the virtio-net device's config space
>> >> >> > >   if applicable; else, fall back to matching by MAC as done today
>> >> >> > >
>> >> >> > > That should work for all virtio transports.
>> >> >> >
>> >> >> > True. I'm a bit unhappy that it's virtio net specific though
>> >> >> > since down the road I expect we'll have a very similar feature
>> >> >> > for scsi (and maybe others).
>> >> >> >
>> >> >> > But we do not have a way to have fields that are portable
>> >> >> > both across devices and transports, and I think it would
>> >> >> > be a useful addition. How would this work though? Any idea?
>> >> >>
>> >> >> Can we introduce some kind of device-independent config space area?
>> >> >> Pushing back the device-specific config space by a certain value if the
>> >> >> appropriate feature is negotiated and use that for things like the uuid?
>> >> >
>> >> > So config moves back and forth?
>> >> > Reminds me of the msi vector mess we had with pci.
>> >> > I'd rather have every transport add a new config.
>> >> >
>> >> >> But regardless of that, I'm not sure whether extending this approach to
>> >> >> other device types is the way to go. Tying together two different
>> >> >> devices is creating complicated situations at least in the hypervisor
>> >> >> (even if it's fairly straightforward in the guest). [I have not come
>> >> >> around again to look at the "how to handle visibility in QEMU"
>> >> >> questions due to lack of cycles, sorry about that.]
>> >> >>
>> >> >> So, what's the goal of this approach? Only to allow migration with
>> >> >> vfio-pci, or also to plug in a faster device and use it instead of an
>> >> >> already attached paravirtualized device?
>> >> >
>> >> > These are two sides of the same coin, I think the second approach
>> >> > is closer to what we are doing here.
>> >>
>> >> I'm leaning towards being conservative to keep the potential of doing
>> >> both. It's the vfio migration itself that has to be addessed, not to
>> >> make every virtio device to have an accelerated datapath, right?
>> >>
>> >> -Siwei
>> >
>> > I think that the approach we took (signal configuration
>> > through standby) assumes standby always present and
>> > primary appearing and disappearing.
>>
>> I can imagine that's still true even for 1-netdev model. As long as we
>> can hide the lower devices.
>>
>> That's what I said "to keep the potential to support both" models. But
>> we should not go further along to assume the other aspect ie. to get
>> PV accelerated whenever possible, but not to get VF migrated whenever
>> possible. That's essetially a big diveregence of the priority for the
>> goal we'd want to pursue.
>>
>> -Siwei
>
> I suspect the diveregence will be lost on most users though
> simply because they don't even care about vfio. They just
> want things to go fast.

Like Jason said, VF isn't faster than virtio-net in all cases. It
depends on the workload and performance metrics: throughput, latency,
or packet per second.

>
> Rephrasing requirements in terms of acceleration actually
> made things implementable. So I'm not in a hurry to try
> and go back to asking for a migrateable vfio - very
> easy to get stuck there.

Understood. When we get all ethtool_ops exposed on the failover
interface, we'll get back to attack migrateable vfio with the 1-netdev
model.

-Siwei


>
>> >
>> > Anything else isn't well supported and I'm not sure we
>> > should complicate code too much to support it.
>> >
>> >>
>> >> >
>> >> >> What about migration of vfio devices that are not easily replaced by a
>> >> >> paravirtualized device? I'm thinking of vfio-ccw, where our main (and
>> >> >> currently only) supported device is dasd (disks) -- which can do a lot
>> >> >> of specialized things that virtio-blk does not support (and should not
>> >> >> or even cannot support).
>> >> >
>> >> > But maybe virtio-scsi can?
>> >> >
>> >> >> Would it be more helpful to focus on generic
>> >> >> migration support for vfio instead of going about it device by device?
>> >> >>
>> >> >> This network device approach already seems far along, so it makes sense
>> >> >> to continue with it. But I'm not sure whether we want to spend time and
>> >> >> energy on that for other device types rather than working on a general
>> >> >> solution for vfio migration.
>> >> >
>> >> > I'm inclined to say finalizing this feature would be a good first step
>> >> > and will teach us how we can move forward.
>> >> >
>> >> > --
>> >> > MST

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Siwei Liu @ 2018-06-23  0:17 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, konrad.wilk,
	Jakub Kicinski, Samudrala, Sridhar, Cornelia Huck, qemu-devel,
	virtualization, Venu Busireddy, Netdev, boris.ostrovsky,
	aaron.f.brown, Joao Martins
In-Reply-To: <CADGSJ211pWKYnuuzcgUqJZiHU+=7H3oQSpp1TnE=+EBjkdmCSQ@mail.gmail.com>

On Fri, Jun 22, 2018 at 4:40 PM, Siwei Liu <loseweigh@gmail.com> wrote:
> On Fri, Jun 22, 2018 at 3:25 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> On Fri, Jun 22, 2018 at 02:51:11PM -0700, Siwei Liu wrote:
>>> On Fri, Jun 22, 2018 at 2:29 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> > On Fri, Jun 22, 2018 at 01:03:04PM -0700, Siwei Liu wrote:
>>> >> On Fri, Jun 22, 2018 at 1:00 PM, Siwei Liu <loseweigh@gmail.com> wrote:
>>> >> > On Thu, Jun 21, 2018 at 7:32 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> >> >> On Thu, Jun 21, 2018 at 06:21:55PM -0700, Siwei Liu wrote:
>>> >> >>> On Thu, Jun 21, 2018 at 7:59 AM, Cornelia Huck <cohuck@redhat.com> wrote:
>>> >> >>> > On Wed, 20 Jun 2018 22:48:58 +0300
>>> >> >>> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>> >> >>> >
>>> >> >>> >> On Wed, Jun 20, 2018 at 06:06:19PM +0200, Cornelia Huck wrote:
>>> >> >>> >> > In any case, I'm not sure anymore why we'd want the extra uuid.
>>> >> >>> >>
>>> >> >>> >> It's mostly so we can have e.g. multiple devices with same MAC
>>> >> >>> >> (which some people seem to want in order to then use
>>> >> >>> >> then with different containers).
>>> >> >>> >>
>>> >> >>> >> But it is also handy for when you assign a PF, since then you
>>> >> >>> >> can't set the MAC.
>>> >> >>> >>
>>> >> >>> >
>>> >> >>> > OK, so what about the following:
>>> >> >>> >
>>> >> >>> > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
>>> >> >>> >   that we have a new uuid field in the virtio-net config space
>>> >> >>> > - in QEMU, add a property for virtio-net that allows to specify a uuid,
>>> >> >>> >   offer VIRTIO_NET_F_STANDBY_UUID if set
>>> >> >>> > - when configuring, set the property to the group UUID of the vfio-pci
>>> >> >>> >   device
>>> >> >>>
>>> >> >>> If feature negotiation fails on VIRTIO_NET_F_STANDBY_UUID, is it safe
>>> >> >>> to still expose UUID in the config space on virtio-pci?
>>> >> >>
>>> >> >>
>>> >> >> Yes but guest is not supposed to read it.
>>> >> >>
>>> >> >>> I'm not even sure if it's sane to expose group UUID on the PCI bridge
>>> >> >>> where the corresponding vfio-pci device attached to for a guest which
>>> >> >>> doesn't support the feature (legacy).
>>> >> >>>
>>> >> >>> -Siwei
>>> >> >>
>>> >> >> Yes but you won't add the primary behind such a bridge.
>>> >> >
>>> >> > I assume the UUID feature is a new one besides the exiting
>>> >> > VIRTIO_NET_F_STANDBY feature, where I think the current proposal is,
>>> >> > if UUID feature is present and supported by guest, we'll do pairing
>>> >> > based on UUID; if UUID feature present
>>> >>                                  ^^^^^^^  is NOT present
>>> >
>>> > Let's go back for a bit.
>>> >
>>> > What is wrong with matching by mac?
>>> >
>>> > 1. Does not allow multiple NICs with same mac
>>> > 2. Requires MAC to be programmed by host in the PT device
>>> >    (which is often possible with VFs but isn't possible with all PT
>>> >    devices)
>>>
>>> Might not neccessarily be something wrong, but it's very limited to
>>> prohibit the MAC of VF from changing when enslaved by failover.
>>
>> You mean guest changing MAC? I'm not sure why we prohibit that.
>
> I think Sridhar and Jiri might be better person to answer it. My
> impression was that sync'ing the MAC address change between all 3
> devices is challenging, as the failover driver uses MAC address to
> match net_device internally.
>
>>
>>
>>> The
>>> same as you indicated on the PT device. I suspect the same is
>>> prohibited on even virtio-net and failover is because of the same
>>> limitation.
>>>
>>> >
>>> > Both issues are up to host: if host needs them it needs the UUID
>>> > feature, if not MAC matching is sufficient.
>>>
>>> I know. What I'm saying is that we rely on STANDBY feature to control
>>> the visibility of the primary, not the UUID feature.
>>>
>>> -Siwei
>>
>> And what I'm saying is that it's up to a host policy.
>
> So lets say host policy explicitly requires UUID but the guest does
> not support it. The guest could be a legacy guest with no STANDBY
> support, or a relevately recent guest with STANDBY support but without
> the UUID support. In either case, since host asked for UUID matching
> explicitly, maybe because it requires different NIC using same MAC, so
> it has to override the negotiation result of STANDBY, even if STANDBY
> is set and supported? That will end up with inter-feature dependency
> over STANDBY, as you see.

I forgot to mention the above has the assumption that we expose both
STANDBY and UUID feature to QEMU user. In fact, as we're going towards
not exposing the STANDBY feature directly to user, UUID may be always
required to enable STANDBY. If not, how do we make sure QEMU can
control the visibility of primary device? Something to be confirmed
before implementing the code.

-Siwei

>
> -Siwei
>
>>
>>> >
>>> >
>>> >> > or not supported by guest,
>>> >> > we'll still plug in the VF (if guest supports the _F_STANDBY feature)
>>> >> > but the pairing will be fallback to using MAC address. Are you saying
>>> >> > you don't even want to plug in the primary when the UUID feature is
>>> >> > not supported by guest? Does the feature negotiation UUID have to
>>> >> > depend on STANDBY being set, or the other way around? I thought that
>>> >> > just the absence of STANDBY will prevent primary to get exposed to the
>>> >> > guest.
>>> >> >
>>> >> > -Siwei
>>> >> >
>>> >> >>
>>> >> >>>
>>> >> >>> > - in the guest, use the uuid from the virtio-net device's config space
>>> >> >>> >   if applicable; else, fall back to matching by MAC as done today
>>> >

^ permalink raw reply

* Re: [PATCH bpf-next 2/3] bpf: btf: add btf json print functionality
From: Martin KaFai Lau @ 2018-06-23  0:26 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Okash Khawaja, Daniel Borkmann, Alexei Starovoitov, Yonghong Song,
	Quentin Monnet, David S. Miller, netdev, kernel-team,
	linux-kernel
In-Reply-To: <20180622163200.20564ec4@cakuba.netronome.com>

On Fri, Jun 22, 2018 at 04:32:00PM -0700, Jakub Kicinski wrote:
> On Fri, 22 Jun 2018 15:54:08 -0700, Martin KaFai Lau wrote:
> > > > > > > > > > >         "value": ["0x02","0x00","0x00","0x00","0x00","0x00","0x00","0x00"
> > > > > > > > > > >         ],
> > > > > > > > > > > 	"value_struct":{
> > > > > > > > > > > 		"src_ip":2,      
> > > > > > If for the same map the user changes the "src_ip" to an array of int[4]
> > > > > > later (e.g. to support ipv6), it will become "src_ip": [1, 2, 3, 4].
> > > > > > Is it breaking backward compat?
> > > > > > i.e.
> > > > > > struct five_tuples {
> > > > > > -	int src_ip;
> > > > > > +	int src_ip[4];
> > > > > > /* ... */
> > > > > > };    
> > > > > 
> > > > > Well, it is breaking backward compat, but it's the program doing it,
> > > > > not bpftool :)  BTF changes so does the output.    
> > > > As we see, the key/value's btf-output is inherently not backward compat.
> > > > Hence, "-j" and "-p" will stay as is.  The whole existing json will
> > > > be backward compat instead of only partly backward compat.  
> > > 
> > > No.  There is a difference between user of a facility changing their
> > > input and kernel/libraries providing different output in response to
> > > that, and the libraries suddenly changing the output on their own.
> > > 
> > > Your example is like saying if user started using IPv6 addresses
> > > instead of IPv4 the netlink attributes in dumps will be different so
> > > kernel didn't keep backwards compat.  While what you're doing is more
> > > equivalent to dropping support for old ioctl interfaces because there
> > > is a better mechanism now.  
> > Sorry, I don't follow this.  I don't see netlink suffer json issue like
> > the one on "key" and "value".
> > 
> > All I can grasp is, the json should normally be backward compat but now
> > we are saying anything added by btf-output is an exception because
> > the script parsing it will treat it differently than "key" and "value"
> 
> Backward compatibility means that if I run *the same* program against
> different kernels/libraries it continues to work.  If someone decides
> to upgrade their program to work with IPv6 (which was your example)
> obviously there is no way system as a whole will look 1:1 the same.
> 
> > > BTF in JSON is very useful, and will help people who writes simple
> > > orchestration/scripts based on bpftool *a* *lot*.  I really appreciate  
> > Can you share what the script will do?  I want to understand why
> > it cannot directly use the BTF format and the map data.
> 
> Think about a python script which wants to read a counter in a map.
> Right now it would have to get the BTF, find out which bytes are the
> counter, then convert the bytes into a larger int.  With JSON BTF if
> just does entry["formatted"]["value"]["counter"].
> 
> Real life example from my test code (conversion of 3 element counter
> array):
> 
> def str2int(strtab):
>     inttab = []
>     for i in strtab:
>         inttab.append(int(i, 16))
>     ba = bytearray(inttab)
>     if len(strtab) == 4:
>         fmt = "I"
>     elif len(strtab) == 8:
>         fmt = "Q"
>     else:
>         raise Exception("String array of len %d can't be unpacked to an int" %
>                         (len(strtab)))
>     return struct.unpack(fmt, ba)[0]
> 
> def convert(elems, idx):
>     val = []
>     for i in range(3):
>         part = elems[idx]["value"][i * length:(i + 1) * length]
>         val.append(str2int(part))
>     return val
> 
> With BTF it would be:
> 
> 	elems[idx]["formatted"]["value"]
> 
> Which is fairly awesome.
Thanks for the example.  Agree that with BTF, things are easier in general.

btw, what more awesome is,
#> bpftool map find id 100 key 1
{
	"counter_x": 1,
	"counter_y": 10
}

> 
> > > this addition to bpftool and will start using it myself as soon as it
> > > lands.  I'm not sure why the reluctance to slightly change the output
> > > format?  
> > The initial change argument is because the json has to be backward compat.
> > 
> > Then we show that btf-output is inherently not backward compat, so
> > printing it in json does not make sense at all.
> > 
> > However, now it is saying part of it does not have to be backward compat.
> 
> Compatibility of "formatted" member is defined as -> fields broken out
> according to BTF.  So it is backward compatible.  The definition of
> "value" member is -> an array of unfortunately formatted array of
> ugly hex strings :(
> 
> > I am fine putting it under "formatted" for "-j" or "-p" if that is the
> > case, other than the double output is still confusing.  Lets wait for
> > Okash's input.
> >
> > At the same time, the same output will be used as the default plaintext
> > output when BTF is available.  Then the plaintext BTF output
> > will not be limited by the json restrictions when we want
> > to improve human readability later.  Apparently, the
> > improvements on plaintext will not be always applicable
> > to json output.
> 

^ permalink raw reply

* [PATCH net-next 1/4] selftests: rtnetlink: clear the return code at start of ipsec test
From: Shannon Nelson @ 2018-06-23  0:31 UTC (permalink / raw)
  To: davem, netdev, jakub.kicinski; +Cc: anders.roxell, linux-kselftest
In-Reply-To: <1529713898-9200-1-git-send-email-shannon.nelson@oracle.com>

Following the custom from the other functions, clear the global
ret code before starting the test so as to not have previously
failed tests cause us to think this test has failed.

Reported-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
---
 tools/testing/selftests/net/rtnetlink.sh | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/testing/selftests/net/rtnetlink.sh b/tools/testing/selftests/net/rtnetlink.sh
index b33a371..261a981 100755
--- a/tools/testing/selftests/net/rtnetlink.sh
+++ b/tools/testing/selftests/net/rtnetlink.sh
@@ -522,6 +522,8 @@ kci_test_macsec()
 #-------------------------------------------------------------------
 kci_test_ipsec()
 {
+	ret=0
+
 	# find an ip address on this machine and make up a destination
 	srcip=`ip -o addr | awk '/inet / { print $4; }' | grep -v "^127" | head -1 | cut -f1 -d/`
 	net=`echo $srcip | cut -f1-3 -d.`
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 2/4] selftests: rtnetlink: use dummydev as a test device
From: Shannon Nelson @ 2018-06-23  0:31 UTC (permalink / raw)
  To: davem, netdev, jakub.kicinski; +Cc: anders.roxell, linux-kselftest
In-Reply-To: <1529713898-9200-1-git-send-email-shannon.nelson@oracle.com>

We really shouldn't mess with local system settings, so let's
use the already created dummy device instead for ipsec testing.
Oh, and let's put the temp file into a proper directory.

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
---
 tools/testing/selftests/net/rtnetlink.sh | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/net/rtnetlink.sh b/tools/testing/selftests/net/rtnetlink.sh
index 261a981..15948cf 100755
--- a/tools/testing/selftests/net/rtnetlink.sh
+++ b/tools/testing/selftests/net/rtnetlink.sh
@@ -523,21 +523,19 @@ kci_test_macsec()
 kci_test_ipsec()
 {
 	ret=0
-
-	# find an ip address on this machine and make up a destination
-	srcip=`ip -o addr | awk '/inet / { print $4; }' | grep -v "^127" | head -1 | cut -f1 -d/`
-	net=`echo $srcip | cut -f1-3 -d.`
-	base=`echo $srcip | cut -f4 -d.`
-	dstip="$net."`expr $base + 1`
-
 	algo="aead rfc4106(gcm(aes)) 0x3132333435363738393031323334353664636261 128"
+	srcip=192.168.123.1
+	dstip=192.168.123.2
+	spi=7
+
+	ip addr add $srcip dev $devdummy
 
 	# flush to be sure there's nothing configured
 	ip x s flush ; ip x p flush
 	check_err $?
 
 	# start the monitor in the background
-	tmpfile=`mktemp ipsectestXXX`
+	tmpfile=`mktemp /var/run/ipsectestXXX`
 	mpid=`(ip x m > $tmpfile & echo $!) 2>/dev/null`
 	sleep 0.2
 
@@ -601,6 +599,7 @@ kci_test_ipsec()
 	check_err $?
 	ip x p flush
 	check_err $?
+	ip addr del $srcip/32 dev $devdummy
 
 	if [ $ret -ne 0 ]; then
 		echo "FAIL: ipsec"
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 0/4] Updates for ipsec selftests
From: Shannon Nelson @ 2018-06-23  0:31 UTC (permalink / raw)
  To: davem, netdev, jakub.kicinski; +Cc: anders.roxell, linux-kselftest

Fix up the existing ipsec selftest and add tests for
the ipsec offload driver API.

Shannon Nelson (4):
  selftests: rtnetlink: clear the return code at start of ipsec test
  selftests: rtnetlink: use dummydev as a test device
  netdevsim: add ipsec offload testing
  selftests: rtnetlink: add ipsec offload API test

 drivers/net/netdevsim/Makefile           |   4 +
 drivers/net/netdevsim/ipsec.c            | 345 +++++++++++++++++++++++++++++++
 drivers/net/netdevsim/netdev.c           |   7 +
 drivers/net/netdevsim/netdevsim.h        |  37 ++++
 tools/testing/selftests/net/rtnetlink.sh | 132 +++++++++++-
 5 files changed, 518 insertions(+), 7 deletions(-)
 create mode 100644 drivers/net/netdevsim/ipsec.c

-- 
2.7.4

^ permalink raw reply

* [PATCH net-next 3/4] netdevsim: add ipsec offload testing
From: Shannon Nelson @ 2018-06-23  0:31 UTC (permalink / raw)
  To: davem, netdev, jakub.kicinski; +Cc: anders.roxell, linux-kselftest
In-Reply-To: <1529713898-9200-1-git-send-email-shannon.nelson@oracle.com>

Implement the IPsec/XFRM offload API for testing.

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
---
 drivers/net/netdevsim/Makefile    |   4 +
 drivers/net/netdevsim/ipsec.c     | 345 ++++++++++++++++++++++++++++++++++++++
 drivers/net/netdevsim/netdev.c    |   7 +
 drivers/net/netdevsim/netdevsim.h |  37 ++++
 4 files changed, 393 insertions(+)
 create mode 100644 drivers/net/netdevsim/ipsec.c

diff --git a/drivers/net/netdevsim/Makefile b/drivers/net/netdevsim/Makefile
index 449b2a1..0fee1d0 100644
--- a/drivers/net/netdevsim/Makefile
+++ b/drivers/net/netdevsim/Makefile
@@ -13,3 +13,7 @@ endif
 ifneq ($(CONFIG_NET_DEVLINK),)
 netdevsim-objs += devlink.o fib.o
 endif
+
+ifneq ($(CONFIG_XFRM_OFFLOAD),)
+netdevsim-objs += ipsec.o
+endif
diff --git a/drivers/net/netdevsim/ipsec.c b/drivers/net/netdevsim/ipsec.c
new file mode 100644
index 0000000..ad64266
--- /dev/null
+++ b/drivers/net/netdevsim/ipsec.c
@@ -0,0 +1,345 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2018 Oracle and/or its affiliates. All rights reserved. */
+
+#include <net/xfrm.h>
+#include <crypto/aead.h>
+#include <linux/debugfs.h>
+#include "netdevsim.h"
+
+#define NSIM_IPSEC_AUTH_BITS	128
+
+/**
+ * nsim_ipsec_dbg_read - read for ipsec data
+ * @filp: the opened file
+ * @buffer: where to write the data for the user to read
+ * @count: the size of the user's buffer
+ * @ppos: file position offset
+ **/
+static ssize_t nsim_dbg_netdev_ops_read(struct file *filp,
+					char __user *buffer,
+					size_t count, loff_t *ppos)
+{
+	struct netdevsim *ns = filp->private_data;
+	struct nsim_ipsec *ipsec = &ns->ipsec;
+	size_t bufsize;
+	char *buf, *p;
+	int len;
+	int i;
+
+	/* don't allow partial reads */
+	if (*ppos != 0)
+		return 0;
+
+	/* the buffer needed is
+	 * (num SAs * 3 lines each * ~60 bytes per line) + one more line
+	 */
+	bufsize = (ipsec->count * 4 * 60) + 60;
+	buf = kzalloc(bufsize, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	p = buf;
+	p += snprintf(p, bufsize - (p - buf),
+		      "SA count=%u tx=%u\n",
+		      ipsec->count, ipsec->tx);
+
+	for (i = 0; i < NSIM_IPSEC_MAX_SA_COUNT; i++) {
+		struct nsim_sa *sap = &ipsec->sa[i];
+
+		if (!sap->used)
+			continue;
+
+		p += snprintf(p, bufsize - (p - buf),
+			      "sa[%i] %cx ipaddr=0x%08x %08x %08x %08x\n",
+			      i, (sap->rx ? 'r' : 't'), sap->ipaddr[0],
+			      sap->ipaddr[1], sap->ipaddr[2], sap->ipaddr[3]);
+		p += snprintf(p, bufsize - (p - buf),
+			      "sa[%i]    spi=0x%08x proto=0x%x salt=0x%08x crypt=%d\n",
+			      i, be32_to_cpu(sap->xs->id.spi),
+			      sap->xs->id.proto, sap->salt, sap->crypt);
+		p += snprintf(p, bufsize - (p - buf),
+			      "sa[%i]    key=0x%08x %08x %08x %08x\n",
+			      i, sap->key[0], sap->key[1],
+			      sap->key[2], sap->key[3]);
+	}
+
+	len = simple_read_from_buffer(buffer, count, ppos, buf, p - buf);
+
+	kfree(buf);
+	return len;
+}
+
+static const struct file_operations ipsec_dbg_fops = {
+	.owner = THIS_MODULE,
+	.open = simple_open,
+	.read = nsim_dbg_netdev_ops_read,
+};
+
+/**
+ * nsim_ipsec_find_empty_idx - find the first unused security parameter index
+ * @ipsec: pointer to ipsec struct
+ **/
+static int nsim_ipsec_find_empty_idx(struct nsim_ipsec *ipsec)
+{
+	u32 i;
+
+	if (ipsec->count == NSIM_IPSEC_MAX_SA_COUNT)
+		return -ENOSPC;
+
+	/* search sa table */
+	for (i = 0; i < NSIM_IPSEC_MAX_SA_COUNT; i++) {
+		if (!ipsec->sa[i].used)
+			return i;
+	}
+
+	return -ENOSPC;
+}
+
+/**
+ * nsim_ipsec_parse_proto_keys - find the key and salt based on the protocol
+ * @xs: pointer to xfrm_state struct
+ * @mykey: pointer to key array to populate
+ * @mysalt: pointer to salt value to populate
+ *
+ * This copies the protocol keys and salt to our own data tables.  The
+ * 82599 family only supports the one algorithm.
+ **/
+static int nsim_ipsec_parse_proto_keys(struct xfrm_state *xs,
+				       u32 *mykey, u32 *mysalt)
+{
+	struct net_device *dev = xs->xso.dev;
+	unsigned char *key_data;
+	char *alg_name = NULL;
+	const char aes_gcm_name[] = "rfc4106(gcm(aes))";
+	int key_len;
+
+	if (!xs->aead) {
+		netdev_err(dev, "Unsupported IPsec algorithm\n");
+		return -EINVAL;
+	}
+
+	if (xs->aead->alg_icv_len != NSIM_IPSEC_AUTH_BITS) {
+		netdev_err(dev, "IPsec offload requires %d bit authentication\n",
+			   NSIM_IPSEC_AUTH_BITS);
+		return -EINVAL;
+	}
+
+	key_data = &xs->aead->alg_key[0];
+	key_len = xs->aead->alg_key_len;
+	alg_name = xs->aead->alg_name;
+
+	if (strcmp(alg_name, aes_gcm_name)) {
+		netdev_err(dev, "Unsupported IPsec algorithm - please use %s\n",
+			   aes_gcm_name);
+		return -EINVAL;
+	}
+
+	/* The key bytes come down in a bigendian array of bytes, so
+	 * we don't need to do any byteswapping.
+	 * 160 accounts for 16 byte key and 4 byte salt
+	 */
+	if (key_len > 128) {
+		*mysalt = ((u32 *)key_data)[4];
+	} else if (key_len == 128) {
+		*mysalt = 0;
+	} else {
+		netdev_err(dev, "IPsec hw offload only supports 128 bit keys with optional 32 bit salt\n");
+		return -EINVAL;
+	}
+	memcpy(mykey, key_data, 16);
+
+	return 0;
+}
+
+/**
+ * nsim_ipsec_add_sa - program device with a security association
+ * @xs: pointer to transformer state struct
+ **/
+static int nsim_ipsec_add_sa(struct xfrm_state *xs)
+{
+	struct net_device *dev = xs->xso.dev;
+	struct netdevsim *ns = netdev_priv(dev);
+	struct nsim_ipsec *ipsec = &ns->ipsec;
+	struct nsim_sa sa;
+	u16 sa_idx;
+	int ret;
+
+	if (xs->id.proto != IPPROTO_ESP && xs->id.proto != IPPROTO_AH) {
+		netdev_err(dev, "Unsupported protocol 0x%04x for ipsec offload\n",
+			   xs->id.proto);
+		return -EINVAL;
+	}
+
+	if (xs->calg) {
+		netdev_err(dev, "Compression offload not supported\n");
+		return -EINVAL;
+	}
+
+	/* find the first unused index */
+	ret = nsim_ipsec_find_empty_idx(ipsec);
+	if (ret < 0) {
+		netdev_err(dev, "No space for SA in Rx table!\n");
+		return ret;
+	}
+	sa_idx = (u16)ret;
+
+	memset(&sa, 0, sizeof(sa));
+	sa.used = true;
+	sa.xs = xs;
+
+	if (sa.xs->id.proto & IPPROTO_ESP)
+		sa.crypt = xs->ealg || xs->aead;
+
+	/* get the key and salt */
+	ret = nsim_ipsec_parse_proto_keys(xs, sa.key, &sa.salt);
+	if (ret) {
+		netdev_err(dev, "Failed to get key data for SA table\n");
+		return ret;
+	}
+
+	if (xs->xso.flags & XFRM_OFFLOAD_INBOUND) {
+		sa.rx = true;
+
+		if (xs->props.family == AF_INET6)
+			memcpy(sa.ipaddr, &xs->id.daddr.a6, 16);
+		else
+			memcpy(&sa.ipaddr[3], &xs->id.daddr.a4, 4);
+	}
+
+	/* the preparations worked, so save the info */
+	memcpy(&ipsec->sa[sa_idx], &sa, sizeof(sa));
+
+	/* the XFRM stack doesn't like offload_handle == 0,
+	 * so add a bitflag in case our array index is 0
+	 */
+	xs->xso.offload_handle = sa_idx | NSIM_IPSEC_VALID;
+	ipsec->count++;
+
+	return 0;
+}
+
+/**
+ * nsim_ipsec_del_sa - clear out this specific SA
+ * @xs: pointer to transformer state struct
+ **/
+static void nsim_ipsec_del_sa(struct xfrm_state *xs)
+{
+	struct net_device *dev = xs->xso.dev;
+	struct netdevsim *ns = netdev_priv(dev);
+	struct nsim_ipsec *ipsec = &ns->ipsec;
+	u16 sa_idx;
+
+	sa_idx = xs->xso.offload_handle & ~NSIM_IPSEC_VALID;
+	if (!ipsec->sa[sa_idx].used) {
+		netdev_err(dev, "Invalid SA for delete sa_idx=%d\n", sa_idx);
+		return;
+	}
+
+	memset(&ipsec->sa[sa_idx], 0, sizeof(struct nsim_sa));
+	ipsec->count--;
+}
+
+/**
+ * nsim_ipsec_offload_ok - can this packet use the xfrm hw offload
+ * @skb: current data packet
+ * @xs: pointer to transformer state struct
+ **/
+static bool nsim_ipsec_offload_ok(struct sk_buff *skb, struct xfrm_state *xs)
+{
+	struct net_device *dev = xs->xso.dev;
+	struct netdevsim *ns = netdev_priv(dev);
+	struct nsim_ipsec *ipsec = &ns->ipsec;
+
+	ipsec->ok++;
+
+	return true;
+}
+
+static const struct xfrmdev_ops nsim_xfrmdev_ops = {
+	.xdo_dev_state_add = nsim_ipsec_add_sa,
+	.xdo_dev_state_delete = nsim_ipsec_del_sa,
+	.xdo_dev_offload_ok = nsim_ipsec_offload_ok,
+};
+
+/**
+ * nsim_ipsec_tx - check Tx packet for ipsec offload
+ * @ns: pointer to ns structure
+ * @skb: current data packet
+ **/
+int nsim_ipsec_tx(struct netdevsim *ns, struct sk_buff *skb)
+{
+	struct nsim_ipsec *ipsec = &ns->ipsec;
+	struct xfrm_state *xs;
+	struct nsim_sa *tsa;
+	u32 sa_idx;
+
+	/* do we even need to check this packet? */
+	if (!skb->sp)
+		return 1;
+
+	if (unlikely(!skb->sp->len)) {
+		netdev_err(ns->netdev, "%s: no xfrm state len = %d\n",
+			   __func__, skb->sp->len);
+		return 0;
+	}
+
+	xs = xfrm_input_state(skb);
+	if (unlikely(!xs)) {
+		netdev_err(ns->netdev, "%s: no xfrm_input_state() xs = %p\n",
+			   __func__, xs);
+		return 0;
+	}
+
+	sa_idx = xs->xso.offload_handle & ~NSIM_IPSEC_VALID;
+	if (unlikely(sa_idx > NSIM_IPSEC_MAX_SA_COUNT)) {
+		netdev_err(ns->netdev, "%s: bad sa_idx=%d max=%d\n",
+			   __func__, sa_idx, NSIM_IPSEC_MAX_SA_COUNT);
+		return 0;
+	}
+
+	tsa = &ipsec->sa[sa_idx];
+	if (unlikely(!tsa->used)) {
+		netdev_err(ns->netdev, "%s: unused sa_idx=%d\n",
+			   __func__, sa_idx);
+		return 0;
+	}
+
+	if (xs->id.proto != IPPROTO_ESP && xs->id.proto != IPPROTO_AH) {
+		netdev_err(ns->netdev, "%s: unexpected proto=%d\n",
+			   __func__, xs->id.proto);
+		return 0;
+	}
+
+	ipsec->tx++;
+
+	return 1;
+}
+
+/**
+ * nsim_ipsec_init - initialize security registers for IPSec operation
+ * @ns: board private structure
+ **/
+void nsim_ipsec_init(struct netdevsim *ns)
+{
+	ns->netdev->xfrmdev_ops = &nsim_xfrmdev_ops;
+
+#define NSIM_ESP_FEATURES	(NETIF_F_HW_ESP | \
+				 NETIF_F_HW_ESP_TX_CSUM | \
+				 NETIF_F_GSO_ESP)
+
+	ns->netdev->features |= NSIM_ESP_FEATURES;
+	ns->netdev->hw_enc_features |= NSIM_ESP_FEATURES;
+
+	ns->ipsec.pfile = debugfs_create_file("ipsec", 0400, ns->ddir, ns,
+					      &ipsec_dbg_fops);
+}
+
+void nsim_ipsec_teardown(struct netdevsim *ns)
+{
+	struct nsim_ipsec *ipsec = &ns->ipsec;
+
+	if (ipsec->count)
+		netdev_err(ns->netdev, "%s: tearing down IPsec offload with %d SAs left\n",
+			   __func__, ipsec->count);
+	debugfs_remove_recursive(ipsec->pfile);
+}
diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
index ec68f38..6ce8604d 100644
--- a/drivers/net/netdevsim/netdev.c
+++ b/drivers/net/netdevsim/netdev.c
@@ -171,6 +171,8 @@ static int nsim_init(struct net_device *dev)
 	if (err)
 		goto err_unreg_dev;
 
+	nsim_ipsec_init(ns);
+
 	return 0;
 
 err_unreg_dev:
@@ -186,6 +188,7 @@ static void nsim_uninit(struct net_device *dev)
 {
 	struct netdevsim *ns = netdev_priv(dev);
 
+	nsim_ipsec_teardown(ns);
 	nsim_devlink_teardown(ns);
 	debugfs_remove_recursive(ns->ddir);
 	nsim_bpf_uninit(ns);
@@ -203,11 +206,15 @@ static netdev_tx_t nsim_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct netdevsim *ns = netdev_priv(dev);
 
+	if (!nsim_ipsec_tx(ns, skb))
+		goto out;
+
 	u64_stats_update_begin(&ns->syncp);
 	ns->tx_packets++;
 	ns->tx_bytes += skb->len;
 	u64_stats_update_end(&ns->syncp);
 
+out:
 	dev_kfree_skb(skb);
 
 	return NETDEV_TX_OK;
diff --git a/drivers/net/netdevsim/netdevsim.h b/drivers/net/netdevsim/netdevsim.h
index 3a8581a..1708dee 100644
--- a/drivers/net/netdevsim/netdevsim.h
+++ b/drivers/net/netdevsim/netdevsim.h
@@ -29,6 +29,29 @@ struct bpf_prog;
 struct dentry;
 struct nsim_vf_config;
 
+#if IS_ENABLED(CONFIG_XFRM_OFFLOAD)
+#define NSIM_IPSEC_MAX_SA_COUNT		33
+#define NSIM_IPSEC_VALID		BIT(31)
+
+struct nsim_sa {
+	struct xfrm_state *xs;
+	__be32 ipaddr[4];
+	u32 key[4];
+	u32 salt;
+	bool used;
+	bool crypt;
+	bool rx;
+};
+
+struct nsim_ipsec {
+	struct nsim_sa sa[NSIM_IPSEC_MAX_SA_COUNT];
+	struct dentry *pfile;
+	u32 count;
+	u32 tx;
+	u32 ok;
+};
+#endif
+
 struct netdevsim {
 	struct net_device *netdev;
 
@@ -67,6 +90,9 @@ struct netdevsim {
 #if IS_ENABLED(CONFIG_NET_DEVLINK)
 	struct devlink *devlink;
 #endif
+#if IS_ENABLED(CONFIG_XFRM_OFFLOAD)
+	struct nsim_ipsec ipsec;
+#endif
 };
 
 extern struct dentry *nsim_ddir;
@@ -147,6 +173,17 @@ static inline void nsim_devlink_exit(void)
 }
 #endif
 
+#if IS_ENABLED(CONFIG_XFRM_OFFLOAD)
+void nsim_ipsec_init(struct netdevsim *ns);
+void nsim_ipsec_teardown(struct netdevsim *ns);
+int nsim_ipsec_tx(struct netdevsim *ns, struct sk_buff *skb);
+#else
+static inline void nsim_ipsec_init(struct netdevsim *ns) {};
+static inline void nsim_ipsec_teardown(struct netdevsim *ns) {};
+static inline int nsim_ipsec_tx(struct netdevsim *ns, struct sk_buff *skb)
+								{ return 1; };
+#endif
+
 static inline struct netdevsim *to_nsim(struct device *ptr)
 {
 	return container_of(ptr, struct netdevsim, dev);
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 4/4] selftests: rtnetlink: add ipsec offload API test
From: Shannon Nelson @ 2018-06-23  0:31 UTC (permalink / raw)
  To: davem, netdev, jakub.kicinski; +Cc: anders.roxell, linux-kselftest
In-Reply-To: <1529713898-9200-1-git-send-email-shannon.nelson@oracle.com>

Using the netdevsim as a device for testing, try out the XFRM commands
for setting up IPsec hardware offloads.

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
---
 tools/testing/selftests/net/rtnetlink.sh | 114 +++++++++++++++++++++++++++++++
 1 file changed, 114 insertions(+)

diff --git a/tools/testing/selftests/net/rtnetlink.sh b/tools/testing/selftests/net/rtnetlink.sh
index 15948cf..9e1a82e 100755
--- a/tools/testing/selftests/net/rtnetlink.sh
+++ b/tools/testing/selftests/net/rtnetlink.sh
@@ -608,6 +608,119 @@ kci_test_ipsec()
 	echo "PASS: ipsec"
 }
 
+#-------------------------------------------------------------------
+# Example commands
+#   ip x s add proto esp src 14.0.0.52 dst 14.0.0.70 \
+#            spi 0x07 mode transport reqid 0x07 replay-window 32 \
+#            aead 'rfc4106(gcm(aes))' 1234567890123456dcba 128 \
+#            sel src 14.0.0.52/24 dst 14.0.0.70/24
+#            offload dev sim1 dir out
+#   ip x p add dir out src 14.0.0.52/24 dst 14.0.0.70/24 \
+#            tmpl proto esp src 14.0.0.52 dst 14.0.0.70 \
+#            spi 0x07 mode transport reqid 0x07
+#
+#-------------------------------------------------------------------
+kci_test_ipsec_offload()
+{
+	ret=0
+	algo="aead rfc4106(gcm(aes)) 0x3132333435363738393031323334353664636261 128"
+	srcip=192.168.123.3
+	dstip=192.168.123.4
+	dev=simx1
+	sysfsd=/sys/kernel/debug/netdevsim/$dev
+	sysfsf=$sysfsd/ipsec
+
+	# setup netdevsim since dummydev doesn't have offload support
+	modprobe netdevsim
+	check_err $?
+	if [ $ret -ne 0 ]; then
+		echo "FAIL: ipsec_offload can't load netdevsim"
+		return 1
+	fi
+
+	ip link add $dev type netdevsim
+	ip addr add $srcip dev $dev
+	ip link set $dev up
+	if [ ! -d $sysfsd ] ; then
+		echo "FAIL: ipsec_offload can't create device $dev"
+		return 1
+	fi
+	if [ ! -f $sysfsf ] ; then
+		echo "FAIL: ipsec_offload netdevsim doesn't support IPsec offload"
+		return 1
+	fi
+
+	# flush to be sure there's nothing configured
+	ip x s flush ; ip x p flush
+
+	# create offloaded SAs, both in and out
+	ip x p add dir out src $srcip/24 dst $dstip/24 \
+	    tmpl proto esp src $srcip dst $dstip spi 9 \
+	    mode transport reqid 42
+	check_err $?
+	ip x p add dir out src $dstip/24 dst $srcip/24 \
+	    tmpl proto esp src $dstip dst $srcip spi 9 \
+	    mode transport reqid 42
+	check_err $?
+
+	ip x s add proto esp src $srcip dst $dstip spi 9 \
+	    mode transport reqid 42 $algo sel src $srcip/24 dst $dstip/24 \
+	    offload dev $dev dir out
+	check_err $?
+	ip x s add proto esp src $dstip dst $srcip spi 9 \
+	    mode transport reqid 42 $algo sel src $dstip/24 dst $srcip/24 \
+	    offload dev $dev dir in
+	check_err $?
+	if [ $ret -ne 0 ]; then
+		echo "FAIL: ipsec_offload can't create SA"
+		return 1
+	fi
+
+	# does offload show up in ip output
+	lines=`ip x s list | grep -c "crypto offload parameters: dev $dev dir"`
+	if [ $lines -ne 2 ] ; then
+		echo "FAIL: ipsec_offload SA offload missing from list output"
+		check_err 1
+	fi
+
+	# use ping to exercise the Tx path
+	ping -I $dev -c 3 -W 1 -i 0 $dstip >/dev/null
+
+	# does driver have correct offload info
+	diff $sysfsf - << EOF
+SA count=2 tx=3
+sa[0] tx ipaddr=0x00000000 00000000 00000000 00000000
+sa[0]    spi=0x00000009 proto=0x32 salt=0x61626364 crypt=1
+sa[0]    key=0x34333231 38373635 32313039 36353433
+sa[1] rx ipaddr=0x00000000 00000000 00000000 037ba8c0
+sa[1]    spi=0x00000009 proto=0x32 salt=0x61626364 crypt=1
+sa[1]    key=0x34333231 38373635 32313039 36353433
+EOF
+	if [ $? -ne 0 ] ; then
+		echo "FAIL: ipsec_offload incorrect driver data"
+		check_err 1
+	fi
+
+	# does offload get removed from driver
+	ip x s flush
+	ip x p flush
+	lines=`grep -c "SA count=0" $sysfsf`
+	if [ $lines -ne 1 ] ; then
+		echo "FAIL: ipsec_offload SA not removed from driver"
+		check_err 1
+	fi
+
+	# clean up any leftovers
+	ip link del $dev
+	rmmod netdevsim
+
+	if [ $ret -ne 0 ]; then
+		echo "FAIL: ipsec_offload"
+		return 1
+	fi
+	echo "PASS: ipsec_offload"
+}
+
 kci_test_gretap()
 {
 	testns="testns"
@@ -862,6 +975,7 @@ kci_test_rtnl()
 	kci_test_encap
 	kci_test_macsec
 	kci_test_ipsec
+	kci_test_ipsec_offload
 
 	kci_del_dummy
 }
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH net] vhost_net: validate sock before trying to put its fd
From: David Miller @ 2018-06-23  1:24 UTC (permalink / raw)
  To: jasowang; +Cc: kvm, mst, netdev, linux-kernel, virtualization, dan.carpenter
In-Reply-To: <1529557891-4828-1-git-send-email-jasowang@redhat.com>

From: Jason Wang <jasowang@redhat.com>
Date: Thu, 21 Jun 2018 13:11:31 +0800

> Sock will be NULL if we pass -1 to vhost_net_set_backend(), but when
> we meet errors during ubuf allocation, the code does not check for
> NULL before calling sockfd_put(), this will lead NULL
> dereferencing. Fixing by checking sock pointer before.
> 
> Fixes: bab632d69ee4 ("vhost: vhost TX zero-copy support")
> Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>

Applied and queued up for -stable, thanks Jason.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox