Netdev List
 help / color / mirror / Atom feed
* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Michael S. Tsirkin @ 2018-06-15  2:34 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Samudrala, Sridhar, qemu-devel, virtualization, Siwei Liu, Netdev,
	aaron.f.brown
In-Reply-To: <20180614120231.0a72bd5f.cohuck@redhat.com>

On Thu, Jun 14, 2018 at 12:02:31PM +0200, Cornelia Huck wrote:
> So, do you know from the outset that there will be such a coupled
> device? I.e., is it a property of the VM definition?
> 
> Can there be a 'prepared' virtio-net device that presents the STANDBY
> feature even if there currently is no vfio-handled device available --
> but making it possible to simply hotplug that device later?
> Should it be possible to add a virtio/vfio pair later on?

Yes, that's the plan, more or less.

> > >>> I think Qemu should check if guest virtio-net supports this feature and
> > >>> provide a mechanism for
> > >>> an upper layer indicating if the STANDBY feature is successfully
> > >>> negotiated or not.
> > >>> The upper layer can then decide if it should hot plug a VF with the same
> > >>> MAC and manage the 2 links.
> > >>> If VF is successfully hot plugged, virtio-net link should be disabled.  

BTW I see no reason to do this last part. The role of the
standby device is to be always there.

> > >>
> > >> Did you even talk to upper layer management about it?
> > >> Just list the steps they need to do and you will see
> > >> that's a lot of machinery to manage by the upper layer.
> > >>
> > >> What do we gain in flexibility? As far as I can see the
> > >> only gain is some resources saved for legacy VMs.
> > >>
> > >> That's not a lot as tenant of the upper layer probably already has
> > >> at least a hunch that it's a new guest otherwise
> > >> why bother specifying the feature at all - you
> > >> save even more resources without it.
> > >>  
> > >
> > > I am not all that familiar with how Qemu manages network devices. If we can
> > > do all the
> > > required management of the primary/standby devices within Qemu, that is
> > > definitely a better
> > > approach without upper layer involvement.  
> > 
> > Right. I would imagine in the extreme case the upper layer doesn't
> > have to be involved at all if QEMU manages all hot plug/unplug logic.
> > The management tool can supply passthrough device and virtio with the
> > same group UUID, QEMU auto-manages the presence of the primary, and
> > hot plug the device as needed before or after the migration.
> 
> I do not really see how you can manage that kind of stuff in QEMU only.

So right now failover is limited to pci passthrough devices only.
The idea is to realize the vfio device but not expose it
to guest. Have a separate command to expose it to guest.
Hotunplug would also hide it from guest but not unrealize it.

This will help ensure that e.g. on migration failure we can
re-expose the device without risk of running out of resources.
-- 
MST

^ permalink raw reply

* Re: [PATCH] optoe: driver to read/write SFP/QSFP EEPROMs
From: Don Bollinger @ 2018-06-15  2:26 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Tom Lendacky, Arnd Bergmann, Greg Kroah-Hartman, linux-kernel,
	brandon_chuang, wally_wang, roy_lee, rick_burchett, quentin.chang,
	steven.noble, jeffrey.townsend, scotte, roopa, David Ahern,
	luke.williams, Guohan Lu, Russell King, netdev@vger.kernel.org
In-Reply-To: <20180612181109.GD12251@lunn.ch>

On Tue, Jun 12, 2018 at 08:11:09PM +0200, Andrew Lunn wrote:
> > There's an SFP driver under drivers/net/phy.  Can that driver be extended
> > to provide this support?  Adding Russel King who developed sfp.c, as well
> > at the netdev mailing list.
> 
> I agree, the current SFP code should be used.
> 
> My observations seem to be there are two different ways {Q}SFP are used:
> 
> 1) The Linux kernel has full control, as assumed by the devlink/SFP
> frame work. We parse the SFP data to find the capabilities of the SFP
> and use it to program the MAC to use the correct mode. The MAC can be
> a NIC, but it can also be a switch. DSA is gaining support for
> PHYLINK, so SFP modules should just work with most switches which DSA
> support.  And there is no reason a plain switchdev switch can not use
> PHYLINK.
> 
> 2) Firmware is in control of the PHY layer, but there is a wish to
> expose some of the data which is available via i2c from the {Q}SFP to
> linux.
> 
> It appears this optoe supports this second case. It does not appear to
> support any in kernel API to actually make use of the SFP data in the
> kernel.
> 
> We should not be duplicating code. We should share the SFP code for
> both use cases above. There is also a Linux standard API for getting
> access to this information. ethtool -m/--module-info. Anything which
> is exporting {Q}SFP data needs to use this API.
> 
>    Andrew

Actually this is better described by a third use case.  The target
switches are PHY-less (see various designs at
www.compute.org/wiki/Networking/SpecsAndDesigns). The AS5712 for example
says "The AS5712-54X is a PHY-Less design with the SFP+ and QSFP+
connections directly attaching to the Serdes interfaces of the Broadcom
BCM56854 720G Trident 2 switching silicon..."

The electrical controls of the {Q}SFP devices (TxDisable for example)
are organized in a platform dependent way, through CPLD devices, and
managed by a platform specific CPLD driver.

The i2c bus is muxed from the CPU to all of the {Q}SFP devices, which
are set up as standard linux i2c devices
(/sys/bus/i2c/devices/i2c-xxxx).

There is no MDIO bus between the CPU and the {Q}SFP devices.

> 2) Firmware is in control of the PHY layer, but there is a wish to
> expose some of the data which is available via i2c from the {Q}SFP to
> linux.

So the switch silicon is in control of the PHY layer.  The platform
driver is in control of the electrical interfaces.  And the EEPROM data
is available via I2C.

And, there isn't actually 'a wish to expose' the EEPROM data to linux
(the kernel).  It turns out that none of the NOS partners I'm working
with use that data *in the kernel*.  It is all managed from user space.

More generally, I think sfp.c and optoe are not actually trying to
accomplish the same thing at all.  sfp.c combines all three functions
(PHY, electrical control, EEPROM access).  optoe is only providing EEPROM
access, and only to user space.  This is a real need in the white box
switch environment, and is not met by sfp.c.  optoe isn't better, sfp.c
isn't better, they're just different.

Finally, sfp.c does not recognize that SFP devices have data beyond 512
bytes, accessible via a page register.  It also does not recognize QSFP
devices at all.  QSFP devices have only 256 bytes accessible (one i2c
address) before switching to paged access for the remaining data.  The
first design requirement for optoe was to access all the available
pages, because there is information and controls that we (optics
vendors) want to make available to network management applications.

If sfp.c creates a standard linux i2c client for each SFP device, it
should be possible to create an optoe managed device 'under' sfp.c to
provide access to the full EEPROM address space:
  # echo optoe2 0x50 >/sys/bus/i2c/devices/i2c-xx/new_device
This might prove useful to user space consumers of that data.  We could
also easily add a kernel API (eg the nvmem framework) to optoe to provide
kernel access.  In other words, sfp.c could assign EEPROM management to
optoe, while managing the electrical interfaces.  (This is actually
pretty close to how the platfom drivers work in the switch environment.)
sfp.c would get SFP page support and QSFP EEPROM access 'for free'.

>                       There is also a Linux standard API for getting
> access to this information. ethtool -m/--module-info. Anything which
> is exporting {Q}SFP data needs to use this API.

optoe simply provides direct access from user space to the full EEPROM
data.  There is more data there than ethtool knows about, and in some
devices there are proprietary registers that ethtool will never know
about.  optoe does not interpret any of the EEPROM content (except the
bare minimum to access pages correctly).  optoe also does not get in the
way of ethtool.  It could prove to be a handy way for ethtool to access
new EEPROM fields in the future.  QSFP-DD/OSFP are coming soon, they
will have a different (incompatible) set of new fields to be decoded.

Bottom Line:  sfp.c is not a useful starting point for the switch
environment I'm working in.  The underlying hardware architecture is
quite different.  optoe is not a competing alternative.  Its only
function is to provide user-space access to the EEPROM data in {Q}SFP
devices.

Don

^ permalink raw reply

* Re: [PATCH 0/3] Use sbitmap instead of percpu_ida
From: Martin K. Petersen @ 2018-06-15  2:06 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Juergen Gross, Jens Axboe, kvm, linux-scsi, netdev, linux-usb,
	linux-kernel, virtualization, target-devel, qla2xxx-upstream,
	linux1394-devel, Kent Overstreet
In-Reply-To: <20180612190545.10781-1-willy@infradead.org>


Matthew,

> Removing the percpu_ida code nets over 400 lines of removal.  It's not
> as spectacular as deleting an entire architecture, but it's still a
> worthy reduction in lines of code.

Since most of the changes are in scsi or target, should I take this
series through my tree?

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply

* Re: [PATCH] rds: avoid unenecessary cong_update in loop transport
From: David Miller @ 2018-06-15  2:02 UTC (permalink / raw)
  To: santosh.shilimkar; +Cc: netdev
In-Reply-To: <1529002354-16849-1-git-send-email-santosh.shilimkar@oracle.com>

From: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Date: Thu, 14 Jun 2018 11:52:34 -0700

> Loop transport which is self loopback, remote port congestion
> update isn't relevant. Infact the xmit path already ignores it.
> Receive path needs to do the same.
> 
> Reported-by: syzbot+4c20b3866171ce8441d2@syzkaller.appspotmail.com
> Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH net-next 1/1] tc-testing: initial version of tunnel_key unit tests
From: David Miller @ 2018-06-15  2:01 UTC (permalink / raw)
  To: kleib; +Cc: netdev, jhs, xiyou.wangcong, jiri, lucasb
In-Reply-To: <1528999542-15621-1-git-send-email-kleib@mojatatu.com>

From: Keara Leibovitz <kleib@mojatatu.com>
Date: Thu, 14 Jun 2018 14:05:42 -0400

> Signed-off-by: Keara Leibovitz <kleib@mojatatu.com>

Please resubmit when net-next opens back up.

^ permalink raw reply

* Re: [PATCH] net: cxgb3: add error handling for sysfs_create_group
From: David Miller @ 2018-06-15  2:00 UTC (permalink / raw)
  To: jiazhouyang09; +Cc: santosh, netdev, linux-kernel
In-Reply-To: <1528984571-53320-1-git-send-email-jiazhouyang09@gmail.com>

From: Zhouyang Jia <jiazhouyang09@gmail.com>
Date: Thu, 14 Jun 2018 21:56:11 +0800

> diff --git a/drivers/net/ethernet/chelsio/cxgb3/cxgb3_main.c b/drivers/net/ethernet/chelsio/cxgb3/cxgb3_main.c
> index 2edfdbd..73d6aa9 100644
> --- a/drivers/net/ethernet/chelsio/cxgb3/cxgb3_main.c
> +++ b/drivers/net/ethernet/chelsio/cxgb3/cxgb3_main.c
> @@ -3362,6 +3362,10 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
>  
>  	err = sysfs_create_group(&adapter->port[0]->dev.kobj,
>  				 &cxgb3_attr_group);
> +	if (err) {
> +		dev_err(&pdev->dev, "cannot create sysfs group\n");
> +		goto out_free_dev;
> +	}

You have to do more than this to cleanup.  For example, you have to
change the LED state back, as it has just been set to indicate that
the interface is operational.

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Siwei Liu @ 2018-06-15  1:57 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, qemu-devel, virtualization,
	Netdev, aaron.f.brown
In-Reply-To: <20180614120231.0a72bd5f.cohuck@redhat.com>

Thank you for sharing your thoughts, Cornelia. With questions below, I
think you raised really good points, some of which I don't have answer
yet and would also like to explore here.

First off, I don't want to push the discussion to the extreme at this
point, or sell anything about having QEMU manage everything
automatically. Don't get me wrong, it's not there yet. Let's don't
assume we are tied to a specific or concerte solution. I think the key
for our discussion might be to define or refine the boundary between
VM and guest,  e.g. what each layer is expected to control and manage
exactly.

In my view, there might be possibly 3 different options to represent
the failover device conceipt to QEMU and libvirt (or any upper layer
software):

a. Seperate device: in this model, virtio and passthough remains
separate devices just as today. QEMU exposes the standby feature bit
for virtio, and publish status/event around the negotiation process of
this feature bit for libvirt to react upon. Since Libvirt has the
pairing relationship itself, maybe through MAC address or something
else, it can control the presence of primary by hot plugging or
unplugging the passthrough device, although it has to work tightly
with virtio's feature negotation process. Not just for migration but
also various corner scenarios (driver/feature ok, device reset,
reboot, legacy guest etc) along virtio's feature negotiation.

b. Coupled device: in this model, virtio and passthough devices are
weakly coupled using some group ID, i.e. QEMU match the passthough
device for a standby virtio instance by comparing the group ID value
present behind each device's bridge. Libvirt provides QEMU the group
ID for both type of devices, and only deals with hot plug for
migration, by checking some migration status exposed (e.g. the feature
negotiation status on the virtio device) by QEMU. QEMU manages the
visibility of the primary in guest along virtio's feature negotiation
process.

c. Fully combined device: in this model, virtio and passthough devices
are viewed as a single VM interface altogther. QEMU not just controls
the visibility of the primary in guest, but can also manage the
exposure of the passthrough for migratability. It can be like that
libvirt supplies the group ID to QEMU. Or libvirt does not even have
to provide group ID for grouping the two devices, if just one single
combined device is exposed by QEMU. In either case, QEMU manages all
aspect of such internal construct, including virtio feature
negotiation, presence of the primary, and live migration.

It looks like to me that, in your opinion, you seem to prefer go with
(a). While I'm actually okay with either (b) or (c). Do I understand
your point correctly?

The reason that I feel that (a) might not be ideal, just as Michael
alluded to (quoting below), is that as management stack, it really
doesn't need to care about the detailed process of feature negotiation
(if we view the guest presence of the primary as part of feature
negotiation at an extended level not just virtio). All it needs to be
done is to hand in the required devices to QEMU and that's all. Why do
we need to addd various hooks, events for whichever happens internally
within the guest?

''
Primary device is added with a special "primary-failover" flag.
A virtual machine is then initialized with just a standby virtio
device. Primary is not yet added.

Later QEMU detects that guest driver device set DRIVER_OK.
It then exposes the primary device to the guest, and triggers
a device addition event (hot-plug event) for it.

If QEMU detects guest driver removal, it initiates a hot-unplug sequence
to remove the primary driver.  In particular, if QEMU detects guest
re-initialization (e.g. by detecting guest reset) it immediately removes
the primary device.
''

and,

''
management just wants to give the primary to guest and later take it back,
it really does not care about the details of the process,
so I don't see what does pushing it up the stack buy you.

So I don't think it *needs* to be done in libvirt. It probably can if you
add a bunch of hooks so it knows whenever vm reboots, driver binds and
unbinds from device, and can check that backup flag was set.
If you are pushing for a setup like that please get a buy-in
from libvirt maintainers or better write a patch.
''

Let me know if my clarifications sound clear to you now.


Thanks,
-Siwei


On Thu, Jun 14, 2018 at 3:02 AM, Cornelia Huck <cohuck@redhat.com> wrote:
> I've been pointed to this discussion (which I had missed previously)
> and I'm getting a headache. Let me first summarize how I understand how
> this feature is supposed to work, then I'll respond to some individual
> points.
>
> The basic idea is to enable guests to migrate seamlessly, while still
> making it possible for them to use a passed-through device for more
> performance etc. The means to do so is to hook a virtio-net device
> together with a network device passed through via vfio. The
> vfio-handled device is there for performance, the virtio device for
> migratability. We have a new virtio feature bit for that which needs to
> be negotiated for that 'combined' device to be available. We have to
> consider two cases:
>
> - Older guests that do not support the new feature bit. We presume that
>   those guests will be confused if they get two network devices with
>   the same MAC, so the idea is to not show them the vfio-handled device
>   at all.
> - Guests that negotiate the feature bit. We only know positively that
>   they (a) know the feature bit and (b) are prepared to handle the
>   consequences of negotiating it after they set the FEATURES_OK bit.
>   This is therefore the earliest point in time that the vfio-handled
>   device should be visible or usable for the guest.
>
> On Wed, 13 Jun 2018 18:02:01 -0700
> Siwei Liu <loseweigh@gmail.com> wrote:
>
>> On Tue, Jun 12, 2018 at 5:08 PM, Samudrala, Sridhar
>> <sridhar.samudrala@intel.com> wrote:
>> > On 6/12/2018 4:34 AM, Michael S. Tsirkin wrote:
>> >>
>> >> On Mon, Jun 11, 2018 at 10:02:45PM -0700, Samudrala, Sridhar wrote:
>> >>>
>> >>> On 6/11/2018 7:17 PM, Michael S. Tsirkin wrote:
>> >>>>
>> >>>> On Tue, Jun 12, 2018 at 09:54:44AM +0800, Jason Wang wrote:
>> >>>>>
>> >>>>> On 2018年06月12日 01:26, Michael S. Tsirkin wrote:
>> >>>>>>
>> >>>>>> On Mon, May 07, 2018 at 04:09:54PM -0700, Sridhar Samudrala wrote:
>> >>>>>>>
>> >>>>>>> This feature bit can be used by hypervisor to indicate virtio_net
>> >>>>>>> device to
>> >>>>>>> act as a standby for another device with the same MAC address.
>> >>>>>>>
>> >>>>>>> I tested this with a small change to the patch to mark the STANDBY
>> >>>>>>> feature 'true'
>> >>>>>>> by default as i am using libvirt to start the VMs.
>> >>>>>>> Is there a way to pass the newly added feature bit 'standby' to qemu
>> >>>>>>> via libvirt
>> >>>>>>> XML file?
>> >>>>>>>
>> >>>>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> >>>>>>
>> >>>>>> So I do not think we can commit to this interface: we
>> >>>>>> really need to control visibility of the primary device.
>> >>>>>
>> >>>>> The problem is legacy guest won't use primary device at all if we do
>> >>>>> this.
>> >>>>
>> >>>> And that's by design - I think it's the only way to ensure the
>> >>>> legacy guest isn't confused.
>> >>>
>> >>> Yes. I think so. But i am not sure if Qemu is the right place to control
>> >>> the visibility
>> >>> of the primary device. The primary device may not be specified as an
>> >>> argument to Qemu. It
>> >>> may be plugged in later.
>> >>> The cloud service provider is providing a feature that enables low
>> >>> latency datapath and live
>> >>> migration capability.
>> >>> A tenant can use this feature only if he is running a VM that has
>> >>> virtio-net with failover support.
>
> So, do you know from the outset that there will be such a coupled
> device? I.e., is it a property of the VM definition?
>
> Can there be a 'prepared' virtio-net device that presents the STANDBY
> feature even if there currently is no vfio-handled device available --
> but making it possible to simply hotplug that device later?
>
> Should it be possible to add a virtio/vfio pair later on?
>
>> >>
>> >> Well live migration is there already. The new feature is low latency
>> >> data path.
>> >
>> >
>> > we get live migration with just virtio.  But I meant live migration with VF
>> > as
>> > primary device.
>> >
>> >>
>> >> And it's the guest that needs failover support not the VM.
>> >
>> >
>> > Isn't guest and VM synonymous?
>
> I think we need to be really careful to not mix up the two: The VM
> contains the definitions, but it is up to the guest how it uses them.
>
>> >
>> >
>> >>
>> >>
>> >>> I think Qemu should check if guest virtio-net supports this feature and
>> >>> provide a mechanism for
>> >>> an upper layer indicating if the STANDBY feature is successfully
>> >>> negotiated or not.
>> >>> The upper layer can then decide if it should hot plug a VF with the same
>> >>> MAC and manage the 2 links.
>> >>> If VF is successfully hot plugged, virtio-net link should be disabled.
>> >>
>> >> Did you even talk to upper layer management about it?
>> >> Just list the steps they need to do and you will see
>> >> that's a lot of machinery to manage by the upper layer.
>> >>
>> >> What do we gain in flexibility? As far as I can see the
>> >> only gain is some resources saved for legacy VMs.
>> >>
>> >> That's not a lot as tenant of the upper layer probably already has
>> >> at least a hunch that it's a new guest otherwise
>> >> why bother specifying the feature at all - you
>> >> save even more resources without it.
>> >>
>> >
>> > I am not all that familiar with how Qemu manages network devices. If we can
>> > do all the
>> > required management of the primary/standby devices within Qemu, that is
>> > definitely a better
>> > approach without upper layer involvement.
>>
>> Right. I would imagine in the extreme case the upper layer doesn't
>> have to be involved at all if QEMU manages all hot plug/unplug logic.
>> The management tool can supply passthrough device and virtio with the
>> same group UUID, QEMU auto-manages the presence of the primary, and
>> hot plug the device as needed before or after the migration.
>
> I do not really see how you can manage that kind of stuff in QEMU only.
> Have you talked to some libvirt folks? (And I'm not sure what you refer
> to with 'group UUID'?)
>
> Also, I think you need to make a distinction between hotplugging a
> device and making it visible to the guest. What does 'hotplugging' mean
> here? Adding it to the VM definition? Would it be enough to have the
> vfio-based device not operational until the virtio feature bit has been
> negotiated?
>
> What happens if the guest does not use the vfio-based device after it
> has been made available? Will you still disable the virtio-net link?
> (All that link handling definitely sounds like a task for libvirt or
> the like.)
>
> Regarding hot(un)plugging during migration, I think you also need to
> keep in mind that different architectures/busses have different
> semantics there. Something that works if there's an unplug handshake may
> not work on a platform with surprise removal.
>
> Have you considered guest agents? All of this is punching through
> several layers, and I'm not sure if that is a good idea.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCHv2 net-next] sctp: add support for SCTP_REUSE_PORT sockopt
From: David Miller @ 2018-06-15  1:54 UTC (permalink / raw)
  To: lucien.xin; +Cc: netdev, linux-sctp, marcelo.leitner, nhorman, tuexen
In-Reply-To: <26fda4cd22604d88a68a58c2b007231984e5f4f0.1528947888.git.lucien.xin@gmail.com>

From: Xin Long <lucien.xin@gmail.com>
Date: Thu, 14 Jun 2018 11:44:48 +0800

> This feature is actually already supported by sk->sk_reuse which can be
> set by socket level opt SO_REUSEADDR. But it's not working exactly as
> RFC6458 demands in section 8.1.27, like:
> 
>   - This option only supports one-to-one style SCTP sockets
>   - This socket option must not be used after calling bind()
>     or sctp_bindx().
> 
> Besides, SCTP_REUSE_PORT sockopt should be provided for user's programs.
> Otherwise, the programs with SCTP_REUSE_PORT from other systems will not
> work in linux.
> 
> To separate it from the socket level version, this patch adds 'reuse' in
> sctp_sock and it works pretty much as sk->sk_reuse, but with some extra
> setup limitations that are needed when it is being enabled.
> 
> "It should be noted that the behavior of the socket-level socket option
> to reuse ports and/or addresses for SCTP sockets is unspecified", so it
> leaves SO_REUSEADDR as is for the compatibility.
> 
> Note that the name SCTP_REUSE_PORT is kind of confusing, it is identical
> to SO_REUSEADDR with some extra restriction, so here it uses 'reuse' in
> sctp_sock instead of 'reuseport'. As for sk->sk_reuseport support for
> SCTP, it will be added in another patch.
> 
> Thanks to Neil to make this clear.
> 
> v1->v2:
>   - add sctp_sk->reuse to separate it from the socket level version.
> 
> Signed-off-by: Xin Long <lucien.xin@gmail.com>

Please resubmit this when net-next opens back up.

Thank you.

^ permalink raw reply

* Re: [PATCH bpf-net] selftests/bpf: delete xfrm tunnel when test exits.
From: Daniel Borkmann @ 2018-06-15  1:33 UTC (permalink / raw)
  To: Martin KaFai Lau, William Tu; +Cc: netdev, anders.roxell
In-Reply-To: <6d4000e4-dc27-23a4-d50e-be2b8b0555e1@iogearbox.net>

On 06/15/2018 02:05 AM, Daniel Borkmann wrote:
> On 06/15/2018 12:30 AM, Martin KaFai Lau wrote:
>> On Thu, Jun 14, 2018 at 05:01:06AM -0700, William Tu wrote:
>>> Make the printting of bpf xfrm tunnel better and
>>> cleanup xfrm state and policy when xfrm test finishes.
>> LGTM.  The subject tag actually meant s/bpf-net/bpf-next/?
>>
>> It makes sense to be in bpf-next but I think bpf-next is still closed.
>> Please repost later.
> 
> But given this fixes up missing cleanup of xfrm policy/state that was
> added earlier as part of the test, I think bpf would be fine. (Subject
> is a bit confusing indeed either bpf resp. net tree or bpf-next was
> meant.)

Ok, took it in, thanks William!

^ permalink raw reply

* RE: [PATCH net-next] hv_netvsc: Fix the variable sizes in ipsecv2 and rsc offload
From: Haiyang Zhang @ 2018-06-15  1:31 UTC (permalink / raw)
  To: David Miller
  Cc: netdev@vger.kernel.org, KY Srinivasan, Stephen Hemminger,
	olaf@aepfle.de, vkuznets@redhat.com, devel@linuxdriverproject.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20180614.170014.1767744227867675403.davem@davemloft.net>



> -----Original Message-----
> From: David Miller <davem@davemloft.net>
> Sent: Thursday, June 14, 2018 8:00 PM
> To: Haiyang Zhang <haiyangz@microsoft.com>; haiyangz@linuxonhyperv.com
> Cc: netdev@vger.kernel.org; KY Srinivasan <kys@microsoft.com>; Stephen
> Hemminger <sthemmin@microsoft.com>; olaf@aepfle.de;
> vkuznets@redhat.com; devel@linuxdriverproject.org; linux-
> kernel@vger.kernel.org
> Subject: Re: [PATCH net-next] hv_netvsc: Fix the variable sizes in ipsecv2 and
> rsc offload
> 
> 
> Bug fixes should be targetted at net, not net-next.  Furthermore, net-next is
> closed.

Thanks for the reminder. I just sent out the patch to "net" tree.

Thanks,
- Haiyang

^ permalink raw reply

* [PATCH net] hv_netvsc: Fix the variable sizes in ipsecv2 and rsc offload
From: Haiyang Zhang @ 2018-06-15  1:29 UTC (permalink / raw)
  To: davem, netdev
  Cc: haiyangz, kys, sthemmin, olaf, vkuznets, devel, linux-kernel

From: Haiyang Zhang <haiyangz@microsoft.com>

These fields in struct ndis_ipsecv2_offload and struct ndis_rsc_offload
are one byte according to the specs. This patch defines them with the
right size. These structs are not in use right now, but will be used soon.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 drivers/net/hyperv/hyperv_net.h | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index d31c0cd329a1..1a924b867b07 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -1277,17 +1277,17 @@ struct ndis_lsov2_offload {
 
 struct ndis_ipsecv2_offload {
 	u32	encap;
-	u16	ip6;
-	u16	ip4opt;
-	u16	ip6ext;
-	u16	ah;
-	u16	esp;
-	u16	ah_esp;
-	u16	xport;
-	u16	tun;
-	u16	xport_tun;
-	u16	lso;
-	u16	extseq;
+	u8	ip6;
+	u8	ip4opt;
+	u8	ip6ext;
+	u8	ah;
+	u8	esp;
+	u8	ah_esp;
+	u8	xport;
+	u8	tun;
+	u8	xport_tun;
+	u8	lso;
+	u8	extseq;
 	u32	udp_esp;
 	u32	auth;
 	u32	crypto;
@@ -1295,8 +1295,8 @@ struct ndis_ipsecv2_offload {
 };
 
 struct ndis_rsc_offload {
-	u16	ip4;
-	u16	ip6;
+	u8	ip4;
+	u8	ip6;
 };
 
 struct ndis_encap_offload {
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH 1/1] selftest: check tunnel type more accurately
From: Daniel Borkmann @ 2018-06-15  1:25 UTC (permalink / raw)
  To: Y Song, Wang Jian; +Cc: Alexei Starovoitov, Shuah Khan, netdev
In-Reply-To: <CAH3MdRWEdGBGmZ=HNd8P49m53jaNAszCwdcv3YEJ62-kHrYnZQ@mail.gmail.com>

On 06/13/2018 06:53 PM, Y Song wrote:
> On Wed, Jun 13, 2018 at 5:03 AM, Wang Jian <jianjian.wang1@gmail.com> wrote:
>> Grep tunnel type directly to make sure 'ip' command supports it.
>>
>> Signed-off-by: Jian Wang <jianjian.wang1@gmail.com>
> 
> Acked-by: Yonghong Song <yhs@fb.com>

Applied to bpf, thanks Jian!

^ permalink raw reply

* Re: [PATCH bpf 0/2] bpf: fix the load time reporting and make offload test more resilient
From: Daniel Borkmann @ 2018-06-15  1:15 UTC (permalink / raw)
  To: Jakub Kicinski, alexei.starovoitov; +Cc: netdev, oss-drivers
In-Reply-To: <20180614180656.14550-1-jakub.kicinski@netronome.com>

On 06/14/2018 08:06 PM, Jakub Kicinski wrote:
> Hi!
> 
> This small series allows test_offload.py selftest to run on modern
> distributions which may create BPF programs for cgroups at boot,
> like Ubuntu 18.04.  We still expect the program list to not be
> altered by any other agent while the test is running, but no longer
> depend on there being no BPF programs at all at the start.
> 
> Fixing the test revealed a small problem with bpftool, which doesn't
> report the program load time very accurately.  Because nanoseconds
> were not taken into account reported load time would fluctuate by
> 1 second.  First patch of the series takes care of fixing that.
> 
> Jakub Kicinski (2):
>   tools: bpftool: improve accuracy of load time
>   selftests/bpf: test offloads even with BPF programs present
> 
>  tools/bpf/bpftool/prog.c                    |  4 +++-
>  tools/testing/selftests/bpf/test_offload.py | 12 ++++++++++--
>  2 files changed, 13 insertions(+), 3 deletions(-)
> 

Applied to bpf, thanks Jakub!

^ permalink raw reply

* [PATCH net 1/2] tls: fix use-after-free in tls_push_record
From: Daniel Borkmann @ 2018-06-15  1:07 UTC (permalink / raw)
  To: davem; +Cc: davejwatson, netdev, Daniel Borkmann
In-Reply-To: <20180615010746.3099-1-daniel@iogearbox.net>

syzkaller managed to trigger a use-after-free in tls like the
following:

  BUG: KASAN: use-after-free in tls_push_record.constprop.15+0x6a2/0x810 [tls]
  Write of size 1 at addr ffff88037aa08000 by task a.out/2317

  CPU: 3 PID: 2317 Comm: a.out Not tainted 4.17.0+ #144
  Hardware name: LENOVO 20FBCTO1WW/20FBCTO1WW, BIOS N1FET47W (1.21 ) 11/28/2016
  Call Trace:
   dump_stack+0x71/0xab
   print_address_description+0x6a/0x280
   kasan_report+0x258/0x380
   ? tls_push_record.constprop.15+0x6a2/0x810 [tls]
   tls_push_record.constprop.15+0x6a2/0x810 [tls]
   tls_sw_push_pending_record+0x2e/0x40 [tls]
   tls_sk_proto_close+0x3fe/0x710 [tls]
   ? tcp_check_oom+0x4c0/0x4c0
   ? tls_write_space+0x260/0x260 [tls]
   ? kmem_cache_free+0x88/0x1f0
   inet_release+0xd6/0x1b0
   __sock_release+0xc0/0x240
   sock_close+0x11/0x20
   __fput+0x22d/0x660
   task_work_run+0x114/0x1a0
   do_exit+0x71a/0x2780
   ? mm_update_next_owner+0x650/0x650
   ? handle_mm_fault+0x2f5/0x5f0
   ? __do_page_fault+0x44f/0xa50
   ? mm_fault_error+0x2d0/0x2d0
   do_group_exit+0xde/0x300
   __x64_sys_exit_group+0x3a/0x50
   do_syscall_64+0x9a/0x300
   ? page_fault+0x8/0x30
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

This happened through fault injection where aead_req allocation in
tls_do_encryption() eventually failed and we returned -ENOMEM from
the function. Turns out that the use-after-free is triggered from
tls_sw_sendmsg() in the second tls_push_record(). The error then
triggers a jump to waiting for memory in sk_stream_wait_memory()
resp. returning immediately in case of MSG_DONTWAIT. What follows is
the trim_both_sgl(sk, orig_size), which drops elements from the sg
list added via tls_sw_sendmsg(). Now the use-after-free gets triggered
when the socket is being closed, where tls_sk_proto_close() callback
is invoked. The tls_complete_pending_work() will figure that there's
a pending closed tls record to be flushed and thus calls into the
tls_push_pending_closed_record() from there. ctx->push_pending_record()
is called from the latter, which is the tls_sw_push_pending_record()
from sw path. This again calls into tls_push_record(). And here the
tls_fill_prepend() will panic since the buffer address has been freed
earlier via trim_both_sgl(). One way to fix it is to move the aead
request allocation out of tls_do_encryption() early into tls_push_record().
This means we don't prep the tls header and advance state to the
TLS_PENDING_CLOSED_RECORD before allocation which could potentially
fail happened. That fixes the issue on my side.

Fixes: 3c4d7559159b ("tls: kernel TLS support")
Reported-by: syzbot+5c74af81c547738e1684@syzkaller.appspotmail.com
Reported-by: syzbot+709f2810a6a05f11d4d3@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Dave Watson <davejwatson@fb.com>
---
 net/tls/tls_sw.c | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 34895b7..2945a3b 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -191,18 +191,12 @@ static void tls_free_both_sg(struct sock *sk)
 }
 
 static int tls_do_encryption(struct tls_context *tls_ctx,
-			     struct tls_sw_context_tx *ctx, size_t data_len,
-			     gfp_t flags)
+			     struct tls_sw_context_tx *ctx,
+			     struct aead_request *aead_req,
+			     size_t data_len)
 {
-	unsigned int req_size = sizeof(struct aead_request) +
-		crypto_aead_reqsize(ctx->aead_send);
-	struct aead_request *aead_req;
 	int rc;
 
-	aead_req = kzalloc(req_size, flags);
-	if (!aead_req)
-		return -ENOMEM;
-
 	ctx->sg_encrypted_data[0].offset += tls_ctx->tx.prepend_size;
 	ctx->sg_encrypted_data[0].length -= tls_ctx->tx.prepend_size;
 
@@ -219,7 +213,6 @@ static int tls_do_encryption(struct tls_context *tls_ctx,
 	ctx->sg_encrypted_data[0].offset -= tls_ctx->tx.prepend_size;
 	ctx->sg_encrypted_data[0].length += tls_ctx->tx.prepend_size;
 
-	kfree(aead_req);
 	return rc;
 }
 
@@ -228,8 +221,14 @@ static int tls_push_record(struct sock *sk, int flags,
 {
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
 	struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx);
+	struct aead_request *req;
 	int rc;
 
+	req = kzalloc(sizeof(struct aead_request) +
+		      crypto_aead_reqsize(ctx->aead_send), sk->sk_allocation);
+	if (!req)
+		return -ENOMEM;
+
 	sg_mark_end(ctx->sg_plaintext_data + ctx->sg_plaintext_num_elem - 1);
 	sg_mark_end(ctx->sg_encrypted_data + ctx->sg_encrypted_num_elem - 1);
 
@@ -245,15 +244,14 @@ static int tls_push_record(struct sock *sk, int flags,
 	tls_ctx->pending_open_record_frags = 0;
 	set_bit(TLS_PENDING_CLOSED_RECORD, &tls_ctx->flags);
 
-	rc = tls_do_encryption(tls_ctx, ctx, ctx->sg_plaintext_size,
-			       sk->sk_allocation);
+	rc = tls_do_encryption(tls_ctx, ctx, req, ctx->sg_plaintext_size);
 	if (rc < 0) {
 		/* If we are called from write_space and
 		 * we fail, we need to set this SOCK_NOSPACE
 		 * to trigger another write_space in the future.
 		 */
 		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
-		return rc;
+		goto out_req;
 	}
 
 	free_sg(sk, ctx->sg_plaintext_data, &ctx->sg_plaintext_num_elem,
@@ -268,6 +266,8 @@ static int tls_push_record(struct sock *sk, int flags,
 		tls_err_abort(sk, EBADMSG);
 
 	tls_advance_record_sn(sk, &tls_ctx->tx);
+out_req:
+	kfree(req);
 	return rc;
 }
 
-- 
2.9.5

^ permalink raw reply related

* [PATCH net 2/2] tls: fix waitall behavior in tls_sw_recvmsg
From: Daniel Borkmann @ 2018-06-15  1:07 UTC (permalink / raw)
  To: davem; +Cc: davejwatson, netdev, Daniel Borkmann
In-Reply-To: <20180615010746.3099-1-daniel@iogearbox.net>

Current behavior in tls_sw_recvmsg() is to wait for incoming tls
messages and copy up to exactly len bytes of data that the user
provided. This is problematic in the sense that i) if no packet
is currently queued in strparser we keep waiting until one has been
processed and pushed into tls receive layer for tls_wait_data() to
wake up and push the decrypted bits to user space. Given after
tls decryption, we're back at streaming data, use sock_rcvlowat()
hint from tcp socket instead. Retain current behavior with MSG_WAITALL
flag and otherwise use the hint target for breaking the loop and
returning to application. This is done if currently no ctx->recv_pkt
is ready, otherwise continue to process it from our strparser
backlog.

Fixes: c46234ebb4d1 ("tls: RX path for ktls")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Dave Watson <davejwatson@fb.com>
---
 net/tls/tls_sw.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 2945a3b..f127fac 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -754,7 +754,7 @@ int tls_sw_recvmsg(struct sock *sk,
 	struct sk_buff *skb;
 	ssize_t copied = 0;
 	bool cmsg = false;
-	int err = 0;
+	int target, err = 0;
 	long timeo;
 
 	flags |= nonblock;
@@ -764,6 +764,7 @@ int tls_sw_recvmsg(struct sock *sk,
 
 	lock_sock(sk);
 
+	target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);
 	timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
 	do {
 		bool zc = false;
@@ -856,6 +857,9 @@ int tls_sw_recvmsg(struct sock *sk,
 					goto recv_end;
 			}
 		}
+		/* If we have a new message from strparser, continue now. */
+		if (copied >= target && !ctx->recv_pkt)
+			break;
 	} while (len);
 
 recv_end:
-- 
2.9.5

^ permalink raw reply related

* [PATCH net 0/2] Two tls fixes
From: Daniel Borkmann @ 2018-06-15  1:07 UTC (permalink / raw)
  To: davem; +Cc: davejwatson, netdev, Daniel Borkmann

First one is syzkaller trigered uaf and second one noticed
while writing test code with tls ulp. For details please see
individual patches.

Thanks!

Daniel Borkmann (2):
  tls: fix use-after-free in tls_push_record
  tls: fix waitall behavior in tls_sw_recvmsg

 net/tls/tls_sw.c | 32 ++++++++++++++++++--------------
 1 file changed, 18 insertions(+), 14 deletions(-)

-- 
2.9.5

^ permalink raw reply

* Re: [PATCH v2 2/3] bpfilter: include bpfilter_umh in assembly instead of using objcopy
From: Alexei Starovoitov @ 2018-06-15  0:47 UTC (permalink / raw)
  To: Masahiro Yamada
  Cc: netdev, Alexei Starovoitov, David S . Miller, Arnd Bergmann,
	Geert Uytterhoeven, linux-kernel, YueHaibing, Daniel Borkmann
In-Reply-To: <1528987172-19810-3-git-send-email-yamada.masahiro@socionext.com>

On Thu, Jun 14, 2018 at 11:39:31PM +0900, Masahiro Yamada wrote:
> What we want here is to embed a user-space program into the kernel.
> Instead of the complex ELF magic, let's simply wrap it in the assembly
> with the '.incbin' directive.
> 
> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
> ---
> 
> Changes in v2:
>   - Rebase
> 
>  net/bpfilter/Makefile            | 15 ++-------------
>  net/bpfilter/bpfilter_kern.c     | 11 +++++------
>  net/bpfilter/bpfilter_umh_blob.S |  7 +++++++
>  3 files changed, 14 insertions(+), 19 deletions(-)
>  create mode 100644 net/bpfilter/bpfilter_umh_blob.S
> 
> diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
> index e0bbe75..39c6980 100644
> --- a/net/bpfilter/Makefile
> +++ b/net/bpfilter/Makefile
> @@ -15,18 +15,7 @@ ifeq ($(CONFIG_BPFILTER_UMH), y)
>  HOSTLDFLAGS += -static
>  endif
>  
> -# a bit of elf magic to convert bpfilter_umh binary into a binary blob
> -# inside bpfilter_umh.o elf file referenced by
> -# _binary_net_bpfilter_bpfilter_umh_start symbol
> -# which bpfilter_kern.c passes further into umh blob loader at run-time
> -quiet_cmd_copy_umh = GEN $@
> -      cmd_copy_umh = echo ':' > $(obj)/.bpfilter_umh.o.cmd; \
> -      $(OBJCOPY) -I binary -O `$(OBJDUMP) -f $<|grep format|cut -d' ' -f8` \
> -      -B `$(OBJDUMP) -f $<|grep architecture|cut -d, -f1|cut -d' ' -f2` \
> -      --rename-section .data=.init.rodata $< $@
> -
> -$(obj)/bpfilter_umh.o: $(obj)/bpfilter_umh
> -	$(call cmd,copy_umh)
> +$(obj)/bpfilter_umh_blob.o: $(obj)/bpfilter_umh
>  
>  obj-$(CONFIG_BPFILTER_UMH) += bpfilter.o
> -bpfilter-objs += bpfilter_kern.o bpfilter_umh.o
> +bpfilter-objs += bpfilter_kern.o bpfilter_umh_blob.o
> diff --git a/net/bpfilter/bpfilter_kern.c b/net/bpfilter/bpfilter_kern.c
> index 0952257..6de3ae5 100644
> --- a/net/bpfilter/bpfilter_kern.c
> +++ b/net/bpfilter/bpfilter_kern.c
> @@ -10,11 +10,8 @@
>  #include <linux/file.h>
>  #include "msgfmt.h"
>  
> -#define UMH_start _binary_net_bpfilter_bpfilter_umh_start
> -#define UMH_end _binary_net_bpfilter_bpfilter_umh_end
> -
> -extern char UMH_start;
> -extern char UMH_end;
> +extern char bpfilter_umh_start;
> +extern char bpfilter_umh_end;
>  
>  static struct umh_info info;
>  /* since ip_getsockopt() can run in parallel, serialize access to umh */
> @@ -93,7 +90,9 @@ static int __init load_umh(void)
>  	int err;
>  
>  	/* fork usermode process */
> -	err = fork_usermode_blob(&UMH_start, &UMH_end - &UMH_start, &info);
> +	err = fork_usermode_blob(&bpfilter_umh_end,
> +				 &bpfilter_umh_end - &bpfilter_umh_start,
> +				 &info);
>  	if (err)
>  		return err;
>  	pr_info("Loaded bpfilter_umh pid %d\n", info.pid);
> diff --git a/net/bpfilter/bpfilter_umh_blob.S b/net/bpfilter/bpfilter_umh_blob.S
> new file mode 100644
> index 0000000..40311d1
> --- /dev/null
> +++ b/net/bpfilter/bpfilter_umh_blob.S
> @@ -0,0 +1,7 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +	.section .init.rodata, "a"
> +	.global bpfilter_umh_start
> +bpfilter_umh_start:
> +	.incbin "net/bpfilter/bpfilter_umh"
> +	.global bpfilter_umh_end
> +bpfilter_umh_end:

for some reason it doesn't work.
fork_usermode_blob() returns ENOEXEC
You should be able to test it simply running 'iptables -L'.
Without this patch you should see:
[   12.696937] bpfilter: Loaded bpfilter_umh pid 225
Started bpfilter

where first line comes from kernel module and second from umh.

^ permalink raw reply

* [PATCH bpf 2/2] bpf: reject any prog that failed read-only lock
From: Daniel Borkmann @ 2018-06-15  0:30 UTC (permalink / raw)
  To: ast; +Cc: netdev, Daniel Borkmann
In-Reply-To: <20180615003048.3219-1-daniel@iogearbox.net>

We currently lock any JITed image as read-only via bpf_jit_binary_lock_ro()
as well as the BPF image as read-only through bpf_prog_lock_ro(). In
the case any of these would fail we throw a WARN_ON_ONCE() in order to
yell loudly to the log. Perhaps, to some extend, this may be comparable
to an allocation where __GFP_NOWARN is explicitly not set.

Added via 65869a47f348 ("bpf: improve read-only handling"), this behavior
is slightly different compared to any of the other in-kernel set_memory_ro()
users who do not check the return code of set_memory_ro() and friends /at
all/ (e.g. in the case of module_enable_ro() / module_disable_ro()). Given
in BPF this is mandatory hardening step, we want to know whether there
are any issues that would leave both BPF data writable. So it happens
that syzkaller enabled fault injection and it triggered memory allocation
failure deep inside x86's change_page_attr_set_clr() which was triggered
from set_memory_ro().

Now, there are two options: i) leaving everything as is, and ii) reworking
the image locking code in order to have a final checkpoint out of the
central bpf_prog_select_runtime() which probes whether any of the calls
during prog setup weren't successful, and then bailing out with an error.
Option ii) is a better approach since this additional paranoia avoids
altogether leaving any potential W+X pages from BPF side in the system.
Therefore, lets be strict about it, and reject programs in such unlikely
occasion. While testing I noticed also that one bpf_prog_lock_ro()
call was missing on the outer dummy prog in case of calls, e.g. in the
destructor we call bpf_prog_free_deferred() on the main prog where we
try to bpf_prog_unlock_free() the program, and since we go via
bpf_prog_select_runtime() do that as well.

Reported-by: syzbot+3b889862e65a98317058@syzkaller.appspotmail.com
Reported-by: syzbot+9e762b52dd17e616a7a5@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 include/linux/filter.h | 60 ++++++++++++++++++++++++++++++++------------------
 kernel/bpf/core.c      | 55 +++++++++++++++++++++++++++++++++++++++------
 kernel/bpf/syscall.c   |  4 +---
 3 files changed, 87 insertions(+), 32 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 297c56f..108f981 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -469,7 +469,8 @@ struct sock_fprog_kern {
 };
 
 struct bpf_binary_header {
-	unsigned int pages;
+	u16 pages;
+	u16 locked:1;
 	u8 image[];
 };
 
@@ -671,15 +672,18 @@ bpf_ctx_narrow_access_ok(u32 off, u32 size, u32 size_default)
 
 #define bpf_classic_proglen(fprog) (fprog->len * sizeof(fprog->filter[0]))
 
-#ifdef CONFIG_ARCH_HAS_SET_MEMORY
 static inline void bpf_prog_lock_ro(struct bpf_prog *fp)
 {
+#ifdef CONFIG_ARCH_HAS_SET_MEMORY
 	fp->locked = 1;
-	WARN_ON_ONCE(set_memory_ro((unsigned long)fp, fp->pages));
+	if (set_memory_ro((unsigned long)fp, fp->pages))
+		fp->locked = 0;
+#endif
 }
 
 static inline void bpf_prog_unlock_ro(struct bpf_prog *fp)
 {
+#ifdef CONFIG_ARCH_HAS_SET_MEMORY
 	if (fp->locked) {
 		WARN_ON_ONCE(set_memory_rw((unsigned long)fp, fp->pages));
 		/* In case set_memory_rw() fails, we want to be the first
@@ -687,34 +691,30 @@ static inline void bpf_prog_unlock_ro(struct bpf_prog *fp)
 		 */
 		fp->locked = 0;
 	}
+#endif
 }
 
 static inline void bpf_jit_binary_lock_ro(struct bpf_binary_header *hdr)
 {
-	WARN_ON_ONCE(set_memory_ro((unsigned long)hdr, hdr->pages));
-}
-
-static inline void bpf_jit_binary_unlock_ro(struct bpf_binary_header *hdr)
-{
-	WARN_ON_ONCE(set_memory_rw((unsigned long)hdr, hdr->pages));
-}
-#else
-static inline void bpf_prog_lock_ro(struct bpf_prog *fp)
-{
-}
-
-static inline void bpf_prog_unlock_ro(struct bpf_prog *fp)
-{
-}
-
-static inline void bpf_jit_binary_lock_ro(struct bpf_binary_header *hdr)
-{
+#ifdef CONFIG_ARCH_HAS_SET_MEMORY
+	hdr->locked = 1;
+	if (set_memory_ro((unsigned long)hdr, hdr->pages))
+		hdr->locked = 0;
+#endif
 }
 
 static inline void bpf_jit_binary_unlock_ro(struct bpf_binary_header *hdr)
 {
+#ifdef CONFIG_ARCH_HAS_SET_MEMORY
+	if (hdr->locked) {
+		WARN_ON_ONCE(set_memory_rw((unsigned long)hdr, hdr->pages));
+		/* In case set_memory_rw() fails, we want to be the first
+		 * to crash here instead of some random place later on.
+		 */
+		hdr->locked = 0;
+	}
+#endif
 }
-#endif /* CONFIG_ARCH_HAS_SET_MEMORY */
 
 static inline struct bpf_binary_header *
 bpf_jit_binary_hdr(const struct bpf_prog *fp)
@@ -725,6 +725,22 @@ bpf_jit_binary_hdr(const struct bpf_prog *fp)
 	return (void *)addr;
 }
 
+#ifdef CONFIG_ARCH_HAS_SET_MEMORY
+static inline int bpf_prog_check_pages_ro_single(const struct bpf_prog *fp)
+{
+	if (!fp->locked)
+		return -ENOLCK;
+	if (fp->jited) {
+		const struct bpf_binary_header *hdr = bpf_jit_binary_hdr(fp);
+
+		if (!hdr->locked)
+			return -ENOLCK;
+	}
+
+	return 0;
+}
+#endif
+
 int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap);
 static inline int sk_filter(struct sock *sk, struct sk_buff *skb)
 {
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 1061968..a9e6c04 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -598,6 +598,8 @@ bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr,
 	bpf_fill_ill_insns(hdr, size);
 
 	hdr->pages = size / PAGE_SIZE;
+	hdr->locked = 0;
+
 	hole = min_t(unsigned int, size - (proglen + sizeof(*hdr)),
 		     PAGE_SIZE - sizeof(*hdr));
 	start = (get_random_int() % hole) & ~(alignment - 1);
@@ -1448,6 +1450,33 @@ static int bpf_check_tail_call(const struct bpf_prog *fp)
 	return 0;
 }
 
+static int bpf_prog_check_pages_ro_locked(const struct bpf_prog *fp)
+{
+#ifdef CONFIG_ARCH_HAS_SET_MEMORY
+	int i, err;
+
+	for (i = 0; i < fp->aux->func_cnt; i++) {
+		err = bpf_prog_check_pages_ro_single(fp->aux->func[i]);
+		if (err)
+			return err;
+	}
+
+	return bpf_prog_check_pages_ro_single(fp);
+#endif
+	return 0;
+}
+
+static void bpf_prog_select_func(struct bpf_prog *fp)
+{
+#ifndef CONFIG_BPF_JIT_ALWAYS_ON
+	u32 stack_depth = max_t(u32, fp->aux->stack_depth, 1);
+
+	fp->bpf_func = interpreters[(round_up(stack_depth, 32) / 32) - 1];
+#else
+	fp->bpf_func = __bpf_prog_ret0_warn;
+#endif
+}
+
 /**
  *	bpf_prog_select_runtime - select exec runtime for BPF program
  *	@fp: bpf_prog populated with internal BPF program
@@ -1458,13 +1487,13 @@ static int bpf_check_tail_call(const struct bpf_prog *fp)
  */
 struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err)
 {
-#ifndef CONFIG_BPF_JIT_ALWAYS_ON
-	u32 stack_depth = max_t(u32, fp->aux->stack_depth, 1);
+	/* In case of BPF to BPF calls, verifier did all the prep
+	 * work with regards to JITing, etc.
+	 */
+	if (fp->bpf_func)
+		goto finalize;
 
-	fp->bpf_func = interpreters[(round_up(stack_depth, 32) / 32) - 1];
-#else
-	fp->bpf_func = __bpf_prog_ret0_warn;
-#endif
+	bpf_prog_select_func(fp);
 
 	/* eBPF JITs can rewrite the program in case constant
 	 * blinding is active. However, in case of error during
@@ -1485,6 +1514,8 @@ struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err)
 		if (*err)
 			return fp;
 	}
+
+finalize:
 	bpf_prog_lock_ro(fp);
 
 	/* The tail call compatibility check can only be done at
@@ -1493,7 +1524,17 @@ struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err)
 	 * all eBPF JITs might immediately support all features.
 	 */
 	*err = bpf_check_tail_call(fp);
-
+	if (*err)
+		return fp;
+
+	/* Checkpoint: at this point onwards any cBPF -> eBPF or
+	 * native eBPF program is read-only. If we failed to change
+	 * the page attributes (e.g. allocation failure from
+	 * splitting large pages), then reject the whole program
+	 * in order to guarantee not ending up with any W+X pages
+	 * from BPF side in kernel.
+	 */
+	*err = bpf_prog_check_pages_ro_locked(fp);
 	return fp;
 }
 EXPORT_SYMBOL_GPL(bpf_prog_select_runtime);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0f62692..35dc466 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1353,9 +1353,7 @@ static int bpf_prog_load(union bpf_attr *attr)
 	if (err < 0)
 		goto free_used_maps;
 
-	/* eBPF program is ready to be JITed */
-	if (!prog->bpf_func)
-		prog = bpf_prog_select_runtime(prog, &err);
+	prog = bpf_prog_select_runtime(prog, &err);
 	if (err < 0)
 		goto free_used_maps;
 
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf 0/2] Two bpf fixes
From: Daniel Borkmann @ 2018-06-15  0:30 UTC (permalink / raw)
  To: ast; +Cc: netdev, Daniel Borkmann

First one is a panic I ran into while testing the second
one where we got several syzkaller reports. Series here
fixes both.

Thanks!

Daniel Borkmann (2):
  bpf: fix panic in prog load calls cleanup
  bpf: reject any prog that failed read-only lock

 include/linux/filter.h | 63 +++++++++++++++++++++++++++++----------------
 kernel/bpf/core.c      | 69 +++++++++++++++++++++++++++++++++++++++++++++-----
 kernel/bpf/syscall.c   | 12 +++------
 3 files changed, 106 insertions(+), 38 deletions(-)

-- 
2.9.5

^ permalink raw reply

* [PATCH bpf 1/2] bpf: fix panic in prog load calls cleanup
From: Daniel Borkmann @ 2018-06-15  0:30 UTC (permalink / raw)
  To: ast; +Cc: netdev, Daniel Borkmann
In-Reply-To: <20180615003048.3219-1-daniel@iogearbox.net>

While testing I found that when hitting error path in bpf_prog_load()
where we jump to free_used_maps and prog contained BPF to BPF calls
that were JITed earlier, then we never clean up the bpf_prog_kallsyms_add()
done under jit_subprogs(). Add proper API to make BPF kallsyms deletion
more clear and fix that.

Fixes: 1c2a088a6626 ("bpf: x64: add JIT support for multi-function programs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 include/linux/filter.h |  3 +++
 kernel/bpf/core.c      | 14 ++++++++++++++
 kernel/bpf/syscall.c   |  8 ++------
 3 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 45fc0f5..297c56f 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -961,6 +961,9 @@ static inline void bpf_prog_kallsyms_del(struct bpf_prog *fp)
 }
 #endif /* CONFIG_BPF_JIT */
 
+void bpf_prog_kallsyms_del_subprogs(struct bpf_prog *fp);
+void bpf_prog_kallsyms_del_all(struct bpf_prog *fp);
+
 #define BPF_ANC		BIT(15)
 
 static inline bool bpf_needs_clear_a(const struct sock_filter *first)
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 9f14937..1061968 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -350,6 +350,20 @@ struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
 	return prog_adj;
 }
 
+void bpf_prog_kallsyms_del_subprogs(struct bpf_prog *fp)
+{
+	int i;
+
+	for (i = 0; i < fp->aux->func_cnt; i++)
+		bpf_prog_kallsyms_del(fp->aux->func[i]);
+}
+
+void bpf_prog_kallsyms_del_all(struct bpf_prog *fp)
+{
+	bpf_prog_kallsyms_del_subprogs(fp);
+	bpf_prog_kallsyms_del(fp);
+}
+
 #ifdef CONFIG_BPF_JIT
 /* All BPF JIT sysctl knobs here. */
 int bpf_jit_enable   __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_ALWAYS_ON);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0fa2062..0f62692 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1034,14 +1034,9 @@ static void __bpf_prog_put_rcu(struct rcu_head *rcu)
 static void __bpf_prog_put(struct bpf_prog *prog, bool do_idr_lock)
 {
 	if (atomic_dec_and_test(&prog->aux->refcnt)) {
-		int i;
-
 		/* bpf_prog_free_id() must be called first */
 		bpf_prog_free_id(prog, do_idr_lock);
-
-		for (i = 0; i < prog->aux->func_cnt; i++)
-			bpf_prog_kallsyms_del(prog->aux->func[i]);
-		bpf_prog_kallsyms_del(prog);
+		bpf_prog_kallsyms_del_all(prog);
 
 		call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu);
 	}
@@ -1384,6 +1379,7 @@ static int bpf_prog_load(union bpf_attr *attr)
 	return err;
 
 free_used_maps:
+	bpf_prog_kallsyms_del_subprogs(prog);
 	free_used_maps(prog->aux);
 free_prog:
 	bpf_prog_uncharge_memlock(prog);
-- 
2.9.5

^ permalink raw reply related

* Re: [bpf PATCH v2 2/6] bpf: sockmap only allow ESTABLISHED sock state
From: Martin KaFai Lau @ 2018-06-15  0:18 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, netdev
In-Reply-To: <20180614164451.24994.31096.stgit@john-Precision-Tower-5810>

On Thu, Jun 14, 2018 at 09:44:52AM -0700, John Fastabend wrote:
> Per the note in the TLS ULP (which is actually a generic statement
> regarding ULPs)
> 
>  /* The TLS ulp is currently supported only for TCP sockets
>   * in ESTABLISHED state.
>   * Supporting sockets in LISTEN state will require us
>   * to modify the accept implementation to clone rather then
>   * share the ulp context.
>   */
Can you explain how that apply to bpf_tcp ulp?

My understanding is the "ulp context" referred in TLS ulp is
the tls_context stored in icsk_ulp_data but I don't see bpf_tcp's
ulp is using icsk_ulp_data.

Others LGTM.

> 
> After this patch we only allow socks that are in ESTABLISHED state or
> are being added via a sock_ops event that is transitioning into an
> ESTABLISHED state. By allowing sock_ops events we allow users to
> manage sockmaps directly from sock ops programs. The two supported
> sock_ops ops are BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB and
> BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB.
> 
> >From the userspace BPF update API the sock lock is also taken now
> to ensure we don't race with state changes after the ESTABLISHED
> check. The BPF program sock ops hook already has the sock lock
> taken.
> 
> Also tested with 'netserver -6' and 'netperf -H [IPv6]' as well as
> 'netperf -H [IPv4]'.
> 
> Reported-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
>  0 files changed
> 
> diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
> index f6dd4cd..f1ab52d 100644
> --- a/kernel/bpf/sockmap.c
> +++ b/kernel/bpf/sockmap.c
> @@ -1976,13 +1976,20 @@ static int sock_map_update_elem(struct bpf_map *map,
>  		return -EINVAL;
>  	}
>  
> +	lock_sock(skops.sk);
> +	/* ULPs are currently supported only for TCP sockets in ESTABLISHED
> +	 * state.
> +	 */
>  	if (skops.sk->sk_type != SOCK_STREAM ||
> -	    skops.sk->sk_protocol != IPPROTO_TCP) {
> -		fput(socket->file);
> -		return -EOPNOTSUPP;
> +	    skops.sk->sk_protocol != IPPROTO_TCP ||
> +	    skops.sk->sk_state != TCP_ESTABLISHED) {
> +		err = -EOPNOTSUPP;
> +		goto out;
>  	}
>  
>  	err = sock_map_ctx_update_elem(&skops, map, key, flags);
> +out:
> +	release_sock(skops.sk);
>  	fput(socket->file);
>  	return err;
>  }
> @@ -2247,10 +2254,6 @@ static int sock_hash_ctx_update_elem(struct bpf_sock_ops_kern *skops,
>  
>  	sock = skops->sk;
>  
> -	if (sock->sk_type != SOCK_STREAM ||
> -	    sock->sk_protocol != IPPROTO_TCP)
> -		return -EOPNOTSUPP;
> -
>  	if (unlikely(map_flags > BPF_EXIST))
>  		return -EINVAL;
>  
> @@ -2338,7 +2341,20 @@ static int sock_hash_update_elem(struct bpf_map *map,
>  		return -EINVAL;
>  	}
>  
> +	lock_sock(skops.sk);
> +	/* ULPs are currently supported only for TCP sockets in ESTABLISHED
> +	 * state.
> +	 */
> +	if (skops.sk->sk_type != SOCK_STREAM ||
> +	    skops.sk->sk_protocol != IPPROTO_TCP ||
> +	    skops.sk->sk_state != TCP_ESTABLISHED) {
> +		err = -EOPNOTSUPP;
> +		goto out;
> +	}
> +
>  	err = sock_hash_ctx_update_elem(&skops, map, key, flags);
> +out:
> +	release_sock(skops.sk);
>  	fput(socket->file);
>  	return err;
>  }
> @@ -2423,10 +2439,19 @@ struct sock  *__sock_hash_lookup_elem(struct bpf_map *map, void *key)
>  	.map_delete_elem = sock_hash_delete_elem,
>  };
>  
> +static bool bpf_is_valid_sock(struct bpf_sock_ops_kern *ops)
> +{
> +	return ops->op == BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB ||
> +	       ops->op == BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB;
> +}
> +
>  BPF_CALL_4(bpf_sock_map_update, struct bpf_sock_ops_kern *, bpf_sock,
>  	   struct bpf_map *, map, void *, key, u64, flags)
>  {
>  	WARN_ON_ONCE(!rcu_read_lock_held());
> +
> +	if (!bpf_is_valid_sock(bpf_sock))
> +		return -EOPNOTSUPP;
>  	return sock_map_ctx_update_elem(bpf_sock, map, key, flags);
>  }
>  
> @@ -2445,6 +2470,9 @@ struct sock  *__sock_hash_lookup_elem(struct bpf_map *map, void *key)
>  	   struct bpf_map *, map, void *, key, u64, flags)
>  {
>  	WARN_ON_ONCE(!rcu_read_lock_held());
> +
> +	if (!bpf_is_valid_sock(bpf_sock))
> +		return -EOPNOTSUPP;
>  	return sock_hash_ctx_update_elem(bpf_sock, map, key, flags);
>  }
>  
> 

^ permalink raw reply

* Re: [PATCH net 0/4] l2tp: pppol2tp_connect() fixes
From: David Miller @ 2018-06-15  0:10 UTC (permalink / raw)
  To: g.nault; +Cc: netdev, jchapman
In-Reply-To: <cover.1528887257.git.g.nault@alphalink.fr>

From: Guillaume Nault <g.nault@alphalink.fr>
Date: Wed, 13 Jun 2018 15:09:16 +0200

> This series fixes a few remaining issues with pppol2tp_connect().
> 
> It doesn't try to prevent invalid configurations that have no effect on
> kernel's reliability. That would be work for a future patch set.
> 
> Patch 2 is the most important as it avoids an invalid pointer
> dereference crashing the kernel. It depends on patch 1 for correctly
> identifying L2TP session types.
> 
> Patches 3 and 4 avoid creating stale tunnels and sessions.

Series applied, thank you.

^ permalink raw reply

* Re: [PATCH net-next] tcp: add SNMP counter for zero-window drops
From: David Miller @ 2018-06-15  0:09 UTC (permalink / raw)
  To: laoar.shao; +Cc: edumazet, netdev, linux-kernel
In-Reply-To: <1528889908-15980-1-git-send-email-laoar.shao@gmail.com>

From: Yafang Shao <laoar.shao@gmail.com>
Date: Wed, 13 Jun 2018 07:38:28 -0400

> It will be helpful if we could display the drops due to zero window or no
> enough window space.
> So a new SNMP MIB entry is added to track this behavior.
> This entry is named LINUX_MIB_TCPZEROWINDOWDROP and published in
> /proc/net/netstat in TcpExt line as TCPZeroWindowDrop.
> 
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>

net-next is closed, please resubmit this when net-next opens
back up.

Thank you.

^ permalink raw reply

* Re: [PATCH 0/4] emaclite bug fixes and code cleanup
From: David Miller @ 2018-06-15  0:08 UTC (permalink / raw)
  To: radhey.shyam.pandey; +Cc: michal.simek, netdev, linux-arm-kernel, linux-kernel
In-Reply-To: <1528871719-1681-1-git-send-email-radhey.shyam.pandey@xilinx.com>

From: Radhey Shyam Pandey <radhey.shyam.pandey@xilinx.com>
Date: Wed, 13 Jun 2018 12:05:15 +0530

> This patch series fixes bug in emaclite remove and mdio_setup routines.
> It does minor code cleanup.

Series applied.

^ permalink raw reply

* Re: [PATCH v1 net-next] stmmac: fix DMA channel hang in half-duplex mode
From: David Miller @ 2018-06-15  0:06 UTC (permalink / raw)
  To: vbhadram; +Cc: peppe.cavallaro, alexandre.torgue, joabreu, netdev, narayanr
In-Reply-To: <1528864248-2246-1-git-send-email-vbhadram@nvidia.com>

From: Bhadram Varka <vbhadram@nvidia.com>
Date: Wed, 13 Jun 2018 10:00:48 +0530

> HW does not support Half-duplex mode in multi-queue
> scenario. Fix it by not advertising the Half-Duplex
> mode if multi-queue enabled.
> 
> Signed-off-by: Bhadram Varka <vbhadram@nvidia.com>

Bug fixes should be submitted against net, not net-next.  And
net-next is closed for submissions at this time.

Thanks.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox