Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net] KEYS: DNS: fix parsing multiple options
From: Simon Horman @ 2018-06-11 18:08 UTC (permalink / raw)
  To: Eric Biggers
  Cc: netdev, David S . Miller, keyrings, David Howells, Wang Lei,
	Eric Biggers
In-Reply-To: <20180611175742.GA33284@gmail.com>

On Mon, Jun 11, 2018 at 10:57:42AM -0700, Eric Biggers wrote:
> Hi Simon,
> 
> On Mon, Jun 11, 2018 at 11:40:23AM +0200, Simon Horman wrote:
> > On Fri, Jun 08, 2018 at 09:20:37AM -0700, Eric Biggers wrote:
> > > From: Eric Biggers <ebiggers@google.com>
> > > 
> > > My recent fix for dns_resolver_preparse() printing very long strings was
> > > incomplete, as shown by syzbot which still managed to hit the
> > > WARN_ONCE() in set_precision() by adding a crafted "dns_resolver" key:
> > > 
> > >     precision 50001 too large
> > >     WARNING: CPU: 7 PID: 864 at lib/vsprintf.c:2164 vsnprintf+0x48a/0x5a0
> > > 
> > > The bug this time isn't just a printing bug, but also a logical error
> > > when multiple options ("#"-separated strings) are given in the key
> > > payload.  Specifically, when separating an option string into name and
> > > value, if there is no value then the name is incorrectly considered to
> > > end at the end of the key payload, rather than the end of the current
> > > option.  This bypasses validation of the option length, and also means
> > > that specifying multiple options is broken -- which presumably has gone
> > > unnoticed as there is currently only one valid option anyway.
> > > 
> > > Fix it by correctly calculating the length of the option name.
> > > 
> > > Reproducer:
> > > 
> > >     perl -e 'print "#A#", "\x00" x 50000' | keyctl padd dns_resolver desc @s
> > > 
> > > Fixes: 4a2d789267e0 ("DNS: If the DNS server returns an error, allow that to be cached [ver #2]")
> > > Signed-off-by: Eric Biggers <ebiggers@google.com>
> > > ---
> > >  net/dns_resolver/dns_key.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/net/dns_resolver/dns_key.c b/net/dns_resolver/dns_key.c
> > > index 40c851693f77e..d448823d4d2ed 100644
> > > --- a/net/dns_resolver/dns_key.c
> > > +++ b/net/dns_resolver/dns_key.c
> > > @@ -97,7 +97,7 @@ dns_resolver_preparse(struct key_preparsed_payload *prep)
> > >  				return -EINVAL;
> > >  			}
> > >  
> > > -			eq = memchr(opt, '=', opt_len) ?: end;
> > > +			eq = memchr(opt, '=', opt_len) ?: next_opt;
> > >  			opt_nlen = eq - opt;
> > >  			eq++;
> > 
> > It seems risky to advance eq++ in the case there the value is empty.
> > Its not not pointing to the value but it may be accessed twice further on
> > in this loop.
> > 
> 
> Sure, that's a separate existing issue though, and it must be checked that the
> value is present before using it anyway, which the code already does, so it's
> not a "real" bug.  I think I'll keep this patch simple and leave that part as-is
> for now.

Thanks Eric, I was reflecting on that too. I agree that your patch resolves
a problem without introducing a new one.

Reviewed-by: Simon Horman <simon.horman@netronome.com>

^ permalink raw reply

* Re: [PATCH 4/6] arcnet: com20020: bindings for smsc com20020
From: Rob Herring @ 2018-06-11 18:09 UTC (permalink / raw)
  To: Andrea Greco
  Cc: davem, tobin, Andrea Greco, Mark Rutland, netdev, devicetree,
	linux-kernel
In-Reply-To: <20180611142705.20849-1-andrea.greco.gapmilano@gmail.com>

On Mon, Jun 11, 2018 at 04:27:01PM +0200, Andrea Greco wrote:
> From: Andrea Greco <a.greco@4sigma.it>
> 
> Add devicetree bindings for smsc com20020
> 
> Signed-off-by: Andrea Greco <a.greco@4sigma.it>
> ---
>  .../devicetree/bindings/net/smsc-com20020.txt       | 21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/net/smsc-com20020.txt
> 
> diff --git a/Documentation/devicetree/bindings/net/smsc-com20020.txt b/Documentation/devicetree/bindings/net/smsc-com20020.txt
> new file mode 100644
> index 000000000000..660a4a751f29
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/net/smsc-com20020.txt
> @@ -0,0 +1,21 @@
> +SMSC com20020 Arcnet network controller
> +
> +Required property:
> +- timeout-ns: Arcnet bus timeout, Idle Time (328000 - 20500)
> +- bus-speed-bps: Arcnet bus speed (10000000 - 156250)
> +- smsc,xtal-mhz: External oscillator frequency
> +- smsc,backplane-enabled: Controller use backplane mode
> +- reset-gpios: Chip reset pin
> +- interrupts: Should contain controller interrupt
> +
> +arcnet@28000000 {
> +	compatible = "smsc,com20020";
> +
> +	timeout-ns = <20500>;
> +	bus-speed-hz = <10000000>;

You have hz here and bps above.

> +	smsc,xtal-mhz = <20>;
> +	smsc,backplane-enabled;
> +
> +	reset-gpios = <&gpio3 21 GPIO_ACTIVE_LOW>;
> +	interrupts = <&gpio2 10 GPIO_ACTIVE_LOW>;
> +};
> -- 
> 2.14.4
> 

^ permalink raw reply

* Re: [PATCH net] failover: eliminate callback hell
From: Michael S. Tsirkin @ 2018-06-11 18:10 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: kys, haiyangz, davem, sridhar.samudrala, netdev,
	Stephen Hemminger
In-Reply-To: <20180605034231.31610-1-sthemmin@microsoft.com>

On Mon, Jun 04, 2018 at 08:42:31PM -0700, Stephen Hemminger wrote:
>   * Set permanent and current address of net_failover device
>     to match the primary.
> 
>   * Carrier should be marked off before registering device
>     the net_failover device.

Sridhar, do we want to address this?
If yes, could you please take a look at addressing these
meanwhile, while we keep arguing about making API changes?

-- 
MST

^ permalink raw reply

* Re: [PATCH] Bluetooth: hci_bcm: Configure SCO routing automatically
From: Marcel Holtmann @ 2018-06-11 18:19 UTC (permalink / raw)
  To: Rob Herring
  Cc: Attila Tőkés, David S. Miller, Mark Rutland,
	Johan Hedberg, Artiom Vaskov, netdev, devicetree,
	linux-kernel@vger.kernel.org, open list:BLUETOOTH DRIVERS
In-Reply-To: <CAL_JsqL2StWgA9gkL9i3o85-C3TNixBLmbTXzFQW8bts4phJqQ@mail.gmail.com>

Hi Rob,

>>>>>> Added support to automatically configure the SCO packet routing at the
>>>>>> device setup. The SCO packets are used with the HSP / HFP profiles, but in
>>>>>> some devices (ex. CYW43438) they are routed to a PCM output by default. This
>>>>>> change allows sending the vendor specific HCI command to configure the SCO
>>>>>> routing. The parameters of the command are loaded from the device tree.
>>>>> 
>>>>> Please wrap your commit msg.
>>>> 
>>>> 
>>>> Sure.
>>>>> 
>>>>> 
>>>>>> 
>>>>>> Signed-off-by: Attila Tőkés <attitokes@gmail.com>
>>>>>> ---
>>>>>> .../bindings/net/broadcom-bluetooth.txt       |  7 ++
>>>>> 
>>>>> Please split bindings to separate patch.
>>>> 
>>>> 
>>>> Ok, I will split this in two.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> drivers/bluetooth/hci_bcm.c                   | 72 +++++++++++++++++++
>>>>>> 2 files changed, 79 insertions(+)
>>>>>> 
>>>>>> diff --git
>>>>>> a/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt
>>>>>> b/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt
>>>>>> index 4194ff7e..aea3a094 100644
>>>>>> --- a/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt
>>>>>> +++ b/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt
>>>>>> @@ -21,6 +21,12 @@ Optional properties:
>>>>>> - clocks: clock specifier if external clock provided to the controller
>>>>>> - clock-names: should be "extclk"
>>>>>> 
>>>>>> + SCO routing parameters:
>>>>>> + - sco-routing: 0-3 (PCM, Transport, Codec, I2S)
>>>>>> + - pcm-interface-rate: 0-4 (128 Kbps - 2048 Kbps)
>>>>>> + - pcm-frame-type: 0 (short), 1 (long)
>>>>>> + - pcm-sync-mode: 0 (slave), 1 (master)
>>>>>> + - pcm-clock-mode: 0 (slave), 1 (master)
>>>>> 
>>>>> Are these Broadcom specific? Properties need either vendor prefix or
>>>>> to be documented in a common location. I think these look like the
>>>>> latter.
>>>> 
>>>> 
>>>> These will be used as parameters of a vendor specific (Broadcom/Cypress)
>>>> command configuring the SCO packet routing. See the Write_SCO_PCM_Int_Param
>>>> command from: http://www.cypress.com/file/298311/download.
>>> 
>>> The DT should just describe how the h/w is hooked-up. What the s/w has
>>> to do based on that is the driver's problem which is certainly
>>> vendor/chip specific, but that is all irrelevant to the binding.
>>> 
>>>> What would be the property names with a Broadcom / Cypress vendor prefix?
>>>> 
>>>>   brcm,sco-routing
>>>>   brcm,pcm-interface-rate
>>>>   brcm,pcm-frame-type
>>>>   brcm,pcm-sync-mode
>>>>   brcm,pcm-clock-mode
>>>> 
>>>> ?
>>> 
>>> Yes.
>> 
>> we can do this. However all pcm-* are optional if you switch the HCI transport. And sco-routing should default to HCI if that is not present. Meaning a driver should actively trying to change this. Nevertheless, it would be good if a driver reads the current settings.
>> 
>> In theory we could make sco-routing generic, but so many vendors have different modes, that we better keep this vendor specific.
> 
> Even if vendor specific, the properties for not HCI transport case are
> still incomplete IMO.
> 
> By modes, you mean PCM vs. I2S and all the flavors of timings you can
> have within those or something else? For the former, that's often
> going to be a process of solving what each end support and if that
> doesn't work, then IIRC we already have properties for setting
> modes/timing. All the same issues exist with audio codecs and this is
> really not any different.

this is what Broadcom uses to configure their PCM transport. So I think for now, we make them brcm, specific and see how that goes. We can always generalize them later if enough chip manufactures provide support for it.

Regards

Marcel

^ permalink raw reply

* Re: [v3, 03/10] dt-binding: ptp_qoriq: add DPAA FMan support
From: Rob Herring @ 2018-06-11 18:25 UTC (permalink / raw)
  To: Yangbo Lu
  Cc: netdev, madalin.bucur, Richard Cochran, Shawn Guo,
	David S . Miller, devicetree, linuxppc-dev, linux-arm-kernel,
	linux-kernel
In-Reply-To: <20180607092050.46128-4-yangbo.lu@nxp.com>

On Thu, Jun 07, 2018 at 05:20:43PM +0800, Yangbo Lu wrote:
> This patch is to add bindings description for DPAA
> FMan 1588 timer, and also remove its description in
> fsl-fman dt-bindings document.
> 
> Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
> ---
> Changes for v2:
> 	- None.
> Changes for v3:
> 	- None.
> ---
>  Documentation/devicetree/bindings/net/fsl-fman.txt |   25 +-------------------
>  .../devicetree/bindings/ptp/ptp-qoriq.txt          |   15 +++++++++--
>  2 files changed, 13 insertions(+), 27 deletions(-)

Reviewed-by: Rob Herring <robh@kernel.org>

^ permalink raw reply

* Re: [PULL] vhost: cleanups and fixes
From: Linus Torvalds @ 2018-06-11 18:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: KVM list, Network Development, Linux Kernel Mailing List,
	Bjorn Andersson, Andrew Morton, virtualization
In-Reply-To: <20180611192353-mutt-send-email-mst@kernel.org>

On Mon, Jun 11, 2018 at 9:24 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
>       virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

Is this really a good idea?

Plus it seems entirely broken.

The report_pfn_range() callback is done under the zone lock, but
virtio_balloon_send_free_pages() (which is the only callback used that
I can find) does add_one_sg(), which does virtqueue_add_inbuf(vq, &sg,
1, vq, GFP_KERNEL);

So now we apparently do a GFP_KERNEL allocation insider the mm zone
lock, which is broken on just _so_ many levels.

Pulled and then unpulled again.

Either somebody needs to explain why I'm wrong and you can re-submit
this, or this kind of garbage needs to go away.

I do *not* want to be in the situation where I pull stuff from the
virtio people that adds completely broken core VM functionality.

Because if I'm in that situation, I will stop pulling from you guys.
Seriously. You have *no* place sending me broken shit that is outside
the virtio layer.

Subsystems that break code MM will get shunned. You just aren't
important enough to allow you breaking code VM.

                Linus

^ permalink raw reply

* Re: [PATCH] optoe: driver to read/write SFP/QSFP EEPROMs
From: Tom Lendacky @ 2018-06-11 18:33 UTC (permalink / raw)
  To: Don Bollinger, Arnd Bergmann, Greg Kroah-Hartman, linux-kernel
  Cc: brandon_chuang, wally_wang, roy_lee, rick_burchett, quentin.chang,
	jeffrey.townsend, scotte, roopa, David Ahern, luke.williams,
	Guohan Lu, Russell King, netdev@vger.kernel.org
In-Reply-To: <20180611042515.ml6zbcmz6dlvjmrp@thebollingers.org>

On 6/10/2018 11:25 PM, Don Bollinger wrote:
> optoe is an i2c based driver that supports read/write access to all
> the pages (tables) of MSA standard SFP and similar devices (conforming
> to the SFF-8472 spec) and MSA standard QSFP and similar devices
> (conforming to the SFF-8436 spec).
> 
> These devices provide identification, operational status and control
> registers via an EEPROM model. These devices support one or 3 fixed
> pages (128 bytes) of data, and one page that is selected via a page
> register on the first fixed page.  Thus the driver's main task is
> to map these pages onto a simple linear address space for user space
> management applications.  See the driver code for a detailed layout.
> 
> EEPROM data is accessible via a bin_attribute file called 'eeprom',
> e.g. /sys/bus/i2c/devices/24-0050/eeprom.
> 
> Signed-off-by: Don Bollinger <don@thebollingers.org>
> ---
> 
> Why should this driver be in the Linux kernel?  SFP and QSFP devices plug
> into switches to convert electrical to optical signals and drive the
> optical signal over fiber optic cables.  They provide status and control
> registers through an i2c interface similar to to other EEPROMS.  However,
> they have a paging mechanism that is unique, which requires a different
> driver from (for example) at24.  Various drivers have been developed for
> this purpose, none of them support both SFP and QSFP, provide both read
> and write access, and access all 256 architected pages.  optoe does all
> of these.
> 
> optoe has been adopted and is shipping to customers  as a base module,
> available to all platforms (switches) and used by multiple vendors and
> platforms on both ONL (Open Network Linux) and SONiC (Microsoft's
> 'Software for Open Networking in the Cloud').
> 
> This patch has been built on the latest staging-testing kernel.  It has
> built and tested with SFP and QSFP devices on an ARM platform with a 4.9
> kernel, and an x86 switch with a 3.16 kernel.  This patch should install
> and build clean on any kernel from 3.16 up to the latest (as of 6/10/2018).
> 
> 
>  Documentation/misc-devices/optoe.txt |   56 ++
>  drivers/misc/eeprom/Kconfig          |   18 +
>  drivers/misc/eeprom/Makefile         |    1 +
>  drivers/misc/eeprom/optoe.c          | 1141 ++++++++++++++++++++++++++++++++++
>  4 files changed, 1216 insertions(+)
>  create mode 100644 Documentation/misc-devices/optoe.txt
>  create mode 100644 drivers/misc/eeprom/optoe.c
> 

There's an SFP driver under drivers/net/phy.  Can that driver be extended
to provide this support?  Adding Russel King who developed sfp.c, as well
at the netdev mailing list.

Thanks,
Tom

> diff --git a/Documentation/misc-devices/optoe.txt b/Documentation/misc-devices/optoe.txt
> new file mode 100644
> index 000000000000..496134940147
> --- /dev/null
> +++ b/Documentation/misc-devices/optoe.txt
> @@ -0,0 +1,56 @@
> +optoe driver
> +
> +Author Don Bollinger (don@thebollingers.org)
> +
> +Optoe is an i2c based driver that supports read/write access to all
> +the pages (tables) of MSA standard SFP and similar devices (conforming
> +to the SFF-8472 spec) and MSA standard QSFP and similar devices
> +(conforming to the SFF-8436 spec).
> +
> +i2c based optoelectronic transceivers (SPF, QSFP, etc) provide identification,
> +operational status, and control registers via an EEPROM model.  Unlike the
> +EEPROMs that at24 supports, these devices access data beyond byte 256 via
> +a page select register, which must be managed by the driver.  See the driver
> +code for a detailed explanation of how the linear address space provided
> +by the driver maps to the paged address space provided by the devices.
> +
> +The EEPROM data is accessible via a bin_attribute file called 'eeprom',
> +e.g. /sys/bus/i2c/devices/24-0050/eeprom
> +
> +This driver also reports the port number for each device, via a sysfs
> +attribute: 'port_name'.  This is a read/write attribute.  It should be
> +explicitly set as part of system initialization, ideally at the same time
> +the device is instantiated.  Write an appropriate port name (any string, up
> +to 19 characters) to initialize.  If not initialized explicitly, all ports
> +will have the port_name of 'unitialized'.  Alternatively, if the driver is
> +called with platform_data, the port_name will be read from eeprom_data->label
> +(if the EEPROM CLASS driver is configured) or from platform_data.port_name.
> +
> +This driver can be instantiated with 'new_device', per the convention
> +described in Documentation/i2c/instantiating-devices.  It wants one of
> +two possible device identifiers.  Use 'optoe1' to indicate this is a device
> +with just one i2c address (all QSFP type devices).  Use 'optoe2' to indicate
> +this is a device with two i2c addresses (all SFP type devices).
> +
> +Example:
> +# echo optoe1 0x50 > /sys/bus/i2c/devices/i2c-64/new_device
> +# echo port54 > /sys/bus/i2c/devices/i2c-64/port_name
> +
> +This will add a QSFP type device to i2c bus i2c-64, and name it 'port54'
> +
> +Example:
> +# echo optoe2 0x50 > /sys/bus/i2c/devices/i2c-11/new_device
> +# echo port1 > /sys/bus/i2c/devices/i2c-11/port_name
> +
> +This will add an SFP type device to i2c bus i2c-11, and name it 'port1'
> +
> +The second parameter to new_device is an i2c address, and MUST be 0x50 for
> +this driver to work properly.  This is part of the spec for these devices.
> +(It is not necessary to create a device at 0x51 for SFP type devices, the
> +driver does that automatically.)
> +
> +Note that SFP type and QSFP type devices are not plug-compatible.  The
> +driver expects the correct ID for each port (each i2c device).  It does
> +not check because the port will often be empty, and the only way to check
> +is to interrogate the device.  Incorrect choice of ID will lead to correct
> +data being reported for the first 256 bytes, incorrect data after that.
> diff --git a/drivers/misc/eeprom/Kconfig b/drivers/misc/eeprom/Kconfig
> index 68a1ac929917..9a08e12756ee 100644
> --- a/drivers/misc/eeprom/Kconfig
> +++ b/drivers/misc/eeprom/Kconfig
> @@ -111,4 +111,22 @@ config EEPROM_IDT_89HPESX
>  	  This driver can also be built as a module. If so, the module
>  	  will be called idt_89hpesx.
>  
> +config EEPROM_OPTOE
> +	tristate "read/write access to SFP* & QSFP* EEPROMs"
> +	depends on I2C && SYSFS
> +	help
> +	  If you say yes here you get support for read and write access to
> +	  the EEPROM of SFP and QSFP type optical and copper transceivers.
> +	  Includes all devices which conform to the sff-8436 and sff-8472
> +	  spec including SFP, SFP+, SFP28, SFP-DWDM, QSFP, QSFP+, QSFP28
> +	  or later.  These devices are usually found in network switches.
> +
> +	  This driver only manages read/write access to the EEPROM, all
> +	  other features should be accessed via i2c-dev.
> +
> +	  This driver can also be built as a module. If so, the module
> +	  will be called optoe.
> +
> +	  If unsure, say N.
> +
>  endmenu
> diff --git a/drivers/misc/eeprom/Makefile b/drivers/misc/eeprom/Makefile
> index 2aab60ef3e3e..00288d669017 100644
> --- a/drivers/misc/eeprom/Makefile
> +++ b/drivers/misc/eeprom/Makefile
> @@ -7,3 +7,4 @@ obj-$(CONFIG_EEPROM_93CX6)	+= eeprom_93cx6.o
>  obj-$(CONFIG_EEPROM_93XX46)	+= eeprom_93xx46.o
>  obj-$(CONFIG_EEPROM_DIGSY_MTC_CFG) += digsy_mtc_eeprom.o
>  obj-$(CONFIG_EEPROM_IDT_89HPESX) += idt_89hpesx.o
> +obj-$(CONFIG_EEPROM_OPTOE)      += optoe.o
> diff --git a/drivers/misc/eeprom/optoe.c b/drivers/misc/eeprom/optoe.c
> new file mode 100644
> index 000000000000..7cdf1a0a5299
> --- /dev/null
> +++ b/drivers/misc/eeprom/optoe.c
> @@ -0,0 +1,1141 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + * optoe.c - A driver to read and write the EEPROM on optical transceivers
> + * (SFP, QSFP and similar I2C based devices)
> + *
> + * Copyright (C) 2014 Cumulus networks Inc.
> + * Copyright (C) 2017 Finisar Corp.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Freeoftware Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + */
> +
> +/*
> + *	Description:
> + *	a) Optical transceiver EEPROM read/write transactions are just like
> + *		the at24 eeproms managed by the at24.c i2c driver
> + *	b) The register/memory layout is up to 256 128 byte pages defined by
> + *		a "pages valid" register and switched via a "page select"
> + *		register as explained in below diagram.
> + *	c) 256 bytes are mapped at a time. 'Lower page 00h' is the first 128
> + *	        bytes of address space, and always references the same
> + *	        location, independent of the page select register.
> + *	        All mapped pages are mapped into the upper 128 bytes
> + *	        (offset 128-255) of the i2c address.
> + *	d) Devices with one I2C address (eg QSFP) use I2C address 0x50
> + *		(A0h in the spec), and map all pages in the upper 128 bytes
> + *		of that address.
> + *	e) Devices with two I2C addresses (eg SFP) have 256 bytes of data
> + *		at I2C address 0x50, and 256 bytes of data at I2C address
> + *		0x51 (A2h in the spec).  Page selection and paged access
> + *		only apply to this second I2C address (0x51).
> + *	e) The address space is presented, by the driver, as a linear
> + *	        address space.  For devices with one I2C client at address
> + *	        0x50 (eg QSFP), offset 0-127 are in the lower
> + *	        half of address 50/A0h/client[0].  Offset 128-255 are in
> + *	        page 0, 256-383 are page 1, etc.  More generally, offset
> + *	        'n' resides in page (n/128)-1.  ('page -1' is the lower
> + *	        half, offset 0-127).
> + *	f) For devices with two I2C clients at address 0x50 and 0x51 (eg SFP),
> + *		the address space places offset 0-127 in the lower
> + *	        half of 50/A0/client[0], offset 128-255 in the upper
> + *	        half.  Offset 256-383 is in the lower half of 51/A2/client[1].
> + *	        Offset 384-511 is in page 0, in the upper half of 51/A2/...
> + *	        Offset 512-639 is in page 1, in the upper half of 51/A2/...
> + *	        Offset 'n' is in page (n/128)-3 (for n > 383)
> + *
> + *	                    One I2c addressed (eg QSFP) Memory Map
> + *
> + *	                    2-Wire Serial Address: 1010000x
> + *
> + *	                    Lower Page 00h (128 bytes)
> + *	                    =====================
> + *	                   |                     |
> + *	                   |                     |
> + *	                   |                     |
> + *	                   |                     |
> + *	                   |                     |
> + *	                   |                     |
> + *	                   |                     |
> + *	                   |                     |
> + *	                   |                     |
> + *	                   |                     |
> + *	                   |Page Select Byte(127)|
> + *	                    =====================
> + *	                              |
> + *	                              |
> + *	                              |
> + *	                              |
> + *	                              V
> + *	     ------------------------------------------------------------
> + *	    |                 |                  |                       |
> + *	    |                 |                  |                       |
> + *	    |                 |                  |                       |
> + *	    |                 |                  |                       |
> + *	    |                 |                  |                       |
> + *	    |                 |                  |                       |
> + *	    |                 |                  |                       |
> + *	    |                 |                  |                       |
> + *	    |                 |                  |                       |
> + *	    V                 V                  V                       V
> + *	 ------------   --------------      ---------------     --------------
> + *	|            | |              |    |               |   |              |
> + *	|   Upper    | |     Upper    |    |     Upper     |   |    Upper     |
> + *	|  Page 00h  | |    Page 01h  |    |    Page 02h   |   |   Page 03h   |
> + *	|            | |   (Optional) |    |   (Optional)  |   |  (Optional   |
> + *	|            | |              |    |               |   |   for Cable  |
> + *	|            | |              |    |               |   |  Assemblies) |
> + *	|    ID      | |     AST      |    |      User     |   |              |
> + *	|  Fields    | |    Table     |    |   EEPROM Data |   |              |
> + *	|            | |              |    |               |   |              |
> + *	|            | |              |    |               |   |              |
> + *	|            | |              |    |               |   |              |
> + *	 ------------   --------------      ---------------     --------------
> + *
> + * The SFF 8436 (QSFP) spec only defines the 4 pages described above.
> + * In anticipation of future applications and devices, this driver
> + * supports access to the full architected range, 256 pages.
> + *
> + **/
> +
> +/* #define DEBUG 1 */
> +
> +#undef EEPROM_CLASS
> +#ifdef CONFIG_EEPROM_CLASS
> +#define EEPROM_CLASS
> +#endif
> +#ifdef CONFIG_EEPROM_CLASS_MODULE
> +#define EEPROM_CLASS
> +#endif
> +
> +#include <linux/kernel.h>
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/delay.h>
> +#include <linux/mutex.h>
> +#include <linux/sysfs.h>
> +#include <linux/jiffies.h>
> +#include <linux/i2c.h>
> +
> +#ifdef EEPROM_CLASS
> +#include <linux/eeprom_class.h>
> +#endif
> +
> +#include <linux/types.h>
> +
> +/* The maximum length of a port name */
> +#define MAX_PORT_NAME_LEN 20
> +
> +struct optoe_platform_data {
> +	u32		byte_len;		/* size (sum of all addr) */
> +	u16		page_size;		/* for writes */
> +	u8		flags;
> +	void		*dummy1;		/* backward compatibility */
> +	void		*dummy2;		/* backward compatibility */
> +
> +#ifdef EEPROM_CLASS
> +	struct eeprom_platform_data *eeprom_data;
> +#endif
> +	char port_name[MAX_PORT_NAME_LEN];
> +};
> +
> +/* fundamental unit of addressing for EEPROM */
> +#define OPTOE_PAGE_SIZE 128
> +/*
> + * Single address devices (eg QSFP) have 256 pages, plus the unpaged
> + * low 128 bytes.  If the device does not support paging, it is
> + * only 2 'pages' long.
> + */
> +#define OPTOE_ARCH_PAGES 256
> +#define ONE_ADDR_EEPROM_SIZE ((1 + OPTOE_ARCH_PAGES) * OPTOE_PAGE_SIZE)
> +#define ONE_ADDR_EEPROM_UNPAGED_SIZE (2 * OPTOE_PAGE_SIZE)
> +/*
> + * Dual address devices (eg SFP) have 256 pages, plus the unpaged
> + * low 128 bytes, plus 256 bytes at 0x50.  If the device does not
> + * support paging, it is 4 'pages' long.
> + */
> +#define TWO_ADDR_EEPROM_SIZE ((3 + OPTOE_ARCH_PAGES) * OPTOE_PAGE_SIZE)
> +#define TWO_ADDR_EEPROM_UNPAGED_SIZE (4 * OPTOE_PAGE_SIZE)
> +#define TWO_ADDR_NO_0X51_SIZE (2 * OPTOE_PAGE_SIZE)
> +
> +/* a few constants to find our way around the EEPROM */
> +#define OPTOE_PAGE_SELECT_REG   0x7F
> +#define ONE_ADDR_PAGEABLE_REG 0x02
> +#define ONE_ADDR_NOT_PAGEABLE BIT(2)
> +#define TWO_ADDR_PAGEABLE_REG 0x40
> +#define TWO_ADDR_PAGEABLE BIT(4)
> +#define TWO_ADDR_0X51_REG 92
> +#define TWO_ADDR_0X51_SUPP BIT(6)
> +#define OPTOE_ID_REG 0
> +#define OPTOE_READ_OP 0
> +#define OPTOE_WRITE_OP 1
> +#define OPTOE_EOF 0  /* used for access beyond end of device */
> +
> +struct optoe_data {
> +	struct optoe_platform_data chip;
> +	int use_smbus;
> +	char port_name[MAX_PORT_NAME_LEN];
> +
> +	/*
> +	 * Lock protects against activities from other Linux tasks,
> +	 * but not from changes by other I2C masters.
> +	 */
> +	struct mutex lock;
> +	struct bin_attribute bin;
> +	struct attribute_group attr_group;
> +
> +	u8 *writebuf;
> +	unsigned int write_max;
> +
> +	unsigned int num_addresses;
> +
> +#ifdef EEPROM_CLASS
> +	struct eeprom_device *eeprom_dev;
> +#endif
> +
> +	/* dev_class: ONE_ADDR (QSFP) or TWO_ADDR (SFP) */
> +	int dev_class;
> +
> +	struct i2c_client *client[2];
> +};
> +
> +/*
> + * This parameter is to help this driver avoid blocking other drivers out
> + * of I2C for potentially troublesome amounts of time. With a 100 kHz I2C
> + * clock, one 256 byte read takes about 1/43 second which is excessive;
> + * but the 1/170 second it takes at 400 kHz may be quite reasonable; and
> + * at 1 MHz (Fm+) a 1/430 second delay could easily be invisible.
> + *
> + * This value is forced to be a power of two so that writes align on pages.
> + */
> +static unsigned int io_limit = OPTOE_PAGE_SIZE;
> +
> +/*
> + * specs often allow 5 msec for a page write, sometimes 20 msec;
> + * it's important to recover from write timeouts.
> + */
> +static unsigned int write_timeout = 25;
> +
> +/*
> + * flags to distinguish one-address (QSFP family) from two-address (SFP family)
> + * If the family is not known, figure it out when the device is accessed
> + */
> +#define ONE_ADDR 1
> +#define TWO_ADDR 2
> +
> +static const struct i2c_device_id optoe_ids[] = {
> +	{ "optoe1", ONE_ADDR },
> +	{ "optoe2", TWO_ADDR },
> +	{ "sff8436", ONE_ADDR },
> +	{ "24c04", TWO_ADDR },
> +	{ /* END OF LIST */ }
> +};
> +MODULE_DEVICE_TABLE(i2c, optoe_ids);
> +
> +/*-------------------------------------------------------------------------*/
> +/*
> + * This routine computes the addressing information to be used for
> + * a given r/w request.
> + *
> + * Task is to calculate the client (0 = i2c addr 50, 1 = i2c addr 51),
> + * the page, and the offset.
> + *
> + * Handles both single address (eg QSFP) and two address (eg SFP).
> + *     For SFP, offset 0-255 are on client[0], >255 is on client[1]
> + *     Offset 256-383 are on the lower half of client[1]
> + *     Pages are accessible on the upper half of client[1].
> + *     Offset >383 are in 128 byte pages mapped into the upper half
> + *
> + *     For QSFP, all offsets are on client[0]
> + *     offset 0-127 are on the lower half of client[0] (no paging)
> + *     Pages are accessible on the upper half of client[1].
> + *     Offset >127 are in 128 byte pages mapped into the upper half
> + *
> + *     Callers must not read/write beyond the end of a client or a page
> + *     without recomputing the client/page.  Hence offset (within page)
> + *     plus length must be less than or equal to 128.  (Note that this
> + *     routine does not have access to the length of the call, hence
> + *     cannot do the validity check.)
> + *
> + * Offset within Lower Page 00h and Upper Page 00h are not recomputed
> + */
> +
> +static uint8_t optoe_translate_offset(struct optoe_data *optoe,
> +				      loff_t *offset,
> +				      struct i2c_client **client)
> +{
> +	unsigned int page = 0;
> +
> +	*client = optoe->client[0];
> +
> +	/* if SFP style, offset > 255, shift to i2c addr 0x51 */
> +	if (optoe->dev_class == TWO_ADDR) {
> +		if (*offset > 255) {
> +			/* like QSFP, but shifted to client[1] */
> +			*client = optoe->client[1];
> +			*offset -= 256;
> +		}
> +	}
> +
> +	/*
> +	 * if offset is in the range 0-128...
> +	 * page doesn't matter (using lower half), return 0.
> +	 * offset is already correct (don't add 128 to get to paged area)
> +	 */
> +	if (*offset < OPTOE_PAGE_SIZE)
> +		return page;
> +
> +	/* note, page will always be positive since *offset >= 128 */
> +	page = (*offset >> 7) - 1;
> +	/* 0x80 places the offset in the top half, offset is last 7 bits */
> +	*offset = OPTOE_PAGE_SIZE + (*offset & 0x7f);
> +
> +	return page;  /* note also returning client and offset */
> +}
> +
> +static ssize_t optoe_eeprom_read(struct optoe_data *optoe,
> +				 struct i2c_client *client,
> +				 char *buf, unsigned int offset, size_t count)
> +{
> +	struct i2c_msg msg[2];
> +	u8 msgbuf[2];
> +	unsigned long timeout, read_time;
> +	int status, i;
> +
> +	memset(msg, 0, sizeof(msg));
> +
> +	switch (optoe->use_smbus) {
> +	case I2C_SMBUS_I2C_BLOCK_DATA:
> +		/*smaller eeproms can work given some SMBus extension calls */
> +		if (count > I2C_SMBUS_BLOCK_MAX)
> +			count = I2C_SMBUS_BLOCK_MAX;
> +		break;
> +	case I2C_SMBUS_WORD_DATA:
> +		/* Check for odd length transaction */
> +		count = (count == 1) ? 1 : 2;
> +		break;
> +	case I2C_SMBUS_BYTE_DATA:
> +		count = 1;
> +		break;
> +	default:
> +		/*
> +		 * When we have a better choice than SMBus calls, use a
> +		 * combined I2C message. Write address; then read up to
> +		 * io_limit data bytes.  msgbuf is u8 and will cast to our
> +		 * needs.
> +		 */
> +		i = 0;
> +		msgbuf[i++] = offset;
> +
> +		msg[0].addr = client->addr;
> +		msg[0].buf = msgbuf;
> +		msg[0].len = i;
> +
> +		msg[1].addr = client->addr;
> +		msg[1].flags = I2C_M_RD;
> +		msg[1].buf = buf;
> +		msg[1].len = count;
> +	}
> +
> +	/*
> +	 * Reads fail if the previous write didn't complete yet. We may
> +	 * loop a few times until this one succeeds, waiting at least
> +	 * long enough for one entire page write to work.
> +	 */
> +	timeout = jiffies + msecs_to_jiffies(write_timeout);
> +	do {
> +		read_time = jiffies;
> +
> +		switch (optoe->use_smbus) {
> +		case I2C_SMBUS_I2C_BLOCK_DATA:
> +			status = i2c_smbus_read_i2c_block_data(client, offset,
> +							       count, buf);
> +			break;
> +		case I2C_SMBUS_WORD_DATA:
> +			status = i2c_smbus_read_word_data(client, offset);
> +			if (status >= 0) {
> +				buf[0] = status & 0xff;
> +				if (count == 2)
> +					buf[1] = status >> 8;
> +				status = count;
> +			}
> +			break;
> +		case I2C_SMBUS_BYTE_DATA:
> +			status = i2c_smbus_read_byte_data(client, offset);
> +			if (status >= 0) {
> +				buf[0] = status;
> +				status = count;
> +			}
> +			break;
> +		default:
> +			status = i2c_transfer(client->adapter, msg, 2);
> +			if (status == 2)
> +				status = count;
> +		}
> +
> +		dev_dbg(&client->dev, "eeprom read %zu@%d --> %d (%ld)\n",
> +			count, offset, status, jiffies);
> +
> +		if (status == count)  /* happy path */
> +			return count;
> +
> +		if (status == -ENXIO) /* no module present */
> +			return status;
> +
> +		/* REVISIT: at HZ=100, this is sloooow */
> +		usleep_range(1000, 2000);
> +	} while (time_before(read_time, timeout));
> +
> +	return -ETIMEDOUT;
> +}
> +
> +static ssize_t optoe_eeprom_write(struct optoe_data *optoe,
> +				  struct i2c_client *client,
> +				  const char *buf,
> +				  unsigned int offset, size_t count)
> +{
> +	struct i2c_msg msg;
> +	ssize_t status;
> +	unsigned long timeout, write_time;
> +	unsigned int next_page_start;
> +	int i = 0;
> +	u16 writeword;
> +
> +	/* write max is at most a page
> +	 * (In this driver, write_max is actually one byte!)
> +	 */
> +	if (count > optoe->write_max)
> +		count = optoe->write_max;
> +
> +	/* shorten count if necessary to avoid crossing page boundary */
> +	next_page_start = roundup(offset + 1, OPTOE_PAGE_SIZE);
> +	if (offset + count > next_page_start)
> +		count = next_page_start - offset;
> +
> +	switch (optoe->use_smbus) {
> +	case I2C_SMBUS_I2C_BLOCK_DATA:
> +		/*smaller eeproms can work given some SMBus extension calls */
> +		if (count > I2C_SMBUS_BLOCK_MAX)
> +			count = I2C_SMBUS_BLOCK_MAX;
> +		break;
> +	case I2C_SMBUS_WORD_DATA:
> +		/* Check for odd length transaction */
> +		count = (count == 1) ? 1 : 2;
> +		break;
> +	case I2C_SMBUS_BYTE_DATA:
> +		count = 1;
> +		break;
> +	default:
> +		/* If we'll use I2C calls for I/O, set up the message */
> +		msg.addr = client->addr;
> +		msg.flags = 0;
> +
> +		/* msg.buf is u8 and casts will mask the values */
> +		msg.buf = optoe->writebuf;
> +
> +		msg.buf[i++] = offset;
> +		memcpy(&msg.buf[i], buf, count);
> +		msg.len = i + count;
> +		break;
> +	}
> +
> +	/*
> +	 * Reads fail if the previous write didn't complete yet. We may
> +	 * loop a few times until this one succeeds, waiting at least
> +	 * long enough for one entire page write to work.
> +	 */
> +	timeout = jiffies + msecs_to_jiffies(write_timeout);
> +	do {
> +		write_time = jiffies;
> +
> +		switch (optoe->use_smbus) {
> +		case I2C_SMBUS_I2C_BLOCK_DATA:
> +			status = i2c_smbus_write_i2c_block_data(client,
> +								offset,
> +								count,
> +								buf);
> +			if (status == 0)
> +				status = count;
> +			break;
> +		case I2C_SMBUS_WORD_DATA:
> +			if (count == 2) {
> +				writeword = (buf[1] << 8) | buf[0];
> +				status = i2c_smbus_write_word_data(client,
> +								   offset,
> +								   writeword);
> +			} else {
> +				/* count = 1 */
> +				status = i2c_smbus_write_byte_data(client,
> +								   offset,
> +								   buf[0]);
> +			}
> +			if (status == 0)
> +				status = count;
> +			break;
> +		case I2C_SMBUS_BYTE_DATA:
> +			status = i2c_smbus_write_byte_data(client, offset,
> +							   buf[0]);
> +			if (status == 0)
> +				status = count;
> +			break;
> +		default:
> +			status = i2c_transfer(client->adapter, &msg, 1);
> +			if (status == 1)
> +				status = count;
> +			break;
> +		}
> +
> +		dev_dbg(&client->dev, "eeprom write %zu@%d --> %ld (%lu)\n",
> +			count, offset, (long int)status, jiffies);
> +
> +		if (status == count)
> +			return count;
> +
> +		/* REVISIT: at HZ=100, this is sloooow */
> +		usleep_range(1000, 2000);
> +	} while (time_before(write_time, timeout));
> +
> +	return -ETIMEDOUT;
> +}
> +
> +static ssize_t optoe_eeprom_update_client(struct optoe_data *optoe,
> +					  char *buf, loff_t off,
> +					  size_t count, int opcode)
> +{
> +	struct i2c_client *client;
> +	ssize_t retval = 0;
> +	u8 page = 0;
> +	loff_t phy_offset = off;
> +	int ret = 0;
> +
> +	page = optoe_translate_offset(optoe, &phy_offset, &client);
> +	dev_dbg(&client->dev,
> +		"%s off %lld  page:%d phy_offset:%lld, count:%ld, opcode:%d\n",
> +		__func__, off, page, phy_offset, (long int)count, opcode);
> +	if (page > 0) {
> +		ret = optoe_eeprom_write(optoe, client, &page,
> +					 OPTOE_PAGE_SELECT_REG, 1);
> +		if (ret < 0) {
> +			dev_dbg(&client->dev,
> +				"Write page register for page %d failed ret:%d!\n",
> +					page, ret);
> +			return ret;
> +		}
> +	}
> +
> +	while (count) {
> +		ssize_t	status;
> +
> +		if (opcode == OPTOE_READ_OP) {
> +			status =  optoe_eeprom_read(optoe, client, buf,
> +						    phy_offset, count);
> +		} else {
> +			status =  optoe_eeprom_write(optoe, client, buf,
> +						     phy_offset, count);
> +		}
> +		if (status <= 0) {
> +			if (retval == 0)
> +				retval = status;
> +			break;
> +		}
> +		buf += status;
> +		phy_offset += status;
> +		count -= status;
> +		retval += status;
> +	}
> +
> +	if (page > 0) {
> +		/* return the page register to page 0 (why?) */
> +		page = 0;
> +		ret = optoe_eeprom_write(optoe, client, &page,
> +					 OPTOE_PAGE_SELECT_REG, 1);
> +		if (ret < 0) {
> +			dev_err(&client->dev,
> +				"Restore page register to 0 failed:%d!\n", ret);
> +			/* error only if nothing has been transferred */
> +			if (retval == 0)
> +				retval = ret;
> +		}
> +	}
> +	return retval;
> +}
> +
> +/*
> + * Figure out if this access is within the range of supported pages.
> + * Note this is called on every access because we don't know if the
> + * module has been replaced since the last call.
> + * If/when modules support more pages, this is the routine to update
> + * to validate and allow access to additional pages.
> + *
> + * Returns updated len for this access:
> + *     - entire access is legal, original len is returned.
> + *     - access begins legal but is too long, len is truncated to fit.
> + *     - initial offset exceeds supported pages, return OPTOE_EOF (zero)
> + */
> +static ssize_t optoe_page_legal(struct optoe_data *optoe,
> +				loff_t off, size_t len)
> +{
> +	struct i2c_client *client = optoe->client[0];
> +	u8 regval;
> +	int status;
> +	size_t maxlen;
> +
> +	if (off < 0)
> +		return -EINVAL;
> +	if (optoe->dev_class == TWO_ADDR) {
> +		/* SFP case */
> +		/* if only using addr 0x50 (first 256 bytes) we're good */
> +		if ((off + len) <= TWO_ADDR_NO_0X51_SIZE)
> +			return len;
> +		/* if offset exceeds possible pages, we're not good */
> +		if (off >= TWO_ADDR_EEPROM_SIZE)
> +			return OPTOE_EOF;
> +		/* in between, are pages supported? */
> +		status = optoe_eeprom_read(optoe, client, &regval,
> +					   TWO_ADDR_PAGEABLE_REG, 1);
> +		if (status < 0)
> +			return status;  /* error out (no module?) */
> +		if (regval & TWO_ADDR_PAGEABLE) {
> +			/* Pages supported, trim len to the end of pages */
> +			maxlen = TWO_ADDR_EEPROM_SIZE - off;
> +		} else {
> +			/* pages not supported, trim len to unpaged size */
> +			if (off >= TWO_ADDR_EEPROM_UNPAGED_SIZE)
> +				return OPTOE_EOF;
> +
> +			/* will be accessing addr 0x51, is that supported? */
> +			/* byte 92, bit 6 implies DDM support, 0x51 support */
> +			status = optoe_eeprom_read(optoe, client, &regval,
> +						   TWO_ADDR_0X51_REG, 1);
> +			if (status < 0)
> +				return status;
> +			if (regval & TWO_ADDR_0X51_SUPP) {
> +				/* addr 0x51 is OK */
> +				maxlen = TWO_ADDR_EEPROM_UNPAGED_SIZE - off;
> +			} else {
> +				/* addr 0x51 NOT supported, trim to 256 max */
> +				if (off >= TWO_ADDR_NO_0X51_SIZE)
> +					return OPTOE_EOF;
> +				maxlen = TWO_ADDR_NO_0X51_SIZE - off;
> +			}
> +		}
> +		len = (len > maxlen) ? maxlen : len;
> +		dev_dbg(&client->dev,
> +			"page_legal, SFP, off %lld len %ld\n",
> +			off, (long int)len);
> +	} else {
> +		/* QSFP case */
> +		/* if no pages needed, we're good */
> +		if ((off + len) <= ONE_ADDR_EEPROM_UNPAGED_SIZE)
> +			return len;
> +		/* if offset exceeds possible pages, we're not good */
> +		if (off >= ONE_ADDR_EEPROM_SIZE)
> +			return OPTOE_EOF;
> +		/* in between, are pages supported? */
> +		status = optoe_eeprom_read(optoe, client, &regval,
> +					   ONE_ADDR_PAGEABLE_REG, 1);
> +		if (status < 0)
> +			return status;  /* error out (no module?) */
> +		if (regval & ONE_ADDR_NOT_PAGEABLE) {
> +			/* pages not supported, trim len to unpaged size */
> +			if (off >= ONE_ADDR_EEPROM_UNPAGED_SIZE)
> +				return OPTOE_EOF;
> +			maxlen = ONE_ADDR_EEPROM_UNPAGED_SIZE - off;
> +		} else {
> +			/* Pages supported, trim len to the end of pages */
> +			maxlen = ONE_ADDR_EEPROM_SIZE - off;
> +		}
> +		len = (len > maxlen) ? maxlen : len;
> +		dev_dbg(&client->dev,
> +			"page_legal, QSFP, off %lld len %ld\n",
> +			off, (long int)len);
> +	}
> +	return len;
> +}
> +
> +static ssize_t optoe_read_write(struct optoe_data *optoe,
> +				char *buf, loff_t off,
> +				size_t len, int opcode)
> +{
> +	struct i2c_client *client = optoe->client[0];
> +	int chunk;
> +	int status = 0;
> +	ssize_t retval;
> +	size_t pending_len = 0, chunk_len = 0;
> +	loff_t chunk_offset = 0, chunk_start_offset = 0;
> +
> +	dev_dbg(&client->dev,
> +		"%s: off %lld  len:%ld, opcode:%s\n",
> +		__func__, off, (long int)len,
> +		(opcode == OPTOE_READ_OP) ? "r" : "w");
> +	if (unlikely(!len))
> +		return len;
> +
> +	/*
> +	 * Read data from chip, protecting against concurrent updates
> +	 * from this host, but not from other I2C masters.
> +	 */
> +	mutex_lock(&optoe->lock);
> +
> +	/*
> +	 * Confirm this access fits within the device supported addr range
> +	 */
> +	status = optoe_page_legal(optoe, off, len);
> +	if (status == OPTOE_EOF || status < 0) {
> +		mutex_unlock(&optoe->lock);
> +		return status;
> +	}
> +	len = status;
> +
> +	/*
> +	 * For each (128 byte) chunk involved in this request, issue a
> +	 * separate call to sff_eeprom_update_client(), to
> +	 * ensure that each access recalculates the client/page
> +	 * and writes the page register as needed.
> +	 * Note that chunk to page mapping is confusing, is different for
> +	 * QSFP and SFP, and never needs to be done.  Don't try!
> +	 */
> +	pending_len = len; /* amount remaining to transfer */
> +	retval = 0;  /* amount transferred */
> +	for (chunk = off >> 7; chunk <= (off + len - 1) >> 7; chunk++) {
> +		/*
> +		 * Compute the offset and number of bytes to be read/write
> +		 *
> +		 * 1. start at offset 0 (within the chunk), and read/write
> +		 *    the entire chunk
> +		 * 2. start at offset 0 (within the chunk) and read/write less
> +		 *    than entire chunk
> +		 * 3. start at an offset not equal to 0 and read/write the rest
> +		 *    of the chunk
> +		 * 4. start at an offset not equal to 0 and read/write less than
> +		 *    (end of chunk - offset)
> +		 */
> +		chunk_start_offset = chunk * OPTOE_PAGE_SIZE;
> +
> +		if (chunk_start_offset < off) {
> +			chunk_offset = off;
> +			if ((off + pending_len) < (chunk_start_offset +
> +					OPTOE_PAGE_SIZE))
> +				chunk_len = pending_len;
> +			else
> +				chunk_len = OPTOE_PAGE_SIZE - off;
> +		} else {
> +			chunk_offset = chunk_start_offset;
> +			if (pending_len > OPTOE_PAGE_SIZE)
> +				chunk_len = OPTOE_PAGE_SIZE;
> +			else
> +				chunk_len = pending_len;
> +		}
> +
> +		dev_dbg(&client->dev,
> +			"sff_r/w: off %lld, len %ld, chunk_start_offset %lld, chunk_offset %lld, chunk_len %ld, pending_len %ld\n",
> +			off, (long int)len, chunk_start_offset, chunk_offset,
> +			(long int)chunk_len, (long int)pending_len);
> +
> +		/*
> +		 * note: chunk_offset is from the start of the EEPROM,
> +		 * not the start of the chunk
> +		 */
> +		status = optoe_eeprom_update_client(optoe, buf, chunk_offset,
> +						    chunk_len, opcode);
> +		if (status != chunk_len) {
> +			/* This is another 'no device present' path */
> +			dev_dbg(&client->dev,
> +			"o_u_c: chunk %d c_offset %lld c_len %ld failed %d!\n",
> +			chunk, chunk_offset, (long int)chunk_len, status);
> +			if (status > 0)
> +				retval += status;
> +			if (retval == 0)
> +				retval = status;
> +			break;
> +		}
> +		buf += status;
> +		pending_len -= status;
> +		retval += status;
> +	}
> +	mutex_unlock(&optoe->lock);
> +
> +	return retval;
> +}
> +
> +static ssize_t optoe_bin_read(struct file *filp, struct kobject *kobj,
> +			      struct bin_attribute *attr,
> +			      char *buf, loff_t off, size_t count)
> +{
> +	struct i2c_client *client = to_i2c_client(container_of(kobj,
> +				struct device, kobj));
> +	struct optoe_data *optoe = i2c_get_clientdata(client);
> +
> +	return optoe_read_write(optoe, buf, off, count, OPTOE_READ_OP);
> +}
> +
> +static ssize_t optoe_bin_write(struct file *filp, struct kobject *kobj,
> +			       struct bin_attribute *attr,
> +			       char *buf, loff_t off, size_t count)
> +{
> +	struct i2c_client *client = to_i2c_client(container_of(kobj,
> +				struct device, kobj));
> +	struct optoe_data *optoe = i2c_get_clientdata(client);
> +
> +	return optoe_read_write(optoe, buf, off, count, OPTOE_WRITE_OP);
> +}
> +
> +static int optoe_remove(struct i2c_client *client)
> +{
> +	struct optoe_data *optoe;
> +
> +	optoe = i2c_get_clientdata(client);
> +	sysfs_remove_group(&client->dev.kobj, &optoe->attr_group);
> +	sysfs_remove_bin_file(&client->dev.kobj, &optoe->bin);
> +
> +	if (optoe->num_addresses == 2)
> +		i2c_unregister_device(optoe->client[1]);
> +
> +#ifdef EEPROM_CLASS
> +	eeprom_device_unregister(optoe->eeprom_dev);
> +#endif
> +
> +	kfree(optoe->writebuf);
> +	kfree(optoe);
> +	return 0;
> +}
> +
> +static ssize_t dev_class_show(struct device *dev,
> +			      struct device_attribute *dattr, char *buf)
> +{
> +	struct i2c_client *client = to_i2c_client(dev);
> +	struct optoe_data *optoe = i2c_get_clientdata(client);
> +	ssize_t count;
> +
> +	mutex_lock(&optoe->lock);
> +	count = sprintf(buf, "%d\n", optoe->dev_class);
> +	mutex_unlock(&optoe->lock);
> +
> +	return count;
> +}
> +
> +static ssize_t dev_class_store(struct device *dev,
> +			       struct device_attribute *attr,
> +			       const char *buf, size_t count)
> +{
> +	struct i2c_client *client = to_i2c_client(dev);
> +	struct optoe_data *optoe = i2c_get_clientdata(client);
> +	int dev_class;
> +
> +	/*
> +	 * dev_class is actually the number of i2c addresses used, thus
> +	 * legal values are "1" (QSFP class) and "2" (SFP class)
> +	 */
> +
> +	if (kstrtoint(buf, 0, &dev_class) != 0 ||
> +	    dev_class < 1 || dev_class > 2)
> +		return -EINVAL;
> +
> +	mutex_lock(&optoe->lock);
> +	optoe->dev_class = dev_class;
> +	mutex_unlock(&optoe->lock);
> +
> +	return count;
> +}
> +
> +/*
> + * if using the EEPROM CLASS driver, we don't report a port_name,
> + * the EEPROM CLASS drive handles that.  Hence all this code is
> + * only compiled if we are NOT using the EEPROM CLASS driver.
> + */
> +#ifndef EEPROM_CLASS
> +
> +static ssize_t port_name_show(struct device *dev,
> +			      struct device_attribute *dattr, char *buf)
> +{
> +	struct i2c_client *client = to_i2c_client(dev);
> +	struct optoe_data *optoe = i2c_get_clientdata(client);
> +	ssize_t count;
> +
> +	mutex_lock(&optoe->lock);
> +	count = sprintf(buf, "%s\n", optoe->port_name);
> +	mutex_unlock(&optoe->lock);
> +
> +	return count;
> +}
> +
> +static ssize_t port_name_store(struct device *dev,
> +			       struct device_attribute *attr,
> +			       const char *buf, size_t count)
> +{
> +	struct i2c_client *client = to_i2c_client(dev);
> +	struct optoe_data *optoe = i2c_get_clientdata(client);
> +	char port_name[MAX_PORT_NAME_LEN];
> +
> +	/* no checking, this value is not used except by port_name_show */
> +
> +	if (sscanf(buf, "%19s", port_name) != 1)
> +		return -EINVAL;
> +
> +	mutex_lock(&optoe->lock);
> +	strcpy(optoe->port_name, port_name);
> +	mutex_unlock(&optoe->lock);
> +
> +	return count;
> +}
> +
> +static DEVICE_ATTR_RW(port_name);
> +#endif  /* if NOT defined EEPROM_CLASS, the common case */
> +
> +static DEVICE_ATTR_RW(dev_class);
> +
> +static struct attribute *optoe_attrs[] = {
> +#ifndef EEPROM_CLASS
> +	&dev_attr_port_name.attr,
> +#endif
> +	&dev_attr_dev_class.attr,
> +	NULL,
> +};
> +
> +static struct attribute_group optoe_attr_group = {
> +	.attrs = optoe_attrs,
> +};
> +
> +static int optoe_probe(struct i2c_client *client,
> +		       const struct i2c_device_id *id)
> +{
> +	int err;
> +	int use_smbus = 0;
> +	struct optoe_platform_data chip;
> +	struct optoe_data *optoe;
> +	int num_addresses;
> +	char port_name[MAX_PORT_NAME_LEN];
> +
> +	if (client->addr != 0x50) {
> +		dev_dbg(&client->dev, "probe, bad i2c addr: 0x%x\n",
> +			client->addr);
> +		err = -EINVAL;
> +		goto exit;
> +	}
> +
> +	if (client->dev.platform_data) {
> +		chip = *(struct optoe_platform_data *)client->dev.platform_data;
> +		/* take the port name from the supplied platform data */
> +#ifdef EEPROM_CLASS
> +		strncpy(port_name, chip.eeprom_data->label, MAX_PORT_NAME_LEN);
> +#else
> +		memcpy(port_name, chip.port_name, MAX_PORT_NAME_LEN);
> +#endif
> +		dev_dbg(&client->dev,
> +			"probe, chip provided, flags:0x%x; name: %s\n",
> +			chip.flags, client->name);
> +	} else {
> +		if (!id->driver_data) {
> +			err = -ENODEV;
> +			goto exit;
> +		}
> +		dev_dbg(&client->dev, "probe, building chip\n");
> +		strcpy(port_name, "unitialized");
> +		chip.flags = 0;
> +#ifdef EEPROM_CLASS
> +		chip.eeprom_data = NULL;
> +#endif
> +	}
> +
> +	/* Use I2C operations unless we're stuck with SMBus extensions. */
> +	if (!i2c_check_functionality(client->adapter, I2C_FUNC_I2C)) {
> +		if (i2c_check_functionality(client->adapter,
> +				I2C_FUNC_SMBUS_READ_I2C_BLOCK)) {
> +			use_smbus = I2C_SMBUS_I2C_BLOCK_DATA;
> +		} else if (i2c_check_functionality(client->adapter,
> +				I2C_FUNC_SMBUS_READ_WORD_DATA)) {
> +			use_smbus = I2C_SMBUS_WORD_DATA;
> +		} else if (i2c_check_functionality(client->adapter,
> +				I2C_FUNC_SMBUS_READ_BYTE_DATA)) {
> +			use_smbus = I2C_SMBUS_BYTE_DATA;
> +		} else {
> +			err = -EPFNOSUPPORT;
> +			goto exit;
> +		}
> +	}
> +
> +	optoe = kzalloc(sizeof(*optoe), GFP_KERNEL);
> +	if (!optoe) {
> +		err = -ENOMEM;
> +		goto exit;
> +	}
> +
> +	mutex_init(&optoe->lock);
> +
> +	/* determine whether this is a one-address or two-address module */
> +	if ((strcmp(client->name, "optoe1") == 0) ||
> +	    (strcmp(client->name, "sff8436") == 0)) {
> +		/* one-address (eg QSFP) family */
> +		optoe->dev_class = ONE_ADDR;
> +		chip.byte_len = ONE_ADDR_EEPROM_SIZE;
> +		num_addresses = 1;
> +	} else if ((strcmp(client->name, "optoe2") == 0) ||
> +		   (strcmp(client->name, "24c04") == 0)) {
> +		/* SFP family */
> +		optoe->dev_class = TWO_ADDR;
> +		chip.byte_len = TWO_ADDR_EEPROM_SIZE;
> +		num_addresses = 2;
> +	} else {     /* those were the only two choices */
> +		err = -EINVAL;
> +		goto exit;
> +	}
> +
> +	dev_dbg(&client->dev, "dev_class: %d\n", optoe->dev_class);
> +	optoe->use_smbus = use_smbus;
> +	optoe->chip = chip;
> +	optoe->num_addresses = num_addresses;
> +	memcpy(optoe->port_name, port_name, MAX_PORT_NAME_LEN);
> +
> +	/*
> +	 * Export the EEPROM bytes through sysfs, since that's convenient.
> +	 * By default, only root should see the data (maybe passwords etc)
> +	 */
> +	sysfs_bin_attr_init(&optoe->bin);
> +	optoe->bin.attr.name = "eeprom";
> +	optoe->bin.attr.mode = 0444;
> +	optoe->bin.read = optoe_bin_read;
> +	optoe->bin.size = chip.byte_len;
> +
> +	if (!use_smbus ||
> +	    i2c_check_functionality(client->adapter,
> +				    I2C_FUNC_SMBUS_WRITE_I2C_BLOCK) ||
> +	    i2c_check_functionality(client->adapter,
> +				    I2C_FUNC_SMBUS_WRITE_WORD_DATA) ||
> +	    i2c_check_functionality(client->adapter,
> +				    I2C_FUNC_SMBUS_WRITE_BYTE_DATA)) {
> +		/*
> +		 * NOTE: AN-2079
> +		 * Finisar recommends that the host implement 1 byte writes
> +		 * only since this module only supports 32 byte page boundaries.
> +		 * 2 byte writes are acceptable for PE and Vout changes per
> +		 * Application Note AN-2071.
> +		 */
> +		unsigned int write_max = 1;
> +
> +		optoe->bin.write = optoe_bin_write;
> +		optoe->bin.attr.mode |= 0200;
> +
> +		if (write_max > io_limit)
> +			write_max = io_limit;
> +		if (use_smbus && write_max > I2C_SMBUS_BLOCK_MAX)
> +			write_max = I2C_SMBUS_BLOCK_MAX;
> +		optoe->write_max = write_max;
> +
> +		/* buffer (data + address at the beginning) */
> +		optoe->writebuf = kmalloc(write_max + 2, GFP_KERNEL);
> +		if (!optoe->writebuf) {
> +			err = -ENOMEM;
> +			goto exit_kfree;
> +		}
> +	} else {
> +		dev_warn(&client->dev,
> +			 "cannot write due to controller restrictions.");
> +	}
> +
> +	optoe->client[0] = client;
> +
> +	/* SFF-8472 spec requires that the second I2C address be 0x51 */
> +	if (num_addresses == 2) {
> +		optoe->client[1] = i2c_new_dummy(client->adapter, 0x51);
> +		if (!optoe->client[1]) {
> +			dev_err(&client->dev, "address 0x51 unavailable\n");
> +			err = -EADDRINUSE;
> +			goto err_struct;
> +		}
> +	}
> +
> +	/* create the sysfs eeprom file */
> +	err = sysfs_create_bin_file(&client->dev.kobj, &optoe->bin);
> +	if (err)
> +		goto err_struct;
> +
> +	optoe->attr_group = optoe_attr_group;
> +
> +	err = sysfs_create_group(&client->dev.kobj, &optoe->attr_group);
> +	if (err) {
> +		dev_err(&client->dev, "failed to create sysfs attribute group.\n");
> +		goto err_struct;
> +	}
> +
> +#ifdef EEPROM_CLASS
> +	optoe->eeprom_dev = eeprom_device_register(&client->dev,
> +						   chip.eeprom_data);
> +	if (IS_ERR(optoe->eeprom_dev)) {
> +		dev_err(&client->dev, "error registering eeprom device.\n");
> +		err = PTR_ERR(optoe->eeprom_dev);
> +		goto err_sysfs_cleanup;
> +	}
> +#endif
> +
> +	i2c_set_clientdata(client, optoe);
> +
> +	dev_info(&client->dev, "%zu byte %s EEPROM, %s\n",
> +		 optoe->bin.size, client->name,
> +		 optoe->bin.write ? "read/write" : "read-only");
> +
> +	if (use_smbus == I2C_SMBUS_WORD_DATA ||
> +	    use_smbus == I2C_SMBUS_BYTE_DATA) {
> +		dev_notice(&client->dev,
> +			"Falling back to %s reads, performance will suffer\n",
> +			use_smbus == I2C_SMBUS_WORD_DATA ? "word" : "byte");
> +	}
> +
> +	return 0;
> +
> +#ifdef EEPROM_CLASS
> +err_sysfs_cleanup:
> +	sysfs_remove_group(&client->dev.kobj, &optoe->attr_group);
> +	sysfs_remove_bin_file(&client->dev.kobj, &optoe->bin);
> +#endif
> +
> +err_struct:
> +	if (num_addresses == 2) {
> +		if (optoe->client[1])
> +			i2c_unregister_device(optoe->client[1]);
> +	}
> +
> +	kfree(optoe->writebuf);
> +exit_kfree:
> +	kfree(optoe);
> +exit:
> +	dev_dbg(&client->dev, "probe error %d\n", err);
> +
> +	return err;
> +}
> +
> +/*-------------------------------------------------------------------------*/
> +
> +static struct i2c_driver optoe_driver = {
> +	.driver = {
> +		.name = "optoe",
> +		.owner = THIS_MODULE,
> +	},
> +	.probe = optoe_probe,
> +	.remove = optoe_remove,
> +	.id_table = optoe_ids,
> +};
> +
> +static int __init optoe_init(void)
> +{
> +	if (!io_limit) {
> +		pr_err("optoe: io_limit must not be 0!\n");
> +		return -EINVAL;
> +	}
> +
> +	io_limit = rounddown_pow_of_two(io_limit);
> +	return i2c_add_driver(&optoe_driver);
> +}
> +module_init(optoe_init);
> +
> +static void __exit optoe_exit(void)
> +{
> +	i2c_del_driver(&optoe_driver);
> +}
> +module_exit(optoe_exit);
> +
> +MODULE_DESCRIPTION("Driver for optical transceiver (SFP, QSFP, ...) EEPROMs");
> +MODULE_AUTHOR("DON BOLLINGER <don@thebollingers.org>");
> +MODULE_LICENSE("GPL");
> 

^ permalink raw reply

* Re: net: do not allow changing SO_REUSEADDR/SO_REUSEPORT on bound sockets
From: Andrei Vagin @ 2018-06-11 18:35 UTC (permalink / raw)
  To: Maciej Żenczykowski
  Cc: David S . Miller, Eric Dumazet, Linux NetDev, Pavel Emelyanov
In-Reply-To: <CAHo-OoycbdoMO7aRW23-0B+Ev7Ow=YXy3uHmrx7FOKf2PXc4hA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1872 bytes --]

Cc: Pavel

On Fri, Jun 08, 2018 at 03:07:30AM -0700, Maciej Żenczykowski wrote:
> I think we probably need to make sk->sk_reuse back into a boolean.
> (ie. eliminate SK_FORCE_REUSE)
> 
> Then add a new tcp/udp sk->ignore_bind_conflicts boolean setting...
> (ie. not just for tcp, but sol_socket)  [or perhaps SO_REPAIR,
> sk->repair or something]
> 
> What I'm not certain of is exactly what sorts of conflicts it should ignore...
> all?  probably not, still seems utterly wrong to allow creation of 2 connected
> tcp sockets with identical 5-tuples.

It is required when we are restoring i_b_c sockets on a server side.  In
this cases, they all have the same source address of a listening socket.

To restore these sockets, we need to be able to create a listening socket
and all i_b_c sockets and bind them all to the same source address.

BTW: Here is an example of how tcp_repair works:
https://github.com/avagin/tcp-repair/blob/master/tcp-constructor.c

> 
> Would it only ignore conflicts against other i_b_c sockets?
> ie. set it on all sockets as we're repairing, then clear it on them
> all once we're done?

TCP_REPAIR (which is set SK_FORCE_REUSE) is used to restore only i_b_c
sockets. SK_FORCE_REUSE is needed to ignore bind conflicts for repaired
sockets. It ignores conflicts agains other i_b_c and listen sockets.

The current idea is that CRIU will restore listening sockets first, and
them it will restore i_b_c sockets.

Pls, take a look at the attached patch.

> 
> and ignore all the fast caching when checking conflicts for an i_b_c socket?
> 
> For CRIU is it safe to assume we're restoring an entire namespace into
> a new namespace?

No. It isn't. CRIU can restore processes in an existing network namespace.

> 
> Could we perhaps instead allow a new namespace to ignore bind conflicts until
> we flip it into enforcing mode?

No, we could not


[-- Attachment #2: 0001-net-split-sk_reuse-into-sk_reuse-and-sk_force_reuse.patch --]
[-- Type: text/plain, Size: 4034 bytes --]

>From 990baa56993827ae6f4441cf078eddf73389d6ee Mon Sep 17 00:00:00 2001
From: Andrei Vagin <avagin@openvz.org>
Date: Fri, 8 Jun 2018 23:27:46 -0700
Subject: [PATCH] net: split sk_reuse into sk_reuse and sk_force_reuse

Currently sk_reuse can have there values: SK_NO_REUSE, SK_CAN_REUSE,
SK_FORCE_REUSE. SK_CAN_REUSE is set by SOL_REUSEADDR.  SK_FORCE_REUSE is
used to ignore bind conflicts for sockets in the repair mode.

This patch makes sk->sk_reuse back into a boolean and adds
sk->sk_force_reuse to track SK_FORCE_REUSE separatly.

Recently here were changes which prohibit to change
SO_REUSEADDR/SO_REUSEPORT on bound sockets and now it is impossible to
set origin values of these parameters for restored (repaired) sockets.

With introduced changes, the tcp_repair mode doesn't affect sk_reuse, so
it is possible to set its value before switching a socket into the
repair mode.

Fixes: f396922d862a ("net: do not allow changing SO_REUSEADDR/SO_REUSEPORT on bound sockets")
Signed-off-by: Andrei Vagin <avagin@openvz.org>
---
 include/net/sock.h              | 13 ++++---------
 net/ipv4/inet_connection_sock.c |  2 +-
 net/ipv4/tcp.c                  |  4 ++--
 3 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index b3b75419eafe..8ad19286ab9e 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -130,6 +130,7 @@ typedef __u64 __bitwise __addrpair;
  *	@skc_family: network address family
  *	@skc_state: Connection state
  *	@skc_reuse: %SO_REUSEADDR setting
+ *	@skc_force_reuse: ignore bind conflicts
  *	@skc_reuseport: %SO_REUSEPORT setting
  *	@skc_bound_dev_if: bound device index if != 0
  *	@skc_bind_node: bind hash linkage for various protocol lookup tables
@@ -174,7 +175,8 @@ struct sock_common {
 
 	unsigned short		skc_family;
 	volatile unsigned char	skc_state;
-	unsigned char		skc_reuse:4;
+	unsigned char		skc_reuse:1;
+	unsigned char		skc_force_reuse:1;
 	unsigned char		skc_reuseport:1;
 	unsigned char		skc_ipv6only:1;
 	unsigned char		skc_net_refcnt:1;
@@ -339,6 +341,7 @@ struct sock {
 #define sk_family		__sk_common.skc_family
 #define sk_state		__sk_common.skc_state
 #define sk_reuse		__sk_common.skc_reuse
+#define sk_force_reuse		__sk_common.skc_force_reuse
 #define sk_reuseport		__sk_common.skc_reuseport
 #define sk_ipv6only		__sk_common.skc_ipv6only
 #define sk_net_refcnt		__sk_common.skc_net_refcnt
@@ -502,16 +505,8 @@ enum sk_pacing {
 #define rcu_dereference_sk_user_data(sk)	rcu_dereference(__sk_user_data((sk)))
 #define rcu_assign_sk_user_data(sk, ptr)	rcu_assign_pointer(__sk_user_data((sk)), ptr)
 
-/*
- * SK_CAN_REUSE and SK_NO_REUSE on a socket mean that the socket is OK
- * or not whether his port will be reused by someone else. SK_FORCE_REUSE
- * on a socket means that the socket will reuse everybody else's port
- * without looking at the other's sk_reuse value.
- */
-
 #define SK_NO_REUSE	0
 #define SK_CAN_REUSE	1
-#define SK_FORCE_REUSE	2
 
 int sk_set_peek_off(struct sock *sk, int val);
 
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 33a88e045efd..2ac1c591b60c 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -306,7 +306,7 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
 		goto fail_unlock;
 tb_found:
 	if (!hlist_empty(&tb->owners)) {
-		if (sk->sk_reuse == SK_FORCE_REUSE)
+		if (sk->sk_force_reuse)
 			goto success;
 
 		if ((tb->fastreuse > 0 && reuse) ||
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 2741953adaba..70bfdd5a2fc4 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2810,11 +2810,11 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 			err = -EPERM;
 		else if (val == 1) {
 			tp->repair = 1;
-			sk->sk_reuse = SK_FORCE_REUSE;
+			sk->sk_force_reuse = 1;
 			tp->repair_queue = TCP_NO_QUEUE;
 		} else if (val == 0) {
 			tp->repair = 0;
-			sk->sk_reuse = SK_NO_REUSE;
+			sk->sk_force_reuse = 0;
 			tcp_send_window_probe(sk);
 		} else
 			err = -EINVAL;
-- 
2.17.0


^ permalink raw reply related

* Re: [PULL] vhost: cleanups and fixes
From: Linus Torvalds @ 2018-06-11 18:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: KVM list, Network Development, Linux Kernel Mailing List,
	Bjorn Andersson, Andrew Morton, virtualization
In-Reply-To: <CA+55aFzrPgnd7hRPrkeV+jX-MSwOZf7T4wKxz66Lk4oub3PZsw@mail.gmail.com>

On Mon, Jun 11, 2018 at 11:32 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So now we apparently do a GFP_KERNEL allocation insider the mm zone
> lock, which is broken on just _so_ many levels.

Oh, I see the comment about how it doesn't actually do an allocation
at all because it's a single-entry.

Still too damn ugly to live, and much too fragile. No way in hell do
we even _hint_ at a GFP_KERNEL when we're inside a context that can't
do any memory allocation at all.

Plus I'm not convinced it's a "no allocation" path even despite that
comment, because it also does a "dma_map_page()" etc, which can cause
allocations to do the dma mapping thing afaik. No?

Maybe there's some reason why that doesn't happen either, but
basically this whole callchain looks *way* to complicated to be used
under a core VM spinlock.

                Linus

^ permalink raw reply

* [bpf PATCH] bpf: selftest fix for sockmap
From: John Fastabend @ 2018-06-11 18:47 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev

In selftest test_maps the sockmap test case attempts to add a socket
in listening state to the sockmap. This is no longer a valid operation
so it fails as expected. However, the test wrongly reports this as an
error now. Fix the test to avoid adding sockets in listening state.

Fixes: 945ae430aa44 ("bpf: sockmap only allow ESTABLISHED sock state")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 tools/testing/selftests/bpf/test_maps.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
index 6c25334..9fed5f0 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -564,7 +564,7 @@ static void test_sockmap(int tasks, void *data)
 	}
 
 	/* Test update without programs */
-	for (i = 0; i < 6; i++) {
+	for (i = 2; i < 6; i++) {
 		err = bpf_map_update_elem(fd, &i, &sfd[i], BPF_ANY);
 		if (err) {
 			printf("Failed noprog update sockmap '%i:%i'\n",
@@ -727,7 +727,7 @@ static void test_sockmap(int tasks, void *data)
 	}
 
 	/* Test map update elem afterwards fd lives in fd and map_fd */
-	for (i = 0; i < 6; i++) {
+	for (i = 2; i < 6; i++) {
 		err = bpf_map_update_elem(map_fd_rx, &i, &sfd[i], BPF_ANY);
 		if (err) {
 			printf("Failed map_fd_rx update sockmap %i '%i:%i'\n",

^ permalink raw reply related

* Re: [PATCH net] failover: eliminate callback hell
From: Siwei Liu @ 2018-06-11 18:56 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Stephen Hemminger, Michael S. Tsirkin, Jiri Pirko, kys, haiyangz,
	David Miller, Samudrala, Sridhar, Netdev, Stephen Hemminger
In-Reply-To: <20180608182933.5a8150e7@cakuba.netronome.com>

On Fri, Jun 8, 2018 at 6:29 PM, Jakub Kicinski <kubakici@wp.pl> wrote:
> On Fri, 8 Jun 2018 16:44:12 -0700, Siwei Liu wrote:
>> >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
>> >> that flag, as well as the 1-netdev model, is to have a means to
>> >> inherit the interface name from the VF, and to eliminate playing hacks
>> >> around renaming devices, customizing udev rules and et al. Why
>> >> inheriting VF's name important? To allow existing config/setup around
>> >> VF continues to work across kernel feature upgrade. Most of network
>> >> config files in all distros are based on interface names. Few are MAC
>> >> address based but making lower slaves hidden would cover the rest. And
>> >> most importantly, preserving the same level of user experience as
>> >> using raw VF interface once getting all ndo_ops and ethtool_ops
>> >> exposed. This is essential to realize transparent live migration that
>> >> users dont have to learn and be aware of the undertaken.
>> >
>> > Inheriting the VF name will fail in the migration scenario.
>> > It is perfectly reasonable to migrate a guest to another machine where
>> > the VF PCI address is different. And since current udev/systemd model
>> > is to base network device name off of PCI address, the device will change
>> > name when guest is migrated.
>> >
>> The scenario of having VF on a different PCI address on post migration
>> is essentially equal to plugging in a new NIC. Why it has to pair with
>> the original PV? A sepearte PV device should be in place to pair the
>> new VF.
>
> IMHO it may be a better idea to look at the VF as acceleration for the
> PV rather than PV a migration vehicle from the VF.  Hence we should

I'm basically talking about two use cases not about solutions or
implementations specifically. As said, the one I'm looking into needs
to migrate a pre-failover VF setup to 1-netdev failover model in a
transparent manner. There's no point to switch PCI address back and
forth in the backend to set where to bind the PV or the VF, as you
have no ways to predict what guest kernel will be running until its
fully loaded. Supporting a VF on new location binding to existing PV
might be nice, but not directly relevant to those who don't need this
side feature than migration itself.

Having said that, while I somewhat agree both use cases should have
its own place in the picture, I don't think judging one better than
the other or vice versa is logical IMHO.

> continue to follow the naming of PV, like the current implementation
> does implicitly by linking to PV's struct device.

The current implementation may only work with new userspace, even so
the eth0/eth0nsby naming is not consistenly persisted due to races in
bus probing. The naming part should be fixed.

-Siwei

^ permalink raw reply

* Re: net: do not allow changing SO_REUSEADDR/SO_REUSEPORT on bound sockets
From: Andrei Vagin @ 2018-06-11 18:57 UTC (permalink / raw)
  To: Maciej Żenczykowski
  Cc: Maciej Żenczykowski, David S . Miller, Eric Dumazet, netdev
In-Reply-To: <20180603174705.51802-1-zenczykowski@gmail.com>

On Sun, Jun 03, 2018 at 10:47:05AM -0700, Maciej Żenczykowski wrote:
> From: Maciej Żenczykowski <maze@google.com>
> 
> It is not safe to do so because such sockets are already in the
> hash tables and changing these options can result in invalidating
> the tb->fastreuse(port) caching.
> 
> This can have later far reaching consequences wrt. bind conflict checks
> which rely on these caches (for optimization purposes).
> 
> Not to mention that you can currently end up with two identical
> non-reuseport listening sockets bound to the same local ip:port
> by clearing reuseport on them after they've already both been bound.
> 
> There is unfortunately no EISBOUND error or anything similar,
> and EISCONN seems to be misleading for a bound-but-not-connected
> socket, so use EUCLEAN 'Structure needs cleaning' which AFAICT
> is the closest you can get to meaning 'socket in bad state'.
> (although perhaps EINVAL wouldn't be a bad choice either?)
> 
> This does unfortunately run the risk of breaking buggy
> userspace programs...
> 
> Signed-off-by: Maciej Żenczykowski <maze@google.com>
> Cc: Eric Dumazet <edumazet@google.com>
> 
> Change-Id: I77c2b3429b2fdf42671eee0fa7a8ba721c94963b
> Reviewed-by: Eric Dumazet <edumazet@google.com>
> ---
>  net/core/sock.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 435a0ba85e52..feca4c98f8a0 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -728,9 +728,22 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
>  			sock_valbool_flag(sk, SOCK_DBG, valbool);
>  		break;
>  	case SO_REUSEADDR:
> -		sk->sk_reuse = (valbool ? SK_CAN_REUSE : SK_NO_REUSE);
> +		val = (valbool ? SK_CAN_REUSE : SK_NO_REUSE);
> +		if ((sk->sk_family == PF_INET || sk->sk_family == PF_INET6) &&
> +		    inet_sk(sk)->inet_num &&
> +		    (sk->sk_reuse != val)) {
> +			ret = (sk->sk_state == TCP_ESTABLISHED) ? -EISCONN : -EUCLEAN;

There are a few more states like TCP_LAST_ACK, TCP_CLOSE_WAIT which
means that a socket is connected.

Actually, I don't see any reasons to return two different values here.

> +			break;
> +		}
> +		sk->sk_reuse = val;
>  		break;
>  	case SO_REUSEPORT:
> +		if ((sk->sk_family == PF_INET || sk->sk_family == PF_INET6) &&
> +		    inet_sk(sk)->inet_num &&
> +		    (sk->sk_reuseport != valbool)) {
> +			ret = (sk->sk_state == TCP_ESTABLISHED) ? -EISCONN : -EUCLEAN;
> +			break;
> +		}
>  		sk->sk_reuseport = valbool;
>  		break;
>  	case SO_TYPE:

^ permalink raw reply

* Re: [Xen-devel] [PATCH] xen/netfront: raise max number of slots in xennet_get_responses()
From: Boris Ostrovsky @ 2018-06-11 18:59 UTC (permalink / raw)
  To: Juergen Gross, linux-kernel, xen-devel, netdev; +Cc: davem
In-Reply-To: <20180611075742.1691-1-jgross@suse.com>

On 06/11/2018 03:57 AM, Juergen Gross wrote:
> The max number of slots used in xennet_get_responses() is set to
> MAX_SKB_FRAGS + (rx->status <= RX_COPY_THRESHOLD).
>
> In old kernel-xen MAX_SKB_FRAGS was 18, while nowadays it is 17. This
> difference is resulting in frequent messages "too many slots" and a
> reduced network throughput for some workloads (factor 10 below that of
> a kernel-xen based guest).
>
> Replacing MAX_SKB_FRAGS by XEN_NETIF_NR_SLOTS_MIN for calculation of
> the max number of slots to use solves that problem (tests showed no
> more messages "too many slots" and throughput was as high as with the
> kernel-xen based guest system).
>
> Signed-off-by: Juergen Gross <jgross@suse.com>

Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

I wonder also whether netfront_tx_slot_available() is meant to be

return (queue->tx.req_prod_pvt - queue->tx.rsp_cons) <
                (NET_TX_RING_SIZE - XEN_NETIF_NR_SLOTS_MIN - 1);

which is the same numeric value but provides a more accurate description
of what is being tested.

-boris


> ---
>  drivers/net/xen-netfront.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
> index 679da1abd73c..ba411005d829 100644
> --- a/drivers/net/xen-netfront.c
> +++ b/drivers/net/xen-netfront.c
> @@ -790,7 +790,7 @@ static int xennet_get_responses(struct netfront_queue *queue,
>  	RING_IDX cons = queue->rx.rsp_cons;
>  	struct sk_buff *skb = xennet_get_rx_skb(queue, cons);
>  	grant_ref_t ref = xennet_get_rx_ref(queue, cons);
> -	int max = MAX_SKB_FRAGS + (rx->status <= RX_COPY_THRESHOLD);
> +	int max = XEN_NETIF_NR_SLOTS_MIN + (rx->status <= RX_COPY_THRESHOLD);
>  	int slots = 1;
>  	int err = 0;
>  	unsigned long ret;

^ permalink raw reply

* Re: [PATCH] Bluetooth: hci_bcm: Configure SCO routing automatically
From: Rob Herring @ 2018-06-11 19:05 UTC (permalink / raw)
  To: Marcel Holtmann
  Cc: Attila Tőkés, David S. Miller, Mark Rutland,
	Johan Hedberg, Artiom Vaskov, netdev, devicetree,
	linux-kernel@vger.kernel.org, open list:BLUETOOTH DRIVERS
In-Reply-To: <D56466C0-2312-4F0E-AB00-66EF4BEECB1A@holtmann.org>

On Mon, Jun 11, 2018 at 12:19 PM, Marcel Holtmann <marcel@holtmann.org> wrote:
> Hi Rob,
>
>>>>>>> Added support to automatically configure the SCO packet routing at the
>>>>>>> device setup. The SCO packets are used with the HSP / HFP profiles, but in
>>>>>>> some devices (ex. CYW43438) they are routed to a PCM output by default. This
>>>>>>> change allows sending the vendor specific HCI command to configure the SCO
>>>>>>> routing. The parameters of the command are loaded from the device tree.
>>>>>>
>>>>>> Please wrap your commit msg.
>>>>>
>>>>>
>>>>> Sure.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Signed-off-by: Attila Tőkés <attitokes@gmail.com>
>>>>>>> ---
>>>>>>> .../bindings/net/broadcom-bluetooth.txt       |  7 ++
>>>>>>
>>>>>> Please split bindings to separate patch.
>>>>>
>>>>>
>>>>> Ok, I will split this in two.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> drivers/bluetooth/hci_bcm.c                   | 72 +++++++++++++++++++
>>>>>>> 2 files changed, 79 insertions(+)
>>>>>>>
>>>>>>> diff --git
>>>>>>> a/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt
>>>>>>> b/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt
>>>>>>> index 4194ff7e..aea3a094 100644
>>>>>>> --- a/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt
>>>>>>> +++ b/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt
>>>>>>> @@ -21,6 +21,12 @@ Optional properties:
>>>>>>> - clocks: clock specifier if external clock provided to the controller
>>>>>>> - clock-names: should be "extclk"
>>>>>>>
>>>>>>> + SCO routing parameters:
>>>>>>> + - sco-routing: 0-3 (PCM, Transport, Codec, I2S)
>>>>>>> + - pcm-interface-rate: 0-4 (128 Kbps - 2048 Kbps)
>>>>>>> + - pcm-frame-type: 0 (short), 1 (long)
>>>>>>> + - pcm-sync-mode: 0 (slave), 1 (master)
>>>>>>> + - pcm-clock-mode: 0 (slave), 1 (master)
>>>>>>
>>>>>> Are these Broadcom specific? Properties need either vendor prefix or
>>>>>> to be documented in a common location. I think these look like the
>>>>>> latter.
>>>>>
>>>>>
>>>>> These will be used as parameters of a vendor specific (Broadcom/Cypress)
>>>>> command configuring the SCO packet routing. See the Write_SCO_PCM_Int_Param
>>>>> command from: http://www.cypress.com/file/298311/download.
>>>>
>>>> The DT should just describe how the h/w is hooked-up. What the s/w has
>>>> to do based on that is the driver's problem which is certainly
>>>> vendor/chip specific, but that is all irrelevant to the binding.
>>>>
>>>>> What would be the property names with a Broadcom / Cypress vendor prefix?
>>>>>
>>>>>   brcm,sco-routing
>>>>>   brcm,pcm-interface-rate
>>>>>   brcm,pcm-frame-type
>>>>>   brcm,pcm-sync-mode
>>>>>   brcm,pcm-clock-mode
>>>>>
>>>>> ?
>>>>
>>>> Yes.
>>>
>>> we can do this. However all pcm-* are optional if you switch the HCI transport. And sco-routing should default to HCI if that is not present. Meaning a driver should actively trying to change this. Nevertheless, it would be good if a driver reads the current settings.
>>>
>>> In theory we could make sco-routing generic, but so many vendors have different modes, that we better keep this vendor specific.
>>
>> Even if vendor specific, the properties for not HCI transport case are
>> still incomplete IMO.
>>
>> By modes, you mean PCM vs. I2S and all the flavors of timings you can
>> have within those or something else? For the former, that's often
>> going to be a process of solving what each end support and if that
>> doesn't work, then IIRC we already have properties for setting
>> modes/timing. All the same issues exist with audio codecs and this is
>> really not any different.
>
> this is what Broadcom uses to configure their PCM transport. So I think for now, we make them brcm, specific and see how that goes. We can always generalize them later if enough chip manufactures provide support for it.

We already have properties doing the same thing defined in
Documentation/devicetree/bindings/sound/simple-card.txt. Use and
extend that. We don't need new properties especially for something
that is not complete. For example If I have 2 host ports (every SoC
has at least 2), how do I indicate which one is connected to BT.

Rob

^ permalink raw reply

* Re: [PATCH net] failover: eliminate callback hell
From: Siwei Liu @ 2018-06-11 19:23 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Michael S. Tsirkin, Jiri Pirko, kys, haiyangz, David Miller,
	Samudrala, Sridhar, Netdev, Stephen Hemminger
In-Reply-To: <20180611082253.631219cf@xeon-e3>

On Mon, Jun 11, 2018 at 8:22 AM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Fri, 8 Jun 2018 17:42:21 -0700
> Siwei Liu <loseweigh@gmail.com> wrote:
>
>> On Fri, Jun 8, 2018 at 5:02 PM, Stephen Hemminger
>> <stephen@networkplumber.org> wrote:
>> > On Fri, 8 Jun 2018 16:44:12 -0700
>> > Siwei Liu <loseweigh@gmail.com> wrote:
>> >
>> >> On Fri, Jun 8, 2018 at 4:18 PM, Stephen Hemminger
>> >> <stephen@networkplumber.org> wrote:
>> >> > On Fri, 8 Jun 2018 15:25:59 -0700
>> >> > Siwei Liu <loseweigh@gmail.com> wrote:
>> >> >
>> >> >> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
>> >> >> <stephen@networkplumber.org> wrote:
>> >> >> > On Wed, 6 Jun 2018 15:30:27 +0300
>> >> >> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> >> >> >
>> >> >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
>> >> >> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:
>> >> >> >> > >The net failover should be a simple library, not a virtual
>> >> >> >> > >object with function callbacks (see callback hell).
>> >> >> >> >
>> >> >> >> > Why just a library? It should do a common things. I think it should be a
>> >> >> >> > virtual object. Looks like your patch again splits the common
>> >> >> >> > functionality into multiple drivers. That is kind of backwards attitude.
>> >> >> >> > I don't get it. We should rather focus on fixing the mess the
>> >> >> >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
>> >> >> >> > model.
>> >> >> >>
>> >> >> >> So it seems that at least one benefit for netvsc would be better
>> >> >> >> handling of renames.
>> >> >> >>
>> >> >> >> Question is how can this change to 3-netdev happen?  Stephen is
>> >> >> >> concerned about risk of breaking some userspace.
>> >> >> >>
>> >> >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>> >> >> >> address, and you said then "why not use existing network namespaces
>> >> >> >> rather than inventing a new abstraction". So how about it then? Do you
>> >> >> >> want to find a way to use namespaces to hide the PV device for netvsc
>> >> >> >> compatibility?
>> >> >> >>
>> >> >> >
>> >> >> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
>> >> >> > startups that all demand eth0 always be present. And VF may come and go.
>> >> >> > After this history, there is a strong motivation not to change how kernel
>> >> >> > behaves. Switching to 3 device model would be perceived as breaking
>> >> >> > existing userspace.
>> >> >> >
>> >> >> > With virtio you can  work it out with the distro's yourself.
>> >> >> > There is no pre-existing semantics to deal with.
>> >> >> >
>> >> >> > For the virtio, I don't see the need for IFF_HIDDEN.
>> >> >>
>> >> >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
>> >> >> that flag, as well as the 1-netdev model, is to have a means to
>> >> >> inherit the interface name from the VF, and to eliminate playing hacks
>> >> >> around renaming devices, customizing udev rules and et al. Why
>> >> >> inheriting VF's name important? To allow existing config/setup around
>> >> >> VF continues to work across kernel feature upgrade. Most of network
>> >> >> config files in all distros are based on interface names. Few are MAC
>> >> >> address based but making lower slaves hidden would cover the rest. And
>> >> >> most importantly, preserving the same level of user experience as
>> >> >> using raw VF interface once getting all ndo_ops and ethtool_ops
>> >> >> exposed. This is essential to realize transparent live migration that
>> >> >> users dont have to learn and be aware of the undertaken.
>> >> >
>> >> > Inheriting the VF name will fail in the migration scenario.
>> >> > It is perfectly reasonable to migrate a guest to another machine where
>> >> > the VF PCI address is different. And since current udev/systemd model
>> >> > is to base network device name off of PCI address, the device will change
>> >> > name when guest is migrated.
>> >> >
>> >> The scenario of having VF on a different PCI address on post migration
>> >> is essentially equal to plugging in a new NIC. Why it has to pair with
>> >> the original PV? A sepearte PV device should be in place to pair the
>> >> new VF.
>> >
>> > The host only guarantees that the PV device will be on the same network.
>> > It does not make any PCI guarantees. The way Windows works is to find
>> > the device based on "serial number" which is an Hyper-V specific attribute
>> > of PCI devices.
>> >
>> > I considered naming off of serial number but that won't work for the
>> > case where PV device is present first and VF arrives later. The serial
>> > number is attribute of VF, not the PV which is there first.
>>
>> I assume the PV can get that information ahead of time before VF
>> arrives? Without it how do you match the device when you see a VF
>> coming with some serial number? Is it possible for PV to get the
>> matching SN even earlier during probe time? Or it has to depend on the
>> presence of vPCI bridge to generate this SN?
>
>
>
> NO. the PV device does not know ahead of time and there are scenario
> where the serial and PCI info can change when it does arrive. These
> are test cases (not something people usually do). Example on WS2016:
>   Guest configured with two or more vswitches and NICs.
>   SR-IOV is not enabled
>
> Later:
>   On Hyper-V console (or Powershell command line) on host SR-IOV
>   is enabled on the second NIC.
>
>   The guest will be notified of new PCI device; the "serial number"
>   will be 1.
>
> If same process is repeated but in this case the first NIC has
> SR-IOV enabled, it will get serial # 1.
>
>
> I agree with Jakub. What you are proposing is backwards. The VF
> must be thought of as a dependent of PV device not vice/versa.

I don't enforce netvsc moving to the same 1-netdev model, did I? I
understand Hyper-V has its specific design that's hard to get around
of.

All I said transparent live migration and the 1-netdev model should
work for the passthrough with virtio as helper under QEMU. As I recall
the initial intent was to use virtio as a migration helper rather than
having VF as acceleration path. The latter is as far as I know is from
Hyper-V's point of view. I don't know where those side features come
from and why doing live migration religiously is backwards.

-Siwei

>
>> >
>> > Your ideas about having the PCI information of the VF form the name
>> > of the failover device have the same problem. The PV device may
>> > be the only one present on boot.
>>
>> Yeah, this is a chicken-egg problem indeed, and that was the reason
>> why I supply the BDF info for PV to name the master interface.
>> However, the ACPI PCI slot needs to depend on the PCI bus enumeration
>> so that can't be predictable.  Would it make sense to only rename when
>> the first time a matching VF appears and PV interface isn't brought
>> up, then the failover master would always stick to the name
>> afterwards? I think it should cover most scenarios as it's usually
>> during boot time (dracut) the VF first appears and the PV interface at
>> the time then shouldn't have been configured yet.
>>
>> -Siwei
>>
>> >
>> >
>> >> > On Azure, the VF maybe removed (by host) at any time and then later
>> >> > reattached. There is no guarantee that VF will show back up at
>> >> > the same synthetic PCI address. It will likely have a different
>> >> > PCI domain value.
>> >>
>> >> This is something QEMU can do and make sure the PCI address is
>> >> consistent after migration.
>> >>
>> >> -Siwei
>> >
>

^ permalink raw reply

* Re: [PATCH net] failover: eliminate callback hell
From: Samudrala, Sridhar @ 2018-06-11 19:34 UTC (permalink / raw)
  To: Michael S. Tsirkin, Stephen Hemminger
  Cc: kys, haiyangz, davem, netdev, Stephen Hemminger
In-Reply-To: <20180611210819-mutt-send-email-mst@kernel.org>

On 6/11/2018 11:10 AM, Michael S. Tsirkin wrote:
> On Mon, Jun 04, 2018 at 08:42:31PM -0700, Stephen Hemminger wrote:
>>    * Set permanent and current address of net_failover device
>>      to match the primary.
>>
>>    * Carrier should be marked off before registering device
>>      the net_failover device.
> Sridhar, do we want to address this?
> If yes, could you please take a look at addressing these
> meanwhile, while we keep arguing about making API changes?

Sure. I will submit patches to address these issues raised by Stephen.

^ permalink raw reply

* Re: [PATCH] iwlwifi: pcie: make array prop static, shrinks object size
From: Joe Perches @ 2018-06-11 19:40 UTC (permalink / raw)
  To: Colin King, Greg Kroah-Hartman, Johannes Berg, Emmanuel Grumbach,
	Luca Coelho, Intel Linux Wireless, Kalle Valo, David S . Miller,
	linux-wireless, netdev
  Cc: kernel-janitors, linux-kernel
In-Reply-To: <20180611171537.14597-1-colin.king@canonical.com>

(adding Greg KH)

On Mon, 2018-06-11 at 18:15 +0100, Colin King wrote:
> From: Colin Ian King <colin.king@canonical.com>
> 
> Don't populate the read-only array 'prop' on the stack but
> instead make it static. Makes the object code smaller by 20 bytes:
> 
> Before:
>    text    data     bss     dec     hex filename
>   71659   14614     576   86849   15341 trans.o
> 
> After:
>    text    data     bss     dec     hex filename
>   71479   14774     576   86829   1532d trans.o
> 
> (gcc version 7.3.0 x86_64)
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>
> ---
>  drivers/net/wireless/intel/iwlwifi/pcie/trans.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/wireless/intel/iwlwifi/pcie/trans.c b/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
> index 7229991ae70d..c4626ebe5da1 100644
> --- a/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
> +++ b/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
> @@ -1946,7 +1946,7 @@ static void iwl_trans_pcie_removal_wk(struct work_struct *wk)
>  	struct iwl_trans_pcie_removal *removal =
>  		container_of(wk, struct iwl_trans_pcie_removal, work);
>  	struct pci_dev *pdev = removal->pdev;
> -	char *prop[] = {"EVENT=INACCESSIBLE", NULL};
> +	static char *prop[] = {"EVENT=INACCESSIBLE", NULL};
>  
>  	dev_err(&pdev->dev, "Device gone - attempting removal\n");
>  	kobject_uevent_env(&pdev->dev.kobj, KOBJ_CHANGE, prop);

Now what is happening is that prop is being reloaded
each invocation with the constant addresses of the strings.

It seems the prototype and function for kobject_uevent_env
should change as well to avoid this.

Perhaps this should become:
---
 drivers/net/wireless/intel/iwlwifi/pcie/trans.c | 2 +-
 include/linux/kobject.h                         | 2 +-
 lib/kobject_uevent.c                            | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/wireless/intel/iwlwifi/pcie/trans.c b/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
index 7229991ae70d..6668a8aad22e 100644
--- a/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
+++ b/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
@@ -1946,7 +1946,7 @@ static void iwl_trans_pcie_removal_wk(struct work_struct *wk)
 	struct iwl_trans_pcie_removal *removal =
 		container_of(wk, struct iwl_trans_pcie_removal, work);
 	struct pci_dev *pdev = removal->pdev;
-	char *prop[] = {"EVENT=INACCESSIBLE", NULL};
+	static const char * const prop[] = {"EVENT=INACCESSIBLE", NULL};
 
 	dev_err(&pdev->dev, "Device gone - attempting removal\n");
 	kobject_uevent_env(&pdev->dev.kobj, KOBJ_CHANGE, prop);
diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index 7f6f93c3df9c..9f5cf553dd1e 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -217,7 +217,7 @@ extern struct kobject *firmware_kobj;
 
 int kobject_uevent(struct kobject *kobj, enum kobject_action action);
 int kobject_uevent_env(struct kobject *kobj, enum kobject_action action,
-			char *envp[]);
+			const char * const envp[]);
 int kobject_synth_uevent(struct kobject *kobj, const char *buf, size_t count);
 
 __printf(2, 3)
diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 63d0816ab23b..9107989a0cc8 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -452,7 +452,7 @@ static void zap_modalias_env(struct kobj_uevent_env *env)
  * corresponding error when it fails.
  */
 int kobject_uevent_env(struct kobject *kobj, enum kobject_action action,
-		       char *envp_ext[])
+		       const char * const envp_ext[])
 {
 	struct kobj_uevent_env *env;
 	const char *action_string = kobject_actions[action];

^ permalink raw reply related

* [PATCH net 1/3] hv_netvsc: drop common code until callback model fixed
From: Stephen Hemminger @ 2018-06-11 19:44 UTC (permalink / raw)
  To: kys, haiyangz, sthemmin; +Cc: devel, netdev
In-Reply-To: <20180611194456.8268-1-sthemmin@microsoft.com>

The callback model of handling network failover is not suitable
in the current form.
  1. It was merged without addressing all the review feedback.
  2. It was merged without approval of any of the netvsc maintainers.
  3. Design discussion on how to handle PV/VF fallback is still
     not complete.
  4. IMHO the code model using callbacks is trying to make
     something common which isn't.

Revert the netvsc specific changes for now. Does not impact ongoing
development of failover model for virtio.
Revisit this after a simpler library based failover kernel
routines are extracted.

This reverts
commit 9c6ffbacdb57 ("hv_netvsc: fix error return code in netvsc_probe()")
and
commit 1ff78076d8dd ("netvsc: refactor notifier/event handling code to use the failover framework")

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
---
 drivers/net/hyperv/Kconfig      |   1 -
 drivers/net/hyperv/hyperv_net.h |   2 -
 drivers/net/hyperv/netvsc_drv.c | 224 +++++++++++++++++++++++---------
 3 files changed, 165 insertions(+), 62 deletions(-)

diff --git a/drivers/net/hyperv/Kconfig b/drivers/net/hyperv/Kconfig
index 23a2d145813a..0765d5f61714 100644
--- a/drivers/net/hyperv/Kconfig
+++ b/drivers/net/hyperv/Kconfig
@@ -2,6 +2,5 @@ config HYPERV_NET
 	tristate "Microsoft Hyper-V virtual network driver"
 	depends on HYPERV
 	select UCS2_STRING
-	select FAILOVER
 	help
 	  Select this option to enable the Hyper-V virtual network driver.
diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 99d8e7398a5b..1be34d2e3563 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -932,8 +932,6 @@ struct net_device_context {
 	u32 vf_alloc;
 	/* Serial number of the VF to team with */
 	u32 vf_serial;
-
-	struct failover *failover;
 };
 
 /* Per channel data */
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 8eec156418ea..2d4370c94b6e 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -43,7 +43,6 @@
 #include <net/pkt_sched.h>
 #include <net/checksum.h>
 #include <net/ip6_checksum.h>
-#include <net/failover.h>
 
 #include "hyperv_net.h"
 
@@ -1782,6 +1781,46 @@ static void netvsc_link_change(struct work_struct *w)
 	rtnl_unlock();
 }
 
+static struct net_device *get_netvsc_bymac(const u8 *mac)
+{
+	struct net_device *dev;
+
+	ASSERT_RTNL();
+
+	for_each_netdev(&init_net, dev) {
+		if (dev->netdev_ops != &device_ops)
+			continue;	/* not a netvsc device */
+
+		if (ether_addr_equal(mac, dev->perm_addr))
+			return dev;
+	}
+
+	return NULL;
+}
+
+static struct net_device *get_netvsc_byref(struct net_device *vf_netdev)
+{
+	struct net_device *dev;
+
+	ASSERT_RTNL();
+
+	for_each_netdev(&init_net, dev) {
+		struct net_device_context *net_device_ctx;
+
+		if (dev->netdev_ops != &device_ops)
+			continue;	/* not a netvsc device */
+
+		net_device_ctx = netdev_priv(dev);
+		if (!rtnl_dereference(net_device_ctx->nvdev))
+			continue;	/* device is removed */
+
+		if (rtnl_dereference(net_device_ctx->vf_netdev) == vf_netdev)
+			return dev;	/* a match */
+	}
+
+	return NULL;
+}
+
 /* Called when VF is injecting data into network stack.
  * Change the associated network device from VF to netvsc.
  * note: already called with rcu_read_lock
@@ -1804,6 +1843,46 @@ static rx_handler_result_t netvsc_vf_handle_frame(struct sk_buff **pskb)
 	return RX_HANDLER_ANOTHER;
 }
 
+static int netvsc_vf_join(struct net_device *vf_netdev,
+			  struct net_device *ndev)
+{
+	struct net_device_context *ndev_ctx = netdev_priv(ndev);
+	int ret;
+
+	ret = netdev_rx_handler_register(vf_netdev,
+					 netvsc_vf_handle_frame, ndev);
+	if (ret != 0) {
+		netdev_err(vf_netdev,
+			   "can not register netvsc VF receive handler (err = %d)\n",
+			   ret);
+		goto rx_handler_failed;
+	}
+
+	ret = netdev_master_upper_dev_link(vf_netdev, ndev,
+					   NULL, NULL, NULL);
+	if (ret != 0) {
+		netdev_err(vf_netdev,
+			   "can not set master device %s (err = %d)\n",
+			   ndev->name, ret);
+		goto upper_link_failed;
+	}
+
+	/* set slave flag before open to prevent IPv6 addrconf */
+	vf_netdev->flags |= IFF_SLAVE;
+
+	schedule_delayed_work(&ndev_ctx->vf_takeover, VF_TAKEOVER_INT);
+
+	call_netdevice_notifiers(NETDEV_JOIN, vf_netdev);
+
+	netdev_info(vf_netdev, "joined to %s\n", ndev->name);
+	return 0;
+
+upper_link_failed:
+	netdev_rx_handler_unregister(vf_netdev);
+rx_handler_failed:
+	return ret;
+}
+
 static void __netvsc_vf_setup(struct net_device *ndev,
 			      struct net_device *vf_netdev)
 {
@@ -1854,95 +1933,85 @@ static void netvsc_vf_setup(struct work_struct *w)
 	rtnl_unlock();
 }
 
-static int netvsc_pre_register_vf(struct net_device *vf_netdev,
-				  struct net_device *ndev)
+static int netvsc_register_vf(struct net_device *vf_netdev)
 {
+	struct net_device *ndev;
 	struct net_device_context *net_device_ctx;
 	struct netvsc_device *netvsc_dev;
 
+	if (vf_netdev->addr_len != ETH_ALEN)
+		return NOTIFY_DONE;
+
+	/*
+	 * We will use the MAC address to locate the synthetic interface to
+	 * associate with the VF interface. If we don't find a matching
+	 * synthetic interface, move on.
+	 */
+	ndev = get_netvsc_bymac(vf_netdev->perm_addr);
+	if (!ndev)
+		return NOTIFY_DONE;
+
 	net_device_ctx = netdev_priv(ndev);
 	netvsc_dev = rtnl_dereference(net_device_ctx->nvdev);
 	if (!netvsc_dev || rtnl_dereference(net_device_ctx->vf_netdev))
-		return -ENODEV;
-
-	return 0;
-}
-
-static int netvsc_register_vf(struct net_device *vf_netdev,
-			      struct net_device *ndev)
-{
-	struct net_device_context *ndev_ctx = netdev_priv(ndev);
-
-	/* set slave flag before open to prevent IPv6 addrconf */
-	vf_netdev->flags |= IFF_SLAVE;
+		return NOTIFY_DONE;
 
-	schedule_delayed_work(&ndev_ctx->vf_takeover, VF_TAKEOVER_INT);
+	if (netvsc_vf_join(vf_netdev, ndev) != 0)
+		return NOTIFY_DONE;
 
-	call_netdevice_notifiers(NETDEV_JOIN, vf_netdev);
-
-	netdev_info(vf_netdev, "joined to %s\n", ndev->name);
+	netdev_info(ndev, "VF registering: %s\n", vf_netdev->name);
 
 	dev_hold(vf_netdev);
-	rcu_assign_pointer(ndev_ctx->vf_netdev, vf_netdev);
-
-	return 0;
+	rcu_assign_pointer(net_device_ctx->vf_netdev, vf_netdev);
+	return NOTIFY_OK;
 }
 
 /* VF up/down change detected, schedule to change data path */
-static int netvsc_vf_changed(struct net_device *vf_netdev,
-			     struct net_device *ndev)
+static int netvsc_vf_changed(struct net_device *vf_netdev)
 {
 	struct net_device_context *net_device_ctx;
 	struct netvsc_device *netvsc_dev;
+	struct net_device *ndev;
 	bool vf_is_up = netif_running(vf_netdev);
 
+	ndev = get_netvsc_byref(vf_netdev);
+	if (!ndev)
+		return NOTIFY_DONE;
+
 	net_device_ctx = netdev_priv(ndev);
 	netvsc_dev = rtnl_dereference(net_device_ctx->nvdev);
 	if (!netvsc_dev)
-		return -ENODEV;
+		return NOTIFY_DONE;
 
 	netvsc_switch_datapath(ndev, vf_is_up);
 	netdev_info(ndev, "Data path switched %s VF: %s\n",
 		    vf_is_up ? "to" : "from", vf_netdev->name);
 
-	return 0;
+	return NOTIFY_OK;
 }
 
-static int netvsc_pre_unregister_vf(struct net_device *vf_netdev,
-				    struct net_device *ndev)
+static int netvsc_unregister_vf(struct net_device *vf_netdev)
 {
+	struct net_device *ndev;
 	struct net_device_context *net_device_ctx;
 
-	net_device_ctx = netdev_priv(ndev);
-	cancel_delayed_work_sync(&net_device_ctx->vf_takeover);
-
-	return 0;
-}
-
-static int netvsc_unregister_vf(struct net_device *vf_netdev,
-				struct net_device *ndev)
-{
-	struct net_device_context *net_device_ctx;
+	ndev = get_netvsc_byref(vf_netdev);
+	if (!ndev)
+		return NOTIFY_DONE;
 
 	net_device_ctx = netdev_priv(ndev);
+	cancel_delayed_work_sync(&net_device_ctx->vf_takeover);
 
 	netdev_info(ndev, "VF unregistering: %s\n", vf_netdev->name);
 
+	netdev_rx_handler_unregister(vf_netdev);
+	netdev_upper_dev_unlink(vf_netdev, ndev);
 	RCU_INIT_POINTER(net_device_ctx->vf_netdev, NULL);
 	dev_put(vf_netdev);
 
-	return 0;
+	return NOTIFY_OK;
 }
 
-static struct failover_ops netvsc_failover_ops = {
-	.slave_pre_register	= netvsc_pre_register_vf,
-	.slave_register		= netvsc_register_vf,
-	.slave_pre_unregister	= netvsc_pre_unregister_vf,
-	.slave_unregister	= netvsc_unregister_vf,
-	.slave_link_change	= netvsc_vf_changed,
-	.slave_handle_frame	= netvsc_vf_handle_frame,
-};
-
 static int netvsc_probe(struct hv_device *dev,
 			const struct hv_vmbus_device_id *dev_id)
 {
@@ -2032,16 +2101,8 @@ static int netvsc_probe(struct hv_device *dev,
 		goto register_failed;
 	}
 
-	net_device_ctx->failover = failover_register(net, &netvsc_failover_ops);
-	if (IS_ERR(net_device_ctx->failover)) {
-		ret = PTR_ERR(net_device_ctx->failover);
-		goto err_failover;
-	}
-
 	return ret;
 
-err_failover:
-	unregister_netdev(net);
 register_failed:
 	rndis_filter_device_remove(dev, nvdev);
 rndis_failed:
@@ -2082,15 +2143,13 @@ static int netvsc_remove(struct hv_device *dev)
 	rtnl_lock();
 	vf_netdev = rtnl_dereference(ndev_ctx->vf_netdev);
 	if (vf_netdev)
-		failover_slave_unregister(vf_netdev);
+		netvsc_unregister_vf(vf_netdev);
 
 	if (nvdev)
 		rndis_filter_device_remove(dev, nvdev);
 
 	unregister_netdevice(net);
 
-	failover_unregister(ndev_ctx->failover);
-
 	rtnl_unlock();
 	rcu_read_unlock();
 
@@ -2117,8 +2176,54 @@ static struct  hv_driver netvsc_drv = {
 	.remove = netvsc_remove,
 };
 
+/*
+ * On Hyper-V, every VF interface is matched with a corresponding
+ * synthetic interface. The synthetic interface is presented first
+ * to the guest. When the corresponding VF instance is registered,
+ * we will take care of switching the data path.
+ */
+static int netvsc_netdev_event(struct notifier_block *this,
+			       unsigned long event, void *ptr)
+{
+	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
+
+	/* Skip our own events */
+	if (event_dev->netdev_ops == &device_ops)
+		return NOTIFY_DONE;
+
+	/* Avoid non-Ethernet type devices */
+	if (event_dev->type != ARPHRD_ETHER)
+		return NOTIFY_DONE;
+
+	/* Avoid Vlan dev with same MAC registering as VF */
+	if (is_vlan_dev(event_dev))
+		return NOTIFY_DONE;
+
+	/* Avoid Bonding master dev with same MAC registering as VF */
+	if ((event_dev->priv_flags & IFF_BONDING) &&
+	    (event_dev->flags & IFF_MASTER))
+		return NOTIFY_DONE;
+
+	switch (event) {
+	case NETDEV_REGISTER:
+		return netvsc_register_vf(event_dev);
+	case NETDEV_UNREGISTER:
+		return netvsc_unregister_vf(event_dev);
+	case NETDEV_UP:
+	case NETDEV_DOWN:
+		return netvsc_vf_changed(event_dev);
+	default:
+		return NOTIFY_DONE;
+	}
+}
+
+static struct notifier_block netvsc_netdev_notifier = {
+	.notifier_call = netvsc_netdev_event,
+};
+
 static void __exit netvsc_drv_exit(void)
 {
+	unregister_netdevice_notifier(&netvsc_netdev_notifier);
 	vmbus_driver_unregister(&netvsc_drv);
 }
 
@@ -2138,6 +2243,7 @@ static int __init netvsc_drv_init(void)
 	if (ret)
 		return ret;
 
+	register_netdevice_notifier(&netvsc_netdev_notifier);
 	return 0;
 }
 
-- 
2.17.1

^ permalink raw reply related

* [PATCH net 2/3] hv_netvsc: fix network namespace issues with VF support
From: Stephen Hemminger @ 2018-06-11 19:44 UTC (permalink / raw)
  To: kys, haiyangz, sthemmin; +Cc: devel, netdev
In-Reply-To: <20180611194456.8268-1-sthemmin@microsoft.com>

When finding the parent netvsc device, the search needs to be across
all netvsc device instances (independent of network namespace).

Find parent device of VF using upper_dev_get routine which
searches only adjacent list.

Fixes: e8ff40d4bff1 ("hv_netvsc: improve VF device matching")
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>

netns aware byref
---
 drivers/net/hyperv/hyperv_net.h |  2 ++
 drivers/net/hyperv/netvsc_drv.c | 43 +++++++++++++++------------------
 2 files changed, 22 insertions(+), 23 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 1be34d2e3563..c0b3f3c125d4 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -902,6 +902,8 @@ struct net_device_context {
 	struct hv_device *device_ctx;
 	/* netvsc_device */
 	struct netvsc_device __rcu *nvdev;
+	/* list of netvsc net_devices */
+	struct list_head list;
 	/* reconfigure work */
 	struct delayed_work dwork;
 	/* last reconfig time */
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 2d4370c94b6e..8cb21e013d1d 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -69,6 +69,8 @@ static int debug = -1;
 module_param(debug, int, 0444);
 MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)");
 
+static LIST_HEAD(netvsc_dev_list);
+
 static void netvsc_change_rx_flags(struct net_device *net, int change)
 {
 	struct net_device_context *ndev_ctx = netdev_priv(net);
@@ -1783,13 +1785,10 @@ static void netvsc_link_change(struct work_struct *w)
 
 static struct net_device *get_netvsc_bymac(const u8 *mac)
 {
-	struct net_device *dev;
-
-	ASSERT_RTNL();
+	struct net_device_context *ndev_ctx;
 
-	for_each_netdev(&init_net, dev) {
-		if (dev->netdev_ops != &device_ops)
-			continue;	/* not a netvsc device */
+	list_for_each_entry(ndev_ctx, &netvsc_dev_list, list) {
+		struct net_device *dev = hv_get_drvdata(ndev_ctx->device_ctx);
 
 		if (ether_addr_equal(mac, dev->perm_addr))
 			return dev;
@@ -1800,25 +1799,18 @@ static struct net_device *get_netvsc_bymac(const u8 *mac)
 
 static struct net_device *get_netvsc_byref(struct net_device *vf_netdev)
 {
+	struct net_device_context *net_device_ctx;
 	struct net_device *dev;
 
-	ASSERT_RTNL();
-
-	for_each_netdev(&init_net, dev) {
-		struct net_device_context *net_device_ctx;
+	dev = netdev_master_upper_dev_get(vf_netdev);
+	if (!dev || dev->netdev_ops != &device_ops)
+		return NULL;	/* not a netvsc device */
 
-		if (dev->netdev_ops != &device_ops)
-			continue;	/* not a netvsc device */
+	net_device_ctx = netdev_priv(dev);
+	if (!rtnl_dereference(net_device_ctx->nvdev))
+		return NULL;	/* device is removed */
 
-		net_device_ctx = netdev_priv(dev);
-		if (!rtnl_dereference(net_device_ctx->nvdev))
-			continue;	/* device is removed */
-
-		if (rtnl_dereference(net_device_ctx->vf_netdev) == vf_netdev)
-			return dev;	/* a match */
-	}
-
-	return NULL;
+	return dev;
 }
 
 /* Called when VF is injecting data into network stack.
@@ -2095,15 +2087,19 @@ static int netvsc_probe(struct hv_device *dev,
 	else
 		net->max_mtu = ETH_DATA_LEN;
 
-	ret = register_netdev(net);
+	rtnl_lock();
+	ret = register_netdevice(net);
 	if (ret != 0) {
 		pr_err("Unable to register netdev.\n");
 		goto register_failed;
 	}
 
-	return ret;
+	list_add(&net_device_ctx->list, &netvsc_dev_list);
+	rtnl_unlock();
+	return 0;
 
 register_failed:
+	rtnl_unlock();
 	rndis_filter_device_remove(dev, nvdev);
 rndis_failed:
 	free_percpu(net_device_ctx->vf_stats);
@@ -2149,6 +2145,7 @@ static int netvsc_remove(struct hv_device *dev)
 		rndis_filter_device_remove(dev, nvdev);
 
 	unregister_netdevice(net);
+	list_del(&ndev_ctx->list);
 
 	rtnl_unlock();
 	rcu_read_unlock();
-- 
2.17.1

^ permalink raw reply related

* [PATCH net 0/3] hv_netvsc: notification and namespace fixes
From: Stephen Hemminger @ 2018-06-11 19:44 UTC (permalink / raw)
  To: kys, haiyangz, sthemmin; +Cc: devel, netdev

This set of patches addresses two set of fixes. First it backs out
the common callback model which was merged in net-next without
completing all the review feedback or getting maintainer approval.

Then it fixes the transparent VF management code to handle network
namespaces.

Stephen Hemminger (3):
  hv_netvsc: drop common code until callback model fixed
  hv_netvsc: fix network namespace issues with VF support
  hv_netvsc: move VF to same namespace as netvsc device

 drivers/net/hyperv/Kconfig      |   1 -
 drivers/net/hyperv/hyperv_net.h |   4 +-
 drivers/net/hyperv/netvsc_drv.c | 242 ++++++++++++++++++++++++--------
 3 files changed, 184 insertions(+), 63 deletions(-)

-- 
2.17.1

^ permalink raw reply

* [PATCH net 3/3] hv_netvsc: move VF to same namespace as netvsc device
From: Stephen Hemminger @ 2018-06-11 19:44 UTC (permalink / raw)
  To: kys, haiyangz, sthemmin; +Cc: devel, netdev
In-Reply-To: <20180611194456.8268-1-sthemmin@microsoft.com>

When VF is added, the paravirtual device is already present
and may have been moved to another network namespace. For example,
sometimes the management interface is put in another net namespace
in some environments.

The VF should get moved to where the netvsc device is when the
VF is discovered. The user can move it later (if desired).

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
---
 drivers/net/hyperv/netvsc_drv.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 8cb21e013d1d..bf1b845c1147 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -1930,6 +1930,7 @@ static int netvsc_register_vf(struct net_device *vf_netdev)
 	struct net_device *ndev;
 	struct net_device_context *net_device_ctx;
 	struct netvsc_device *netvsc_dev;
+	int ret;
 
 	if (vf_netdev->addr_len != ETH_ALEN)
 		return NOTIFY_DONE;
@@ -1948,11 +1949,29 @@ static int netvsc_register_vf(struct net_device *vf_netdev)
 	if (!netvsc_dev || rtnl_dereference(net_device_ctx->vf_netdev))
 		return NOTIFY_DONE;
 
-	if (netvsc_vf_join(vf_netdev, ndev) != 0)
+	/* if syntihetic interface is a different namespace,
+	 * then move the VF to that namespace; join will be
+	 * done again in that context.
+	 */
+	if (!net_eq(dev_net(ndev), dev_net(vf_netdev))) {
+		ret = dev_change_net_namespace(vf_netdev,
+					       dev_net(ndev), "eth%d");
+		if (ret)
+			netdev_err(vf_netdev,
+				   "could not move to same namespace as %s: %d\n",
+				   ndev->name, ret);
+		else
+			netdev_info(vf_netdev,
+				    "VF moved to namespace with: %s\n",
+				    ndev->name);
 		return NOTIFY_DONE;
+	}
 
 	netdev_info(ndev, "VF registering: %s\n", vf_netdev->name);
 
+	if (netvsc_vf_join(vf_netdev, ndev) != 0)
+		return NOTIFY_DONE;
+
 	dev_hold(vf_netdev);
 	rcu_assign_pointer(net_device_ctx->vf_netdev, vf_netdev);
 	return NOTIFY_OK;
-- 
2.17.1

^ permalink raw reply related

* RE: locking in wimax/i2400m
From: Perez-Gonzalez, Inaky @ 2018-06-11 19:56 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-wimax
  Cc: netdev@vger.kernel.org, tglx@linutronix.de
In-Reply-To: <20180611160528.be37vniefy4est6m@linutronix.de>

Hello Sebastian

Thanks for checking this out,

    SA> I tried to figure out if the URB-completion handler uses any
    SA> locking and stumbled here.

    SA> i2400m_pm_notifier() is called from process context. This
    SA> function invokes i2400m_fw_cache() + i2400m_fw_uncache(). Both
    SA> functions do spin_lock(&i2400m->rx_lock); while in other
    SA> places (say i2400mu_rxd()) it does
    SA> spin_lock_irqsave(&i2400m->rx_lock, flags);

    SA> So what do I miss? Is this lock never used in interrupt
    SA> context and lockdep didn't complain or did nobody try suspend
    SA> with this driver before?  From what I can tell
    SA> i2400m_dev_bootstrap() has the same locking problem.

I don't remember ever getting to try suspend in the driver, so that
might be the case. That said, I haven't touched this code in years, or
the current locking best practices, so I'll let others chime in,
Thomas being prolly one of the key ones.

^ permalink raw reply

* RE: [net] fq_codel: fix NULL pointer deref in fq_codel_reset
From: Keller, Jacob E @ 2018-06-11 19:57 UTC (permalink / raw)
  To: Kirsher, Jeffrey T, davem@davemloft.net
  Cc: netdev@vger.kernel.org, nhorman@redhat.com, sassmann@redhat.com,
	jogreene@redhat.com, Eric Dumazet
In-Reply-To: <20180611170011.7200-1-jeffrey.t.kirsher@intel.com>

> -----Original Message-----
> From: Kirsher, Jeffrey T
> Sent: Monday, June 11, 2018 10:00 AM
> To: davem@davemloft.net
> Cc: Keller, Jacob E <jacob.e.keller@intel.com>; netdev@vger.kernel.org;
> nhorman@redhat.com; sassmann@redhat.com; jogreene@redhat.com; Eric
> Dumazet <edumazet@google.com>; Kirsher, Jeffrey T
> <jeffrey.t.kirsher@intel.com>
> Subject: [net] fq_codel: fix NULL pointer deref in fq_codel_reset
> 
> From: Jacob Keller <jacob.e.keller@intel.com>
> 
> The function qdisc_create_dftl attempts to create a default qdisc. If
> this fails, it calls qdisc_destroy when cleaning up. The qdisc_destroy
> function calls the ->reset op on the qdisc.
> 
> In the case of sch_fq_codel.c, this function will panic when the qdisc
> wasn't properly initialized:
> 
>    kernel: BUG: unable to handle kernel NULL pointer dereference at
> 0000000000000008
>    kernel: IP: fq_codel_reset+0x58/0xd0 [sch_fq_codel]
>    kernel: PGD 0 P4D 0
>    kernel: Oops: 0000 [#1] SMP PTI
>    kernel: Modules linked in: i40iw i40e(OE) xt_CHECKSUM iptable_mangle
> ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat
> nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun bridge stp llc
> devlink ebtable_filter ebtables ip6table_filter ip6_tables rpcrdma ib_isert
> iscsi_target_mod sunrpc ib_iser libiscsi scsi_transport_iscsi ib_srpt
> target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm
> ib_uverbs ib_umad rdma_cm ib_cm iw_cm intel_rapl sb_edac
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbypass
> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate iTCO_wdt
> iTCO_vendor_support intel_uncore ib_core intel_rapl_perf mei_me mei joydev
> i2c_i801 lpc_ich ioatdma shpchp wmi sch_fq_codel xfs libcrc32c mgag200 ixgbe
> drm_kms_helper isci ttm firewire_ohci
>    kernel:  mdio drm igb libsas crc32c_intel firewire_core ptp pps_core
> scsi_transport_sas crc_itu_t dca i2c_algo_bit ipmi_si ipmi_devintf
> ipmi_msghandler [last unloaded: i40e]
>    kernel: CPU: 10 PID: 4219 Comm: ip Tainted: G           OE    4.16.13custom-fq-
> codel-test+ #3
>    kernel: Hardware name: Intel Corporation S2600CO/S2600CO, BIOS
> SE5C600.86B.02.05.0004.051120151007 05/11/2015
>    kernel: RIP: 0010:fq_codel_reset+0x58/0xd0 [sch_fq_codel]
>    kernel: RSP: 0018:ffffbfbf4c1fb620 EFLAGS: 00010246
>    kernel: RAX: 0000000000000400 RBX: 0000000000000000 RCX: 00000000000005b9
>    kernel: RDX: 0000000000000000 RSI: ffff9d03264a60c0 RDI: ffff9cfd17b31c00
>    kernel: RBP: 0000000000000001 R08: 00000000000260c0 R09: ffffffffb679c3e9
>    kernel: R10: fffff1dab06a0e80 R11: ffff9cfd163af800 R12: ffff9cfd17b31c00
>    kernel: R13: 0000000000000001 R14: ffff9cfd153de600 R15: 0000000000000001
>    kernel: FS:  00007fdec2f92800(0000) GS:ffff9d0326480000(0000)
> knlGS:0000000000000000
>    kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>    kernel: CR2: 0000000000000008 CR3: 0000000c1956a006 CR4: 00000000000606e0
>    kernel: Call Trace:
>    kernel:  qdisc_destroy+0x56/0x140
>    kernel:  qdisc_create_dflt+0x8b/0xb0
>    kernel:  mq_init+0xc1/0xf0
>    kernel:  qdisc_create_dflt+0x5a/0xb0
>    kernel:  dev_activate+0x205/0x230
>    kernel:  __dev_open+0xf5/0x160
>    kernel:  __dev_change_flags+0x1a3/0x210
>    kernel:  dev_change_flags+0x21/0x60
>    kernel:  do_setlink+0x660/0xdf0
>    kernel:  ? down_trylock+0x25/0x30
>    kernel:  ? xfs_buf_trylock+0x1a/0xd0 [xfs]
>    kernel:  ? rtnl_newlink+0x816/0x990
>    kernel:  ? _xfs_buf_find+0x327/0x580 [xfs]
>    kernel:  ? _cond_resched+0x15/0x30
>    kernel:  ? kmem_cache_alloc+0x20/0x1b0
>    kernel:  ? rtnetlink_rcv_msg+0x200/0x2f0
>    kernel:  ? rtnl_calcit.isra.30+0x100/0x100
>    kernel:  ? netlink_rcv_skb+0x4c/0x120
>    kernel:  ? netlink_unicast+0x19e/0x260
>    kernel:  ? netlink_sendmsg+0x1ff/0x3c0
>    kernel:  ? sock_sendmsg+0x36/0x40
>    kernel:  ? ___sys_sendmsg+0x295/0x2f0
>    kernel:  ? ebitmap_cmp+0x6d/0x90
>    kernel:  ? dev_get_by_name_rcu+0x73/0x90
>    kernel:  ? skb_dequeue+0x52/0x60
>    kernel:  ? __inode_wait_for_writeback+0x7f/0xf0
>    kernel:  ? bit_waitqueue+0x30/0x30
>    kernel:  ? fsnotify_grab_connector+0x3c/0x60
>    kernel:  ? __sys_sendmsg+0x51/0x90
>    kernel:  ? do_syscall_64+0x74/0x180
>    kernel:  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>    kernel: Code: 00 00 48 89 87 00 02 00 00 8b 87 a0 01 00 00 85 c0 0f 84 84 00 00 00
> 31 ed 48 63 dd 83 c5 01 48 c1 e3 06 49 03 9c 24 90 01 00 00 <48> 8b 73 08 48 8b 3b e8
> 6c 9a 4f f6 48 8d 43 10 48 c7 03 00 00
>    kernel: RIP: fq_codel_reset+0x58/0xd0 [sch_fq_codel] RSP: ffffbfbf4c1fb620
>    kernel: CR2: 0000000000000008
>    kernel: ---[ end trace e81a62bede66274e ]---
> 
> This occurs because if fq_codel_init fails, it has left the private data
> in an incomplete state. For example, if tcf_block_get fails, (as in the
> above panic), then q->flows and q->backlogs will be NULL. Thus they will
> cause NULL pointer access when attempting to reset them in
> fq_codel_reset.
> 
> We could mitigate some of these issues by changing fq_codel_init to more
> explicitly cleanup after itself when failing. For example, we could
> ensure that q->flowcnt was set to 0 so that the loop over each flow in
> fq_codel_reset would not trigger. However, this would not prevent a NULL
> pointer dereference when attempting to memset the q->backlogs.
> 
> Instead, just add a NULL check prior to attempting to reset these
> fields.
> 
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
> Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> ---
>  net/sched/sch_fq_codel.c | 15 +++++++++------
>  1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
> index 22fa13cf5d8b..1658c314ee40 100644
> --- a/net/sched/sch_fq_codel.c
> +++ b/net/sched/sch_fq_codel.c
> @@ -352,14 +352,17 @@ static void fq_codel_reset(struct Qdisc *sch)
> 
>  	INIT_LIST_HEAD(&q->new_flows);
>  	INIT_LIST_HEAD(&q->old_flows);
> -	for (i = 0; i < q->flows_cnt; i++) {
> -		struct fq_codel_flow *flow = q->flows + i;
> +	if (q->flows) {
> +		for (i = 0; i < q->flows_cnt; i++) {
> +			struct fq_codel_flow *flow = q->flows + i;
> 
> -		fq_codel_flow_purge(flow);
> -		INIT_LIST_HEAD(&flow->flowchain);
> -		codel_vars_init(&flow->cvars);
> +			fq_codel_flow_purge(flow);
> +			INIT_LIST_HEAD(&flow->flowchain);
> +			codel_vars_init(&flow->cvars);
> +		}
>  	}
> -	memset(q->backlogs, 0, q->flows_cnt * sizeof(u32));
> +	if (q->backlogs)
> +		memset(q->backlogs, 0, q->flows_cnt * sizeof(u32));

I'm open to alternative suggestinos for fixing this, I think Eric suggested that maybe we should just remove the ->reset() call from qdisc_destroy..?

I don't really understand enough about this code, so if someone has a better suggestion please feel free to suggest it.

Thanks,
Jake

>  	sch->q.qlen = 0;
>  	sch->qstats.backlog = 0;
>  	q->memory_usage = 0;
> --
> 2.17.1

^ permalink raw reply

* Re: [net] fq_codel: fix NULL pointer deref in fq_codel_reset
From: Cong Wang @ 2018-06-11 20:02 UTC (permalink / raw)
  To: Jeff Kirsher
  Cc: David Miller, Jacob Keller, Linux Kernel Network Developers,
	nhorman, sassmann, jogreene, Eric Dumazet
In-Reply-To: <20180611170011.7200-1-jeffrey.t.kirsher@intel.com>

On Mon, Jun 11, 2018 at 10:00 AM, Jeff Kirsher
<jeffrey.t.kirsher@intel.com> wrote:
>
> We could mitigate some of these issues by changing fq_codel_init to more
> explicitly cleanup after itself when failing. For example, we could
> ensure that q->flowcnt was set to 0 so that the loop over each flow in
> fq_codel_reset would not trigger. However, this would not prevent a NULL
> pointer dereference when attempting to memset the q->backlogs.

Are you saying memset(ptr, 0, 0) is not nop?? :-/

Making q->flows_cnt 0 is simpler and easier to understand.

^ permalink raw reply

* Re: 4.17.0-10146-gf0dc7f9c6dd9: hw csum failure on powerpc+sungem
From: Mathieu Malaterre @ 2018-06-11 20:20 UTC (permalink / raw)
  To: Meelis Roos; +Cc: netdev, Mauro Carvalho Chehab, linuxppc-dev, LKML
In-Reply-To: <alpine.LRH.2.21.1806111352330.17091@math.ut.ee>

Hi Meelis,

On Mon, Jun 11, 2018 at 1:21 PM Meelis Roos <mroos@linux.ee> wrote:
>
> I am seeing this on PowerMac G4 with sungem ethernet driver. 4.17 was
> OK, 4.17.0-10146-gf0dc7f9c6dd9 is problematic.

Same here.

> [  140.518664] eth0: hw csum failure
> [  140.518699] CPU: 0 PID: 1237 Comm: postconf Not tainted 4.17.0-10146-gf0dc7f9c6dd9 #83
> [  140.518707] Call Trace:
> [  140.518734] [effefd90] [c03d6db8] __skb_checksum_complete+0xd8/0xdc (unreliable)
> [  140.518759] [effefdb0] [c04c1284] icmpv6_rcv+0x248/0x4ec
> [  140.518775] [effefdd0] [c049a448] ip6_input_finish.constprop.0+0x11c/0x5f4
> [  140.518786] [effefe10] [c049b1c0] ip6_mc_input+0xcc/0x100
> [  140.518807] [effefe20] [c03e110c] __netif_receive_skb_core+0x310/0x944
> [  140.518820] [effefe70] [c03e76ec] napi_gro_receive+0xd0/0xe8
> [  140.518845] [effefe80] [f3e1f66c] gem_poll+0x618/0x1274 [sungem]
> [  140.518856] [effeff30] [c03e6f0c] net_rx_action+0x198/0x374
> [  140.518872] [effeff90] [c0501a88] __do_softirq+0x120/0x278
> [  140.518890] [effeffe0] [c0036188] irq_exit+0xd8/0xdc
> [  140.518908] [effefff0] [c000f478] call_do_irq+0x24/0x3c
> [  140.518925] [d05a5d30] [c0007120] do_IRQ+0x74/0xf0
> [  140.518941] [d05a5d50] [c0012474] ret_from_except+0x0/0x14
> [  140.518960] --- interrupt: 501 at copy_page+0x40/0x90
>                    LR = copy_user_page+0x18/0x30
> [  140.518973] [d05a5e10] [d058cd80] 0xd058cd80 (unreliable)
> [  140.518989] [d05a5e20] [c00fa2bc] wp_page_copy+0xec/0x654
> [  140.519002] [d05a5e60] [c00fd3a4] do_wp_page+0xa8/0x5b4
> [  140.519013] [d05a5e90] [c00fe934] handle_mm_fault+0x564/0xa84
> [  140.519025] [d05a5f00] [c0016230] do_page_fault+0x1bc/0x7e8
> [  140.519037] [d05a5f40] [c0012300] handle_page_fault+0x14/0x40
> [  140.519048] --- interrupt: 301 at 0xb78b6864
>                    LR = 0xb78b6c54
>

For some reason if I do a git bisect it returns that:

$ git bisect good
3036bc45364f98515a2c446d7fac2c34dcfbeff4 is the first bad commit

Could you also check on your side please.

> --
> Meelis Roos (mroos@linux.ee)

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox