* Re: [PATCH] optoe: driver to read/write SFP/QSFP EEPROMs
From: Tom Lendacky @ 2018-06-11 18:33 UTC (permalink / raw)
To: Don Bollinger, Arnd Bergmann, Greg Kroah-Hartman, linux-kernel
Cc: brandon_chuang, wally_wang, roy_lee, rick_burchett, quentin.chang,
jeffrey.townsend, scotte, roopa, David Ahern, luke.williams,
Guohan Lu, Russell King, netdev@vger.kernel.org
In-Reply-To: <20180611042515.ml6zbcmz6dlvjmrp@thebollingers.org>
On 6/10/2018 11:25 PM, Don Bollinger wrote:
> optoe is an i2c based driver that supports read/write access to all
> the pages (tables) of MSA standard SFP and similar devices (conforming
> to the SFF-8472 spec) and MSA standard QSFP and similar devices
> (conforming to the SFF-8436 spec).
>
> These devices provide identification, operational status and control
> registers via an EEPROM model. These devices support one or 3 fixed
> pages (128 bytes) of data, and one page that is selected via a page
> register on the first fixed page. Thus the driver's main task is
> to map these pages onto a simple linear address space for user space
> management applications. See the driver code for a detailed layout.
>
> EEPROM data is accessible via a bin_attribute file called 'eeprom',
> e.g. /sys/bus/i2c/devices/24-0050/eeprom.
>
> Signed-off-by: Don Bollinger <don@thebollingers.org>
> ---
>
> Why should this driver be in the Linux kernel? SFP and QSFP devices plug
> into switches to convert electrical to optical signals and drive the
> optical signal over fiber optic cables. They provide status and control
> registers through an i2c interface similar to to other EEPROMS. However,
> they have a paging mechanism that is unique, which requires a different
> driver from (for example) at24. Various drivers have been developed for
> this purpose, none of them support both SFP and QSFP, provide both read
> and write access, and access all 256 architected pages. optoe does all
> of these.
>
> optoe has been adopted and is shipping to customers as a base module,
> available to all platforms (switches) and used by multiple vendors and
> platforms on both ONL (Open Network Linux) and SONiC (Microsoft's
> 'Software for Open Networking in the Cloud').
>
> This patch has been built on the latest staging-testing kernel. It has
> built and tested with SFP and QSFP devices on an ARM platform with a 4.9
> kernel, and an x86 switch with a 3.16 kernel. This patch should install
> and build clean on any kernel from 3.16 up to the latest (as of 6/10/2018).
>
>
> Documentation/misc-devices/optoe.txt | 56 ++
> drivers/misc/eeprom/Kconfig | 18 +
> drivers/misc/eeprom/Makefile | 1 +
> drivers/misc/eeprom/optoe.c | 1141 ++++++++++++++++++++++++++++++++++
> 4 files changed, 1216 insertions(+)
> create mode 100644 Documentation/misc-devices/optoe.txt
> create mode 100644 drivers/misc/eeprom/optoe.c
>
There's an SFP driver under drivers/net/phy. Can that driver be extended
to provide this support? Adding Russel King who developed sfp.c, as well
at the netdev mailing list.
Thanks,
Tom
> diff --git a/Documentation/misc-devices/optoe.txt b/Documentation/misc-devices/optoe.txt
> new file mode 100644
> index 000000000000..496134940147
> --- /dev/null
> +++ b/Documentation/misc-devices/optoe.txt
> @@ -0,0 +1,56 @@
> +optoe driver
> +
> +Author Don Bollinger (don@thebollingers.org)
> +
> +Optoe is an i2c based driver that supports read/write access to all
> +the pages (tables) of MSA standard SFP and similar devices (conforming
> +to the SFF-8472 spec) and MSA standard QSFP and similar devices
> +(conforming to the SFF-8436 spec).
> +
> +i2c based optoelectronic transceivers (SPF, QSFP, etc) provide identification,
> +operational status, and control registers via an EEPROM model. Unlike the
> +EEPROMs that at24 supports, these devices access data beyond byte 256 via
> +a page select register, which must be managed by the driver. See the driver
> +code for a detailed explanation of how the linear address space provided
> +by the driver maps to the paged address space provided by the devices.
> +
> +The EEPROM data is accessible via a bin_attribute file called 'eeprom',
> +e.g. /sys/bus/i2c/devices/24-0050/eeprom
> +
> +This driver also reports the port number for each device, via a sysfs
> +attribute: 'port_name'. This is a read/write attribute. It should be
> +explicitly set as part of system initialization, ideally at the same time
> +the device is instantiated. Write an appropriate port name (any string, up
> +to 19 characters) to initialize. If not initialized explicitly, all ports
> +will have the port_name of 'unitialized'. Alternatively, if the driver is
> +called with platform_data, the port_name will be read from eeprom_data->label
> +(if the EEPROM CLASS driver is configured) or from platform_data.port_name.
> +
> +This driver can be instantiated with 'new_device', per the convention
> +described in Documentation/i2c/instantiating-devices. It wants one of
> +two possible device identifiers. Use 'optoe1' to indicate this is a device
> +with just one i2c address (all QSFP type devices). Use 'optoe2' to indicate
> +this is a device with two i2c addresses (all SFP type devices).
> +
> +Example:
> +# echo optoe1 0x50 > /sys/bus/i2c/devices/i2c-64/new_device
> +# echo port54 > /sys/bus/i2c/devices/i2c-64/port_name
> +
> +This will add a QSFP type device to i2c bus i2c-64, and name it 'port54'
> +
> +Example:
> +# echo optoe2 0x50 > /sys/bus/i2c/devices/i2c-11/new_device
> +# echo port1 > /sys/bus/i2c/devices/i2c-11/port_name
> +
> +This will add an SFP type device to i2c bus i2c-11, and name it 'port1'
> +
> +The second parameter to new_device is an i2c address, and MUST be 0x50 for
> +this driver to work properly. This is part of the spec for these devices.
> +(It is not necessary to create a device at 0x51 for SFP type devices, the
> +driver does that automatically.)
> +
> +Note that SFP type and QSFP type devices are not plug-compatible. The
> +driver expects the correct ID for each port (each i2c device). It does
> +not check because the port will often be empty, and the only way to check
> +is to interrogate the device. Incorrect choice of ID will lead to correct
> +data being reported for the first 256 bytes, incorrect data after that.
> diff --git a/drivers/misc/eeprom/Kconfig b/drivers/misc/eeprom/Kconfig
> index 68a1ac929917..9a08e12756ee 100644
> --- a/drivers/misc/eeprom/Kconfig
> +++ b/drivers/misc/eeprom/Kconfig
> @@ -111,4 +111,22 @@ config EEPROM_IDT_89HPESX
> This driver can also be built as a module. If so, the module
> will be called idt_89hpesx.
>
> +config EEPROM_OPTOE
> + tristate "read/write access to SFP* & QSFP* EEPROMs"
> + depends on I2C && SYSFS
> + help
> + If you say yes here you get support for read and write access to
> + the EEPROM of SFP and QSFP type optical and copper transceivers.
> + Includes all devices which conform to the sff-8436 and sff-8472
> + spec including SFP, SFP+, SFP28, SFP-DWDM, QSFP, QSFP+, QSFP28
> + or later. These devices are usually found in network switches.
> +
> + This driver only manages read/write access to the EEPROM, all
> + other features should be accessed via i2c-dev.
> +
> + This driver can also be built as a module. If so, the module
> + will be called optoe.
> +
> + If unsure, say N.
> +
> endmenu
> diff --git a/drivers/misc/eeprom/Makefile b/drivers/misc/eeprom/Makefile
> index 2aab60ef3e3e..00288d669017 100644
> --- a/drivers/misc/eeprom/Makefile
> +++ b/drivers/misc/eeprom/Makefile
> @@ -7,3 +7,4 @@ obj-$(CONFIG_EEPROM_93CX6) += eeprom_93cx6.o
> obj-$(CONFIG_EEPROM_93XX46) += eeprom_93xx46.o
> obj-$(CONFIG_EEPROM_DIGSY_MTC_CFG) += digsy_mtc_eeprom.o
> obj-$(CONFIG_EEPROM_IDT_89HPESX) += idt_89hpesx.o
> +obj-$(CONFIG_EEPROM_OPTOE) += optoe.o
> diff --git a/drivers/misc/eeprom/optoe.c b/drivers/misc/eeprom/optoe.c
> new file mode 100644
> index 000000000000..7cdf1a0a5299
> --- /dev/null
> +++ b/drivers/misc/eeprom/optoe.c
> @@ -0,0 +1,1141 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + * optoe.c - A driver to read and write the EEPROM on optical transceivers
> + * (SFP, QSFP and similar I2C based devices)
> + *
> + * Copyright (C) 2014 Cumulus networks Inc.
> + * Copyright (C) 2017 Finisar Corp.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Freeoftware Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + */
> +
> +/*
> + * Description:
> + * a) Optical transceiver EEPROM read/write transactions are just like
> + * the at24 eeproms managed by the at24.c i2c driver
> + * b) The register/memory layout is up to 256 128 byte pages defined by
> + * a "pages valid" register and switched via a "page select"
> + * register as explained in below diagram.
> + * c) 256 bytes are mapped at a time. 'Lower page 00h' is the first 128
> + * bytes of address space, and always references the same
> + * location, independent of the page select register.
> + * All mapped pages are mapped into the upper 128 bytes
> + * (offset 128-255) of the i2c address.
> + * d) Devices with one I2C address (eg QSFP) use I2C address 0x50
> + * (A0h in the spec), and map all pages in the upper 128 bytes
> + * of that address.
> + * e) Devices with two I2C addresses (eg SFP) have 256 bytes of data
> + * at I2C address 0x50, and 256 bytes of data at I2C address
> + * 0x51 (A2h in the spec). Page selection and paged access
> + * only apply to this second I2C address (0x51).
> + * e) The address space is presented, by the driver, as a linear
> + * address space. For devices with one I2C client at address
> + * 0x50 (eg QSFP), offset 0-127 are in the lower
> + * half of address 50/A0h/client[0]. Offset 128-255 are in
> + * page 0, 256-383 are page 1, etc. More generally, offset
> + * 'n' resides in page (n/128)-1. ('page -1' is the lower
> + * half, offset 0-127).
> + * f) For devices with two I2C clients at address 0x50 and 0x51 (eg SFP),
> + * the address space places offset 0-127 in the lower
> + * half of 50/A0/client[0], offset 128-255 in the upper
> + * half. Offset 256-383 is in the lower half of 51/A2/client[1].
> + * Offset 384-511 is in page 0, in the upper half of 51/A2/...
> + * Offset 512-639 is in page 1, in the upper half of 51/A2/...
> + * Offset 'n' is in page (n/128)-3 (for n > 383)
> + *
> + * One I2c addressed (eg QSFP) Memory Map
> + *
> + * 2-Wire Serial Address: 1010000x
> + *
> + * Lower Page 00h (128 bytes)
> + * =====================
> + * | |
> + * | |
> + * | |
> + * | |
> + * | |
> + * | |
> + * | |
> + * | |
> + * | |
> + * | |
> + * |Page Select Byte(127)|
> + * =====================
> + * |
> + * |
> + * |
> + * |
> + * V
> + * ------------------------------------------------------------
> + * | | | |
> + * | | | |
> + * | | | |
> + * | | | |
> + * | | | |
> + * | | | |
> + * | | | |
> + * | | | |
> + * | | | |
> + * V V V V
> + * ------------ -------------- --------------- --------------
> + * | | | | | | | |
> + * | Upper | | Upper | | Upper | | Upper |
> + * | Page 00h | | Page 01h | | Page 02h | | Page 03h |
> + * | | | (Optional) | | (Optional) | | (Optional |
> + * | | | | | | | for Cable |
> + * | | | | | | | Assemblies) |
> + * | ID | | AST | | User | | |
> + * | Fields | | Table | | EEPROM Data | | |
> + * | | | | | | | |
> + * | | | | | | | |
> + * | | | | | | | |
> + * ------------ -------------- --------------- --------------
> + *
> + * The SFF 8436 (QSFP) spec only defines the 4 pages described above.
> + * In anticipation of future applications and devices, this driver
> + * supports access to the full architected range, 256 pages.
> + *
> + **/
> +
> +/* #define DEBUG 1 */
> +
> +#undef EEPROM_CLASS
> +#ifdef CONFIG_EEPROM_CLASS
> +#define EEPROM_CLASS
> +#endif
> +#ifdef CONFIG_EEPROM_CLASS_MODULE
> +#define EEPROM_CLASS
> +#endif
> +
> +#include <linux/kernel.h>
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/delay.h>
> +#include <linux/mutex.h>
> +#include <linux/sysfs.h>
> +#include <linux/jiffies.h>
> +#include <linux/i2c.h>
> +
> +#ifdef EEPROM_CLASS
> +#include <linux/eeprom_class.h>
> +#endif
> +
> +#include <linux/types.h>
> +
> +/* The maximum length of a port name */
> +#define MAX_PORT_NAME_LEN 20
> +
> +struct optoe_platform_data {
> + u32 byte_len; /* size (sum of all addr) */
> + u16 page_size; /* for writes */
> + u8 flags;
> + void *dummy1; /* backward compatibility */
> + void *dummy2; /* backward compatibility */
> +
> +#ifdef EEPROM_CLASS
> + struct eeprom_platform_data *eeprom_data;
> +#endif
> + char port_name[MAX_PORT_NAME_LEN];
> +};
> +
> +/* fundamental unit of addressing for EEPROM */
> +#define OPTOE_PAGE_SIZE 128
> +/*
> + * Single address devices (eg QSFP) have 256 pages, plus the unpaged
> + * low 128 bytes. If the device does not support paging, it is
> + * only 2 'pages' long.
> + */
> +#define OPTOE_ARCH_PAGES 256
> +#define ONE_ADDR_EEPROM_SIZE ((1 + OPTOE_ARCH_PAGES) * OPTOE_PAGE_SIZE)
> +#define ONE_ADDR_EEPROM_UNPAGED_SIZE (2 * OPTOE_PAGE_SIZE)
> +/*
> + * Dual address devices (eg SFP) have 256 pages, plus the unpaged
> + * low 128 bytes, plus 256 bytes at 0x50. If the device does not
> + * support paging, it is 4 'pages' long.
> + */
> +#define TWO_ADDR_EEPROM_SIZE ((3 + OPTOE_ARCH_PAGES) * OPTOE_PAGE_SIZE)
> +#define TWO_ADDR_EEPROM_UNPAGED_SIZE (4 * OPTOE_PAGE_SIZE)
> +#define TWO_ADDR_NO_0X51_SIZE (2 * OPTOE_PAGE_SIZE)
> +
> +/* a few constants to find our way around the EEPROM */
> +#define OPTOE_PAGE_SELECT_REG 0x7F
> +#define ONE_ADDR_PAGEABLE_REG 0x02
> +#define ONE_ADDR_NOT_PAGEABLE BIT(2)
> +#define TWO_ADDR_PAGEABLE_REG 0x40
> +#define TWO_ADDR_PAGEABLE BIT(4)
> +#define TWO_ADDR_0X51_REG 92
> +#define TWO_ADDR_0X51_SUPP BIT(6)
> +#define OPTOE_ID_REG 0
> +#define OPTOE_READ_OP 0
> +#define OPTOE_WRITE_OP 1
> +#define OPTOE_EOF 0 /* used for access beyond end of device */
> +
> +struct optoe_data {
> + struct optoe_platform_data chip;
> + int use_smbus;
> + char port_name[MAX_PORT_NAME_LEN];
> +
> + /*
> + * Lock protects against activities from other Linux tasks,
> + * but not from changes by other I2C masters.
> + */
> + struct mutex lock;
> + struct bin_attribute bin;
> + struct attribute_group attr_group;
> +
> + u8 *writebuf;
> + unsigned int write_max;
> +
> + unsigned int num_addresses;
> +
> +#ifdef EEPROM_CLASS
> + struct eeprom_device *eeprom_dev;
> +#endif
> +
> + /* dev_class: ONE_ADDR (QSFP) or TWO_ADDR (SFP) */
> + int dev_class;
> +
> + struct i2c_client *client[2];
> +};
> +
> +/*
> + * This parameter is to help this driver avoid blocking other drivers out
> + * of I2C for potentially troublesome amounts of time. With a 100 kHz I2C
> + * clock, one 256 byte read takes about 1/43 second which is excessive;
> + * but the 1/170 second it takes at 400 kHz may be quite reasonable; and
> + * at 1 MHz (Fm+) a 1/430 second delay could easily be invisible.
> + *
> + * This value is forced to be a power of two so that writes align on pages.
> + */
> +static unsigned int io_limit = OPTOE_PAGE_SIZE;
> +
> +/*
> + * specs often allow 5 msec for a page write, sometimes 20 msec;
> + * it's important to recover from write timeouts.
> + */
> +static unsigned int write_timeout = 25;
> +
> +/*
> + * flags to distinguish one-address (QSFP family) from two-address (SFP family)
> + * If the family is not known, figure it out when the device is accessed
> + */
> +#define ONE_ADDR 1
> +#define TWO_ADDR 2
> +
> +static const struct i2c_device_id optoe_ids[] = {
> + { "optoe1", ONE_ADDR },
> + { "optoe2", TWO_ADDR },
> + { "sff8436", ONE_ADDR },
> + { "24c04", TWO_ADDR },
> + { /* END OF LIST */ }
> +};
> +MODULE_DEVICE_TABLE(i2c, optoe_ids);
> +
> +/*-------------------------------------------------------------------------*/
> +/*
> + * This routine computes the addressing information to be used for
> + * a given r/w request.
> + *
> + * Task is to calculate the client (0 = i2c addr 50, 1 = i2c addr 51),
> + * the page, and the offset.
> + *
> + * Handles both single address (eg QSFP) and two address (eg SFP).
> + * For SFP, offset 0-255 are on client[0], >255 is on client[1]
> + * Offset 256-383 are on the lower half of client[1]
> + * Pages are accessible on the upper half of client[1].
> + * Offset >383 are in 128 byte pages mapped into the upper half
> + *
> + * For QSFP, all offsets are on client[0]
> + * offset 0-127 are on the lower half of client[0] (no paging)
> + * Pages are accessible on the upper half of client[1].
> + * Offset >127 are in 128 byte pages mapped into the upper half
> + *
> + * Callers must not read/write beyond the end of a client or a page
> + * without recomputing the client/page. Hence offset (within page)
> + * plus length must be less than or equal to 128. (Note that this
> + * routine does not have access to the length of the call, hence
> + * cannot do the validity check.)
> + *
> + * Offset within Lower Page 00h and Upper Page 00h are not recomputed
> + */
> +
> +static uint8_t optoe_translate_offset(struct optoe_data *optoe,
> + loff_t *offset,
> + struct i2c_client **client)
> +{
> + unsigned int page = 0;
> +
> + *client = optoe->client[0];
> +
> + /* if SFP style, offset > 255, shift to i2c addr 0x51 */
> + if (optoe->dev_class == TWO_ADDR) {
> + if (*offset > 255) {
> + /* like QSFP, but shifted to client[1] */
> + *client = optoe->client[1];
> + *offset -= 256;
> + }
> + }
> +
> + /*
> + * if offset is in the range 0-128...
> + * page doesn't matter (using lower half), return 0.
> + * offset is already correct (don't add 128 to get to paged area)
> + */
> + if (*offset < OPTOE_PAGE_SIZE)
> + return page;
> +
> + /* note, page will always be positive since *offset >= 128 */
> + page = (*offset >> 7) - 1;
> + /* 0x80 places the offset in the top half, offset is last 7 bits */
> + *offset = OPTOE_PAGE_SIZE + (*offset & 0x7f);
> +
> + return page; /* note also returning client and offset */
> +}
> +
> +static ssize_t optoe_eeprom_read(struct optoe_data *optoe,
> + struct i2c_client *client,
> + char *buf, unsigned int offset, size_t count)
> +{
> + struct i2c_msg msg[2];
> + u8 msgbuf[2];
> + unsigned long timeout, read_time;
> + int status, i;
> +
> + memset(msg, 0, sizeof(msg));
> +
> + switch (optoe->use_smbus) {
> + case I2C_SMBUS_I2C_BLOCK_DATA:
> + /*smaller eeproms can work given some SMBus extension calls */
> + if (count > I2C_SMBUS_BLOCK_MAX)
> + count = I2C_SMBUS_BLOCK_MAX;
> + break;
> + case I2C_SMBUS_WORD_DATA:
> + /* Check for odd length transaction */
> + count = (count == 1) ? 1 : 2;
> + break;
> + case I2C_SMBUS_BYTE_DATA:
> + count = 1;
> + break;
> + default:
> + /*
> + * When we have a better choice than SMBus calls, use a
> + * combined I2C message. Write address; then read up to
> + * io_limit data bytes. msgbuf is u8 and will cast to our
> + * needs.
> + */
> + i = 0;
> + msgbuf[i++] = offset;
> +
> + msg[0].addr = client->addr;
> + msg[0].buf = msgbuf;
> + msg[0].len = i;
> +
> + msg[1].addr = client->addr;
> + msg[1].flags = I2C_M_RD;
> + msg[1].buf = buf;
> + msg[1].len = count;
> + }
> +
> + /*
> + * Reads fail if the previous write didn't complete yet. We may
> + * loop a few times until this one succeeds, waiting at least
> + * long enough for one entire page write to work.
> + */
> + timeout = jiffies + msecs_to_jiffies(write_timeout);
> + do {
> + read_time = jiffies;
> +
> + switch (optoe->use_smbus) {
> + case I2C_SMBUS_I2C_BLOCK_DATA:
> + status = i2c_smbus_read_i2c_block_data(client, offset,
> + count, buf);
> + break;
> + case I2C_SMBUS_WORD_DATA:
> + status = i2c_smbus_read_word_data(client, offset);
> + if (status >= 0) {
> + buf[0] = status & 0xff;
> + if (count == 2)
> + buf[1] = status >> 8;
> + status = count;
> + }
> + break;
> + case I2C_SMBUS_BYTE_DATA:
> + status = i2c_smbus_read_byte_data(client, offset);
> + if (status >= 0) {
> + buf[0] = status;
> + status = count;
> + }
> + break;
> + default:
> + status = i2c_transfer(client->adapter, msg, 2);
> + if (status == 2)
> + status = count;
> + }
> +
> + dev_dbg(&client->dev, "eeprom read %zu@%d --> %d (%ld)\n",
> + count, offset, status, jiffies);
> +
> + if (status == count) /* happy path */
> + return count;
> +
> + if (status == -ENXIO) /* no module present */
> + return status;
> +
> + /* REVISIT: at HZ=100, this is sloooow */
> + usleep_range(1000, 2000);
> + } while (time_before(read_time, timeout));
> +
> + return -ETIMEDOUT;
> +}
> +
> +static ssize_t optoe_eeprom_write(struct optoe_data *optoe,
> + struct i2c_client *client,
> + const char *buf,
> + unsigned int offset, size_t count)
> +{
> + struct i2c_msg msg;
> + ssize_t status;
> + unsigned long timeout, write_time;
> + unsigned int next_page_start;
> + int i = 0;
> + u16 writeword;
> +
> + /* write max is at most a page
> + * (In this driver, write_max is actually one byte!)
> + */
> + if (count > optoe->write_max)
> + count = optoe->write_max;
> +
> + /* shorten count if necessary to avoid crossing page boundary */
> + next_page_start = roundup(offset + 1, OPTOE_PAGE_SIZE);
> + if (offset + count > next_page_start)
> + count = next_page_start - offset;
> +
> + switch (optoe->use_smbus) {
> + case I2C_SMBUS_I2C_BLOCK_DATA:
> + /*smaller eeproms can work given some SMBus extension calls */
> + if (count > I2C_SMBUS_BLOCK_MAX)
> + count = I2C_SMBUS_BLOCK_MAX;
> + break;
> + case I2C_SMBUS_WORD_DATA:
> + /* Check for odd length transaction */
> + count = (count == 1) ? 1 : 2;
> + break;
> + case I2C_SMBUS_BYTE_DATA:
> + count = 1;
> + break;
> + default:
> + /* If we'll use I2C calls for I/O, set up the message */
> + msg.addr = client->addr;
> + msg.flags = 0;
> +
> + /* msg.buf is u8 and casts will mask the values */
> + msg.buf = optoe->writebuf;
> +
> + msg.buf[i++] = offset;
> + memcpy(&msg.buf[i], buf, count);
> + msg.len = i + count;
> + break;
> + }
> +
> + /*
> + * Reads fail if the previous write didn't complete yet. We may
> + * loop a few times until this one succeeds, waiting at least
> + * long enough for one entire page write to work.
> + */
> + timeout = jiffies + msecs_to_jiffies(write_timeout);
> + do {
> + write_time = jiffies;
> +
> + switch (optoe->use_smbus) {
> + case I2C_SMBUS_I2C_BLOCK_DATA:
> + status = i2c_smbus_write_i2c_block_data(client,
> + offset,
> + count,
> + buf);
> + if (status == 0)
> + status = count;
> + break;
> + case I2C_SMBUS_WORD_DATA:
> + if (count == 2) {
> + writeword = (buf[1] << 8) | buf[0];
> + status = i2c_smbus_write_word_data(client,
> + offset,
> + writeword);
> + } else {
> + /* count = 1 */
> + status = i2c_smbus_write_byte_data(client,
> + offset,
> + buf[0]);
> + }
> + if (status == 0)
> + status = count;
> + break;
> + case I2C_SMBUS_BYTE_DATA:
> + status = i2c_smbus_write_byte_data(client, offset,
> + buf[0]);
> + if (status == 0)
> + status = count;
> + break;
> + default:
> + status = i2c_transfer(client->adapter, &msg, 1);
> + if (status == 1)
> + status = count;
> + break;
> + }
> +
> + dev_dbg(&client->dev, "eeprom write %zu@%d --> %ld (%lu)\n",
> + count, offset, (long int)status, jiffies);
> +
> + if (status == count)
> + return count;
> +
> + /* REVISIT: at HZ=100, this is sloooow */
> + usleep_range(1000, 2000);
> + } while (time_before(write_time, timeout));
> +
> + return -ETIMEDOUT;
> +}
> +
> +static ssize_t optoe_eeprom_update_client(struct optoe_data *optoe,
> + char *buf, loff_t off,
> + size_t count, int opcode)
> +{
> + struct i2c_client *client;
> + ssize_t retval = 0;
> + u8 page = 0;
> + loff_t phy_offset = off;
> + int ret = 0;
> +
> + page = optoe_translate_offset(optoe, &phy_offset, &client);
> + dev_dbg(&client->dev,
> + "%s off %lld page:%d phy_offset:%lld, count:%ld, opcode:%d\n",
> + __func__, off, page, phy_offset, (long int)count, opcode);
> + if (page > 0) {
> + ret = optoe_eeprom_write(optoe, client, &page,
> + OPTOE_PAGE_SELECT_REG, 1);
> + if (ret < 0) {
> + dev_dbg(&client->dev,
> + "Write page register for page %d failed ret:%d!\n",
> + page, ret);
> + return ret;
> + }
> + }
> +
> + while (count) {
> + ssize_t status;
> +
> + if (opcode == OPTOE_READ_OP) {
> + status = optoe_eeprom_read(optoe, client, buf,
> + phy_offset, count);
> + } else {
> + status = optoe_eeprom_write(optoe, client, buf,
> + phy_offset, count);
> + }
> + if (status <= 0) {
> + if (retval == 0)
> + retval = status;
> + break;
> + }
> + buf += status;
> + phy_offset += status;
> + count -= status;
> + retval += status;
> + }
> +
> + if (page > 0) {
> + /* return the page register to page 0 (why?) */
> + page = 0;
> + ret = optoe_eeprom_write(optoe, client, &page,
> + OPTOE_PAGE_SELECT_REG, 1);
> + if (ret < 0) {
> + dev_err(&client->dev,
> + "Restore page register to 0 failed:%d!\n", ret);
> + /* error only if nothing has been transferred */
> + if (retval == 0)
> + retval = ret;
> + }
> + }
> + return retval;
> +}
> +
> +/*
> + * Figure out if this access is within the range of supported pages.
> + * Note this is called on every access because we don't know if the
> + * module has been replaced since the last call.
> + * If/when modules support more pages, this is the routine to update
> + * to validate and allow access to additional pages.
> + *
> + * Returns updated len for this access:
> + * - entire access is legal, original len is returned.
> + * - access begins legal but is too long, len is truncated to fit.
> + * - initial offset exceeds supported pages, return OPTOE_EOF (zero)
> + */
> +static ssize_t optoe_page_legal(struct optoe_data *optoe,
> + loff_t off, size_t len)
> +{
> + struct i2c_client *client = optoe->client[0];
> + u8 regval;
> + int status;
> + size_t maxlen;
> +
> + if (off < 0)
> + return -EINVAL;
> + if (optoe->dev_class == TWO_ADDR) {
> + /* SFP case */
> + /* if only using addr 0x50 (first 256 bytes) we're good */
> + if ((off + len) <= TWO_ADDR_NO_0X51_SIZE)
> + return len;
> + /* if offset exceeds possible pages, we're not good */
> + if (off >= TWO_ADDR_EEPROM_SIZE)
> + return OPTOE_EOF;
> + /* in between, are pages supported? */
> + status = optoe_eeprom_read(optoe, client, ®val,
> + TWO_ADDR_PAGEABLE_REG, 1);
> + if (status < 0)
> + return status; /* error out (no module?) */
> + if (regval & TWO_ADDR_PAGEABLE) {
> + /* Pages supported, trim len to the end of pages */
> + maxlen = TWO_ADDR_EEPROM_SIZE - off;
> + } else {
> + /* pages not supported, trim len to unpaged size */
> + if (off >= TWO_ADDR_EEPROM_UNPAGED_SIZE)
> + return OPTOE_EOF;
> +
> + /* will be accessing addr 0x51, is that supported? */
> + /* byte 92, bit 6 implies DDM support, 0x51 support */
> + status = optoe_eeprom_read(optoe, client, ®val,
> + TWO_ADDR_0X51_REG, 1);
> + if (status < 0)
> + return status;
> + if (regval & TWO_ADDR_0X51_SUPP) {
> + /* addr 0x51 is OK */
> + maxlen = TWO_ADDR_EEPROM_UNPAGED_SIZE - off;
> + } else {
> + /* addr 0x51 NOT supported, trim to 256 max */
> + if (off >= TWO_ADDR_NO_0X51_SIZE)
> + return OPTOE_EOF;
> + maxlen = TWO_ADDR_NO_0X51_SIZE - off;
> + }
> + }
> + len = (len > maxlen) ? maxlen : len;
> + dev_dbg(&client->dev,
> + "page_legal, SFP, off %lld len %ld\n",
> + off, (long int)len);
> + } else {
> + /* QSFP case */
> + /* if no pages needed, we're good */
> + if ((off + len) <= ONE_ADDR_EEPROM_UNPAGED_SIZE)
> + return len;
> + /* if offset exceeds possible pages, we're not good */
> + if (off >= ONE_ADDR_EEPROM_SIZE)
> + return OPTOE_EOF;
> + /* in between, are pages supported? */
> + status = optoe_eeprom_read(optoe, client, ®val,
> + ONE_ADDR_PAGEABLE_REG, 1);
> + if (status < 0)
> + return status; /* error out (no module?) */
> + if (regval & ONE_ADDR_NOT_PAGEABLE) {
> + /* pages not supported, trim len to unpaged size */
> + if (off >= ONE_ADDR_EEPROM_UNPAGED_SIZE)
> + return OPTOE_EOF;
> + maxlen = ONE_ADDR_EEPROM_UNPAGED_SIZE - off;
> + } else {
> + /* Pages supported, trim len to the end of pages */
> + maxlen = ONE_ADDR_EEPROM_SIZE - off;
> + }
> + len = (len > maxlen) ? maxlen : len;
> + dev_dbg(&client->dev,
> + "page_legal, QSFP, off %lld len %ld\n",
> + off, (long int)len);
> + }
> + return len;
> +}
> +
> +static ssize_t optoe_read_write(struct optoe_data *optoe,
> + char *buf, loff_t off,
> + size_t len, int opcode)
> +{
> + struct i2c_client *client = optoe->client[0];
> + int chunk;
> + int status = 0;
> + ssize_t retval;
> + size_t pending_len = 0, chunk_len = 0;
> + loff_t chunk_offset = 0, chunk_start_offset = 0;
> +
> + dev_dbg(&client->dev,
> + "%s: off %lld len:%ld, opcode:%s\n",
> + __func__, off, (long int)len,
> + (opcode == OPTOE_READ_OP) ? "r" : "w");
> + if (unlikely(!len))
> + return len;
> +
> + /*
> + * Read data from chip, protecting against concurrent updates
> + * from this host, but not from other I2C masters.
> + */
> + mutex_lock(&optoe->lock);
> +
> + /*
> + * Confirm this access fits within the device supported addr range
> + */
> + status = optoe_page_legal(optoe, off, len);
> + if (status == OPTOE_EOF || status < 0) {
> + mutex_unlock(&optoe->lock);
> + return status;
> + }
> + len = status;
> +
> + /*
> + * For each (128 byte) chunk involved in this request, issue a
> + * separate call to sff_eeprom_update_client(), to
> + * ensure that each access recalculates the client/page
> + * and writes the page register as needed.
> + * Note that chunk to page mapping is confusing, is different for
> + * QSFP and SFP, and never needs to be done. Don't try!
> + */
> + pending_len = len; /* amount remaining to transfer */
> + retval = 0; /* amount transferred */
> + for (chunk = off >> 7; chunk <= (off + len - 1) >> 7; chunk++) {
> + /*
> + * Compute the offset and number of bytes to be read/write
> + *
> + * 1. start at offset 0 (within the chunk), and read/write
> + * the entire chunk
> + * 2. start at offset 0 (within the chunk) and read/write less
> + * than entire chunk
> + * 3. start at an offset not equal to 0 and read/write the rest
> + * of the chunk
> + * 4. start at an offset not equal to 0 and read/write less than
> + * (end of chunk - offset)
> + */
> + chunk_start_offset = chunk * OPTOE_PAGE_SIZE;
> +
> + if (chunk_start_offset < off) {
> + chunk_offset = off;
> + if ((off + pending_len) < (chunk_start_offset +
> + OPTOE_PAGE_SIZE))
> + chunk_len = pending_len;
> + else
> + chunk_len = OPTOE_PAGE_SIZE - off;
> + } else {
> + chunk_offset = chunk_start_offset;
> + if (pending_len > OPTOE_PAGE_SIZE)
> + chunk_len = OPTOE_PAGE_SIZE;
> + else
> + chunk_len = pending_len;
> + }
> +
> + dev_dbg(&client->dev,
> + "sff_r/w: off %lld, len %ld, chunk_start_offset %lld, chunk_offset %lld, chunk_len %ld, pending_len %ld\n",
> + off, (long int)len, chunk_start_offset, chunk_offset,
> + (long int)chunk_len, (long int)pending_len);
> +
> + /*
> + * note: chunk_offset is from the start of the EEPROM,
> + * not the start of the chunk
> + */
> + status = optoe_eeprom_update_client(optoe, buf, chunk_offset,
> + chunk_len, opcode);
> + if (status != chunk_len) {
> + /* This is another 'no device present' path */
> + dev_dbg(&client->dev,
> + "o_u_c: chunk %d c_offset %lld c_len %ld failed %d!\n",
> + chunk, chunk_offset, (long int)chunk_len, status);
> + if (status > 0)
> + retval += status;
> + if (retval == 0)
> + retval = status;
> + break;
> + }
> + buf += status;
> + pending_len -= status;
> + retval += status;
> + }
> + mutex_unlock(&optoe->lock);
> +
> + return retval;
> +}
> +
> +static ssize_t optoe_bin_read(struct file *filp, struct kobject *kobj,
> + struct bin_attribute *attr,
> + char *buf, loff_t off, size_t count)
> +{
> + struct i2c_client *client = to_i2c_client(container_of(kobj,
> + struct device, kobj));
> + struct optoe_data *optoe = i2c_get_clientdata(client);
> +
> + return optoe_read_write(optoe, buf, off, count, OPTOE_READ_OP);
> +}
> +
> +static ssize_t optoe_bin_write(struct file *filp, struct kobject *kobj,
> + struct bin_attribute *attr,
> + char *buf, loff_t off, size_t count)
> +{
> + struct i2c_client *client = to_i2c_client(container_of(kobj,
> + struct device, kobj));
> + struct optoe_data *optoe = i2c_get_clientdata(client);
> +
> + return optoe_read_write(optoe, buf, off, count, OPTOE_WRITE_OP);
> +}
> +
> +static int optoe_remove(struct i2c_client *client)
> +{
> + struct optoe_data *optoe;
> +
> + optoe = i2c_get_clientdata(client);
> + sysfs_remove_group(&client->dev.kobj, &optoe->attr_group);
> + sysfs_remove_bin_file(&client->dev.kobj, &optoe->bin);
> +
> + if (optoe->num_addresses == 2)
> + i2c_unregister_device(optoe->client[1]);
> +
> +#ifdef EEPROM_CLASS
> + eeprom_device_unregister(optoe->eeprom_dev);
> +#endif
> +
> + kfree(optoe->writebuf);
> + kfree(optoe);
> + return 0;
> +}
> +
> +static ssize_t dev_class_show(struct device *dev,
> + struct device_attribute *dattr, char *buf)
> +{
> + struct i2c_client *client = to_i2c_client(dev);
> + struct optoe_data *optoe = i2c_get_clientdata(client);
> + ssize_t count;
> +
> + mutex_lock(&optoe->lock);
> + count = sprintf(buf, "%d\n", optoe->dev_class);
> + mutex_unlock(&optoe->lock);
> +
> + return count;
> +}
> +
> +static ssize_t dev_class_store(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf, size_t count)
> +{
> + struct i2c_client *client = to_i2c_client(dev);
> + struct optoe_data *optoe = i2c_get_clientdata(client);
> + int dev_class;
> +
> + /*
> + * dev_class is actually the number of i2c addresses used, thus
> + * legal values are "1" (QSFP class) and "2" (SFP class)
> + */
> +
> + if (kstrtoint(buf, 0, &dev_class) != 0 ||
> + dev_class < 1 || dev_class > 2)
> + return -EINVAL;
> +
> + mutex_lock(&optoe->lock);
> + optoe->dev_class = dev_class;
> + mutex_unlock(&optoe->lock);
> +
> + return count;
> +}
> +
> +/*
> + * if using the EEPROM CLASS driver, we don't report a port_name,
> + * the EEPROM CLASS drive handles that. Hence all this code is
> + * only compiled if we are NOT using the EEPROM CLASS driver.
> + */
> +#ifndef EEPROM_CLASS
> +
> +static ssize_t port_name_show(struct device *dev,
> + struct device_attribute *dattr, char *buf)
> +{
> + struct i2c_client *client = to_i2c_client(dev);
> + struct optoe_data *optoe = i2c_get_clientdata(client);
> + ssize_t count;
> +
> + mutex_lock(&optoe->lock);
> + count = sprintf(buf, "%s\n", optoe->port_name);
> + mutex_unlock(&optoe->lock);
> +
> + return count;
> +}
> +
> +static ssize_t port_name_store(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf, size_t count)
> +{
> + struct i2c_client *client = to_i2c_client(dev);
> + struct optoe_data *optoe = i2c_get_clientdata(client);
> + char port_name[MAX_PORT_NAME_LEN];
> +
> + /* no checking, this value is not used except by port_name_show */
> +
> + if (sscanf(buf, "%19s", port_name) != 1)
> + return -EINVAL;
> +
> + mutex_lock(&optoe->lock);
> + strcpy(optoe->port_name, port_name);
> + mutex_unlock(&optoe->lock);
> +
> + return count;
> +}
> +
> +static DEVICE_ATTR_RW(port_name);
> +#endif /* if NOT defined EEPROM_CLASS, the common case */
> +
> +static DEVICE_ATTR_RW(dev_class);
> +
> +static struct attribute *optoe_attrs[] = {
> +#ifndef EEPROM_CLASS
> + &dev_attr_port_name.attr,
> +#endif
> + &dev_attr_dev_class.attr,
> + NULL,
> +};
> +
> +static struct attribute_group optoe_attr_group = {
> + .attrs = optoe_attrs,
> +};
> +
> +static int optoe_probe(struct i2c_client *client,
> + const struct i2c_device_id *id)
> +{
> + int err;
> + int use_smbus = 0;
> + struct optoe_platform_data chip;
> + struct optoe_data *optoe;
> + int num_addresses;
> + char port_name[MAX_PORT_NAME_LEN];
> +
> + if (client->addr != 0x50) {
> + dev_dbg(&client->dev, "probe, bad i2c addr: 0x%x\n",
> + client->addr);
> + err = -EINVAL;
> + goto exit;
> + }
> +
> + if (client->dev.platform_data) {
> + chip = *(struct optoe_platform_data *)client->dev.platform_data;
> + /* take the port name from the supplied platform data */
> +#ifdef EEPROM_CLASS
> + strncpy(port_name, chip.eeprom_data->label, MAX_PORT_NAME_LEN);
> +#else
> + memcpy(port_name, chip.port_name, MAX_PORT_NAME_LEN);
> +#endif
> + dev_dbg(&client->dev,
> + "probe, chip provided, flags:0x%x; name: %s\n",
> + chip.flags, client->name);
> + } else {
> + if (!id->driver_data) {
> + err = -ENODEV;
> + goto exit;
> + }
> + dev_dbg(&client->dev, "probe, building chip\n");
> + strcpy(port_name, "unitialized");
> + chip.flags = 0;
> +#ifdef EEPROM_CLASS
> + chip.eeprom_data = NULL;
> +#endif
> + }
> +
> + /* Use I2C operations unless we're stuck with SMBus extensions. */
> + if (!i2c_check_functionality(client->adapter, I2C_FUNC_I2C)) {
> + if (i2c_check_functionality(client->adapter,
> + I2C_FUNC_SMBUS_READ_I2C_BLOCK)) {
> + use_smbus = I2C_SMBUS_I2C_BLOCK_DATA;
> + } else if (i2c_check_functionality(client->adapter,
> + I2C_FUNC_SMBUS_READ_WORD_DATA)) {
> + use_smbus = I2C_SMBUS_WORD_DATA;
> + } else if (i2c_check_functionality(client->adapter,
> + I2C_FUNC_SMBUS_READ_BYTE_DATA)) {
> + use_smbus = I2C_SMBUS_BYTE_DATA;
> + } else {
> + err = -EPFNOSUPPORT;
> + goto exit;
> + }
> + }
> +
> + optoe = kzalloc(sizeof(*optoe), GFP_KERNEL);
> + if (!optoe) {
> + err = -ENOMEM;
> + goto exit;
> + }
> +
> + mutex_init(&optoe->lock);
> +
> + /* determine whether this is a one-address or two-address module */
> + if ((strcmp(client->name, "optoe1") == 0) ||
> + (strcmp(client->name, "sff8436") == 0)) {
> + /* one-address (eg QSFP) family */
> + optoe->dev_class = ONE_ADDR;
> + chip.byte_len = ONE_ADDR_EEPROM_SIZE;
> + num_addresses = 1;
> + } else if ((strcmp(client->name, "optoe2") == 0) ||
> + (strcmp(client->name, "24c04") == 0)) {
> + /* SFP family */
> + optoe->dev_class = TWO_ADDR;
> + chip.byte_len = TWO_ADDR_EEPROM_SIZE;
> + num_addresses = 2;
> + } else { /* those were the only two choices */
> + err = -EINVAL;
> + goto exit;
> + }
> +
> + dev_dbg(&client->dev, "dev_class: %d\n", optoe->dev_class);
> + optoe->use_smbus = use_smbus;
> + optoe->chip = chip;
> + optoe->num_addresses = num_addresses;
> + memcpy(optoe->port_name, port_name, MAX_PORT_NAME_LEN);
> +
> + /*
> + * Export the EEPROM bytes through sysfs, since that's convenient.
> + * By default, only root should see the data (maybe passwords etc)
> + */
> + sysfs_bin_attr_init(&optoe->bin);
> + optoe->bin.attr.name = "eeprom";
> + optoe->bin.attr.mode = 0444;
> + optoe->bin.read = optoe_bin_read;
> + optoe->bin.size = chip.byte_len;
> +
> + if (!use_smbus ||
> + i2c_check_functionality(client->adapter,
> + I2C_FUNC_SMBUS_WRITE_I2C_BLOCK) ||
> + i2c_check_functionality(client->adapter,
> + I2C_FUNC_SMBUS_WRITE_WORD_DATA) ||
> + i2c_check_functionality(client->adapter,
> + I2C_FUNC_SMBUS_WRITE_BYTE_DATA)) {
> + /*
> + * NOTE: AN-2079
> + * Finisar recommends that the host implement 1 byte writes
> + * only since this module only supports 32 byte page boundaries.
> + * 2 byte writes are acceptable for PE and Vout changes per
> + * Application Note AN-2071.
> + */
> + unsigned int write_max = 1;
> +
> + optoe->bin.write = optoe_bin_write;
> + optoe->bin.attr.mode |= 0200;
> +
> + if (write_max > io_limit)
> + write_max = io_limit;
> + if (use_smbus && write_max > I2C_SMBUS_BLOCK_MAX)
> + write_max = I2C_SMBUS_BLOCK_MAX;
> + optoe->write_max = write_max;
> +
> + /* buffer (data + address at the beginning) */
> + optoe->writebuf = kmalloc(write_max + 2, GFP_KERNEL);
> + if (!optoe->writebuf) {
> + err = -ENOMEM;
> + goto exit_kfree;
> + }
> + } else {
> + dev_warn(&client->dev,
> + "cannot write due to controller restrictions.");
> + }
> +
> + optoe->client[0] = client;
> +
> + /* SFF-8472 spec requires that the second I2C address be 0x51 */
> + if (num_addresses == 2) {
> + optoe->client[1] = i2c_new_dummy(client->adapter, 0x51);
> + if (!optoe->client[1]) {
> + dev_err(&client->dev, "address 0x51 unavailable\n");
> + err = -EADDRINUSE;
> + goto err_struct;
> + }
> + }
> +
> + /* create the sysfs eeprom file */
> + err = sysfs_create_bin_file(&client->dev.kobj, &optoe->bin);
> + if (err)
> + goto err_struct;
> +
> + optoe->attr_group = optoe_attr_group;
> +
> + err = sysfs_create_group(&client->dev.kobj, &optoe->attr_group);
> + if (err) {
> + dev_err(&client->dev, "failed to create sysfs attribute group.\n");
> + goto err_struct;
> + }
> +
> +#ifdef EEPROM_CLASS
> + optoe->eeprom_dev = eeprom_device_register(&client->dev,
> + chip.eeprom_data);
> + if (IS_ERR(optoe->eeprom_dev)) {
> + dev_err(&client->dev, "error registering eeprom device.\n");
> + err = PTR_ERR(optoe->eeprom_dev);
> + goto err_sysfs_cleanup;
> + }
> +#endif
> +
> + i2c_set_clientdata(client, optoe);
> +
> + dev_info(&client->dev, "%zu byte %s EEPROM, %s\n",
> + optoe->bin.size, client->name,
> + optoe->bin.write ? "read/write" : "read-only");
> +
> + if (use_smbus == I2C_SMBUS_WORD_DATA ||
> + use_smbus == I2C_SMBUS_BYTE_DATA) {
> + dev_notice(&client->dev,
> + "Falling back to %s reads, performance will suffer\n",
> + use_smbus == I2C_SMBUS_WORD_DATA ? "word" : "byte");
> + }
> +
> + return 0;
> +
> +#ifdef EEPROM_CLASS
> +err_sysfs_cleanup:
> + sysfs_remove_group(&client->dev.kobj, &optoe->attr_group);
> + sysfs_remove_bin_file(&client->dev.kobj, &optoe->bin);
> +#endif
> +
> +err_struct:
> + if (num_addresses == 2) {
> + if (optoe->client[1])
> + i2c_unregister_device(optoe->client[1]);
> + }
> +
> + kfree(optoe->writebuf);
> +exit_kfree:
> + kfree(optoe);
> +exit:
> + dev_dbg(&client->dev, "probe error %d\n", err);
> +
> + return err;
> +}
> +
> +/*-------------------------------------------------------------------------*/
> +
> +static struct i2c_driver optoe_driver = {
> + .driver = {
> + .name = "optoe",
> + .owner = THIS_MODULE,
> + },
> + .probe = optoe_probe,
> + .remove = optoe_remove,
> + .id_table = optoe_ids,
> +};
> +
> +static int __init optoe_init(void)
> +{
> + if (!io_limit) {
> + pr_err("optoe: io_limit must not be 0!\n");
> + return -EINVAL;
> + }
> +
> + io_limit = rounddown_pow_of_two(io_limit);
> + return i2c_add_driver(&optoe_driver);
> +}
> +module_init(optoe_init);
> +
> +static void __exit optoe_exit(void)
> +{
> + i2c_del_driver(&optoe_driver);
> +}
> +module_exit(optoe_exit);
> +
> +MODULE_DESCRIPTION("Driver for optical transceiver (SFP, QSFP, ...) EEPROMs");
> +MODULE_AUTHOR("DON BOLLINGER <don@thebollingers.org>");
> +MODULE_LICENSE("GPL");
>
^ permalink raw reply
* Re: [PULL] vhost: cleanups and fixes
From: Linus Torvalds @ 2018-06-11 18:32 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: KVM list, Network Development, Linux Kernel Mailing List,
Bjorn Andersson, Andrew Morton, virtualization
In-Reply-To: <20180611192353-mutt-send-email-mst@kernel.org>
On Mon, Jun 11, 2018 at 9:24 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
Is this really a good idea?
Plus it seems entirely broken.
The report_pfn_range() callback is done under the zone lock, but
virtio_balloon_send_free_pages() (which is the only callback used that
I can find) does add_one_sg(), which does virtqueue_add_inbuf(vq, &sg,
1, vq, GFP_KERNEL);
So now we apparently do a GFP_KERNEL allocation insider the mm zone
lock, which is broken on just _so_ many levels.
Pulled and then unpulled again.
Either somebody needs to explain why I'm wrong and you can re-submit
this, or this kind of garbage needs to go away.
I do *not* want to be in the situation where I pull stuff from the
virtio people that adds completely broken core VM functionality.
Because if I'm in that situation, I will stop pulling from you guys.
Seriously. You have *no* place sending me broken shit that is outside
the virtio layer.
Subsystems that break code MM will get shunned. You just aren't
important enough to allow you breaking code VM.
Linus
^ permalink raw reply
* Re: [v3, 03/10] dt-binding: ptp_qoriq: add DPAA FMan support
From: Rob Herring @ 2018-06-11 18:25 UTC (permalink / raw)
To: Yangbo Lu
Cc: netdev, madalin.bucur, Richard Cochran, Shawn Guo,
David S . Miller, devicetree, linuxppc-dev, linux-arm-kernel,
linux-kernel
In-Reply-To: <20180607092050.46128-4-yangbo.lu@nxp.com>
On Thu, Jun 07, 2018 at 05:20:43PM +0800, Yangbo Lu wrote:
> This patch is to add bindings description for DPAA
> FMan 1588 timer, and also remove its description in
> fsl-fman dt-bindings document.
>
> Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
> ---
> Changes for v2:
> - None.
> Changes for v3:
> - None.
> ---
> Documentation/devicetree/bindings/net/fsl-fman.txt | 25 +-------------------
> .../devicetree/bindings/ptp/ptp-qoriq.txt | 15 +++++++++--
> 2 files changed, 13 insertions(+), 27 deletions(-)
Reviewed-by: Rob Herring <robh@kernel.org>
^ permalink raw reply
* Re: [PATCH] Bluetooth: hci_bcm: Configure SCO routing automatically
From: Marcel Holtmann @ 2018-06-11 18:19 UTC (permalink / raw)
To: Rob Herring
Cc: Attila Tőkés, David S. Miller, Mark Rutland,
Johan Hedberg, Artiom Vaskov, netdev, devicetree,
linux-kernel@vger.kernel.org, open list:BLUETOOTH DRIVERS
In-Reply-To: <CAL_JsqL2StWgA9gkL9i3o85-C3TNixBLmbTXzFQW8bts4phJqQ@mail.gmail.com>
Hi Rob,
>>>>>> Added support to automatically configure the SCO packet routing at the
>>>>>> device setup. The SCO packets are used with the HSP / HFP profiles, but in
>>>>>> some devices (ex. CYW43438) they are routed to a PCM output by default. This
>>>>>> change allows sending the vendor specific HCI command to configure the SCO
>>>>>> routing. The parameters of the command are loaded from the device tree.
>>>>>
>>>>> Please wrap your commit msg.
>>>>
>>>>
>>>> Sure.
>>>>>
>>>>>
>>>>>>
>>>>>> Signed-off-by: Attila Tőkés <attitokes@gmail.com>
>>>>>> ---
>>>>>> .../bindings/net/broadcom-bluetooth.txt | 7 ++
>>>>>
>>>>> Please split bindings to separate patch.
>>>>
>>>>
>>>> Ok, I will split this in two.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> drivers/bluetooth/hci_bcm.c | 72 +++++++++++++++++++
>>>>>> 2 files changed, 79 insertions(+)
>>>>>>
>>>>>> diff --git
>>>>>> a/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt
>>>>>> b/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt
>>>>>> index 4194ff7e..aea3a094 100644
>>>>>> --- a/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt
>>>>>> +++ b/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt
>>>>>> @@ -21,6 +21,12 @@ Optional properties:
>>>>>> - clocks: clock specifier if external clock provided to the controller
>>>>>> - clock-names: should be "extclk"
>>>>>>
>>>>>> + SCO routing parameters:
>>>>>> + - sco-routing: 0-3 (PCM, Transport, Codec, I2S)
>>>>>> + - pcm-interface-rate: 0-4 (128 Kbps - 2048 Kbps)
>>>>>> + - pcm-frame-type: 0 (short), 1 (long)
>>>>>> + - pcm-sync-mode: 0 (slave), 1 (master)
>>>>>> + - pcm-clock-mode: 0 (slave), 1 (master)
>>>>>
>>>>> Are these Broadcom specific? Properties need either vendor prefix or
>>>>> to be documented in a common location. I think these look like the
>>>>> latter.
>>>>
>>>>
>>>> These will be used as parameters of a vendor specific (Broadcom/Cypress)
>>>> command configuring the SCO packet routing. See the Write_SCO_PCM_Int_Param
>>>> command from: http://www.cypress.com/file/298311/download.
>>>
>>> The DT should just describe how the h/w is hooked-up. What the s/w has
>>> to do based on that is the driver's problem which is certainly
>>> vendor/chip specific, but that is all irrelevant to the binding.
>>>
>>>> What would be the property names with a Broadcom / Cypress vendor prefix?
>>>>
>>>> brcm,sco-routing
>>>> brcm,pcm-interface-rate
>>>> brcm,pcm-frame-type
>>>> brcm,pcm-sync-mode
>>>> brcm,pcm-clock-mode
>>>>
>>>> ?
>>>
>>> Yes.
>>
>> we can do this. However all pcm-* are optional if you switch the HCI transport. And sco-routing should default to HCI if that is not present. Meaning a driver should actively trying to change this. Nevertheless, it would be good if a driver reads the current settings.
>>
>> In theory we could make sco-routing generic, but so many vendors have different modes, that we better keep this vendor specific.
>
> Even if vendor specific, the properties for not HCI transport case are
> still incomplete IMO.
>
> By modes, you mean PCM vs. I2S and all the flavors of timings you can
> have within those or something else? For the former, that's often
> going to be a process of solving what each end support and if that
> doesn't work, then IIRC we already have properties for setting
> modes/timing. All the same issues exist with audio codecs and this is
> really not any different.
this is what Broadcom uses to configure their PCM transport. So I think for now, we make them brcm, specific and see how that goes. We can always generalize them later if enough chip manufactures provide support for it.
Regards
Marcel
^ permalink raw reply
* Re: [PATCH net] failover: eliminate callback hell
From: Michael S. Tsirkin @ 2018-06-11 18:10 UTC (permalink / raw)
To: Stephen Hemminger
Cc: kys, haiyangz, davem, sridhar.samudrala, netdev,
Stephen Hemminger
In-Reply-To: <20180605034231.31610-1-sthemmin@microsoft.com>
On Mon, Jun 04, 2018 at 08:42:31PM -0700, Stephen Hemminger wrote:
> * Set permanent and current address of net_failover device
> to match the primary.
>
> * Carrier should be marked off before registering device
> the net_failover device.
Sridhar, do we want to address this?
If yes, could you please take a look at addressing these
meanwhile, while we keep arguing about making API changes?
--
MST
^ permalink raw reply
* Re: [PATCH 4/6] arcnet: com20020: bindings for smsc com20020
From: Rob Herring @ 2018-06-11 18:09 UTC (permalink / raw)
To: Andrea Greco
Cc: davem, tobin, Andrea Greco, Mark Rutland, netdev, devicetree,
linux-kernel
In-Reply-To: <20180611142705.20849-1-andrea.greco.gapmilano@gmail.com>
On Mon, Jun 11, 2018 at 04:27:01PM +0200, Andrea Greco wrote:
> From: Andrea Greco <a.greco@4sigma.it>
>
> Add devicetree bindings for smsc com20020
>
> Signed-off-by: Andrea Greco <a.greco@4sigma.it>
> ---
> .../devicetree/bindings/net/smsc-com20020.txt | 21 +++++++++++++++++++++
> 1 file changed, 21 insertions(+)
> create mode 100644 Documentation/devicetree/bindings/net/smsc-com20020.txt
>
> diff --git a/Documentation/devicetree/bindings/net/smsc-com20020.txt b/Documentation/devicetree/bindings/net/smsc-com20020.txt
> new file mode 100644
> index 000000000000..660a4a751f29
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/net/smsc-com20020.txt
> @@ -0,0 +1,21 @@
> +SMSC com20020 Arcnet network controller
> +
> +Required property:
> +- timeout-ns: Arcnet bus timeout, Idle Time (328000 - 20500)
> +- bus-speed-bps: Arcnet bus speed (10000000 - 156250)
> +- smsc,xtal-mhz: External oscillator frequency
> +- smsc,backplane-enabled: Controller use backplane mode
> +- reset-gpios: Chip reset pin
> +- interrupts: Should contain controller interrupt
> +
> +arcnet@28000000 {
> + compatible = "smsc,com20020";
> +
> + timeout-ns = <20500>;
> + bus-speed-hz = <10000000>;
You have hz here and bps above.
> + smsc,xtal-mhz = <20>;
> + smsc,backplane-enabled;
> +
> + reset-gpios = <&gpio3 21 GPIO_ACTIVE_LOW>;
> + interrupts = <&gpio2 10 GPIO_ACTIVE_LOW>;
> +};
> --
> 2.14.4
>
^ permalink raw reply
* Re: [PATCH net] KEYS: DNS: fix parsing multiple options
From: Simon Horman @ 2018-06-11 18:08 UTC (permalink / raw)
To: Eric Biggers
Cc: netdev, David S . Miller, keyrings, David Howells, Wang Lei,
Eric Biggers
In-Reply-To: <20180611175742.GA33284@gmail.com>
On Mon, Jun 11, 2018 at 10:57:42AM -0700, Eric Biggers wrote:
> Hi Simon,
>
> On Mon, Jun 11, 2018 at 11:40:23AM +0200, Simon Horman wrote:
> > On Fri, Jun 08, 2018 at 09:20:37AM -0700, Eric Biggers wrote:
> > > From: Eric Biggers <ebiggers@google.com>
> > >
> > > My recent fix for dns_resolver_preparse() printing very long strings was
> > > incomplete, as shown by syzbot which still managed to hit the
> > > WARN_ONCE() in set_precision() by adding a crafted "dns_resolver" key:
> > >
> > > precision 50001 too large
> > > WARNING: CPU: 7 PID: 864 at lib/vsprintf.c:2164 vsnprintf+0x48a/0x5a0
> > >
> > > The bug this time isn't just a printing bug, but also a logical error
> > > when multiple options ("#"-separated strings) are given in the key
> > > payload. Specifically, when separating an option string into name and
> > > value, if there is no value then the name is incorrectly considered to
> > > end at the end of the key payload, rather than the end of the current
> > > option. This bypasses validation of the option length, and also means
> > > that specifying multiple options is broken -- which presumably has gone
> > > unnoticed as there is currently only one valid option anyway.
> > >
> > > Fix it by correctly calculating the length of the option name.
> > >
> > > Reproducer:
> > >
> > > perl -e 'print "#A#", "\x00" x 50000' | keyctl padd dns_resolver desc @s
> > >
> > > Fixes: 4a2d789267e0 ("DNS: If the DNS server returns an error, allow that to be cached [ver #2]")
> > > Signed-off-by: Eric Biggers <ebiggers@google.com>
> > > ---
> > > net/dns_resolver/dns_key.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/net/dns_resolver/dns_key.c b/net/dns_resolver/dns_key.c
> > > index 40c851693f77e..d448823d4d2ed 100644
> > > --- a/net/dns_resolver/dns_key.c
> > > +++ b/net/dns_resolver/dns_key.c
> > > @@ -97,7 +97,7 @@ dns_resolver_preparse(struct key_preparsed_payload *prep)
> > > return -EINVAL;
> > > }
> > >
> > > - eq = memchr(opt, '=', opt_len) ?: end;
> > > + eq = memchr(opt, '=', opt_len) ?: next_opt;
> > > opt_nlen = eq - opt;
> > > eq++;
> >
> > It seems risky to advance eq++ in the case there the value is empty.
> > Its not not pointing to the value but it may be accessed twice further on
> > in this loop.
> >
>
> Sure, that's a separate existing issue though, and it must be checked that the
> value is present before using it anyway, which the code already does, so it's
> not a "real" bug. I think I'll keep this patch simple and leave that part as-is
> for now.
Thanks Eric, I was reflecting on that too. I agree that your patch resolves
a problem without introducing a new one.
Reviewed-by: Simon Horman <simon.horman@netronome.com>
^ permalink raw reply
* Re: [PATCH net] failover: eliminate callback hell
From: Michael S. Tsirkin @ 2018-06-11 18:07 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Samudrala, Sridhar, kys, haiyangz, davem, netdev,
Stephen Hemminger
In-Reply-To: <20180606152121.597a89ec@xeon-e3>
On Wed, Jun 06, 2018 at 03:21:21PM -0700, Stephen Hemminger wrote:
> > >
> > > I think you need to get rid of triggering off of the name change.
> >
> > Worth considering down the road since it's a bit of a hack but it's one
> > we won't have trouble supporting, unlike the delayed uplink.
>
> You can't depend on userspace doing rename.
There's only so much we can do to add new functionality to old
userspace. You can always just use PV and all will work.
--
MST
^ permalink raw reply
* Re: [PATCH net] KEYS: DNS: fix parsing multiple options
From: Eric Biggers @ 2018-06-11 17:57 UTC (permalink / raw)
To: Simon Horman
Cc: netdev, David S . Miller, keyrings, David Howells, Wang Lei,
Eric Biggers
In-Reply-To: <20180611094022.gvrefejktxzw44i7@netronome.com>
Hi Simon,
On Mon, Jun 11, 2018 at 11:40:23AM +0200, Simon Horman wrote:
> On Fri, Jun 08, 2018 at 09:20:37AM -0700, Eric Biggers wrote:
> > From: Eric Biggers <ebiggers@google.com>
> >
> > My recent fix for dns_resolver_preparse() printing very long strings was
> > incomplete, as shown by syzbot which still managed to hit the
> > WARN_ONCE() in set_precision() by adding a crafted "dns_resolver" key:
> >
> > precision 50001 too large
> > WARNING: CPU: 7 PID: 864 at lib/vsprintf.c:2164 vsnprintf+0x48a/0x5a0
> >
> > The bug this time isn't just a printing bug, but also a logical error
> > when multiple options ("#"-separated strings) are given in the key
> > payload. Specifically, when separating an option string into name and
> > value, if there is no value then the name is incorrectly considered to
> > end at the end of the key payload, rather than the end of the current
> > option. This bypasses validation of the option length, and also means
> > that specifying multiple options is broken -- which presumably has gone
> > unnoticed as there is currently only one valid option anyway.
> >
> > Fix it by correctly calculating the length of the option name.
> >
> > Reproducer:
> >
> > perl -e 'print "#A#", "\x00" x 50000' | keyctl padd dns_resolver desc @s
> >
> > Fixes: 4a2d789267e0 ("DNS: If the DNS server returns an error, allow that to be cached [ver #2]")
> > Signed-off-by: Eric Biggers <ebiggers@google.com>
> > ---
> > net/dns_resolver/dns_key.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/net/dns_resolver/dns_key.c b/net/dns_resolver/dns_key.c
> > index 40c851693f77e..d448823d4d2ed 100644
> > --- a/net/dns_resolver/dns_key.c
> > +++ b/net/dns_resolver/dns_key.c
> > @@ -97,7 +97,7 @@ dns_resolver_preparse(struct key_preparsed_payload *prep)
> > return -EINVAL;
> > }
> >
> > - eq = memchr(opt, '=', opt_len) ?: end;
> > + eq = memchr(opt, '=', opt_len) ?: next_opt;
> > opt_nlen = eq - opt;
> > eq++;
>
> It seems risky to advance eq++ in the case there the value is empty.
> Its not not pointing to the value but it may be accessed twice further on
> in this loop.
>
Sure, that's a separate existing issue though, and it must be checked that the
value is present before using it anyway, which the code already does, so it's
not a "real" bug. I think I'll keep this patch simple and leave that part as-is
for now.
Eric
^ permalink raw reply
* Re: [PATCH] Bluetooth: hci_bcm: Configure SCO routing automatically
From: Rob Herring @ 2018-06-11 17:54 UTC (permalink / raw)
To: Marcel Holtmann
Cc: Attila Tőkés, David S. Miller, Mark Rutland,
Johan Hedberg, Artiom Vaskov, netdev, devicetree,
linux-kernel@vger.kernel.org, open list:BLUETOOTH DRIVERS
In-Reply-To: <4F0D6729-AE77-47D4-9765-FBC44181D4DE@holtmann.org>
On Mon, Jun 11, 2018 at 9:47 AM, Marcel Holtmann <marcel@holtmann.org> wrote:
> Hi Rob,
>
>>>>> Added support to automatically configure the SCO packet routing at the
>>>>> device setup. The SCO packets are used with the HSP / HFP profiles, but in
>>>>> some devices (ex. CYW43438) they are routed to a PCM output by default. This
>>>>> change allows sending the vendor specific HCI command to configure the SCO
>>>>> routing. The parameters of the command are loaded from the device tree.
>>>>
>>>> Please wrap your commit msg.
>>>
>>>
>>> Sure.
>>>>
>>>>
>>>>>
>>>>> Signed-off-by: Attila Tőkés <attitokes@gmail.com>
>>>>> ---
>>>>> .../bindings/net/broadcom-bluetooth.txt | 7 ++
>>>>
>>>> Please split bindings to separate patch.
>>>
>>>
>>> Ok, I will split this in two.
>>>>
>>>>
>>>>
>>>>
>>>>> drivers/bluetooth/hci_bcm.c | 72 +++++++++++++++++++
>>>>> 2 files changed, 79 insertions(+)
>>>>>
>>>>> diff --git
>>>>> a/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt
>>>>> b/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt
>>>>> index 4194ff7e..aea3a094 100644
>>>>> --- a/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt
>>>>> +++ b/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt
>>>>> @@ -21,6 +21,12 @@ Optional properties:
>>>>> - clocks: clock specifier if external clock provided to the controller
>>>>> - clock-names: should be "extclk"
>>>>>
>>>>> + SCO routing parameters:
>>>>> + - sco-routing: 0-3 (PCM, Transport, Codec, I2S)
>>>>> + - pcm-interface-rate: 0-4 (128 Kbps - 2048 Kbps)
>>>>> + - pcm-frame-type: 0 (short), 1 (long)
>>>>> + - pcm-sync-mode: 0 (slave), 1 (master)
>>>>> + - pcm-clock-mode: 0 (slave), 1 (master)
>>>>
>>>> Are these Broadcom specific? Properties need either vendor prefix or
>>>> to be documented in a common location. I think these look like the
>>>> latter.
>>>
>>>
>>> These will be used as parameters of a vendor specific (Broadcom/Cypress)
>>> command configuring the SCO packet routing. See the Write_SCO_PCM_Int_Param
>>> command from: http://www.cypress.com/file/298311/download.
>>
>> The DT should just describe how the h/w is hooked-up. What the s/w has
>> to do based on that is the driver's problem which is certainly
>> vendor/chip specific, but that is all irrelevant to the binding.
>>
>>> What would be the property names with a Broadcom / Cypress vendor prefix?
>>>
>>> brcm,sco-routing
>>> brcm,pcm-interface-rate
>>> brcm,pcm-frame-type
>>> brcm,pcm-sync-mode
>>> brcm,pcm-clock-mode
>>>
>>> ?
>>
>> Yes.
>
> we can do this. However all pcm-* are optional if you switch the HCI transport. And sco-routing should default to HCI if that is not present. Meaning a driver should actively trying to change this. Nevertheless, it would be good if a driver reads the current settings.
>
> In theory we could make sco-routing generic, but so many vendors have different modes, that we better keep this vendor specific.
Even if vendor specific, the properties for not HCI transport case are
still incomplete IMO.
By modes, you mean PCM vs. I2S and all the flavors of timings you can
have within those or something else? For the former, that's often
going to be a process of solving what each end support and if that
doesn't work, then IIRC we already have properties for setting
modes/timing. All the same issues exist with audio codecs and this is
really not any different.
Rob
^ permalink raw reply
* [PATCH] net: stmmac: dwmac-meson8b: Fix an error handling path in 'meson8b_dwmac_probe()'
From: Christophe JAILLET @ 2018-06-11 17:52 UTC (permalink / raw)
To: peppe.cavallaro, alexandre.torgue, joabreu, davem, carlo, khilman
Cc: netdev, linux-arm-kernel, linux-amlogic, linux-kernel,
kernel-janitors, Christophe JAILLET
If 'of_device_get_match_data()' fails, we need to release some resources as
done in the other error handling path of this function.
Fixes: efacb568c962 ("net: stmmac: dwmac-meson: extend phy mode setting")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
---
drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
index 4ff231df7322..c5979569fd60 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
@@ -334,9 +334,10 @@ static int meson8b_dwmac_probe(struct platform_device *pdev)
dwmac->data = (const struct meson8b_dwmac_data *)
of_device_get_match_data(&pdev->dev);
- if (!dwmac->data)
- return -EINVAL;
-
+ if (!dwmac->data) {
+ ret = -EINVAL;
+ goto err_remove_config_dt;
+ }
res = platform_get_resource(pdev, IORESOURCE_MEM, 1);
dwmac->regs = devm_ioremap_resource(&pdev->dev, res);
if (IS_ERR(dwmac->regs)) {
--
2.17.0
^ permalink raw reply related
* [jkirsher/next-queue PATCH 7/7] net: allow fallback function to pass netdev
From: Alexander Duyck @ 2018-06-11 17:41 UTC (permalink / raw)
To: netdev, intel-wired-lan, jeffrey.t.kirsher
In-Reply-To: <20180611173003.41352.25621.stgit@ahduyck-green-test.jf.intel.com>
For most of these calls we can just pass NULL through to the fallback
function as the sb_dev. The only cases where we cannot are the cases where
we might be dealing with either an upper device or a driver that would
have configured things to support an sb_dev itself.
The only driver that has any signficant change in this patchset should be
ixgbe as we can drop the redundant functionality that existed in both the
ndo_select_queue function and the fallback function that was passed through
to us.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
drivers/net/ethernet/amazon/ena/ena_netdev.c | 2 +-
drivers/net/ethernet/broadcom/bcmsysport.c | 4 ++--
drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 3 ++-
drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 2 +-
drivers/net/ethernet/hisilicon/hns/hns_enet.c | 2 +-
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 4 ++--
drivers/net/ethernet/mellanox/mlx4/en_tx.c | 4 ++--
drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 2 +-
drivers/net/hyperv/netvsc_drv.c | 2 +-
drivers/net/net_failover.c | 2 +-
drivers/net/xen-netback/interface.c | 2 +-
include/linux/netdevice.h | 3 ++-
net/core/dev.c | 12 +++---------
net/packet/af_packet.c | 7 +------
14 files changed, 21 insertions(+), 30 deletions(-)
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index e3befb1..c673ac2 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -2224,7 +2224,7 @@ static u16 ena_select_queue(struct net_device *dev, struct sk_buff *skb,
if (skb_rx_queue_recorded(skb))
qid = skb_get_rx_queue(skb);
else
- qid = fallback(dev, skb);
+ qid = fallback(dev, skb, NULL);
return qid;
}
diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c b/drivers/net/ethernet/broadcom/bcmsysport.c
index 32f548e..eb890c4 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -2116,7 +2116,7 @@ static u16 bcm_sysport_select_queue(struct net_device *dev, struct sk_buff *skb,
unsigned int q, port;
if (!netdev_uses_dsa(dev))
- return fallback(dev, skb);
+ return fallback(dev, skb, NULL);
/* DSA tagging layer will have configured the correct queue */
q = BRCM_TAG_GET_QUEUE(queue);
@@ -2124,7 +2124,7 @@ static u16 bcm_sysport_select_queue(struct net_device *dev, struct sk_buff *skb,
tx_ring = priv->ring_map[q + port * priv->per_port_num_tx_queues];
if (unlikely(!tx_ring))
- return fallback(dev, skb);
+ return fallback(dev, skb, NULL);
return tx_ring->index;
}
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index 969dcc9..7a1b99f 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -1928,7 +1928,8 @@ u16 bnx2x_select_queue(struct net_device *dev, struct sk_buff *skb,
}
/* select a non-FCoE queue */
- return fallback(dev, skb) % (BNX2X_NUM_ETH_QUEUES(bp) * bp->max_cos);
+ return fallback(dev, skb, NULL) %
+ (BNX2X_NUM_ETH_QUEUES(bp) * bp->max_cos);
}
void bnx2x_set_num_queues(struct bnx2x *bp)
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 8de3039..380931d 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -972,7 +972,7 @@ static u16 cxgb_select_queue(struct net_device *dev, struct sk_buff *skb,
return txq;
}
- return fallback(dev, skb) % dev->real_num_tx_queues;
+ return fallback(dev, skb, NULL) % dev->real_num_tx_queues;
}
static int closest_timer(const struct sge *s, int time)
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_enet.c b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
index c36a231..8327254 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
@@ -2033,7 +2033,7 @@ static void hns_nic_get_stats64(struct net_device *ndev,
is_multicast_ether_addr(eth_hdr->h_dest))
return 0;
else
- return fallback(ndev, skb);
+ return fallback(ndev, skb, NULL);
}
static const struct net_device_ops hns_nic_netdev_ops = {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 5d9867e..eef64d0 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -8248,11 +8248,11 @@ static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb,
case htons(ETH_P_FIP):
adapter = netdev_priv(dev);
- if (adapter->flags & IXGBE_FLAG_FCOE_ENABLED)
+ if (!sb_dev && (adapter->flags & IXGBE_FLAG_FCOE_ENABLED))
break;
/* fall through */
default:
- return fallback(dev, skb);
+ return fallback(dev, skb, sb_dev);
}
f = &adapter->ring_feature[RING_F_FCOE];
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index df29966..1857ee0 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -695,9 +695,9 @@ u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb,
u16 rings_p_up = priv->num_tx_rings_p_up;
if (netdev_get_num_tc(dev))
- return fallback(dev, skb);
+ return fallback(dev, skb, NULL);
- return fallback(dev, skb) % rings_p_up;
+ return fallback(dev, skb, NULL) % rings_p_up;
}
static void mlx4_bf_copy(void __iomem *dst, const void *src,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index 0119e86..88c0c85 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -115,7 +115,7 @@ u16 mlx5e_select_queue(struct net_device *dev, struct sk_buff *skb,
select_queue_fallback_t fallback)
{
struct mlx5e_priv *priv = netdev_priv(dev);
- int channel_ix = fallback(dev, skb);
+ int channel_ix = fallback(dev, skb, NULL);
u16 num_channels;
int up = 0;
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 0a01572..5bc32e7 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -344,7 +344,7 @@ static u16 netvsc_select_queue(struct net_device *ndev, struct sk_buff *skb,
txq = vf_ops->ndo_select_queue(vf_netdev, skb,
sb_dev, fallback);
else
- txq = fallback(vf_netdev, skb);
+ txq = fallback(vf_netdev, skb, NULL);
/* Record the queue selected by VF so that it can be
* used for common case where VF has more queues than
diff --git a/drivers/net/net_failover.c b/drivers/net/net_failover.c
index b2dc2e7..6f3d143 100644
--- a/drivers/net/net_failover.c
+++ b/drivers/net/net_failover.c
@@ -131,7 +131,7 @@ static u16 net_failover_select_queue(struct net_device *dev,
txq = ops->ndo_select_queue(primary_dev, skb,
sb_dev, fallback);
else
- txq = fallback(primary_dev, skb);
+ txq = fallback(primary_dev, skb, NULL);
qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c
index 19c4c58..92274c2 100644
--- a/drivers/net/xen-netback/interface.c
+++ b/drivers/net/xen-netback/interface.c
@@ -155,7 +155,7 @@ static u16 xenvif_select_queue(struct net_device *dev, struct sk_buff *skb,
unsigned int size = vif->hash.size;
if (vif->hash.alg == XEN_NETIF_CTRL_HASH_ALGORITHM_NONE)
- return fallback(dev, skb) % dev->real_num_tx_queues;
+ return fallback(dev, skb, NULL) % dev->real_num_tx_queues;
xenvif_set_skb_hash(vif, skb);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e366f42..73c2fcf 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -782,7 +782,8 @@ static inline bool netdev_phys_item_id_same(struct netdev_phys_item_id *a,
}
typedef u16 (*select_queue_fallback_t)(struct net_device *dev,
- struct sk_buff *skb);
+ struct sk_buff *skb,
+ struct net_device *sb_dev);
enum tc_setup_type {
TC_SETUP_QDISC_MQPRIO,
diff --git a/net/core/dev.c b/net/core/dev.c
index 514dcec..960d01f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3519,8 +3519,8 @@ u16 dev_pick_tx_cpu_id(struct net_device *dev, struct sk_buff *skb,
}
EXPORT_SYMBOL(dev_pick_tx_cpu_id);
-static u16 ___netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
- struct net_device *sb_dev)
+static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
+ struct net_device *sb_dev)
{
struct sock *sk = skb->sk;
int queue_index = sk_tx_queue_get(sk);
@@ -3545,12 +3545,6 @@ static u16 ___netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
return queue_index;
}
-static u16 __netdev_pick_tx(struct net_device *dev,
- struct sk_buff *skb)
-{
- return ___netdev_pick_tx(dev, skb, NULL);
-}
-
struct netdev_queue *netdev_pick_tx(struct net_device *dev,
struct sk_buff *skb,
struct net_device *sb_dev)
@@ -3571,7 +3565,7 @@ struct netdev_queue *netdev_pick_tx(struct net_device *dev,
queue_index = ops->ndo_select_queue(dev, skb, sb_dev,
__netdev_pick_tx);
else
- queue_index = ___netdev_pick_tx(dev, skb, sb_dev);
+ queue_index = __netdev_pick_tx(dev, skb, sb_dev);
queue_index = netdev_cap_txqueue(dev, queue_index);
}
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 24b6e60..ad7097c 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -275,11 +275,6 @@ static bool packet_use_direct_xmit(const struct packet_sock *po)
return po->xmit == packet_direct_xmit;
}
-static u16 packet_fallback(struct net_device *dev, struct sk_buff *skb)
-{
- return dev_pick_tx_cpu_id(dev, skb, NULL);
-}
-
static u16 packet_pick_tx_queue(struct sk_buff *skb)
{
struct net_device *dev = skb->dev;
@@ -288,7 +283,7 @@ static u16 packet_pick_tx_queue(struct sk_buff *skb)
if (ops->ndo_select_queue) {
queue_index = ops->ndo_select_queue(dev, skb, NULL,
- packet_fallback);
+ dev_pick_tx_cpu_id);
queue_index = netdev_cap_txqueue(dev, queue_index);
} else {
queue_index = dev_pick_tx_cpu_id(dev, skb, NULL);
^ permalink raw reply related
* [jkirsher/next-queue PATCH 6/7] net: allow ndo_select_queue to pass netdev
From: Alexander Duyck @ 2018-06-11 17:41 UTC (permalink / raw)
To: netdev, intel-wired-lan, jeffrey.t.kirsher
In-Reply-To: <20180611173003.41352.25621.stgit@ahduyck-green-test.jf.intel.com>
This patch makes it so that instead of passing a void pointer as the
accel_priv we instead pass a net_device pointer as sb_dev. Making this
change allows us to pass the subordinate device through to the fallback
function eventually so that we can keep the actual code in the
ndo_select_queue call as focused on possible on the exception cases.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
drivers/infiniband/hw/hfi1/vnic_main.c | 2 +-
drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c | 4 ++--
drivers/net/bonding/bond_main.c | 3 ++-
drivers/net/ethernet/amazon/ena/ena_netdev.c | 3 ++-
drivers/net/ethernet/broadcom/bcmsysport.c | 2 +-
drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 3 ++-
drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h | 3 ++-
drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 3 ++-
drivers/net/ethernet/hisilicon/hns/hns_enet.c | 3 ++-
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 7 ++++---
drivers/net/ethernet/mellanox/mlx4/en_tx.c | 3 ++-
drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 3 ++-
drivers/net/ethernet/mellanox/mlx5/core/en.h | 3 ++-
drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 3 ++-
drivers/net/ethernet/renesas/ravb_main.c | 3 ++-
drivers/net/ethernet/sun/ldmvsw.c | 3 ++-
drivers/net/ethernet/sun/sunvnet.c | 3 ++-
drivers/net/hyperv/netvsc_drv.c | 4 ++--
drivers/net/net_failover.c | 5 +++--
drivers/net/team/team.c | 3 ++-
drivers/net/tun.c | 3 ++-
drivers/net/wireless/marvell/mwifiex/main.c | 3 ++-
drivers/net/xen-netback/interface.c | 2 +-
drivers/net/xen-netfront.c | 3 ++-
drivers/staging/rtl8188eu/os_dep/os_intfs.c | 3 ++-
drivers/staging/rtl8723bs/os_dep/os_intfs.c | 7 +++----
include/linux/netdevice.h | 11 +++++++----
net/core/dev.c | 6 ++++--
net/mac80211/iface.c | 4 ++--
net/packet/af_packet.c | 9 +++++++--
30 files changed, 73 insertions(+), 44 deletions(-)
diff --git a/drivers/infiniband/hw/hfi1/vnic_main.c b/drivers/infiniband/hw/hfi1/vnic_main.c
index 5d65582..616fc9b 100644
--- a/drivers/infiniband/hw/hfi1/vnic_main.c
+++ b/drivers/infiniband/hw/hfi1/vnic_main.c
@@ -423,7 +423,7 @@ static netdev_tx_t hfi1_netdev_start_xmit(struct sk_buff *skb,
static u16 hfi1_vnic_select_queue(struct net_device *netdev,
struct sk_buff *skb,
- void *accel_priv,
+ struct net_device *sb_dev,
select_queue_fallback_t fallback)
{
struct hfi1_vnic_vport_info *vinfo = opa_vnic_dev_priv(netdev);
diff --git a/drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c b/drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c
index 0c8aec6..6155878 100644
--- a/drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c
+++ b/drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c
@@ -95,7 +95,7 @@ static netdev_tx_t opa_netdev_start_xmit(struct sk_buff *skb,
}
static u16 opa_vnic_select_queue(struct net_device *netdev, struct sk_buff *skb,
- void *accel_priv,
+ struct net_device *sb_dev,
select_queue_fallback_t fallback)
{
struct opa_vnic_adapter *adapter = opa_vnic_priv(netdev);
@@ -107,7 +107,7 @@ static u16 opa_vnic_select_queue(struct net_device *netdev, struct sk_buff *skb,
mdata->entropy = opa_vnic_calc_entropy(skb);
mdata->vl = opa_vnic_get_vl(adapter, skb);
rc = adapter->rn_ops->ndo_select_queue(netdev, skb,
- accel_priv, fallback);
+ sb_dev, fallback);
skb_pull(skb, sizeof(*mdata));
return rc;
}
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index bd53a71..e33f689 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4094,7 +4094,8 @@ static inline int bond_slave_override(struct bonding *bond,
static u16 bond_select_queue(struct net_device *dev, struct sk_buff *skb,
- void *accel_priv, select_queue_fallback_t fallback)
+ struct net_device *sb_dev,
+ select_queue_fallback_t fallback)
{
/* This helper function exists to help dev_pick_tx get the correct
* destination queue. Using a helper function skips a call to
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index f2af87d..e3befb1 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -2213,7 +2213,8 @@ static void ena_netpoll(struct net_device *netdev)
#endif /* CONFIG_NET_POLL_CONTROLLER */
static u16 ena_select_queue(struct net_device *dev, struct sk_buff *skb,
- void *accel_priv, select_queue_fallback_t fallback)
+ struct net_device *sb_dev,
+ select_queue_fallback_t fallback)
{
u16 qid;
/* we suspect that this is good for in--kernel network services that
diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c b/drivers/net/ethernet/broadcom/bcmsysport.c
index d5fca2e..32f548e 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -2107,7 +2107,7 @@ static int bcm_sysport_stop(struct net_device *dev)
};
static u16 bcm_sysport_select_queue(struct net_device *dev, struct sk_buff *skb,
- void *accel_priv,
+ struct net_device *sb_dev,
select_queue_fallback_t fallback)
{
struct bcm_sysport_priv *priv = netdev_priv(dev);
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index 8cd73ff..969dcc9 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -1905,7 +1905,8 @@ void bnx2x_netif_stop(struct bnx2x *bp, int disable_hw)
}
u16 bnx2x_select_queue(struct net_device *dev, struct sk_buff *skb,
- void *accel_priv, select_queue_fallback_t fallback)
+ struct net_device *sb_dev,
+ select_queue_fallback_t fallback)
{
struct bnx2x *bp = netdev_priv(dev);
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
index a8ce5c5..0e508e5 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
@@ -497,7 +497,8 @@ int bnx2x_set_vf_vlan(struct net_device *netdev, int vf, u16 vlan, u8 qos,
/* select_queue callback */
u16 bnx2x_select_queue(struct net_device *dev, struct sk_buff *skb,
- void *accel_priv, select_queue_fallback_t fallback);
+ struct net_device *sb_dev,
+ select_queue_fallback_t fallback);
static inline void bnx2x_update_rx_prod(struct bnx2x *bp,
struct bnx2x_fastpath *fp,
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 35cb3ae..8de3039 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -929,7 +929,8 @@ static int setup_sge_queues(struct adapter *adap)
}
static u16 cxgb_select_queue(struct net_device *dev, struct sk_buff *skb,
- void *accel_priv, select_queue_fallback_t fallback)
+ struct net_device *sb_dev,
+ select_queue_fallback_t fallback)
{
int txq;
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_enet.c b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
index 1ccb644..c36a231 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
@@ -2022,7 +2022,8 @@ static void hns_nic_get_stats64(struct net_device *ndev,
static u16
hns_nic_select_queue(struct net_device *ndev, struct sk_buff *skb,
- void *accel_priv, select_queue_fallback_t fallback)
+ struct net_device *sb_dev,
+ select_queue_fallback_t fallback)
{
struct ethhdr *eth_hdr = (struct ethhdr *)skb->data;
struct hns_nic_priv *priv = netdev_priv(ndev);
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 053a54c..5d9867e 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -8221,15 +8221,16 @@ static void ixgbe_atr(struct ixgbe_ring *ring,
#ifdef IXGBE_FCOE
static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb,
- void *accel_priv, select_queue_fallback_t fallback)
+ struct net_device *sb_dev,
+ select_queue_fallback_t fallback)
{
struct ixgbe_adapter *adapter;
struct ixgbe_ring_feature *f;
int txq;
- if (accel_priv) {
+ if (sb_dev) {
u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
- struct net_device *vdev = accel_priv;
+ struct net_device *vdev = sb_dev;
txq = vdev->tc_to_txq[tc].offset;
txq += reciprocal_scale(skb_get_hash(skb),
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 0227786..df29966 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -688,7 +688,8 @@ static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc,
}
u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb,
- void *accel_priv, select_queue_fallback_t fallback)
+ struct net_device *sb_dev,
+ select_queue_fallback_t fallback)
{
struct mlx4_en_priv *priv = netdev_priv(dev);
u16 rings_p_up = priv->num_tx_rings_p_up;
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index ace6545..c3228b8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -699,7 +699,8 @@ int mlx4_en_activate_cq(struct mlx4_en_priv *priv, struct mlx4_en_cq *cq,
void mlx4_en_tx_irq(struct mlx4_cq *mcq);
u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb,
- void *accel_priv, select_queue_fallback_t fallback);
+ struct net_device *sb_dev,
+ select_queue_fallback_t fallback);
netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev);
netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
struct mlx4_en_rx_alloc *frame,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index eb9eb7a..df2d1e8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -866,7 +866,8 @@ struct mlx5e_profile {
void mlx5e_build_ptys2ethtool_map(void);
u16 mlx5e_select_queue(struct net_device *dev, struct sk_buff *skb,
- void *accel_priv, select_queue_fallback_t fallback);
+ struct net_device *sb_dev,
+ select_queue_fallback_t fallback);
netdev_tx_t mlx5e_xmit(struct sk_buff *skb, struct net_device *dev);
netdev_tx_t mlx5e_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
struct mlx5e_tx_wqe *wqe, u16 pi);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index f29deb4..0119e86 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -111,7 +111,8 @@ static inline int mlx5e_get_dscp_up(struct mlx5e_priv *priv, struct sk_buff *skb
#endif
u16 mlx5e_select_queue(struct net_device *dev, struct sk_buff *skb,
- void *accel_priv, select_queue_fallback_t fallback)
+ struct net_device *sb_dev,
+ select_queue_fallback_t fallback)
{
struct mlx5e_priv *priv = netdev_priv(dev);
int channel_ix = fallback(dev, skb);
diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
index 68f1221..4a7f54c 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c
@@ -1656,7 +1656,8 @@ static netdev_tx_t ravb_start_xmit(struct sk_buff *skb, struct net_device *ndev)
}
static u16 ravb_select_queue(struct net_device *ndev, struct sk_buff *skb,
- void *accel_priv, select_queue_fallback_t fallback)
+ struct net_device *sb_dev,
+ select_queue_fallback_t fallback)
{
/* If skb needs TX timestamp, it is handled in network control queue */
return (skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) ? RAVB_NC :
diff --git a/drivers/net/ethernet/sun/ldmvsw.c b/drivers/net/ethernet/sun/ldmvsw.c
index a5dd627..d42f47f 100644
--- a/drivers/net/ethernet/sun/ldmvsw.c
+++ b/drivers/net/ethernet/sun/ldmvsw.c
@@ -101,7 +101,8 @@ static struct vnet_port *vsw_tx_port_find(struct sk_buff *skb,
}
static u16 vsw_select_queue(struct net_device *dev, struct sk_buff *skb,
- void *accel_priv, select_queue_fallback_t fallback)
+ struct net_device *sb_dev,
+ select_queue_fallback_t fallback)
{
struct vnet_port *port = netdev_priv(dev);
diff --git a/drivers/net/ethernet/sun/sunvnet.c b/drivers/net/ethernet/sun/sunvnet.c
index a94f504..12539b3 100644
--- a/drivers/net/ethernet/sun/sunvnet.c
+++ b/drivers/net/ethernet/sun/sunvnet.c
@@ -234,7 +234,8 @@ static struct vnet_port *vnet_tx_port_find(struct sk_buff *skb,
}
static u16 vnet_select_queue(struct net_device *dev, struct sk_buff *skb,
- void *accel_priv, select_queue_fallback_t fallback)
+ struct net_device *sb_dev,
+ select_queue_fallback_t fallback)
{
struct vnet *vp = netdev_priv(dev);
struct vnet_port *port = __tx_port_find(vp, skb);
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 7b18a8c..0a01572 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -328,7 +328,7 @@ static u16 netvsc_pick_tx(struct net_device *ndev, struct sk_buff *skb)
}
static u16 netvsc_select_queue(struct net_device *ndev, struct sk_buff *skb,
- void *accel_priv,
+ struct net_device *sb_dev,
select_queue_fallback_t fallback)
{
struct net_device_context *ndc = netdev_priv(ndev);
@@ -342,7 +342,7 @@ static u16 netvsc_select_queue(struct net_device *ndev, struct sk_buff *skb,
if (vf_ops->ndo_select_queue)
txq = vf_ops->ndo_select_queue(vf_netdev, skb,
- accel_priv, fallback);
+ sb_dev, fallback);
else
txq = fallback(vf_netdev, skb);
diff --git a/drivers/net/net_failover.c b/drivers/net/net_failover.c
index 83f7420..b2dc2e7 100644
--- a/drivers/net/net_failover.c
+++ b/drivers/net/net_failover.c
@@ -115,7 +115,8 @@ static netdev_tx_t net_failover_start_xmit(struct sk_buff *skb,
}
static u16 net_failover_select_queue(struct net_device *dev,
- struct sk_buff *skb, void *accel_priv,
+ struct sk_buff *skb,
+ struct net_device *sb_dev,
select_queue_fallback_t fallback)
{
struct net_failover_info *nfo_info = netdev_priv(dev);
@@ -128,7 +129,7 @@ static u16 net_failover_select_queue(struct net_device *dev,
if (ops->ndo_select_queue)
txq = ops->ndo_select_queue(primary_dev, skb,
- accel_priv, fallback);
+ sb_dev, fallback);
else
txq = fallback(primary_dev, skb);
diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index 8863fa0..b704051 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -1706,7 +1706,8 @@ static netdev_tx_t team_xmit(struct sk_buff *skb, struct net_device *dev)
}
static u16 team_select_queue(struct net_device *dev, struct sk_buff *skb,
- void *accel_priv, select_queue_fallback_t fallback)
+ struct net_device *sb_dev,
+ select_queue_fallback_t fallback)
{
/*
* This helper function exists to help dev_pick_tx get the correct
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index a192a01..76f0f41 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -607,7 +607,8 @@ static u16 tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb)
}
static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb,
- void *accel_priv, select_queue_fallback_t fallback)
+ struct net_device *sb_dev,
+ select_queue_fallback_t fallback)
{
struct tun_struct *tun = netdev_priv(dev);
u16 ret;
diff --git a/drivers/net/wireless/marvell/mwifiex/main.c b/drivers/net/wireless/marvell/mwifiex/main.c
index 510f6b8..fa3e8dd 100644
--- a/drivers/net/wireless/marvell/mwifiex/main.c
+++ b/drivers/net/wireless/marvell/mwifiex/main.c
@@ -1279,7 +1279,8 @@ static struct net_device_stats *mwifiex_get_stats(struct net_device *dev)
static u16
mwifiex_netdev_select_wmm_queue(struct net_device *dev, struct sk_buff *skb,
- void *accel_priv, select_queue_fallback_t fallback)
+ struct net_device *sb_dev,
+ select_queue_fallback_t fallback)
{
skb->priority = cfg80211_classify8021d(skb, NULL);
return mwifiex_1d_to_wmm_queue[skb->priority];
diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c
index 78ebe49..19c4c58 100644
--- a/drivers/net/xen-netback/interface.c
+++ b/drivers/net/xen-netback/interface.c
@@ -148,7 +148,7 @@ void xenvif_wake_queue(struct xenvif_queue *queue)
}
static u16 xenvif_select_queue(struct net_device *dev, struct sk_buff *skb,
- void *accel_priv,
+ struct net_device *sb_dev,
select_queue_fallback_t fallback)
{
struct xenvif *vif = netdev_priv(dev);
diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index 679da1a..3c21a8f 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -545,7 +545,8 @@ static int xennet_count_skb_slots(struct sk_buff *skb)
}
static u16 xennet_select_queue(struct net_device *dev, struct sk_buff *skb,
- void *accel_priv, select_queue_fallback_t fallback)
+ struct net_device *sb_dev,
+ select_queue_fallback_t fallback)
{
unsigned int num_queues = dev->real_num_tx_queues;
u32 hash;
diff --git a/drivers/staging/rtl8188eu/os_dep/os_intfs.c b/drivers/staging/rtl8188eu/os_dep/os_intfs.c
index add1ba0..38e85c8 100644
--- a/drivers/staging/rtl8188eu/os_dep/os_intfs.c
+++ b/drivers/staging/rtl8188eu/os_dep/os_intfs.c
@@ -253,7 +253,8 @@ static unsigned int rtw_classify8021d(struct sk_buff *skb)
}
static u16 rtw_select_queue(struct net_device *dev, struct sk_buff *skb,
- void *accel_priv, select_queue_fallback_t fallback)
+ struct net_device *sb_dev,
+ select_queue_fallback_t fallback)
{
struct adapter *padapter = rtw_netdev_priv(dev);
struct mlme_priv *pmlmepriv = &padapter->mlmepriv;
diff --git a/drivers/staging/rtl8723bs/os_dep/os_intfs.c b/drivers/staging/rtl8723bs/os_dep/os_intfs.c
index ace68f0..1816423 100644
--- a/drivers/staging/rtl8723bs/os_dep/os_intfs.c
+++ b/drivers/staging/rtl8723bs/os_dep/os_intfs.c
@@ -403,10 +403,9 @@ static unsigned int rtw_classify8021d(struct sk_buff *skb)
}
-static u16 rtw_select_queue(struct net_device *dev, struct sk_buff *skb
- , void *accel_priv
- , select_queue_fallback_t fallback
-)
+static u16 rtw_select_queue(struct net_device *dev, struct sk_buff *skb,
+ struct net_device *sb_dev,
+ select_queue_fallback_t fallback)
{
struct adapter *padapter = rtw_netdev_priv(dev);
struct mlme_priv *pmlmepriv = &padapter->mlmepriv;
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f277149..e366f42 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -945,7 +945,8 @@ struct dev_ifalias {
* those the driver believes to be appropriate.
*
* u16 (*ndo_select_queue)(struct net_device *dev, struct sk_buff *skb,
- * void *accel_priv, select_queue_fallback_t fallback);
+ * struct net_device *sb_dev,
+ * select_queue_fallback_t fallback);
* Called to decide which queue to use when device supports multiple
* transmit queues.
*
@@ -1217,7 +1218,7 @@ struct net_device_ops {
netdev_features_t features);
u16 (*ndo_select_queue)(struct net_device *dev,
struct sk_buff *skb,
- void *accel_priv,
+ struct net_device *sb_dev,
select_queue_fallback_t fallback);
void (*ndo_change_rx_flags)(struct net_device *dev,
int flags);
@@ -2551,8 +2552,10 @@ struct net_device *__dev_get_by_flags(struct net *net, unsigned short flags,
void dev_close_many(struct list_head *head, bool unlink);
void dev_disable_lro(struct net_device *dev);
int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff *newskb);
-u16 dev_pick_tx_zero(struct net_device *dev, struct sk_buff *skb);
-u16 dev_pick_tx_cpu_id(struct net_device *dev, struct sk_buff *skb);
+u16 dev_pick_tx_zero(struct net_device *dev, struct sk_buff *skb,
+ struct net_device *sb_dev);
+u16 dev_pick_tx_cpu_id(struct net_device *dev, struct sk_buff *skb,
+ struct net_device *sb_dev);
int dev_queue_xmit(struct sk_buff *skb);
int dev_queue_xmit_accel(struct sk_buff *skb, struct net_device *sb_dev);
int dev_direct_xmit(struct sk_buff *skb, u16 queue_id);
diff --git a/net/core/dev.c b/net/core/dev.c
index d746fdd..514dcec 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3505,13 +3505,15 @@ static inline int get_xps_queue(struct net_device *dev,
#endif
}
-u16 dev_pick_tx_zero(struct net_device *dev, struct sk_buff *skb)
+u16 dev_pick_tx_zero(struct net_device *dev, struct sk_buff *skb,
+ struct net_device *sb_dev)
{
return 0;
}
EXPORT_SYMBOL(dev_pick_tx_zero);
-u16 dev_pick_tx_cpu_id(struct net_device *dev, struct sk_buff *skb)
+u16 dev_pick_tx_cpu_id(struct net_device *dev, struct sk_buff *skb,
+ struct net_device *sb_dev)
{
return (u16)raw_smp_processor_id() % dev->real_num_tx_queues;
}
diff --git a/net/mac80211/iface.c b/net/mac80211/iface.c
index 555e389..5e6cf2c 100644
--- a/net/mac80211/iface.c
+++ b/net/mac80211/iface.c
@@ -1130,7 +1130,7 @@ static void ieee80211_uninit(struct net_device *dev)
static u16 ieee80211_netdev_select_queue(struct net_device *dev,
struct sk_buff *skb,
- void *accel_priv,
+ struct net_device *sb_dev,
select_queue_fallback_t fallback)
{
return ieee80211_select_queue(IEEE80211_DEV_TO_SUB_IF(dev), skb);
@@ -1176,7 +1176,7 @@ static u16 ieee80211_netdev_select_queue(struct net_device *dev,
static u16 ieee80211_monitor_select_queue(struct net_device *dev,
struct sk_buff *skb,
- void *accel_priv,
+ struct net_device *sb_dev,
select_queue_fallback_t fallback)
{
struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev);
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 015a18d..24b6e60 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -275,6 +275,11 @@ static bool packet_use_direct_xmit(const struct packet_sock *po)
return po->xmit == packet_direct_xmit;
}
+static u16 packet_fallback(struct net_device *dev, struct sk_buff *skb)
+{
+ return dev_pick_tx_cpu_id(dev, skb, NULL);
+}
+
static u16 packet_pick_tx_queue(struct sk_buff *skb)
{
struct net_device *dev = skb->dev;
@@ -283,10 +288,10 @@ static u16 packet_pick_tx_queue(struct sk_buff *skb)
if (ops->ndo_select_queue) {
queue_index = ops->ndo_select_queue(dev, skb, NULL,
- dev_pick_tx_cpu_id);
+ packet_fallback);
queue_index = netdev_cap_txqueue(dev, queue_index);
} else {
- queue_index = dev_pick_tx_cpu_id(dev, skb);
+ queue_index = dev_pick_tx_cpu_id(dev, skb, NULL);
}
return queue_index;
^ permalink raw reply related
* [jkirsher/next-queue PATCH 5/7] net: Add generic ndo_select_queue functions
From: Alexander Duyck @ 2018-06-11 17:41 UTC (permalink / raw)
To: netdev, intel-wired-lan, jeffrey.t.kirsher
In-Reply-To: <20180611173003.41352.25621.stgit@ahduyck-green-test.jf.intel.com>
This patch adds a generic version of the ndo_select_queue functions for
either returning 0 or selecting a queue based on the processor ID. This is
generally meant to just reduce the number of functions we have to change
in the future when we have to deal with ndo_select_queue changes.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
drivers/net/ethernet/lantiq_etop.c | 10 +---------
drivers/net/ethernet/ti/netcp_core.c | 9 +--------
drivers/staging/netlogic/xlr_net.c | 9 +--------
include/linux/netdevice.h | 2 ++
net/core/dev.c | 12 ++++++++++++
net/packet/af_packet.c | 9 ++-------
6 files changed, 19 insertions(+), 32 deletions(-)
diff --git a/drivers/net/ethernet/lantiq_etop.c b/drivers/net/ethernet/lantiq_etop.c
index afc8100..7a637b5 100644
--- a/drivers/net/ethernet/lantiq_etop.c
+++ b/drivers/net/ethernet/lantiq_etop.c
@@ -563,14 +563,6 @@ struct ltq_etop_priv {
spin_unlock_irqrestore(&priv->lock, flags);
}
-static u16
-ltq_etop_select_queue(struct net_device *dev, struct sk_buff *skb,
- void *accel_priv, select_queue_fallback_t fallback)
-{
- /* we are currently only using the first queue */
- return 0;
-}
-
static int
ltq_etop_init(struct net_device *dev)
{
@@ -641,7 +633,7 @@ struct ltq_etop_priv {
.ndo_set_mac_address = ltq_etop_set_mac_address,
.ndo_validate_addr = eth_validate_addr,
.ndo_set_rx_mode = ltq_etop_set_multicast_list,
- .ndo_select_queue = ltq_etop_select_queue,
+ .ndo_select_queue = dev_pick_tx_zero,
.ndo_init = ltq_etop_init,
.ndo_tx_timeout = ltq_etop_tx_timeout,
};
diff --git a/drivers/net/ethernet/ti/netcp_core.c b/drivers/net/ethernet/ti/netcp_core.c
index e40aa3e..2c455bd 100644
--- a/drivers/net/ethernet/ti/netcp_core.c
+++ b/drivers/net/ethernet/ti/netcp_core.c
@@ -1889,13 +1889,6 @@ static int netcp_rx_kill_vid(struct net_device *ndev, __be16 proto, u16 vid)
return err;
}
-static u16 netcp_select_queue(struct net_device *dev, struct sk_buff *skb,
- void *accel_priv,
- select_queue_fallback_t fallback)
-{
- return 0;
-}
-
static int netcp_setup_tc(struct net_device *dev, enum tc_setup_type type,
void *type_data)
{
@@ -1972,7 +1965,7 @@ static int netcp_setup_tc(struct net_device *dev, enum tc_setup_type type,
.ndo_vlan_rx_add_vid = netcp_rx_add_vid,
.ndo_vlan_rx_kill_vid = netcp_rx_kill_vid,
.ndo_tx_timeout = netcp_ndo_tx_timeout,
- .ndo_select_queue = netcp_select_queue,
+ .ndo_select_queue = dev_pick_tx_zero,
.ndo_setup_tc = netcp_setup_tc,
};
diff --git a/drivers/staging/netlogic/xlr_net.c b/drivers/staging/netlogic/xlr_net.c
index e461168..4e6611e 100644
--- a/drivers/staging/netlogic/xlr_net.c
+++ b/drivers/staging/netlogic/xlr_net.c
@@ -290,13 +290,6 @@ static netdev_tx_t xlr_net_start_xmit(struct sk_buff *skb,
return NETDEV_TX_OK;
}
-static u16 xlr_net_select_queue(struct net_device *ndev, struct sk_buff *skb,
- void *accel_priv,
- select_queue_fallback_t fallback)
-{
- return (u16)smp_processor_id();
-}
-
static void xlr_hw_set_mac_addr(struct net_device *ndev)
{
struct xlr_net_priv *priv = netdev_priv(ndev);
@@ -403,7 +396,7 @@ static void xlr_stats(struct net_device *ndev, struct rtnl_link_stats64 *stats)
.ndo_open = xlr_net_open,
.ndo_stop = xlr_net_stop,
.ndo_start_xmit = xlr_net_start_xmit,
- .ndo_select_queue = xlr_net_select_queue,
+ .ndo_select_queue = dev_pick_tx_cpu_id,
.ndo_set_mac_address = xlr_net_set_mac_addr,
.ndo_set_rx_mode = xlr_set_rx_mode,
.ndo_get_stats64 = xlr_stats,
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 91b3ca9..f277149 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2551,6 +2551,8 @@ struct net_device *__dev_get_by_flags(struct net *net, unsigned short flags,
void dev_close_many(struct list_head *head, bool unlink);
void dev_disable_lro(struct net_device *dev);
int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff *newskb);
+u16 dev_pick_tx_zero(struct net_device *dev, struct sk_buff *skb);
+u16 dev_pick_tx_cpu_id(struct net_device *dev, struct sk_buff *skb);
int dev_queue_xmit(struct sk_buff *skb);
int dev_queue_xmit_accel(struct sk_buff *skb, struct net_device *sb_dev);
int dev_direct_xmit(struct sk_buff *skb, u16 queue_id);
diff --git a/net/core/dev.c b/net/core/dev.c
index 2249294..d746fdd 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3505,6 +3505,18 @@ static inline int get_xps_queue(struct net_device *dev,
#endif
}
+u16 dev_pick_tx_zero(struct net_device *dev, struct sk_buff *skb)
+{
+ return 0;
+}
+EXPORT_SYMBOL(dev_pick_tx_zero);
+
+u16 dev_pick_tx_cpu_id(struct net_device *dev, struct sk_buff *skb)
+{
+ return (u16)raw_smp_processor_id() % dev->real_num_tx_queues;
+}
+EXPORT_SYMBOL(dev_pick_tx_cpu_id);
+
static u16 ___netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
struct net_device *sb_dev)
{
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index ee01856..015a18d 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -275,11 +275,6 @@ static bool packet_use_direct_xmit(const struct packet_sock *po)
return po->xmit == packet_direct_xmit;
}
-static u16 __packet_pick_tx_queue(struct net_device *dev, struct sk_buff *skb)
-{
- return (u16) raw_smp_processor_id() % dev->real_num_tx_queues;
-}
-
static u16 packet_pick_tx_queue(struct sk_buff *skb)
{
struct net_device *dev = skb->dev;
@@ -288,10 +283,10 @@ static u16 packet_pick_tx_queue(struct sk_buff *skb)
if (ops->ndo_select_queue) {
queue_index = ops->ndo_select_queue(dev, skb, NULL,
- __packet_pick_tx_queue);
+ dev_pick_tx_cpu_id);
queue_index = netdev_cap_txqueue(dev, queue_index);
} else {
- queue_index = __packet_pick_tx_queue(dev, skb);
+ queue_index = dev_pick_tx_cpu_id(dev, skb);
}
return queue_index;
^ permalink raw reply related
* [jkirsher/next-queue PATCH 4/7] net: Add support for subordinate traffic classes to netdev_pick_tx
From: Alexander Duyck @ 2018-06-11 17:41 UTC (permalink / raw)
To: netdev, intel-wired-lan, jeffrey.t.kirsher
In-Reply-To: <20180611173003.41352.25621.stgit@ahduyck-green-test.jf.intel.com>
This change makes it so that we can support the concept of subordinate
device traffic classes to the core networking code. In doing this we can
start pulling out the driver specific bits needed to support selecting a
queue based on an upper device.
The solution at is currently stands is only partially implemented. I have
the start of some XPS bits in here, but I would still need to allow for
configuration of the XPS maps on the queues reserved for the subordinate
devices. For now I am using the reference to the sb_dev XPS map as just a
way to skip the lookup of the lower device XPS map for now as that would
result in the wrong queue being picked.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 19 +++-----
drivers/net/macvlan.c | 10 +---
include/linux/netdevice.h | 4 +-
net/core/dev.c | 57 +++++++++++++++----------
4 files changed, 45 insertions(+), 45 deletions(-)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 6e27848..053a54c 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -8219,20 +8219,17 @@ static void ixgbe_atr(struct ixgbe_ring *ring,
input, common, ring->queue_index);
}
+#ifdef IXGBE_FCOE
static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb,
void *accel_priv, select_queue_fallback_t fallback)
{
- struct ixgbe_fwd_adapter *fwd_adapter = accel_priv;
-#ifdef IXGBE_FCOE
struct ixgbe_adapter *adapter;
struct ixgbe_ring_feature *f;
-#endif
int txq;
- if (fwd_adapter) {
- u8 tc = netdev_get_num_tc(dev) ?
- netdev_get_prio_tc_map(dev, skb->priority) : 0;
- struct net_device *vdev = fwd_adapter->netdev;
+ if (accel_priv) {
+ u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
+ struct net_device *vdev = accel_priv;
txq = vdev->tc_to_txq[tc].offset;
txq += reciprocal_scale(skb_get_hash(skb),
@@ -8241,8 +8238,6 @@ static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb,
return txq;
}
-#ifdef IXGBE_FCOE
-
/*
* only execute the code below if protocol is FCoE
* or FIP and we have FCoE enabled on the adapter
@@ -8268,11 +8263,9 @@ static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb,
txq -= f->indices;
return txq + f->offset;
-#else
- return fallback(dev, skb);
-#endif
}
+#endif
static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
struct xdp_frame *xdpf)
{
@@ -10076,7 +10069,6 @@ static int ixgbe_xdp_xmit(struct net_device *dev, int n,
.ndo_open = ixgbe_open,
.ndo_stop = ixgbe_close,
.ndo_start_xmit = ixgbe_xmit_frame,
- .ndo_select_queue = ixgbe_select_queue,
.ndo_set_rx_mode = ixgbe_set_rx_mode,
.ndo_validate_addr = eth_validate_addr,
.ndo_set_mac_address = ixgbe_set_mac,
@@ -10099,6 +10091,7 @@ static int ixgbe_xdp_xmit(struct net_device *dev, int n,
.ndo_poll_controller = ixgbe_netpoll,
#endif
#ifdef IXGBE_FCOE
+ .ndo_select_queue = ixgbe_select_queue,
.ndo_fcoe_ddp_setup = ixgbe_fcoe_ddp_get,
.ndo_fcoe_ddp_target = ixgbe_fcoe_ddp_target,
.ndo_fcoe_ddp_done = ixgbe_fcoe_ddp_put,
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index adde8fc..401e1d1 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -514,7 +514,6 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
const struct macvlan_dev *vlan = netdev_priv(dev);
const struct macvlan_port *port = vlan->port;
const struct macvlan_dev *dest;
- void *accel_priv = NULL;
if (vlan->mode == MACVLAN_MODE_BRIDGE) {
const struct ethhdr *eth = (void *)skb->data;
@@ -533,15 +532,10 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
return NET_XMIT_SUCCESS;
}
}
-
- /* For packets that are non-multicast and not bridged we will pass
- * the necessary information so that the lowerdev can distinguish
- * the source of the packets via the accel_priv value.
- */
- accel_priv = vlan->accel_priv;
xmit_world:
skb->dev = vlan->lowerdev;
- return dev_queue_xmit_accel(skb, accel_priv);
+ return dev_queue_xmit_accel(skb,
+ netdev_get_sb_channel(dev) ? dev : NULL);
}
static inline netdev_tx_t macvlan_netpoll_send_skb(struct macvlan_dev *vlan, struct sk_buff *skb)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 41b4660..91b3ca9 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2090,7 +2090,7 @@ static inline void netdev_for_each_tx_queue(struct net_device *dev,
struct netdev_queue *netdev_pick_tx(struct net_device *dev,
struct sk_buff *skb,
- void *accel_priv);
+ struct net_device *sb_dev);
/* returns the headroom that the master device needs to take in account
* when forwarding to this dev
@@ -2552,7 +2552,7 @@ struct net_device *__dev_get_by_flags(struct net *net, unsigned short flags,
void dev_disable_lro(struct net_device *dev);
int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff *newskb);
int dev_queue_xmit(struct sk_buff *skb);
-int dev_queue_xmit_accel(struct sk_buff *skb, void *accel_priv);
+int dev_queue_xmit_accel(struct sk_buff *skb, struct net_device *sb_dev);
int dev_direct_xmit(struct sk_buff *skb, u16 queue_id);
int register_netdevice(struct net_device *dev);
void unregister_netdevice_queue(struct net_device *dev, struct list_head *head);
diff --git a/net/core/dev.c b/net/core/dev.c
index 27fe4f2..2249294 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2704,24 +2704,26 @@ void netif_device_attach(struct net_device *dev)
* Returns a Tx hash based on the given packet descriptor a Tx queues' number
* to be used as a distribution range.
*/
-static u16 skb_tx_hash(const struct net_device *dev, struct sk_buff *skb)
+static u16 skb_tx_hash(const struct net_device *dev,
+ const struct net_device *sb_dev,
+ struct sk_buff *skb)
{
u32 hash;
u16 qoffset = 0;
u16 qcount = dev->real_num_tx_queues;
+ if (dev->num_tc) {
+ u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
+
+ qoffset = sb_dev->tc_to_txq[tc].offset;
+ qcount = sb_dev->tc_to_txq[tc].count;
+ }
+
if (skb_rx_queue_recorded(skb)) {
hash = skb_get_rx_queue(skb);
while (unlikely(hash >= qcount))
hash -= qcount;
- return hash;
- }
-
- if (dev->num_tc) {
- u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
-
- qoffset = dev->tc_to_txq[tc].offset;
- qcount = dev->tc_to_txq[tc].count;
+ return hash + qoffset;
}
return (u16) reciprocal_scale(skb_get_hash(skb), qcount) + qoffset;
@@ -3465,7 +3467,9 @@ int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)
}
#endif /* CONFIG_NET_EGRESS */
-static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
+static inline int get_xps_queue(struct net_device *dev,
+ struct net_device *sb_dev,
+ struct sk_buff *skb)
{
#ifdef CONFIG_XPS
struct xps_dev_maps *dev_maps;
@@ -3473,7 +3477,7 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
int queue_index = -1;
rcu_read_lock();
- dev_maps = rcu_dereference(dev->xps_maps);
+ dev_maps = rcu_dereference(sb_dev->xps_maps);
if (dev_maps) {
unsigned int tci = skb->sender_cpu - 1;
@@ -3501,17 +3505,20 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
#endif
}
-static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb)
+static u16 ___netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
+ struct net_device *sb_dev)
{
struct sock *sk = skb->sk;
int queue_index = sk_tx_queue_get(sk);
+ sb_dev = sb_dev ? : dev;
+
if (queue_index < 0 || skb->ooo_okay ||
queue_index >= dev->real_num_tx_queues) {
- int new_index = get_xps_queue(dev, skb);
+ int new_index = get_xps_queue(dev, sb_dev, skb);
if (new_index < 0)
- new_index = skb_tx_hash(dev, skb);
+ new_index = skb_tx_hash(dev, sb_dev, skb);
if (queue_index != new_index && sk &&
sk_fullsock(sk) &&
@@ -3524,9 +3531,15 @@ static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb)
return queue_index;
}
+static u16 __netdev_pick_tx(struct net_device *dev,
+ struct sk_buff *skb)
+{
+ return ___netdev_pick_tx(dev, skb, NULL);
+}
+
struct netdev_queue *netdev_pick_tx(struct net_device *dev,
struct sk_buff *skb,
- void *accel_priv)
+ struct net_device *sb_dev)
{
int queue_index = 0;
@@ -3541,10 +3554,10 @@ struct netdev_queue *netdev_pick_tx(struct net_device *dev,
const struct net_device_ops *ops = dev->netdev_ops;
if (ops->ndo_select_queue)
- queue_index = ops->ndo_select_queue(dev, skb, accel_priv,
+ queue_index = ops->ndo_select_queue(dev, skb, sb_dev,
__netdev_pick_tx);
else
- queue_index = __netdev_pick_tx(dev, skb);
+ queue_index = ___netdev_pick_tx(dev, skb, sb_dev);
queue_index = netdev_cap_txqueue(dev, queue_index);
}
@@ -3556,7 +3569,7 @@ struct netdev_queue *netdev_pick_tx(struct net_device *dev,
/**
* __dev_queue_xmit - transmit a buffer
* @skb: buffer to transmit
- * @accel_priv: private data used for L2 forwarding offload
+ * @sb_dev: suboordinate device used for L2 forwarding offload
*
* Queue a buffer for transmission to a network device. The caller must
* have set the device and priority and built the buffer before calling
@@ -3579,7 +3592,7 @@ struct netdev_queue *netdev_pick_tx(struct net_device *dev,
* the BH enable code must have IRQs enabled so that it will not deadlock.
* --BLG
*/
-static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
+static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
{
struct net_device *dev = skb->dev;
struct netdev_queue *txq;
@@ -3618,7 +3631,7 @@ static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
else
skb_dst_force(skb);
- txq = netdev_pick_tx(dev, skb, accel_priv);
+ txq = netdev_pick_tx(dev, skb, sb_dev);
q = rcu_dereference_bh(txq->qdisc);
trace_net_dev_queue(skb);
@@ -3692,9 +3705,9 @@ int dev_queue_xmit(struct sk_buff *skb)
}
EXPORT_SYMBOL(dev_queue_xmit);
-int dev_queue_xmit_accel(struct sk_buff *skb, void *accel_priv)
+int dev_queue_xmit_accel(struct sk_buff *skb, struct net_device *sb_dev)
{
- return __dev_queue_xmit(skb, accel_priv);
+ return __dev_queue_xmit(skb, sb_dev);
}
EXPORT_SYMBOL(dev_queue_xmit_accel);
^ permalink raw reply related
* [jkirsher/next-queue PATCH 3/7] ixgbe: Add code to populate and use macvlan tc to Tx queue map
From: Alexander Duyck @ 2018-06-11 17:41 UTC (permalink / raw)
To: netdev, intel-wired-lan, jeffrey.t.kirsher
In-Reply-To: <20180611173003.41352.25621.stgit@ahduyck-green-test.jf.intel.com>
This patch makes it so that we use the tc_to_txq mapping in the macvlan
device in order to select the Tx queue for outgoing packets.
The idea here is to try and move away from using ixgbe_select_queue and to
come up with a generic way to make this work for devices going forward. By
encoding this information in the netdev this can become something that can
be used generically as a solution for similar setups going forward.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 44 ++++++++++++++++++++++---
1 file changed, 38 insertions(+), 6 deletions(-)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index fc23e36..6e27848 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -5271,6 +5271,8 @@ static void ixgbe_clean_rx_ring(struct ixgbe_ring *rx_ring)
static int ixgbe_fwd_ring_up(struct ixgbe_adapter *adapter,
struct ixgbe_fwd_adapter *accel)
{
+ u16 rss_i = adapter->ring_feature[RING_F_RSS].indices;
+ int num_tc = netdev_get_num_tc(adapter->netdev);
struct net_device *vdev = accel->netdev;
int i, baseq, err;
@@ -5282,6 +5284,11 @@ static int ixgbe_fwd_ring_up(struct ixgbe_adapter *adapter,
accel->rx_base_queue = baseq;
accel->tx_base_queue = baseq;
+ /* record configuration for macvlan interface in vdev */
+ for (i = 0; i < num_tc; i++)
+ netdev_bind_sb_channel_queue(adapter->netdev, vdev,
+ i, rss_i, baseq + (rss_i * i));
+
for (i = 0; i < adapter->num_rx_queues_per_pool; i++)
adapter->rx_ring[baseq + i]->netdev = vdev;
@@ -5306,6 +5313,10 @@ static int ixgbe_fwd_ring_up(struct ixgbe_adapter *adapter,
netdev_err(vdev, "L2FW offload disabled due to L2 filter error\n");
+ /* unbind the queues and drop the subordinate channel config */
+ netdev_unbind_sb_channel(adapter->netdev, vdev);
+ netdev_set_sb_channel(vdev, 0);
+
clear_bit(accel->pool, adapter->fwd_bitmask);
kfree(accel);
@@ -8212,18 +8223,22 @@ static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb,
void *accel_priv, select_queue_fallback_t fallback)
{
struct ixgbe_fwd_adapter *fwd_adapter = accel_priv;
- struct ixgbe_adapter *adapter;
- int txq;
#ifdef IXGBE_FCOE
+ struct ixgbe_adapter *adapter;
struct ixgbe_ring_feature *f;
#endif
+ int txq;
if (fwd_adapter) {
- adapter = netdev_priv(dev);
- txq = reciprocal_scale(skb_get_hash(skb),
- adapter->num_rx_queues_per_pool);
+ u8 tc = netdev_get_num_tc(dev) ?
+ netdev_get_prio_tc_map(dev, skb->priority) : 0;
+ struct net_device *vdev = fwd_adapter->netdev;
+
+ txq = vdev->tc_to_txq[tc].offset;
+ txq += reciprocal_scale(skb_get_hash(skb),
+ vdev->tc_to_txq[tc].count);
- return txq + fwd_adapter->tx_base_queue;
+ return txq;
}
#ifdef IXGBE_FCOE
@@ -8777,6 +8792,11 @@ static int ixgbe_reassign_macvlan_pool(struct net_device *vdev, void *data)
/* if we cannot find a free pool then disable the offload */
netdev_err(vdev, "L2FW offload disabled due to lack of queue resources\n");
macvlan_release_l2fw_offload(vdev);
+
+ /* unbind the queues and drop the subordinate channel config */
+ netdev_unbind_sb_channel(adapter->netdev, vdev);
+ netdev_set_sb_channel(vdev, 0);
+
kfree(accel);
return 0;
@@ -9785,6 +9805,13 @@ static void *ixgbe_fwd_add(struct net_device *pdev, struct net_device *vdev)
if (!macvlan_supports_dest_filter(vdev))
return ERR_PTR(-EMEDIUMTYPE);
+ /* We need to lock down the macvlan to be a single queue device so that
+ * we can reuse the tc_to_txq field in the macvlan netdev to represent
+ * the queue mapping to our netdev.
+ */
+ if (netif_is_multiqueue(vdev))
+ return ERR_PTR(-ERANGE);
+
pool = find_first_zero_bit(adapter->fwd_bitmask, adapter->num_rx_pools);
if (pool == adapter->num_rx_pools) {
u16 used_pools = adapter->num_vfs + adapter->num_rx_pools;
@@ -9841,6 +9868,7 @@ static void *ixgbe_fwd_add(struct net_device *pdev, struct net_device *vdev)
return ERR_PTR(-ENOMEM);
set_bit(pool, adapter->fwd_bitmask);
+ netdev_set_sb_channel(vdev, pool);
accel->pool = pool;
accel->netdev = vdev;
@@ -9882,6 +9910,10 @@ static void ixgbe_fwd_del(struct net_device *pdev, void *priv)
ring->netdev = NULL;
}
+ /* unbind the queues and drop the subordinate channel config */
+ netdev_unbind_sb_channel(pdev, accel->netdev);
+ netdev_set_sb_channel(accel->netdev, 0);
+
clear_bit(accel->pool, adapter->fwd_bitmask);
kfree(accel);
}
^ permalink raw reply related
* [jkirsher/next-queue PATCH 2/7] net: Add support for subordinate device traffic classes
From: Alexander Duyck @ 2018-06-11 17:41 UTC (permalink / raw)
To: netdev, intel-wired-lan, jeffrey.t.kirsher
In-Reply-To: <20180611173003.41352.25621.stgit@ahduyck-green-test.jf.intel.com>
This patch is meant to provide the basic tools needed to allow us to create
subordinate device traffic classes. The general idea here is to allow
subdividing the queues of a device into queue groups accessible through an
upper device such as a macvlan.
The idea here is to enforce the idea that an upper device has to be a
single queue device, ideally with IFF_NO_QUQUE set. With that being the
case we can pretty much guarantee that the tc_to_txq mappings and XPS maps
for the upper device are unused. As such we could reuse those in order to
support subdividing the lower device and distributing those queues between
the subordinate devices.
In order to distinguish between a regular set of traffic classes and if a
device is carrying subordinate traffic classes I changed num_tc from a u8
to a s16 value and use the negative values to represent the suboordinate
pool values. So starting at -1 and running to -32768 we can encode those as
pool values, and the existing values of 0 to 15 can be maintained.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
include/linux/netdevice.h | 16 ++++++++
net/core/dev.c | 89 +++++++++++++++++++++++++++++++++++++++++++++
net/core/net-sysfs.c | 21 ++++++++++-
3 files changed, 124 insertions(+), 2 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3ec9850..41b4660 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -569,6 +569,9 @@ struct netdev_queue {
* (/sys/class/net/DEV/Q/trans_timeout)
*/
unsigned long trans_timeout;
+
+ /* Suboordinate device that the queue has been assigned to */
+ struct net_device *sb_dev;
/*
* write-mostly part
*/
@@ -1978,7 +1981,7 @@ struct net_device {
#ifdef CONFIG_DCB
const struct dcbnl_rtnl_ops *dcbnl_ops;
#endif
- u8 num_tc;
+ s16 num_tc;
struct netdev_tc_txq tc_to_txq[TC_MAX_QUEUE];
u8 prio_tc_map[TC_BITMASK + 1];
@@ -2032,6 +2035,17 @@ int netdev_get_num_tc(struct net_device *dev)
return dev->num_tc;
}
+void netdev_unbind_sb_channel(struct net_device *dev,
+ struct net_device *sb_dev);
+int netdev_bind_sb_channel_queue(struct net_device *dev,
+ struct net_device *sb_dev,
+ u8 tc, u16 count, u16 offset);
+int netdev_set_sb_channel(struct net_device *dev, u16 channel);
+static inline int netdev_get_sb_channel(struct net_device *dev)
+{
+ return max_t(int, -dev->num_tc, 0);
+}
+
static inline
struct netdev_queue *netdev_get_tx_queue(const struct net_device *dev,
unsigned int index)
diff --git a/net/core/dev.c b/net/core/dev.c
index 6e18242..27fe4f2 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2068,11 +2068,13 @@ int netdev_txq_to_tc(struct net_device *dev, unsigned int txq)
struct netdev_tc_txq *tc = &dev->tc_to_txq[0];
int i;
+ /* walk through the TCs and see if it falls into any of them */
for (i = 0; i < TC_MAX_QUEUE; i++, tc++) {
if ((txq - tc->offset) < tc->count)
return i;
}
+ /* didn't find it, just return -1 to indicate no match */
return -1;
}
@@ -2215,7 +2217,14 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
bool active = false;
if (dev->num_tc) {
+ /* Do not allow XPS on subordinate device directly */
num_tc = dev->num_tc;
+ if (num_tc < 0)
+ return -EINVAL;
+
+ /* If queue belongs to subordinate dev use its map */
+ dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
+
tc = netdev_txq_to_tc(dev, index);
if (tc < 0)
return -EINVAL;
@@ -2366,11 +2375,25 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
EXPORT_SYMBOL(netif_set_xps_queue);
#endif
+static void netdev_unbind_all_sb_channels(struct net_device *dev)
+{
+ struct netdev_queue *txq = &dev->_tx[dev->num_tx_queues];
+
+ /* Unbind any subordinate channels */
+ while (txq-- != &dev->_tx[0]) {
+ if (txq->sb_dev)
+ netdev_unbind_sb_channel(dev, txq->sb_dev);
+ }
+}
+
void netdev_reset_tc(struct net_device *dev)
{
#ifdef CONFIG_XPS
netif_reset_xps_queues_gt(dev, 0);
#endif
+ netdev_unbind_all_sb_channels(dev);
+
+ /* Reset TC configuration of device */
dev->num_tc = 0;
memset(dev->tc_to_txq, 0, sizeof(dev->tc_to_txq));
memset(dev->prio_tc_map, 0, sizeof(dev->prio_tc_map));
@@ -2399,11 +2422,77 @@ int netdev_set_num_tc(struct net_device *dev, u8 num_tc)
#ifdef CONFIG_XPS
netif_reset_xps_queues_gt(dev, 0);
#endif
+ netdev_unbind_all_sb_channels(dev);
+
dev->num_tc = num_tc;
return 0;
}
EXPORT_SYMBOL(netdev_set_num_tc);
+void netdev_unbind_sb_channel(struct net_device *dev,
+ struct net_device *sb_dev)
+{
+ struct netdev_queue *txq = &dev->_tx[dev->num_tx_queues];
+
+#ifdef CONFIG_XPS
+ netif_reset_xps_queues_gt(sb_dev, 0);
+#endif
+ memset(sb_dev->tc_to_txq, 0, sizeof(sb_dev->tc_to_txq));
+ memset(sb_dev->prio_tc_map, 0, sizeof(sb_dev->prio_tc_map));
+
+ while (txq-- != &dev->_tx[0]) {
+ if (txq->sb_dev == sb_dev)
+ txq->sb_dev = NULL;
+ }
+}
+EXPORT_SYMBOL(netdev_unbind_sb_channel);
+
+int netdev_bind_sb_channel_queue(struct net_device *dev,
+ struct net_device *sb_dev,
+ u8 tc, u16 count, u16 offset)
+{
+ /* Make certain the sb_dev and dev are already configured */
+ if (sb_dev->num_tc >= 0 || tc >= dev->num_tc)
+ return -EINVAL;
+
+ /* We cannot hand out queues we don't have */
+ if ((offset + count) > dev->real_num_tx_queues)
+ return -EINVAL;
+
+ /* Record the mapping */
+ sb_dev->tc_to_txq[tc].count = count;
+ sb_dev->tc_to_txq[tc].offset = offset;
+
+ /* Provide a way for Tx queue to find the tc_to_txq map or
+ * XPS map for itself.
+ */
+ while (count--)
+ netdev_get_tx_queue(dev, count + offset)->sb_dev = sb_dev;
+
+ return 0;
+}
+EXPORT_SYMBOL(netdev_bind_sb_channel_queue);
+
+int netdev_set_sb_channel(struct net_device *dev, u16 channel)
+{
+ /* Do not use a multiqueue device to represent a subordinate channel */
+ if (netif_is_multiqueue(dev))
+ return -ENODEV;
+
+ /* We allow channels 1 - 32767 to be used for subordinate channels.
+ * Channel 0 is meant to be "native" mode and used only to represent
+ * the main root device. We allow writing 0 to reset the device back
+ * to normal mode after being used as a subordinate channel.
+ */
+ if (channel > S16_MAX)
+ return -EINVAL;
+
+ dev->num_tc = -channel;
+
+ return 0;
+}
+EXPORT_SYMBOL(netdev_set_sb_channel);
+
/*
* Routine to help set real_num_tx_queues. To avoid skbs mapped to queues
* greater than real_num_tx_queues stale skbs on the qdisc must be flushed.
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 335c6a4..bd067b1 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1054,11 +1054,23 @@ static ssize_t traffic_class_show(struct netdev_queue *queue,
return -ENOENT;
index = get_netdev_queue_index(queue);
+
+ /* If queue belongs to subordinate dev use its tc mapping */
+ dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
+
tc = netdev_txq_to_tc(dev, index);
if (tc < 0)
return -EINVAL;
- return sprintf(buf, "%u\n", tc);
+ /* We can report the traffic class one of two ways:
+ * Subordinate device traffic classes are reported with the traffic
+ * class first, and then the subordinate class so for example TC0 on
+ * subordinate device 2 will be reported as "0-2". If the queue
+ * belongs to the root device it will be reported with just the
+ * traffic class, so just "0" for TC 0 for example.
+ */
+ return dev->num_tc < 0 ? sprintf(buf, "%u%d\n", tc, dev->num_tc) :
+ sprintf(buf, "%u\n", tc);
}
#ifdef CONFIG_XPS
@@ -1225,7 +1237,14 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
index = get_netdev_queue_index(queue);
if (dev->num_tc) {
+ /* Do not allow XPS on subordinate device directly */
num_tc = dev->num_tc;
+ if (num_tc < 0)
+ return -EINVAL;
+
+ /* If queue belongs to subordinate dev use its map */
+ dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
+
tc = netdev_txq_to_tc(dev, index);
if (tc < 0)
return -EINVAL;
^ permalink raw reply related
* [jkirsher/next-queue PATCH 1/7] net-sysfs: Drop support for XPS and traffic_class on single queue device
From: Alexander Duyck @ 2018-06-11 17:40 UTC (permalink / raw)
To: netdev, intel-wired-lan, jeffrey.t.kirsher
In-Reply-To: <20180611173003.41352.25621.stgit@ahduyck-green-test.jf.intel.com>
This patch makes it so that we do not report the traffic class or allow XPS
configuration on single queue devices. This is mostly to avoid unnecessary
complexity with changes I have planned that will allow us to reuse
the unused tc_to_txq and XPS configuration on a single queue device to
allow it to make use of a subset of queues on an underlying device.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
net/core/net-sysfs.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index bb7e80f..335c6a4 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1047,9 +1047,14 @@ static ssize_t traffic_class_show(struct netdev_queue *queue,
char *buf)
{
struct net_device *dev = queue->dev;
- int index = get_netdev_queue_index(queue);
- int tc = netdev_txq_to_tc(dev, index);
+ int index;
+ int tc;
+ if (!netif_is_multiqueue(dev))
+ return -ENOENT;
+
+ index = get_netdev_queue_index(queue);
+ tc = netdev_txq_to_tc(dev, index);
if (tc < 0)
return -EINVAL;
@@ -1214,6 +1219,9 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
cpumask_var_t mask;
unsigned long index;
+ if (!netif_is_multiqueue(dev))
+ return -ENOENT;
+
index = get_netdev_queue_index(queue);
if (dev->num_tc) {
@@ -1260,6 +1268,9 @@ static ssize_t xps_cpus_store(struct netdev_queue *queue,
cpumask_var_t mask;
int err;
+ if (!netif_is_multiqueue(dev))
+ return -ENOENT;
+
if (!capable(CAP_NET_ADMIN))
return -EPERM;
^ permalink raw reply related
* [jkirsher/next-queue PATCH 0/7] Add support for L2 Fwd Offload w/o ndo_select_queue
From: Alexander Duyck @ 2018-06-11 17:40 UTC (permalink / raw)
To: netdev, intel-wired-lan, jeffrey.t.kirsher
This patch series is meant to allow support for the L2 forward offload, aka
MACVLAN offload without the need for using ndo_select_queue.
The existing solution currently requires that we use ndo_select_queue in
the transmit path if we want to associate specific Tx queues with a given
MACVLAN interface. In order to get away from this we need to repurpose the
tc_to_txq array and XPS pointer for the MACVLAN interface and use those as
a means of accessing the queues on the lower device. As a result we cannot
offload a device that is configured as multiqueue, however it doesn't
really make sense to configure a macvlan interfaced as being multiqueue
anyway since it doesn't really have a qdisc of its own in the first place.
I am submitting this as an RFC for the netdev mailing list, and officially
submitting it for testing to Jeff Kirsher's next-queue in order to validate
the ixgbe specific bits.
The big changes in this set are:
Allow lower device to update tc_to_txq and XPS map of offloaded MACVLAN
Disable XPS for single queue devices
Replace accel_priv with sb_dev in ndo_select_queue
Add sb_dev parameter to fallback function for ndo_select_queue
Consolidated ndo_select_queue functions that appeared to be duplicates
---
Alexander Duyck (7):
net-sysfs: Drop support for XPS and traffic_class on single queue device
net: Add support for subordinate device traffic classes
ixgbe: Add code to populate and use macvlan tc to Tx queue map
net: Add support for subordinate traffic classes to netdev_pick_tx
net: Add generic ndo_select_queue functions
net: allow ndo_select_queue to pass netdev
net: allow fallback function to pass netdev
drivers/infiniband/hw/hfi1/vnic_main.c | 2
drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c | 4 -
drivers/net/bonding/bond_main.c | 3
drivers/net/ethernet/amazon/ena/ena_netdev.c | 5 -
drivers/net/ethernet/broadcom/bcmsysport.c | 6 -
drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 6 +
drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h | 3
drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 5 -
drivers/net/ethernet/hisilicon/hns/hns_enet.c | 5 -
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 62 ++++++--
drivers/net/ethernet/lantiq_etop.c | 10 -
drivers/net/ethernet/mellanox/mlx4/en_tx.c | 7 +
drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 3
drivers/net/ethernet/mellanox/mlx5/core/en.h | 3
drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 5 -
drivers/net/ethernet/renesas/ravb_main.c | 3
drivers/net/ethernet/sun/ldmvsw.c | 3
drivers/net/ethernet/sun/sunvnet.c | 3
drivers/net/ethernet/ti/netcp_core.c | 9 -
drivers/net/hyperv/netvsc_drv.c | 6 -
drivers/net/macvlan.c | 10 -
drivers/net/net_failover.c | 7 +
drivers/net/team/team.c | 3
drivers/net/tun.c | 3
drivers/net/wireless/marvell/mwifiex/main.c | 3
drivers/net/xen-netback/interface.c | 4 -
drivers/net/xen-netfront.c | 3
drivers/staging/netlogic/xlr_net.c | 9 -
drivers/staging/rtl8188eu/os_dep/os_intfs.c | 3
drivers/staging/rtl8723bs/os_dep/os_intfs.c | 7 -
include/linux/netdevice.h | 32 ++++
net/core/dev.c | 154 ++++++++++++++++++---
net/core/net-sysfs.c | 36 +++++
net/mac80211/iface.c | 4 -
net/packet/af_packet.c | 9 -
35 files changed, 306 insertions(+), 134 deletions(-)
^ permalink raw reply
* Backport bonding patches to fix active-passive
From: Nate Clark @ 2018-06-11 17:44 UTC (permalink / raw)
Cc: netdev
Hi,
On the latest 4.9 stable active-passive bonding does not always
failover to the passive slave when carrier is lost on the active
slave. It seems that the issue stems from the backport of
c4adfc822bf5d8e97660b6114b5a8892530ce8cb, bonding: make speed, duplex
setting consistent with link state. There were subsequent patches
which resolved issues with the change to bond_update_speed_duplex
which were not backported. The three commits which seem to resolve the
issue are b5bf0f5b16b9c316c34df9f31d4be8729eb86845,
3f3c278c94dd994fe0d9f21679ae19b9c0a55292 and
ad729bc9acfb7c47112964b4877ef5404578ed13. There are other commits in
mainline which also revolve around
c4adfc822bf5d8e97660b6114b5a8892530ce8cb but are not necessary to
resolving the active-passive failover problems.
Would it be possible to queue up the three commits for backporting to
4.9 stable:
b5bf0f5b16b9c316c34df9f31d4be8729eb86845 bonding: correctly update
link status during mii-commit
3f3c278c94dd994fe0d9f21679ae19b9c0a55292 bonding: fix active-backup transition
ad729bc9acfb7c47112964b4877ef5404578ed13 bonding: require speed/duplex
only for 802.3ad, alb and tlb
All of those commits apply cleanly to 4.9.107.
Thanks,
-nate
^ permalink raw reply
* Re: Qualcomm rmnet driver and qmi_wwan
From: Bjørn Mork @ 2018-06-11 17:43 UTC (permalink / raw)
To: Subash Abhinov Kasiviswanathan; +Cc: Daniele Palmas, Dan Williams, netdev
In-Reply-To: <4b74bb1d92b9e9351bc504d18f96116b@codeaurora.org>
Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> writes:
>> thanks, I will test it on Monday.
>>
>> Just a question for my knowledge: is the new sysfs attribute really
>> needed? I mean, is there not any other way to understand from qmi_wwan
>> without user intervention that there is the rmnet device attached?
>>
>> Regards,
>> Daniele
>>
>
> Hi Daniele
>
> You can check for the rx_handler attached to qmi_wwan dev and see if it
> belongs to rmnet. You can use the attached patch for it but it think the
> sysfs way might be a bit cleaner.
>
> --
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> a Linux Foundation Collaborative Project
>
> From f7a2b90948da47ade1b345eddb37b721f5ab65f4 Mon Sep 17 00:00:00 2001
> From: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
> Date: Sat, 9 Jun 2018 11:14:22 -0600
> Subject: [PATCH] net: qmi_wwan: Allow packets to pass through to rmnet
>
> Pass through mode is to allow packets in MAP format to be passed
> on to rmnet if the rmnet rx handler is attached to it.
>
> Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
> ---
> drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 4 +++-
> drivers/net/usb/qmi_wwan.c | 10 ++++++++++
> include/linux/if_rmnet.h | 20 ++++++++++++++++++++
> 3 files changed, 33 insertions(+), 1 deletion(-)
> create mode 100644 include/linux/if_rmnet.h
>
> diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
> index 5f4e447..164a18f 100644
> --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
> +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
> @@ -17,6 +17,7 @@
> #include <linux/module.h>
> #include <linux/netlink.h>
> #include <linux/netdevice.h>
> +#include <linux/if_rmnet.h>
> #include "rmnet_config.h"
> #include "rmnet_handlers.h"
> #include "rmnet_vnd.h"
> @@ -48,10 +49,11 @@
> [IFLA_RMNET_FLAGS] = { .len = sizeof(struct ifla_rmnet_flags) },
> };
>
> -static int rmnet_is_real_dev_registered(const struct net_device *real_dev)
> +int rmnet_is_real_dev_registered(const struct net_device *real_dev)
> {
> return rcu_access_pointer(real_dev->rx_handler) == rmnet_rx_handler;
> }
> +EXPORT_SYMBOL(rmnet_is_real_dev_registered);
>
> /* Needs rtnl lock */
> static struct rmnet_port*
> diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
> index f52a9be..abdae63 100644
> --- a/drivers/net/usb/qmi_wwan.c
> +++ b/drivers/net/usb/qmi_wwan.c
> @@ -22,6 +22,7 @@
> #include <linux/usb/cdc.h>
> #include <linux/usb/usbnet.h>
> #include <linux/usb/cdc-wdm.h>
> +#include <linux/if_rmnet.h>
>
> /* This driver supports wwan (3G/LTE/?) devices using a vendor
> * specific management protocol called Qualcomm MSM Interface (QMI) -
> @@ -354,6 +355,10 @@ static ssize_t add_mux_store(struct device *d, struct device_attribute *attr, c
> if (kstrtou8(buf, 0, &mux_id))
> return -EINVAL;
>
> + /* rmnet is already attached here */
> + if (rmnet_is_real_dev_registered(to_net_dev(d)))
> + return -EINVAL;
> +
Maybe rmnet_is_real_dev_registered(dev->net) instead, since we use that
elsewhere in this function?
> /* mux_id [1 - 0x7f] range empirically found */
> if (mux_id < 1 || mux_id > 0x7f)
> return -EINVAL;
> @@ -543,6 +548,11 @@ static int qmi_wwan_rx_fixup(struct usbnet *dev, struct sk_buff *skb)
> if (skb->len < dev->net->hard_header_len)
> return 0;
>
> + if (rawip && rmnet_is_real_dev_registered(skb->dev)) {
> + skb->protocol = htons(ETH_P_MAP);
> + return (netif_rx(skb) == NET_RX_SUCCESS);
> + }
Like Daniele said: It would be good to have some way to know when the
rawip condition fails. Or even better: Automatically force rawip mode
when the rmnet driver attaches. But that doesn't seem possible? No
notifications or anything when an rx handler is registered?
Hmm, looking at this I wonder: Is the rawip check really necessary? You
skip all the extra rawip code in the driver anyway, so I don't see how
it matters. But maybe the ethernet header_ops are a problem?
And I wonder about using skb->dev here. Does that really work? I
didn't think we set that until later. Why not use dev->net instead?
Bjørn
^ permalink raw reply
* Re: [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Michael S. Tsirkin @ 2018-06-11 17:26 UTC (permalink / raw)
To: Sridhar Samudrala
Cc: netdev, virtualization, virtio-dev, jesse.brandeburg,
alexander.h.duyck, kubakici, jasowang, loseweigh, jiri,
aaron.f.brown, qemu-devel
In-Reply-To: <1525734594-11134-1-git-send-email-sridhar.samudrala@intel.com>
On Mon, May 07, 2018 at 04:09:54PM -0700, Sridhar Samudrala wrote:
> This feature bit can be used by hypervisor to indicate virtio_net device to
> act as a standby for another device with the same MAC address.
>
> I tested this with a small change to the patch to mark the STANDBY feature 'true'
> by default as i am using libvirt to start the VMs.
> Is there a way to pass the newly added feature bit 'standby' to qemu via libvirt
> XML file?
>
> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
So I do not think we can commit to this interface: we
really need to control visibility of the primary device.
However just for testing purposes, we could add a non-stable
interface "x-standby" with the understanding that as any
x- prefix it's unstable and will be changed down the road,
likely in the next release.
> ---
> hw/net/virtio-net.c | 2 ++
> include/standard-headers/linux/virtio_net.h | 3 +++
> 2 files changed, 5 insertions(+)
>
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index 90502fca7c..38b3140670 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
> true),
> DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
> DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
> + DEFINE_PROP_BIT64("standby", VirtIONet, host_features, VIRTIO_NET_F_STANDBY,
> + false),
> DEFINE_PROP_END_OF_LIST(),
> };
>
> diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
> index e9f255ea3f..01ec09684c 100644
> --- a/include/standard-headers/linux/virtio_net.h
> +++ b/include/standard-headers/linux/virtio_net.h
> @@ -57,6 +57,9 @@
> * Steering */
> #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */
>
> +#define VIRTIO_NET_F_STANDBY 62 /* Act as standby for another device
> + * with the same MAC.
> + */
> #define VIRTIO_NET_F_SPEED_DUPLEX 63 /* Device set linkspeed and duplex */
>
> #ifndef VIRTIO_NET_NO_LEGACY
> --
> 2.14.3
^ permalink raw reply
* [PATCH] iwlwifi: pcie: make array prop static, shrinks object size
From: Colin King @ 2018-06-11 17:15 UTC (permalink / raw)
To: Johannes Berg, Emmanuel Grumbach, Luca Coelho,
Intel Linux Wireless, Kalle Valo, David S . Miller,
linux-wireless, netdev
Cc: kernel-janitors, linux-kernel
From: Colin Ian King <colin.king@canonical.com>
Don't populate the read-only array 'prop' on the stack but
instead make it static. Makes the object code smaller by 20 bytes:
Before:
text data bss dec hex filename
71659 14614 576 86849 15341 trans.o
After:
text data bss dec hex filename
71479 14774 576 86829 1532d trans.o
(gcc version 7.3.0 x86_64)
Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
drivers/net/wireless/intel/iwlwifi/pcie/trans.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/wireless/intel/iwlwifi/pcie/trans.c b/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
index 7229991ae70d..c4626ebe5da1 100644
--- a/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
+++ b/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
@@ -1946,7 +1946,7 @@ static void iwl_trans_pcie_removal_wk(struct work_struct *wk)
struct iwl_trans_pcie_removal *removal =
container_of(wk, struct iwl_trans_pcie_removal, work);
struct pci_dev *pdev = removal->pdev;
- char *prop[] = {"EVENT=INACCESSIBLE", NULL};
+ static char *prop[] = {"EVENT=INACCESSIBLE", NULL};
dev_err(&pdev->dev, "Device gone - attempting removal\n");
kobject_uevent_env(&pdev->dev.kobj, KOBJ_CHANGE, prop);
--
2.17.0
^ permalink raw reply related
* [net] fq_codel: fix NULL pointer deref in fq_codel_reset
From: Jeff Kirsher @ 2018-06-11 17:00 UTC (permalink / raw)
To: davem
Cc: Jacob Keller, netdev, nhorman, sassmann, jogreene, Eric Dumazet,
Jeff Kirsher
From: Jacob Keller <jacob.e.keller@intel.com>
The function qdisc_create_dftl attempts to create a default qdisc. If
this fails, it calls qdisc_destroy when cleaning up. The qdisc_destroy
function calls the ->reset op on the qdisc.
In the case of sch_fq_codel.c, this function will panic when the qdisc
wasn't properly initialized:
kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
kernel: IP: fq_codel_reset+0x58/0xd0 [sch_fq_codel]
kernel: PGD 0 P4D 0
kernel: Oops: 0000 [#1] SMP PTI
kernel: Modules linked in: i40iw i40e(OE) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun bridge stp llc devlink ebtable_filter ebtables ip6table_filter ip6_tables rpcrdma ib_isert iscsi_target_mod sunrpc ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate iTCO_wdt iTCO_vendor_support intel_uncore ib_core intel_rapl_perf mei_me mei joydev i2c_i801 lpc_ich ioatdma shpchp wmi sch_fq_codel xfs libcrc32c mgag200 ixgbe drm_kms_helper isci ttm fi
rewire_ohci
kernel: mdio drm igb libsas crc32c_intel firewire_core ptp pps_core scsi_transport_sas crc_itu_t dca i2c_algo_bit ipmi_si ipmi_devintf ipmi_msghandler [last unloaded: i40e]
kernel: CPU: 10 PID: 4219 Comm: ip Tainted: G OE 4.16.13custom-fq-codel-test+ #3
kernel: Hardware name: Intel Corporation S2600CO/S2600CO, BIOS SE5C600.86B.02.05.0004.051120151007 05/11/2015
kernel: RIP: 0010:fq_codel_reset+0x58/0xd0 [sch_fq_codel]
kernel: RSP: 0018:ffffbfbf4c1fb620 EFLAGS: 00010246
kernel: RAX: 0000000000000400 RBX: 0000000000000000 RCX: 00000000000005b9
kernel: RDX: 0000000000000000 RSI: ffff9d03264a60c0 RDI: ffff9cfd17b31c00
kernel: RBP: 0000000000000001 R08: 00000000000260c0 R09: ffffffffb679c3e9
kernel: R10: fffff1dab06a0e80 R11: ffff9cfd163af800 R12: ffff9cfd17b31c00
kernel: R13: 0000000000000001 R14: ffff9cfd153de600 R15: 0000000000000001
kernel: FS: 00007fdec2f92800(0000) GS:ffff9d0326480000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 0000000000000008 CR3: 0000000c1956a006 CR4: 00000000000606e0
kernel: Call Trace:
kernel: qdisc_destroy+0x56/0x140
kernel: qdisc_create_dflt+0x8b/0xb0
kernel: mq_init+0xc1/0xf0
kernel: qdisc_create_dflt+0x5a/0xb0
kernel: dev_activate+0x205/0x230
kernel: __dev_open+0xf5/0x160
kernel: __dev_change_flags+0x1a3/0x210
kernel: dev_change_flags+0x21/0x60
kernel: do_setlink+0x660/0xdf0
kernel: ? down_trylock+0x25/0x30
kernel: ? xfs_buf_trylock+0x1a/0xd0 [xfs]
kernel: ? rtnl_newlink+0x816/0x990
kernel: ? _xfs_buf_find+0x327/0x580 [xfs]
kernel: ? _cond_resched+0x15/0x30
kernel: ? kmem_cache_alloc+0x20/0x1b0
kernel: ? rtnetlink_rcv_msg+0x200/0x2f0
kernel: ? rtnl_calcit.isra.30+0x100/0x100
kernel: ? netlink_rcv_skb+0x4c/0x120
kernel: ? netlink_unicast+0x19e/0x260
kernel: ? netlink_sendmsg+0x1ff/0x3c0
kernel: ? sock_sendmsg+0x36/0x40
kernel: ? ___sys_sendmsg+0x295/0x2f0
kernel: ? ebitmap_cmp+0x6d/0x90
kernel: ? dev_get_by_name_rcu+0x73/0x90
kernel: ? skb_dequeue+0x52/0x60
kernel: ? __inode_wait_for_writeback+0x7f/0xf0
kernel: ? bit_waitqueue+0x30/0x30
kernel: ? fsnotify_grab_connector+0x3c/0x60
kernel: ? __sys_sendmsg+0x51/0x90
kernel: ? do_syscall_64+0x74/0x180
kernel: ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
kernel: Code: 00 00 48 89 87 00 02 00 00 8b 87 a0 01 00 00 85 c0 0f 84 84 00 00 00 31 ed 48 63 dd 83 c5 01 48 c1 e3 06 49 03 9c 24 90 01 00 00 <48> 8b 73 08 48 8b 3b e8 6c 9a 4f f6 48 8d 43 10 48 c7 03 00 00
kernel: RIP: fq_codel_reset+0x58/0xd0 [sch_fq_codel] RSP: ffffbfbf4c1fb620
kernel: CR2: 0000000000000008
kernel: ---[ end trace e81a62bede66274e ]---
This occurs because if fq_codel_init fails, it has left the private data
in an incomplete state. For example, if tcf_block_get fails, (as in the
above panic), then q->flows and q->backlogs will be NULL. Thus they will
cause NULL pointer access when attempting to reset them in
fq_codel_reset.
We could mitigate some of these issues by changing fq_codel_init to more
explicitly cleanup after itself when failing. For example, we could
ensure that q->flowcnt was set to 0 so that the loop over each flow in
fq_codel_reset would not trigger. However, this would not prevent a NULL
pointer dereference when attempting to memset the q->backlogs.
Instead, just add a NULL check prior to attempting to reset these
fields.
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
net/sched/sch_fq_codel.c | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)
diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index 22fa13cf5d8b..1658c314ee40 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -352,14 +352,17 @@ static void fq_codel_reset(struct Qdisc *sch)
INIT_LIST_HEAD(&q->new_flows);
INIT_LIST_HEAD(&q->old_flows);
- for (i = 0; i < q->flows_cnt; i++) {
- struct fq_codel_flow *flow = q->flows + i;
+ if (q->flows) {
+ for (i = 0; i < q->flows_cnt; i++) {
+ struct fq_codel_flow *flow = q->flows + i;
- fq_codel_flow_purge(flow);
- INIT_LIST_HEAD(&flow->flowchain);
- codel_vars_init(&flow->cvars);
+ fq_codel_flow_purge(flow);
+ INIT_LIST_HEAD(&flow->flowchain);
+ codel_vars_init(&flow->cvars);
+ }
}
- memset(q->backlogs, 0, q->flows_cnt * sizeof(u32));
+ if (q->backlogs)
+ memset(q->backlogs, 0, q->flows_cnt * sizeof(u32));
sch->q.qlen = 0;
sch->qstats.backlog = 0;
q->memory_usage = 0;
--
2.17.1
^ permalink raw reply related
* Re: [PATCH nf-next] netfilter: nft_reject_bridge: remove unnecessary ttl set
From: Taehee Yoo @ 2018-06-11 16:54 UTC (permalink / raw)
To: davem, Steffen Klassert; +Cc: netdev, Taehee Yoo
In-Reply-To: <20180611163505.9827-1-ap420073@gmail.com>
2018-06-12 1:35 GMT+09:00 Taehee Yoo <ap420073@gmail.com>:
> In the nft_reject_br_send_v4_tcp_reset(), a ttl is set by
> the nf_reject_ip_tcphdr_put(). so, below code is unnecessary.
>
> Signed-off-by: Taehee Yoo <ap420073@gmail.com>
> ---
> net/bridge/netfilter/nft_reject_bridge.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/net/bridge/netfilter/nft_reject_bridge.c b/net/bridge/netfilter/nft_reject_bridge.c
> index eaf05de..e0b082c 100644
> --- a/net/bridge/netfilter/nft_reject_bridge.c
> +++ b/net/bridge/netfilter/nft_reject_bridge.c
> @@ -89,8 +89,7 @@ static void nft_reject_br_send_v4_tcp_reset(struct net *net,
> niph = nf_reject_iphdr_put(nskb, oldskb, IPPROTO_TCP,
> net->ipv4.sysctl_ip_default_ttl);
> nf_reject_ip_tcphdr_put(nskb, oldskb, oth);
> - niph->ttl = net->ipv4.sysctl_ip_default_ttl;
> - niph->tot_len = htons(nskb->len);
> + niph->tot_len = htons(nskb->len);
> ip_send_check(niph);
>
> nft_reject_br_push_etherhdr(oldskb, nskb);
> --
> 2.9.3
>
I'm so sorry, I sent this to you by mistake.
Please ignore this.
Thanks
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox