Netdev List
 help / color / mirror / Atom feed
* Klientskie Bazy http://prodawez.tilda.ws/page7270311.html
From: netdev @ 2019-09-12  0:25 UTC (permalink / raw)
  To: netdev

Klientskie Bazy http://prodawez.tilda.ws/page7270311.html

^ permalink raw reply

* Re: [PATCH v3 2/2] tcp: Add rcv_wnd to TCP_INFO
From: Neal Cardwell @ 2019-09-12  0:49 UTC (permalink / raw)
  To: Thomas Higdon
  Cc: netdev@vger.kernel.org, Jonathan Lemon, Dave Jones, Eric Dumazet,
	Yuchung Cheng, Soheil Hassas Yeganeh
In-Reply-To: <20190911223148.89808-2-tph@fb.com>

On Wed, Sep 11, 2019 at 6:32 PM Thomas Higdon <tph@fb.com> wrote:
>
> Neal Cardwell mentioned that rcv_wnd would be useful for helping
> diagnose whether a flow is receive-window-limited at a given instant.
>
> This serves the purpose of adding an additional __u32 to avoid the
> would-be hole caused by the addition of the tcpi_rcvi_ooopack field.
>
> Signed-off-by: Thomas Higdon <tph@fb.com>
> ---

Thanks, Thomas.

I know that when I mentioned this before I mentioned the idea of both
tp->snd_wnd (send-side receive window) and tp->rcv_wnd (receive-side
receive window) in tcp_info, and did not express a preference between
the two. Now that we are faced with a decision between the two,
personally I think it would be a little more useful to start with
tp->snd_wnd. :-)

Two main reasons:

(1) Usually when we're diagnosing TCP performance problems, we do so
from the sender, since the sender makes most of the
performance-critical decisions (cwnd, pacing, TSO size, TSQ, etc).
From the sender-side the thing that would be most useful is to see
tp->snd_wnd, the receive window that the receiver has advertised to
the sender.

(2) From the receiver side, "ss" can already show a fair amount of
info about receive-side buffer/window limits, like:
info->tcpi_rcv_ssthresh, info->tcpi_rcv_space,
skmeminfo[SK_MEMINFO_RMEM_ALLOC], skmeminfo[SK_MEMINFO_RCVBUF]. Often
the rwin can be approximated by combining those.

Hopefully Eric, Yuchung, and Soheil can weigh in on the question of
snd_wnd vs rcv_wnd. Or we can perhaps think of another field, and add
the tcpi_rcvi_ooopack, snd_wnd, rcv_wnd, and that final field, all
together.

thanks,
neal

^ permalink raw reply

* Re: [PATCH V2 net-next 4/7] net: hns3: fix port setting handle for fibre port
From: tanhuazhong @ 2019-09-12  0:56 UTC (permalink / raw)
  To: Sergei Shtylyov, davem
  Cc: netdev, linux-kernel, salil.mehta, yisen.zhuang, linuxarm,
	jakub.kicinski
In-Reply-To: <7f914173-a2fc-08d8-e2b1-48fa3da4e29c@cogentembedded.com>



On 2019/9/11 18:16, Sergei Shtylyov wrote:
> Hello!
> 
> On 11.09.2019 5:40, Huazhong Tan wrote:
> 
>> From: Guangbin Huang <huangguangbin2@huawei.com>
>>
>> For hardware doesn't support use specified speed and duplex
> 
>     Can't pasre that. "For hardware that does not support using", perhaps?

Yes, thanks. Will check the grammar more carefully next time.

> 
>> to negotiate, it's unnecessary to check and modify the port
>> speed and duplex for fibre port when autoneg is on.
>>
>> Fixes: 22f48e24a23d ("net: hns3: add autoneg and change speed support 
>> for fibre port")
>> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
>> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
>> ---
>>   drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c | 15 +++++++++++++++
>>   1 file changed, 15 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c 
>> b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
>> index f5a681d..680c350 100644
>> --- a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
>> +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
>> @@ -726,6 +726,12 @@ static int hns3_check_ksettings_param(const 
>> struct net_device *netdev,
>>       u8 duplex;
>>       int ret;
>> +    /* hw doesn't support use specified speed and duplex to negotiate,
> 
>     I can't parse that, did you mean "using"?

yes, thanks.

> 
>> +     * unnecessary to check them when autoneg on.
>> +     */
>> +    if (cmd->base.autoneg)
>> +        return 0;
>> +
>>       if (ops->get_ksettings_an_result) {
>>           ops->get_ksettings_an_result(handle, &autoneg, &speed, 
>> &duplex);
>>           if (cmd->base.autoneg == autoneg && cmd->base.speed == speed &&
>> @@ -787,6 +793,15 @@ static int hns3_set_link_ksettings(struct 
>> net_device *netdev,
>>               return ret;
>>       }
>> +    /* hw doesn't support use specified speed and duplex to negotiate,
> 
>     Here too...
> 


yes, thanks.

>> +     * ignore them when autoneg on.
>> +     */
>> +    if (cmd->base.autoneg) {
>> +        netdev_info(netdev,
>> +                "autoneg is on, ignore the speed and duplex\n");
>> +        return 0;
>> +    }
>> +
>>       if (ops->cfg_mac_speed_dup_h)
>>           ret = ops->cfg_mac_speed_dup_h(handle, cmd->base.speed,
>>                              cmd->base.duplex);
> 
> MBR, Sergei
> 
> .
> 


^ permalink raw reply

* Re: [Patch net] sch_sfb: fix a crash in sfb_destroy()
From: Cong Wang @ 2019-09-12  1:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linux Kernel Network Developers, syzbot, Linus Torvalds,
	Jamal Hadi Salim, Jiri Pirko
In-Reply-To: <7b5b69a9-7ace-2d21-f187-7a81fb1dae5a@gmail.com>

On Wed, Sep 11, 2019 at 2:36 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> It seems a similar fix would be needed in net/sched/sch_dsmark.c ?
>

Yeah, or just add a NULL check in dsmark_destroy().

Anyway, I will send a separate patch for it.

Thanks.

^ permalink raw reply

* Re: [PATCH v1 net-next 12/15] net: dsa: sja1105: Configure the Time-Aware Scheduler via tc-taprio offload
From: Vladimir Oltean @ 2019-09-12  1:30 UTC (permalink / raw)
  To: Vinicius Costa Gomes
  Cc: f.fainelli, vivien.didelot, andrew, davem, vedang.patel,
	richardcochran, weifeng.voon, jiri, m-karicheri2, Jose.Abreu,
	ilias.apalodimas, jhs, xiyou.wangcong, kurt.kanzenbach, netdev
In-Reply-To: <87woeeipm8.fsf@linux.intel.com>

Hi Vinicius,

On 11/09/2019, Vinicius Costa Gomes <vinicius.gomes@intel.com> wrote:
> Hi,
>
> Vladimir Oltean <olteanv@gmail.com> writes:
>
>> This qdisc offload is the closest thing to what the SJA1105 supports in
>> hardware for time-based egress shaping. The switch core really is built
>> around SAE AS6802/TTEthernet (a TTTech standard) but can be made to
>> operate similarly to IEEE 802.1Qbv with some constraints:
>>
>> - The gate control list is a global list for all ports. There are 8
>>   execution threads that iterate through this global list in parallel.
>>   I don't know why 8, there are only 4 front-panel ports.
>>
>> - Care must be taken by the user to make sure that two execution threads
>>   never get to execute a GCL entry simultaneously. I created a O(n^4)
>>   checker for this hardware limitation, prior to accepting a taprio
>>   offload configuration as valid.
>>
>> - The spec says that if a GCL entry's interval is shorter than the frame
>>   length, you shouldn't send it (and end up in head-of-line blocking).
>>   Well, this switch does anyway.
>>
>> - The switch has no concept of ADMIN and OPER configurations. Because
>>   it's so simple, the TAS settings are loaded through the static config
>>   tables interface, so there isn't even place for any discussion about
>>   'graceful switchover between ADMIN and OPER'. You just reset the
>>   switch and upload a new OPER config.
>>
>> - The switch accepts multiple time sources for the gate events. Right
>>   now I am using the standalone clock source as opposed to PTP. So the
>>   base time parameter doesn't really do much. Support for the PTP clock
>>   source will be added in the next patch.
>>
>> Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
>> ---
>> Changes since RFC:
>> - Removed the sja1105_tas_config_work workqueue.
>> - Allocating memory with GFP_KERNEL.
>> - Made the ASCII art drawing fit in < 80 characters.
>> - Made most of the time-holding variables s64 instead of u64 (for fear
>>   of them not holding the result of signed arithmetics properly).
>>
>>  drivers/net/dsa/sja1105/Kconfig        |   8 +
>>  drivers/net/dsa/sja1105/Makefile       |   4 +
>>  drivers/net/dsa/sja1105/sja1105.h      |   5 +
>>  drivers/net/dsa/sja1105/sja1105_main.c |  19 +-
>>  drivers/net/dsa/sja1105/sja1105_tas.c  | 420 +++++++++++++++++++++++++
>>  drivers/net/dsa/sja1105/sja1105_tas.h  |  42 +++
>>  6 files changed, 497 insertions(+), 1 deletion(-)
>>  create mode 100644 drivers/net/dsa/sja1105/sja1105_tas.c
>>  create mode 100644 drivers/net/dsa/sja1105/sja1105_tas.h
>>
>> diff --git a/drivers/net/dsa/sja1105/Kconfig
>> b/drivers/net/dsa/sja1105/Kconfig
>> index 770134a66e48..55424f39cb0d 100644
>> --- a/drivers/net/dsa/sja1105/Kconfig
>> +++ b/drivers/net/dsa/sja1105/Kconfig
>> @@ -23,3 +23,11 @@ config NET_DSA_SJA1105_PTP
>>  	help
>>  	  This enables support for timestamping and PTP clock manipulations in
>>  	  the SJA1105 DSA driver.
>> +
>> +config NET_DSA_SJA1105_TAS
>> +	bool "Support for the Time-Aware Scheduler on NXP SJA1105"
>> +	depends on NET_DSA_SJA1105
>> +	help
>> +	  This enables support for the TTEthernet-based egress scheduling
>> +	  engine in the SJA1105 DSA driver, which is controlled using a
>> +	  hardware offload of the tc-tqprio qdisc.
>> diff --git a/drivers/net/dsa/sja1105/Makefile
>> b/drivers/net/dsa/sja1105/Makefile
>> index 4483113e6259..66161e874344 100644
>> --- a/drivers/net/dsa/sja1105/Makefile
>> +++ b/drivers/net/dsa/sja1105/Makefile
>> @@ -12,3 +12,7 @@ sja1105-objs := \
>>  ifdef CONFIG_NET_DSA_SJA1105_PTP
>>  sja1105-objs += sja1105_ptp.o
>>  endif
>> +
>> +ifdef CONFIG_NET_DSA_SJA1105_TAS
>> +sja1105-objs += sja1105_tas.o
>> +endif
>> diff --git a/drivers/net/dsa/sja1105/sja1105.h
>> b/drivers/net/dsa/sja1105/sja1105.h
>> index 3ca0b87aa3e4..d95f9ce3b4f9 100644
>> --- a/drivers/net/dsa/sja1105/sja1105.h
>> +++ b/drivers/net/dsa/sja1105/sja1105.h
>> @@ -21,6 +21,7 @@
>>  #define SJA1105_AGEING_TIME_MS(ms)	((ms) / 10)
>>
>>  #include "sja1105_ptp.h"
>> +#include "sja1105_tas.h"
>>
>>  /* Keeps the different addresses between E/T and P/Q/R/S */
>>  struct sja1105_regs {
>> @@ -96,6 +97,7 @@ struct sja1105_private {
>>  	struct mutex mgmt_lock;
>>  	struct sja1105_tagger_data tagger_data;
>>  	struct sja1105_ptp_data ptp_data;
>> +	struct sja1105_tas_data tas_data;
>>  };
>>
>>  #include "sja1105_dynamic_config.h"
>> @@ -111,6 +113,9 @@ typedef enum {
>>  	SPI_WRITE = 1,
>>  } sja1105_spi_rw_mode_t;
>>
>> +/* From sja1105_main.c */
>> +int sja1105_static_config_reload(struct sja1105_private *priv);
>> +
>>  /* From sja1105_spi.c */
>>  int sja1105_spi_send_packed_buf(const struct sja1105_private *priv,
>>  				sja1105_spi_rw_mode_t rw, u64 reg_addr,
>> diff --git a/drivers/net/dsa/sja1105/sja1105_main.c
>> b/drivers/net/dsa/sja1105/sja1105_main.c
>> index 8b930cc2dabc..4b393782cc84 100644
>> --- a/drivers/net/dsa/sja1105/sja1105_main.c
>> +++ b/drivers/net/dsa/sja1105/sja1105_main.c
>> @@ -22,6 +22,7 @@
>>  #include <linux/if_ether.h>
>>  #include <linux/dsa/8021q.h>
>>  #include "sja1105.h"
>> +#include "sja1105_tas.h"
>>
>>  static void sja1105_hw_reset(struct gpio_desc *gpio, unsigned int
>> pulse_len,
>>  			     unsigned int startup_delay)
>> @@ -1382,7 +1383,7 @@ static void sja1105_bridge_leave(struct dsa_switch
>> *ds, int port,
>>   * modify at runtime (currently only MAC) and restore them after
>> uploading,
>>   * such that this operation is relatively seamless.
>>   */
>> -static int sja1105_static_config_reload(struct sja1105_private *priv)
>> +int sja1105_static_config_reload(struct sja1105_private *priv)
>>  {
>>  	struct ptp_system_timestamp ptp_sts_before;
>>  	struct ptp_system_timestamp ptp_sts_after;
>> @@ -1761,6 +1762,7 @@ static void sja1105_teardown(struct dsa_switch *ds)
>>  {
>>  	struct sja1105_private *priv = ds->priv;
>>
>> +	sja1105_tas_teardown(priv);
>>  	cancel_work_sync(&priv->tagger_data.rxtstamp_work);
>>  	skb_queue_purge(&priv->tagger_data.skb_rxtstamp_queue);
>>  	sja1105_ptp_clock_unregister(priv);
>> @@ -2088,6 +2090,18 @@ static bool sja1105_port_txtstamp(struct dsa_switch
>> *ds, int port,
>>  	return true;
>>  }
>>
>> +static int sja1105_port_setup_tc(struct dsa_switch *ds, int port,
>> +				 enum tc_setup_type type,
>> +				 void *type_data)
>> +{
>> +	switch (type) {
>> +	case TC_SETUP_QDISC_TAPRIO:
>> +		return sja1105_setup_tc_taprio(ds, port, type_data);
>> +	default:
>> +		return -EOPNOTSUPP;
>> +	}
>> +}
>> +
>>  static const struct dsa_switch_ops sja1105_switch_ops = {
>>  	.get_tag_protocol	= sja1105_get_tag_protocol,
>>  	.setup			= sja1105_setup,
>> @@ -2120,6 +2134,7 @@ static const struct dsa_switch_ops
>> sja1105_switch_ops = {
>>  	.port_hwtstamp_set	= sja1105_hwtstamp_set,
>>  	.port_rxtstamp		= sja1105_port_rxtstamp,
>>  	.port_txtstamp		= sja1105_port_txtstamp,
>> +	.port_setup_tc		= sja1105_port_setup_tc,
>>  };
>>
>>  static int sja1105_check_device_id(struct sja1105_private *priv)
>> @@ -2229,6 +2244,8 @@ static int sja1105_probe(struct spi_device *spi)
>>  	}
>>  	mutex_init(&priv->mgmt_lock);
>>
>> +	sja1105_tas_setup(priv);
>> +
>>  	return dsa_register_switch(priv->ds);
>>  }
>>
>> diff --git a/drivers/net/dsa/sja1105/sja1105_tas.c
>> b/drivers/net/dsa/sja1105/sja1105_tas.c
>> new file mode 100644
>> index 000000000000..769e1d8e5e8f
>> --- /dev/null
>> +++ b/drivers/net/dsa/sja1105/sja1105_tas.c
>> @@ -0,0 +1,420 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (c) 2019, Vladimir Oltean <olteanv@gmail.com>
>> + */
>> +#include "sja1105.h"
>> +
>> +#define SJA1105_TAS_CLKSRC_DISABLED	0
>> +#define SJA1105_TAS_CLKSRC_STANDALONE	1
>> +#define SJA1105_TAS_CLKSRC_AS6802	2
>> +#define SJA1105_TAS_CLKSRC_PTP		3
>> +#define SJA1105_GATE_MASK		GENMASK_ULL(SJA1105_NUM_TC - 1, 0)
>> +#define SJA1105_TAS_MAX_DELTA		BIT(19)
>> +
>> +/* This is not a preprocessor macro because the "ns" argument may or may
>> not be
>> + * s64 at caller side. This ensures it is properly type-cast before
>> div_s64.
>> + */
>> +static s64 ns_to_sja1105_delta(s64 ns)
>> +{
>> +	return div_s64(ns, 200);
>> +}
>> +
>> +/* Lo and behold: the egress scheduler from hell.
>> + *
>> + * At the hardware level, the Time-Aware Shaper holds a global linear
>> arrray of
>> + * all schedule entries for all ports. These are the Gate Control List
>> (GCL)
>> + * entries, let's call them "timeslots" for short. This linear array of
>> + * timeslots is held in BLK_IDX_SCHEDULE.
>> + *
>> + * Then there are a maximum of 8 "execution threads" inside the switch,
>> which
>> + * iterate cyclically through the "schedule". Each "cycle" has an entry
>> point
>> + * and an exit point, both being timeslot indices in the schedule table.
>> The
>> + * hardware calls each cycle a "subschedule".
>> + *
>> + * Subschedule (cycle) i starts when
>> + *   ptpclkval >= ptpschtm + BLK_IDX_SCHEDULE_ENTRY_POINTS[i].delta.
>> + *
>> + * The hardware scheduler iterates BLK_IDX_SCHEDULE with a k ranging
>> from
>> + *   k = BLK_IDX_SCHEDULE_ENTRY_POINTS[i].address to
>> + *   k = BLK_IDX_SCHEDULE_PARAMS.subscheind[i]
>> + *
>> + * For each schedule entry (timeslot) k, the engine executes the gate
>> control
>> + * list entry for the duration of BLK_IDX_SCHEDULE[k].delta.
>> + *
>> + *         +---------+
>> + *         |         | BLK_IDX_SCHEDULE_ENTRY_POINTS_PARAMS
>> + *         +---------+
>> + *              |
>> + *              +-----------------+
>> + *                                | .actsubsch
>> + *  BLK_IDX_SCHEDULE_ENTRY_POINTS v
>> + *                 +-------+-------+
>> + *                 |cycle 0|cycle 1|
>> + *                 +-------+-------+
>> + *                   |  |      |  |
>> + *  +----------------+  |      |
>> +-------------------------------------+
>> + *  |   .subschindx     |      |             .subschindx
>> |
>> + *  |                   |      +---------------+
>> |
>> + *  |          .address |        .address      |
>> |
>> + *  |                   |                      |
>> |
>> + *  |                   |                      |
>> |
>> + *  |  BLK_IDX_SCHEDULE v                      v
>> |
>> + *  |              +-------+-------+-------+-------+-------+------+
>> |
>> + *  |              |entry 0|entry 1|entry 2|entry 3|entry 4|entry5|
>> |
>> + *  |              +-------+-------+-------+-------+-------+------+
>> |
>> + *  |                                  ^                    ^  ^  ^
>> |
>> + *  |                                  |                    |  |  |
>> |
>> + *  |        +-------------------------+                    |  |  |
>> |
>> + *  |        |              +-------------------------------+  |  |
>> |
>> + *  |        |              |              +-------------------+  |
>> |
>> + *  |        |              |              |                      |
>> |
>> + *  | +---------------------------------------------------------------+
>> |
>> + *  | |subscheind[0]<=subscheind[1]<=subscheind[2]<=...<=subscheind[7]|
>> |
>> + *  | +---------------------------------------------------------------+
>> |
>> + *  |        ^              ^                BLK_IDX_SCHEDULE_PARAMS
>> |
>> + *  |        |              |
>> |
>> + *  +--------+
>> +-------------------------------------------+
>> + *
>> + *  In the above picture there are two subschedules (cycles):
>> + *
>> + *  - cycle 0: iterates the schedule table from 0 to 2 (and back)
>> + *  - cycle 1: iterates the schedule table from 3 to 5 (and back)
>> + *
>> + *  All other possible execution threads must be marked as unused by
>> making
>> + *  their "subschedule end index" (subscheind) equal to the last valid
>> + *  subschedule's end index (in this case 5).
>> + */
>> +static int sja1105_init_scheduling(struct sja1105_private *priv)
>> +{
>> +	struct sja1105_schedule_entry_points_entry *schedule_entry_points;
>> +	struct sja1105_schedule_entry_points_params_entry
>> +					*schedule_entry_points_params;
>> +	struct sja1105_schedule_params_entry *schedule_params;
>> +	struct sja1105_tas_data *tas_data = &priv->tas_data;
>> +	struct sja1105_schedule_entry *schedule;
>> +	struct sja1105_table *table;
>> +	int subscheind[8] = {0};
>> +	int schedule_start_idx;
>> +	s64 entry_point_delta;
>> +	int schedule_end_idx;
>> +	int num_entries = 0;
>> +	int num_cycles = 0;
>> +	int cycle = 0;
>> +	int i, k = 0;
>> +	int port;
>> +
>> +	/* Discard previous Schedule Table */
>> +	table = &priv->static_config.tables[BLK_IDX_SCHEDULE];
>> +	if (table->entry_count) {
>> +		kfree(table->entries);
>> +		table->entry_count = 0;
>> +	}
>> +
>> +	/* Discard previous Schedule Entry Points Parameters Table */
>> +	table =
>> &priv->static_config.tables[BLK_IDX_SCHEDULE_ENTRY_POINTS_PARAMS];
>> +	if (table->entry_count) {
>> +		kfree(table->entries);
>> +		table->entry_count = 0;
>> +	}
>> +
>> +	/* Discard previous Schedule Parameters Table */
>> +	table = &priv->static_config.tables[BLK_IDX_SCHEDULE_PARAMS];
>> +	if (table->entry_count) {
>> +		kfree(table->entries);
>> +		table->entry_count = 0;
>> +	}
>> +
>> +	/* Discard previous Schedule Entry Points Table */
>> +	table = &priv->static_config.tables[BLK_IDX_SCHEDULE_ENTRY_POINTS];
>> +	if (table->entry_count) {
>> +		kfree(table->entries);
>> +		table->entry_count = 0;
>> +	}
>> +
>> +	/* Figure out the dimensioning of the problem */
>> +	for (port = 0; port < SJA1105_NUM_PORTS; port++) {
>> +		if (tas_data->config[port]) {
>> +			num_entries += tas_data->config[port]->num_entries;
>> +			num_cycles++;
>> +		}
>> +	}
>> +
>> +	/* Nothing to do */
>> +	if (!num_cycles)
>> +		return 0;
>> +
>> +	/* Pre-allocate space in the static config tables */
>> +
>> +	/* Schedule Table */
>> +	table = &priv->static_config.tables[BLK_IDX_SCHEDULE];
>> +	table->entries = kcalloc(num_entries, table->ops->unpacked_entry_size,
>> +				 GFP_KERNEL);
>> +	if (!table->entries)
>> +		return -ENOMEM;
>> +	table->entry_count = num_entries;
>> +	schedule = table->entries;
>> +
>> +	/* Schedule Points Parameters Table */
>> +	table =
>> &priv->static_config.tables[BLK_IDX_SCHEDULE_ENTRY_POINTS_PARAMS];
>> +	table->entries =
>> kcalloc(SJA1105_MAX_SCHEDULE_ENTRY_POINTS_PARAMS_COUNT,
>> +				 table->ops->unpacked_entry_size, GFP_KERNEL);
>> +	if (!table->entries)
>> +		return -ENOMEM;
>
> Should this free the previous allocation, in case this one fails?
> (also applies to the statements below)
>

I had to take a look at the overall driver code again, since it's
already been a while since I added it and I couldn't remember exactly.
All memory is freed automagically in sja1105_static_config_free from
sja1105_static_config.c. That simplifies driver code considerably,
although it's so generic that I forgot that it's there.

Thanks,
-Vladimir

^ permalink raw reply

* Re: [PATCH net 1/2] sctp: remove redundant assignment when call sctp_get_port_local
From: maowenan @ 2019-09-12  2:05 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner, Dan Carpenter
  Cc: vyasevich, nhorman, davem, linux-sctp, netdev, linux-kernel,
	kernel-janitors
In-Reply-To: <20190911143923.GE3499@localhost.localdomain>



On 2019/9/11 22:39, Marcelo Ricardo Leitner wrote:
> On Wed, Sep 11, 2019 at 11:30:08AM -0300, Marcelo Ricardo Leitner wrote:
>> On Wed, Sep 11, 2019 at 11:30:38AM +0300, Dan Carpenter wrote:
>>> On Wed, Sep 11, 2019 at 09:30:47AM +0800, maowenan wrote:
>>>>
>>>>
>>>> On 2019/9/11 3:22, Dan Carpenter wrote:
>>>>> On Tue, Sep 10, 2019 at 09:57:10PM +0300, Dan Carpenter wrote:
>>>>>> On Tue, Sep 10, 2019 at 03:13:42PM +0800, Mao Wenan wrote:
>>>>>>> There are more parentheses in if clause when call sctp_get_port_local
>>>>>>> in sctp_do_bind, and redundant assignment to 'ret'. This patch is to
>>>>>>> do cleanup.
>>>>>>>
>>>>>>> Signed-off-by: Mao Wenan <maowenan@huawei.com>
>>>>>>> ---
>>>>>>>  net/sctp/socket.c | 3 +--
>>>>>>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
>>>>>>> index 9d1f83b10c0a..766b68b55ebe 100644
>>>>>>> --- a/net/sctp/socket.c
>>>>>>> +++ b/net/sctp/socket.c
>>>>>>> @@ -399,9 +399,8 @@ static int sctp_do_bind(struct sock *sk, union sctp_addr *addr, int len)
>>>>>>>  	 * detection.
>>>>>>>  	 */
>>>>>>>  	addr->v4.sin_port = htons(snum);
>>>>>>> -	if ((ret = sctp_get_port_local(sk, addr))) {
>>>>>>> +	if (sctp_get_port_local(sk, addr))
>>>>>>>  		return -EADDRINUSE;
>>>>>>
>>>>>> sctp_get_port_local() returns a long which is either 0,1 or a pointer
>>>>>> casted to long.  It's not documented what it means and neither of the
>>>>>> callers use the return since commit 62208f12451f ("net: sctp: simplify
>>>>>> sctp_get_port").
>>>>>
>>>>> Actually it was commit 4e54064e0a13 ("sctp: Allow only 1 listening
>>>>> socket with SO_REUSEADDR") from 11 years ago.  That patch fixed a bug,
>>>>> because before the code assumed that a pointer casted to an int was the
>>>>> same as a pointer casted to a long.
>>>>
>>>> commit 4e54064e0a13 treated non-zero return value as unexpected, so the current
>>>> cleanup is ok?
>>>
>>> Yeah.  It's fine, I was just confused why we weren't preserving the
>>> error code and then I saw that we didn't return errors at all and got
>>> confused.
>>
>> But please lets seize the moment and do the change Dean suggested.
> 
> *Dan*, sorry.
> 
>> This was the last place saving this return value somewhere. It makes
>> sense to cleanup sctp_get_port_local() now and remove that masked
>> pointer return.
>>
>> Then you may also cleanup:
>> socket.c:       return !!sctp_get_port_local(sk, &addr);
>> as it will be a direct map.

Thanks Marcelo, shall I post a new individual patch for cleanup as your suggest?
>>
>>   Marcelo
>>
> 
> .
> 


^ permalink raw reply

* Klientskie Bazy http://prodawez.tilda.ws/page7270311.html
From: netdev @ 2019-09-12  1:18 UTC (permalink / raw)
  To: netdev

Klientskie Bazy http://prodawez.tilda.ws/page7270311.html

^ permalink raw reply

* [PATCH v2 net 0/3] fix memory leak for sctp_do_bind
From: Mao Wenan @ 2019-09-12  4:02 UTC (permalink / raw)
  To: vyasevich, nhorman, marcelo.leitner, davem
  Cc: linux-sctp, netdev, linux-kernel, kernel-janitors, Mao Wenan
In-Reply-To: <7a450679-40ca-8a84-4cba-7a16f22ea3c0@huawei.com>

First two patches are to do cleanup, remove redundant assignment,
and change return type of sctp_get_port_local.
Third patch is to fix memory leak for sctp_do_bind if failed
to bind address.

---
 v2: add one patch to change return type of sctp_get_port_local.
---
Mao Wenan (3):
  sctp: change return type of sctp_get_port_local
  sctp: remove redundant assignment when call sctp_get_port_local
  sctp: destroy bucket if failed to bind addr

 net/sctp/socket.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

-- 
2.20.1


^ permalink raw reply

* [PATCH v2 net 3/3] sctp: destroy bucket if failed to bind addr
From: Mao Wenan @ 2019-09-12  4:02 UTC (permalink / raw)
  To: vyasevich, nhorman, marcelo.leitner, davem
  Cc: linux-sctp, netdev, linux-kernel, kernel-janitors, Mao Wenan,
	Hulk Robot
In-Reply-To: <20190912040219.67517-1-maowenan@huawei.com>

There is one memory leak bug report:
BUG: memory leak
unreferenced object 0xffff8881dc4c5ec0 (size 40):
  comm "syz-executor.0", pid 5673, jiffies 4298198457 (age 27.578s)
  hex dump (first 32 bytes):
    02 00 00 00 81 88 ff ff 00 00 00 00 00 00 00 00  ................
    f8 63 3d c1 81 88 ff ff 00 00 00 00 00 00 00 00  .c=.............
  backtrace:
    [<0000000072006339>] sctp_get_port_local+0x2a1/0xa00 [sctp]
    [<00000000c7b379ec>] sctp_do_bind+0x176/0x2c0 [sctp]
    [<000000005be274a2>] sctp_bind+0x5a/0x80 [sctp]
    [<00000000b66b4044>] inet6_bind+0x59/0xd0 [ipv6]
    [<00000000c68c7f42>] __sys_bind+0x120/0x1f0 net/socket.c:1647
    [<000000004513635b>] __do_sys_bind net/socket.c:1658 [inline]
    [<000000004513635b>] __se_sys_bind net/socket.c:1656 [inline]
    [<000000004513635b>] __x64_sys_bind+0x3e/0x50 net/socket.c:1656
    [<0000000061f2501e>] do_syscall_64+0x72/0x2e0 arch/x86/entry/common.c:296
    [<0000000003d1e05e>] entry_SYSCALL_64_after_hwframe+0x49/0xbe

This is because in sctp_do_bind, if sctp_get_port_local is to
create hash bucket successfully, and sctp_add_bind_addr failed
to bind address, e.g return -ENOMEM, so memory leak found, it
needs to destroy allocated bucket.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
---
 net/sctp/socket.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 2f810078c91d..69ec3b796197 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -412,11 +412,13 @@ static int sctp_do_bind(struct sock *sk, union sctp_addr *addr, int len)
 	ret = sctp_add_bind_addr(bp, addr, af->sockaddr_len,
 				 SCTP_ADDR_SRC, GFP_ATOMIC);
 
-	/* Copy back into socket for getsockname() use. */
-	if (!ret) {
-		inet_sk(sk)->inet_sport = htons(inet_sk(sk)->inet_num);
-		sp->pf->to_sk_saddr(addr, sk);
+	if (ret) {
+		sctp_put_port(sk);
+		return ret;
 	}
+	/* Copy back into socket for getsockname() use. */
+	inet_sk(sk)->inet_sport = htons(inet_sk(sk)->inet_num);
+	sp->pf->to_sk_saddr(addr, sk);
 
 	return ret;
 }
-- 
2.20.1


^ permalink raw reply related

* [PATCH v2 net 2/3] sctp: remove redundant assignment when call sctp_get_port_local
From: Mao Wenan @ 2019-09-12  4:02 UTC (permalink / raw)
  To: vyasevich, nhorman, marcelo.leitner, davem
  Cc: linux-sctp, netdev, linux-kernel, kernel-janitors, Mao Wenan
In-Reply-To: <20190912040219.67517-1-maowenan@huawei.com>

There are more parentheses in if clause when call sctp_get_port_local
in sctp_do_bind, and redundant assignment to 'ret'. This patch is to
do cleanup.

Signed-off-by: Mao Wenan <maowenan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
---
 net/sctp/socket.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 5e1934c48709..2f810078c91d 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -399,9 +399,8 @@ static int sctp_do_bind(struct sock *sk, union sctp_addr *addr, int len)
 	 * detection.
 	 */
 	addr->v4.sin_port = htons(snum);
-	if ((ret = sctp_get_port_local(sk, addr))) {
+	if (sctp_get_port_local(sk, addr))
 		return -EADDRINUSE;
-	}
 
 	/* Refresh ephemeral port.  */
 	if (!bp->port)
-- 
2.20.1


^ permalink raw reply related

* [PATCH v2 net 1/3] sctp: change return type of sctp_get_port_local
From: Mao Wenan @ 2019-09-12  4:02 UTC (permalink / raw)
  To: vyasevich, nhorman, marcelo.leitner, davem
  Cc: linux-sctp, netdev, linux-kernel, kernel-janitors, Mao Wenan
In-Reply-To: <20190912040219.67517-1-maowenan@huawei.com>

Currently sctp_get_port_local() returns a long
which is either 0,1 or a pointer casted to long.
It's neither of the callers use the return value since
commit 62208f12451f ("net: sctp: simplify sctp_get_port").
Now two callers are sctp_get_port and sctp_do_bind,
they actually assumend a casted to an int was the same as
a pointer casted to a long, and they don't save the return
value just check whether it is zero or non-zero, so
it would better change return type from long to int for
sctp_get_port_local.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 net/sctp/socket.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 9d1f83b10c0a..5e1934c48709 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -309,7 +309,7 @@ static int sctp_bind(struct sock *sk, struct sockaddr *addr, int addr_len)
 	return retval;
 }
 
-static long sctp_get_port_local(struct sock *, union sctp_addr *);
+static int sctp_get_port_local(struct sock *, union sctp_addr *);
 
 /* Verify this is a valid sockaddr. */
 static struct sctp_af *sctp_sockaddr_af(struct sctp_sock *opt,
@@ -7998,7 +7998,7 @@ static void sctp_unhash(struct sock *sk)
 static struct sctp_bind_bucket *sctp_bucket_create(
 	struct sctp_bind_hashbucket *head, struct net *, unsigned short snum);
 
-static long sctp_get_port_local(struct sock *sk, union sctp_addr *addr)
+static int sctp_get_port_local(struct sock *sk, union sctp_addr *addr)
 {
 	struct sctp_sock *sp = sctp_sk(sk);
 	bool reuse = (sk->sk_reuse || sp->reuse);
@@ -8108,7 +8108,7 @@ static long sctp_get_port_local(struct sock *sk, union sctp_addr *addr)
 
 			if (sctp_bind_addr_conflict(&ep2->base.bind_addr,
 						    addr, sp2, sp)) {
-				ret = (long)sk2;
+				ret = 1;
 				goto fail_unlock;
 			}
 		}
@@ -8180,7 +8180,7 @@ static int sctp_get_port(struct sock *sk, unsigned short snum)
 	addr.v4.sin_port = htons(snum);
 
 	/* Note: sk->sk_num gets filled in if ephemeral port request. */
-	return !!sctp_get_port_local(sk, &addr);
+	return sctp_get_port_local(sk, &addr);
 }
 
 /*
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH net v2 01/11] net: core: limit nested device depth
From: Taehee Yoo @ 2019-09-12  3:56 UTC (permalink / raw)
  To: David Miller
  Cc: Netdev, j.vosburgh, vfalico, Andy Gospodarek,
	Jiří Pírko, sd, Roopa Prabhu, saeedm, manishc,
	rahulv, kys, haiyangz, sthemmin, sashal, hare, varun, ubraun,
	kgraul, Jay Vosburgh
In-Reply-To: <20190912.003209.917226424625610557.davem@davemloft.net>

On Thu, 12 Sep 2019 at 07:32, David Miller <davem@davemloft.net> wrote:
>

Hi David
Thank you for the review!

> From: Taehee Yoo <ap420073@gmail.com>
> Date: Sat,  7 Sep 2019 22:45:32 +0900
>
> > Current code doesn't limit the number of nested devices.
> > Nested devices would be handled recursively and this needs huge stack
> > memory. So, unlimited nested devices could make stack overflow.
>  ...
> > Splat looks like:
> > [  140.483124] BUG: looking up invalid subclass: 8
> > [  140.483505] turning off the locking correctness validator.
>
> The limit here is not stack memory, but a limit in the lockdep
> validator, which can probably be fixed by other means.
>
> This was the feedback I saw given for the previous version of
> this series as well.

I just realized this commit message is not enough for describing the problems.
It looks like that "invalid subclass" makes panic.
But this is not.
The panic is not related to "invalid subclass" lockdep problem.

There are two splats.
1. [  140.483124] BUG: looking up invalid subclass: 8
2. [  168.446539] BUG: KASAN: slab-out-of-bounds in __unwind_start+0x71/0x850
    [  168.794493] Rebooting in 5 seconds..


The first message is just a warning message of lockdep because of too deep
lockdep subclasses and it doesn't make any problem and panic.
This message can be ignored right now because other patches of
this patchset avoids this problem using dynamic lockdep key.

The second message is a panic message.
This is stack overflow and it occurs because of too deep nested devices.
The goal of this patch is to fix this stack overflow problem.

I tested with this reproducer commands without lockdep.

    ip link add dummy0 type dummy
    ip link add link dummy0 name vlan1 type vlan id 1
    ip link set vlan1 up

    for i in {2..200}
    do
            let A=$i-1

            ip link add name vlan$i link vlan$A type vlan id $i
    done
    ip link del vlan1 <-- this command is added.

Splat looks like:
[  181.594092] Thread overran stack, or stack corrupted
[  181.594726] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[  181.595417] CPU: 1 PID: 1402 Comm: ip Not tainted 5.3.0-rc7+ #173
[  181.596193] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS
VirtualBox 12/01/2006
[  181.605986] RIP: 0010:stack_depot_fetch+0x10/0x30
[  181.606588] Code: 00 75 10 48 8b 73 18 48 89 ef 5b 5d e9 59 bf 89
ff 0f 0b e8 02 3f 9d ff eb e9 89 f8 c1 ef 110
[  181.609820] RSP: 0018:ffff8880cbebedd8 EFLAGS: 00010006
[  181.610485] RAX: 00000000001fffff RBX: ffff8880cbebfc00 RCX: 0000000000000000
[  181.611394] RDX: 000000000000001d RSI: ffff8880cbebede0 RDI: 0000000000003ff0
[  181.612297] RBP: ffffea00032fae00 R08: ffffed101b5a3eab R09: ffffed101b5a3eab
[  181.613222] R10: 0000000000000001 R11: ffffed101b5a3eaa R12: ffff8880d6115e80
[  181.614148] R13: ffff8880cbebeac0 R14: ffff8880cbebfc00 R15: ffff8880cbebef80
[  181.615053] FS:  00007f46140510c0(0000) GS:ffff8880dad00000(0000)
knlGS:0000000000000000
[  181.616085] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  181.616841] CR2: ffffffffb9a7a0d8 CR3: 00000000bc356003 CR4: 00000000000606e0
[  181.635748] Call Trace:
[  181.635996] Modules linked in: 8021q garp stp mrp llc dummy veth
openvswitch nsh nf_conncount nf_nat nf_conntrs
[  181.637360] CR2: ffffffffb9a7a0d8
[  181.637670] ---[ end trace f890ce3e5c51ceb4 ]---
[  181.638096] RIP: 0010:stack_depot_fetch+0x10/0x30
[  181.638524] Code: 00 75 10 48 8b 73 18 48 89 ef 5b 5d e9 59 bf 89
ff 0f 0b e8 02 3f 9d ff eb e9 89 f8 c1 ef 110
[  181.805441] RSP: 0018:ffff8880cbebedd8 EFLAGS: 00010006
[  181.900192] RAX: 00000000001fffff RBX: ffff8880cbebfc00 RCX: 0000000000000000
[  181.901119] RDX: 000000000000001d RSI: ffff8880cbebede0 RDI: 0000000000003ff0
[  181.902038] RBP: ffffea00032fae00 R08: ffffed101b5a3eab R09: ffffed101b5a3eab
[  181.902960] R10: 0000000000000001 R11: ffffed101b5a3eaa R12: ffff8880d6115e80
[  181.903885] R13: ffff8880cbebeac0 R14: ffff8880cbebfc00 R15: ffff8880cbebef80
[  181.904825] FS:  00007f46140510c0(0000) GS:ffff8880dad00000(0000)
knlGS:0000000000000000
[  181.905862] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  181.906604] CR2: ffffffffb9a7a0d8 CR3: 00000000bc356003 CR4: 00000000000606e0
[  181.907525] Kernel panic - not syncing: Fatal exception
[  181.908179] Kernel Offset: 0x34000000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffff)
[  181.909176] Rebooting in 5 seconds..

^ permalink raw reply

* Re: [PATCH bpf-next v10 2/4] bpf: new helper to obtain namespace data from current task New bpf helper bpf_get_current_pidns_info.
From: Yonghong Song @ 2019-09-12  5:49 UTC (permalink / raw)
  To: Eric W. Biederman, Al Viro
  Cc: Carlos Antonio Neira Bustos, netdev@vger.kernel.org,
	brouer@redhat.com, bpf@vger.kernel.org
In-Reply-To: <87o8zr8cz3.fsf@x220.int.ebiederm.org>



On 9/11/19 9:16 AM, Eric W. Biederman wrote:
> Al Viro <viro@zeniv.linux.org.uk> writes:
> 
>> On Tue, Sep 10, 2019 at 10:35:09PM +0000, Yonghong Song wrote:
>>>
>>> Carlos,
>>>
>>> Discussed with Eric today for what is the best way to get
>>> the device number for a namespace. The following patch seems
>>> a reasonable start although Eric would like to see
>>> how the helper is used in order to decide whether the
>>> interface looks right.
>>>
>>> commit bb00fc36d5d263047a8bceb3e51e969d7fbce7db (HEAD -> fs2)
>>> Author: Yonghong Song <yhs@fb.com>
>>> Date:   Mon Sep 9 21:50:51 2019 -0700
>>>
>>>       nsfs: add an interface function ns_get_inum_dev()
>>>
>>>       This patch added an interface function
>>>       ns_get_inum_dev(). Given a ns_common structure,
>>>       the function returns the inode and device
>>>       numbers. The function will be used later
>>>       by a newly added bpf helper.
>>>
>>>       Signed-off-by: Yonghong Song <yhs@fb.com>
>>>
>>> diff --git a/fs/nsfs.c b/fs/nsfs.c
>>> index a0431642c6b5..a603c6fc3f54 100644
>>> --- a/fs/nsfs.c
>>> +++ b/fs/nsfs.c
>>> @@ -245,6 +245,14 @@ struct file *proc_ns_fget(int fd)
>>>           return ERR_PTR(-EINVAL);
>>>    }
>>>
>>> +/* Get the device number for the current task pidns.
>>> + */
>>> +void ns_get_inum_dev(struct ns_common *ns, u32 *inum, dev_t *dev)
>>> +{
>>> +       *inum = ns->inum;
>>> +       *dev = nsfs_mnt->mnt_sb->s_dev;
>>> +}
>>
>> Umm...  Where would it get the device number once we get (hell knows
>> what for) multiple nsfs instances?  I still don't understand what
>> would that be about, TBH...  Is it really per-userns?  Or something
>> else entirely?  Eric, could you give some context?
> 
> My goal is not to paint things into a corner, with future changes.
> Right now it is possible to stat a namespace file descriptor and
> get a device and inode number.  Then compare that.
> 
> I don't want people using the inode number in nsfd as some magic
> namespace id.
> 
> We have had times in the past where there was more than one superblock
> and thus more than one device number.  Further if userspace ever uses
> this heavily there may be times in the future where for
> checkpoint/restart purposes we will want multiple nsfd's so we can
> preserve the inode number accross a migration.
> 
> Realistically there will probably just some kind of hotplug notification
> to userspace to say we have hotplugged your operatining system as
> a migration notification.
> 
> Now the halway discussion did not quite capture everything I was trying
> to say but it at least got to the right ballpark.
> 
> The helper in fs/nsfs.c should be:
> 
> bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino)
> {
>          return ((ns->inum == ino) && (nsfs_mnt->mnt_sb->s_dev == dev));
> }
> 
> That way if/when there are multiple inodes identifying the same
> namespace the bpf programs don't need to change.

Thanks, Eric. This is indeed better. The bpf helper should focus
on comparing dev/ino, instead of return the dev/ino to bpf program.

So overall, nsfs related change will look like:

diff --git a/fs/nsfs.c b/fs/nsfs.c
index a0431642c6b5..7e78d89c2172 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -245,6 +245,11 @@ struct file *proc_ns_fget(int fd)
         return ERR_PTR(-EINVAL);
  }

+bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino)
+{
+       return ((ns->inum == ino) && (nsfs_mnt->mnt_sb->s_dev == dev));
+}
+
  static int nsfs_show_path(struct seq_file *seq, struct dentry *dentry)
  {
         struct inode *inode = d_inode(dentry);
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index d31cb6215905..79639807e960 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -81,6 +81,7 @@ extern void *ns_get_path(struct path *path, struct 
task_struct *task,
  typedef struct ns_common *ns_get_path_helper_t(void *);
  extern void *ns_get_path_cb(struct path *path, ns_get_path_helper_t 
ns_get_cb,
                             void *private_data);
+extern bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino);

  extern int ns_get_name(char *buf, size_t size, struct task_struct *task,
                         const struct proc_ns_operations *ns_ops);

> 
> Up farther in the stack it should be something like:
> 
>> BPF_CALL_2(bpf_current_pidns_match, dev_t *dev, ino_t *ino)
>> {
>>          return ns_match(&task_active_pid_ns(current)->ns, *dev, *ino);
>> }
>>
>> const struct bpf_func_proto bpf_current_pidns_match_proto = {
>> 	.func		= bpf_current_pins_match,
>> 	.gpl_only	= true,
>> 	.ret_type	= RET_INTEGER
>> 	.arg1_type	= ARG_PTR_TO_DEVICE_NUMBER,
>> 	.arg2_type	= ARG_PTR_TO_INODE_NUMBER,
>> };
> 
> That allows comparing what the bpf came up with with whatever value
> userspace generated by stating the file descriptor.
> 
> 
> That is the least bad suggestion I currently have for that
> functionality.  It really would be better to not have that filter in the
> bpf program itself but in the infrastructure that binds a program to a
> set of tasks.
> 
> The problem with this approach is whatever device/inode you have when
> the namespace they refer to exits there is the possibility that the
> inode will be reused.  So your filter will eventually start matching on
> the wrong thing.

I come up with a differeent helper definition, which is much more
similar to existing bpf_get_current_pid_tgid() and helper definition
much more conforms to bpf convention.

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 5e28718928ca..bc26903c80c7 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -11,6 +11,8 @@
  #include <linux/uidgid.h>
  #include <linux/filter.h>
  #include <linux/ctype.h>
+#include <linux/pid_namespace.h>
+#include <linux/proc_ns.h>

  #include "../../lib/kstrtox.h"

@@ -487,3 +489,33 @@ const struct bpf_func_proto bpf_strtoul_proto = {
         .arg4_type      = ARG_PTR_TO_LONG,
  };
  #endif
+
+BPF_CALL_2(bpf_get_ns_current_pid_tgid, u32, dev, u32, inum)
+{
+       struct task_struct *task = current;
+       struct pid_namespace *pidns;
+       pid_t pid, tgid;
+
+       if (unlikely(!task))
+               return -EINVAL;
+
+
+       pidns = task_active_pid_ns(task);
+       if (unlikely(!pidns))
+               return -ENOENT;
+
+       if (!ns_match(&pidns->ns, (dev_t)dev, inum))
+               return -EINVAL;
+
+       pid = task_pid_nr_ns(task, pidns);
+       tgid = task_tgid_nr_ns(task, pidns);
+
+       return (u64) tgid << 32 | pid;
+}
+
+const struct bpf_func_proto bpf_get_ns_current_pid_tgid_proto = {
+       .func           = bpf_get_ns_current_pid_tgid,
+       .gpl_only       = false,
+       .ret_type       = RET_INTEGER,
+       .arg1_type      = ARG_ANYTHING,
+       .arg2_type      = ARG_ANYTHING,
+};

Existing usage of bpf_get_current_pid_tgid() can be converted
to bpf_get_ns_current_pid_tgid() if ns dev/inode number
is supplied. For bpf_get_ns_current_pid_tgid(), checking
return value ( < 0 or not) is needed.

> 
> Eric
> 

^ permalink raw reply related

* Re: [PATCH V2 net-next 1/7] net: hns3: add ethtool_ops.set_channels support for HNS3 VF driver
From: Michal Kubecek @ 2019-09-12  6:23 UTC (permalink / raw)
  To: netdev
  Cc: Huazhong Tan, davem, linux-kernel, salil.mehta, yisen.zhuang,
	linuxarm, jakub.kicinski
In-Reply-To: <1568169639-43658-2-git-send-email-tanhuazhong@huawei.com>

On Wed, Sep 11, 2019 at 10:40:33AM +0800, Huazhong Tan wrote:
> From: Guangbin Huang <huangguangbin2@huawei.com>
> 
> This patch adds ethtool_ops.set_channels support for HNS3 VF driver,
> and updates related TQP information and RSS information, to support
> modification of VF TQP number, and uses current rss_size instead of
> max_rss_size to initialize RSS.
> 
> Also, fixes a format error in hclgevf_get_rss().
> 
> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
> ---
>  drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c |  1 +
>  .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c  | 83 ++++++++++++++++++++--
>  2 files changed, 79 insertions(+), 5 deletions(-)
> 
...
> diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
> index 594cae8..e3090b3 100644
> --- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
> +++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
...
> +static void hclgevf_update_rss_size(struct hnae3_handle *handle,
> +				    u32 new_tqps_num)
> +{
> +	struct hnae3_knic_private_info *kinfo = &handle->kinfo;
> +	struct hclgevf_dev *hdev = hclgevf_ae_get_hdev(handle);
> +	u16 max_rss_size;
> +
> +	kinfo->req_rss_size = new_tqps_num;
> +
> +	max_rss_size = min_t(u16, hdev->rss_size_max,
> +			     hdev->num_tqps / kinfo->num_tc);
> +
> +	/* Use the user's configuration when it is not larger than
> +	 * max_rss_size, otherwise, use the maximum specification value.
> +	 */
> +	if (kinfo->req_rss_size != kinfo->rss_size && kinfo->req_rss_size &&
> +	    kinfo->req_rss_size <= max_rss_size)
> +		kinfo->rss_size = kinfo->req_rss_size;
> +	else if (kinfo->rss_size > max_rss_size ||
> +		 (!kinfo->req_rss_size && kinfo->rss_size < max_rss_size))
> +		kinfo->rss_size = max_rss_size;

I don't think requested channel count can be larger than max_rss_size
here. In ethtool_set_channels(), we check that requested channel counts
do not exceed maximum channel counts as reported by ->get_channels().
And hclgevf_get_max_channels() cannot return more than max_rss_size.

> +
> +	kinfo->num_tqps = kinfo->num_tc * kinfo->rss_size;
> +}
> +
> +static int hclgevf_set_channels(struct hnae3_handle *handle, u32 new_tqps_num,
> +				bool rxfh_configured)
> +{
> +	struct hclgevf_dev *hdev = hclgevf_ae_get_hdev(handle);
> +	struct hnae3_knic_private_info *kinfo = &handle->kinfo;
> +	u16 cur_rss_size = kinfo->rss_size;
> +	u16 cur_tqps = kinfo->num_tqps;
> +	u32 *rss_indir;
> +	unsigned int i;
> +	int ret;
> +
> +	hclgevf_update_rss_size(handle, new_tqps_num);
> +
> +	ret = hclgevf_set_rss_tc_mode(hdev, kinfo->rss_size);
> +	if (ret)
> +		return ret;
> +
> +	/* RSS indirection table has been configuared by user */
> +	if (rxfh_configured)
> +		goto out;
> +
> +	/* Reinitializes the rss indirect table according to the new RSS size */
> +	rss_indir = kcalloc(HCLGEVF_RSS_IND_TBL_SIZE, sizeof(u32), GFP_KERNEL);
> +	if (!rss_indir)
> +		return -ENOMEM;
> +
> +	for (i = 0; i < HCLGEVF_RSS_IND_TBL_SIZE; i++)
> +		rss_indir[i] = i % kinfo->rss_size;
> +
> +	ret = hclgevf_set_rss(handle, rss_indir, NULL, 0);
> +	if (ret)
> +		dev_err(&hdev->pdev->dev, "set rss indir table fail, ret=%d\n",
> +			ret);
> +
> +	kfree(rss_indir);
> +
> +out:
> +	if (!ret)
> +		dev_info(&hdev->pdev->dev,
> +			 "Channels changed, rss_size from %u to %u, tqps from %u to %u",
> +			 cur_rss_size, kinfo->rss_size,
> +			 cur_tqps, kinfo->rss_size * kinfo->num_tc);
> +
> +	return ret;
> +}

IIRC David asked you not to issue this log message in v1 review.

Michal Kubecek

^ permalink raw reply

* Re: VRF Issue Since kernel 5
From: Gowen @ 2019-09-12  6:50 UTC (permalink / raw)
  To: David Ahern, netdev@vger.kernel.org
In-Reply-To: <b300235a-00f8-0689-8544-9db07cbd1e21@gmail.com>


Hi David -thanks for getting back to me



The DNS servers are 10.24.65.203 or 10.24.64.203 which you want to go

out mgmt-vrf. correct? No - 10.24.65.203 10.25.65.203, so should hit the route leak rule as below (if I've put the 10.24.64.0/24 subnet anywhere it is a typo)

vmAdmin@NETM06:~$ ip ro get 10.24.65.203 fibmatch
10.24.65.0/24 via 10.24.12.1 dev eth0


I've added the 127/8 route - no difference.

The reason for what you might think is an odd design is that I wanted any non-VRF aware users to be able to come in and run all commands in default context without issue, while production and mgmt traffic was separated still

DNS is now working as long as /etc/resolv.conf is populated with my DNS servers - a lot of people would be using this on Azure which uses netplan, so they'll have the same issue, is there documentation I could update or raise a bug to check the systemd-resolve servers as well?

Gareth


From: David Ahern <dsahern@gmail.com>

Sent: 11 September 2019 18:02

To: Gowen <gowen@potatocomputing.co.uk>; netdev@vger.kernel.org <netdev@vger.kernel.org>

Subject: Re: VRF Issue Since kernel 5

 


At LPC this week and just now getting a chance to process the data you sent.



On 9/9/19 8:46 AM, Gowen wrote:

> the production traffic is all in the 10.0.0.0/8 network (eth1 global VRF) except for a few subnets (DNS) which are routed out eth0 (mgmt-vrf)

> 

> 

> Admin@NETM06:~$ ip route show

> default via 10.24.12.1 dev eth0

> 10.0.0.0/8 via 10.24.12.1 dev eth1

> 10.24.12.0/24 dev eth1 proto kernel scope link src 10.24.12.9

> 10.24.65.0/24 via 10.24.12.1 dev eth0

> 10.25.65.0/24 via 10.24.12.1 dev eth0

> 10.26.0.0/21 via 10.24.12.1 dev eth0

> 10.26.64.0/21 via 10.24.12.1 dev eth0



interesting route table. This is default VRF but you have route leaking

through eth0 which is in mgmt-vrf.



> 

> 

> Admin@NETM06:~$ ip route show vrf mgmt-vrf

> default via 10.24.12.1 dev eth0

> unreachable default metric 4278198272

> 10.24.12.0/24 dev eth0 proto kernel scope link src 10.24.12.10

> 10.24.65.0/24 via 10.24.12.1 dev eth0

> 10.25.65.0/24 via 10.24.12.1 dev eth0

> 10.26.0.0/21 via 10.24.12.1 dev eth0

> 10.26.64.0/21 via 10.24.12.1 dev eth0



The DNS servers are 10.24.65.203 or 10.24.64.203 which you want to go

out mgmt-vrf. correct?



10.24.65.203 should hit the route "10.24.65.0/24 via 10.24.12.1 dev

eth0" for both default VRF and mgmt-vrf.



10.24.64.203 will NOT hit a route leak entry so traverse the VRF

associated with the context of the command (mgmt-vrf or default). Is

that intentional? (verify with: `ip ro get 10.24.64.203 fibmatch` and

`ip ro get 10.24.64.203 vrf mgmt-vrf fibmatch`)





> 

> 

> 

> The strange activity occurs when I enter the command “sudo apt update” as I can resolve the DNS request (10.24.65.203 or 10.24.64.203, verified with tcpdump) out eth0 but for the actual update traffic there is no activity:

> 

> 

> sudo tcpdump -i eth0 '(host 10.24.65.203 or host 10.25.65.203) and port 53' -n

> <OUTPUT OMITTED FOR BREVITY>

> 10:06:05.268735 IP 10.24.12.10.39963 > 10.24.65.203.53: 48798+ [1au] A? security.ubuntu.com. (48)

> <OUTPUT OMITTED FOR BREVITY>

> 10:06:05.284403 IP 10.24.65.203.53 > 10.24.12.10.39963: 48798 13/0/1 A 91.189.91.23, A 91.189.88.24, A 91.189.91.26, A 91.189.88.162, A 91.189.88.149, A 91.189.91.24, A 91.189.88.173, A 91.189.88.177, A 91.189.88.31, A 91.189.91.14, A 91.189.88.176, A 91.189.88.175,
 A 91.189.88.174 (256)

> 

> 

> 

> You can see that the update traffic is returned but is not accepted by the stack and a RST is sent

> 

> 

> Admin@NETM06:~$ sudo tcpdump -i eth0 '(not host 168.63.129.16 and port 80)' -n

> tcpdump: verbose output suppressed, use -v or -vv for full protocol decode

> listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes

> 10:17:12.690658 IP 10.24.12.10.40216 > 91.189.88.175.80: Flags [S], seq 2279624826, win 64240, options [mss 1460,sackOK,TS val 2029365856 ecr 0,nop,wscale 7], length 0

> 10:17:12.691929 IP 10.24.12.10.52362 > 91.189.95.83.80: Flags [S], seq 1465797256, win 64240, options [mss 1460,sackOK,TS val 3833463674 ecr 0,nop,wscale 7], length 0

> 10:17:12.696270 IP 91.189.88.175.80 > 10.24.12.10.40216: Flags [S.], seq 968450722, ack 2279624827, win 28960, options [mss 1418,sackOK,TS val 81957103 ecr 2029365856,nop,wscale 7], length 0                                                                                      
                                      

> 10:17:12.696301 IP 10.24.12.10.40216 > 91.189.88.175.80: Flags [R], seq 2279624827, win 0, length 0

> 10:17:12.697884 IP 91.189.95.83.80 > 10.24.12.10.52362: Flags [S.], seq 4148330738, ack 1465797257, win 28960, options [mss 1418,sackOK,TS val 2257624414 ecr 3833463674,nop,wscale 8], length 0                                                                                                                         

> 10:17:12.697909 IP 10.24.12.10.52362 > 91.189.95.83.80: Flags [R], seq 1465797257, win 0, length 0

> 

> 

> 

> 

> I can emulate the DNS lookup using netcat in the vrf:

> 

> 

> sudo ip vrf exec mgmt-vrf nc -u 10.24.65.203 53

> 



`ip vrf exec mgmt-vrf <COMMAND>` means that every IPv4 and IPv6 socket

opened by <COMMAND> is automatically bound to mgmt-vrf which causes

route lookups to hit the mgmt-vrf table.



Just running <COMMAND> (without binding to any vrf) means no socket is

bound to anything unless the command does a bind. In that case the

routing lookups determine which egress device is used.



Now the response comes back, if the ingress interface is a VRF then the

socket lookup wants to match on a device.



Now, a later response shows this for DNS lookups:



  isc-worker0000 20261 [000]  2215.013849: fib:fib_table_lookup: table

10 oif 0 iif 0 proto 0 0.0.0.0/0 -> 127.0.0.1/0 tos 0 scope 0 flags 0

==> dev eth0 gw 10.24.12.1 src 10.24.12.10 err 0

  isc-worker0000 20261 [000]  2215.013915: fib:fib_table_lookup: table

10 oif 4 iif 1 proto 17 0.0.0.0/52138 -> 127.0.0.53/53 tos 0 scope 0

flags 4 ==> dev eth0 gw 10.24.12.1 src 10.24.12.10 err 0

  isc-worker0000 20261 [000]  2220.014006: fib:fib_table_lookup: table

10 oif 4 iif 1 proto 17 0.0.0.0/52138 -> 127.0.0.53/53 tos 0 scope 0

flags 4 ==> dev eth0 gw 10.24.12.1 src 10.24.12.10 err 0



which suggests your process is passing off the DNS lookup to a local

process (isc-worker) and it hits the default route for mgmt-vrf when it

is trying to connect to a localhost address.



For mgmt-vrf I suggest always adding 127.0.0.1/8 to the mgmt vrf device

(and ::1/128 for IPv6 starting with 5.x kernels - I forget the exact

kernel version).



That might solve your problem; it might not.



(BTW: Cumulus uses fib rules for DNS servers to force DNS packets out

the mgmt-vrf interface.)


^ permalink raw reply

* Re: VRF Issue Since kernel 5
From: Gowen @ 2019-09-12  6:54 UTC (permalink / raw)
  To: David Ahern, Alexis Bauvin, mmanning@vyatta.att-mail.com
  Cc: netdev@vger.kernel.org
In-Reply-To: <0fd88da3-a7b1-d2e5-f5b8-0095220a7cc0@gmail.com>


currently:

vmAdmin@NETM06:~$ uname -r
5.0.0-1018-azure

vmAdmin@NETM06:~$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.3 LTS"


I don't keep a history of kernel versions on test but I noticed it had gone to kernel 5 and stopped working - I'm about 80% sure that happened at the same time - I'll try and dig out some logs today to see what I can find for you as Linux is fairly new to me

Gareth





From: David Ahern <dsahern@gmail.com>

Sent: 11 September 2019 17:09

To: Gowen <gowen@potatocomputing.co.uk>; Alexis Bauvin <abauvin@online.net>; mmanning@vyatta.att-mail.com <mmanning@vyatta.att-mail.com>

Cc: netdev@vger.kernel.org <netdev@vger.kernel.org>

Subject: Re: VRF Issue Since kernel 5

 


On 9/11/19 3:01 PM, Gowen wrote:

> Hi all,

> 

> It looks like ip vrf exec checks /etc/resolv.conf (found with strace -e

> trace=file sudo ip vrf exec mgmt-vrf host www.google.co.uk &>

> ~/straceFileOfVrfHost.txt) , but as I'm on an Azure machine using

> netplan, this file isn't updated with DNS servers. I have added my DNS

> server to resolv.conf and now can update the cache with "sudo ip vrf

> exec sudo apt update", if I am correct (which I'm not sure about as not

> really my area) then this might be affecting more than just me.

> 

> Also still not able to fix the updating cache from global VRF - which

> would cause bother in prod environment to others as well so think it

> would be good to get an RCA for it?

> 

> thanks for your help so far, has been really interesting.

> 

> Gareth

> 

> 

> ------------------------------------------------------------------------

> *From:* Gowen <gowen@potatocomputing.co.uk>

> *Sent:* 11 September 2019 13:48

> *To:* David Ahern <dsahern@gmail.com>; Alexis Bauvin

> <abauvin@online.net>; mmanning@vyatta.att-mail.com

> <mmanning@vyatta.att-mail.com>

> *Cc:* netdev@vger.kernel.org <netdev@vger.kernel.org>

> *Subject:* Re: VRF Issue Since kernel 5

>  

> yep no problem:

> 

> Admin@NETM06:~$ sudo sysctl -a | grep l3mdev

> net.ipv4.raw_l3mdev_accept = 1

> net.ipv4.tcp_l3mdev_accept = 1

> net.ipv4.udp_l3mdev_accept = 1

> 

> 

> The source of the DNS issue in the vrf exec command is something to do

> with networkd managing the DNS servers, I can fix it by explicitly

> mentioning the DNS server:

> 

> systemd-resolve --status --no-page

> 

> <OUTPUT OMITTED>

> 

> Link 4 (mgmt-vrf)

>       Current Scopes: none

>        LLMNR setting: yes

> MulticastDNS setting: no

>       DNSSEC setting: no

>     DNSSEC supported: no

> 

> Link 3 (eth1)

>       Current Scopes: DNS

>        LLMNR setting: yes

> MulticastDNS setting: no

>       DNSSEC setting: no

>     DNSSEC supported: no

>          DNS Servers: 10.24.65.203

>                       10.24.65.204

>                       10.25.65.203

>                       10.25.65.204

>           DNS Domain: reddog.microsoft.com

> 

> Link 2 (eth0)

>       Current Scopes: DNS

>        LLMNR setting: yes

> MulticastDNS setting: no

>       DNSSEC setting: no

>     DNSSEC supported: no

>          DNS Servers: 10.24.65.203

>                       10.24.65.204

>                       10.25.65.203

>                       10.25.65.204

>           DNS Domain: reddog.microsoft.com

> 

> there is no DNS server when I use ip vrf exec command (tcpdump shows

> only loopback traffic when invoked without my DNS sever explicitly

> entered) - odd as mgmt-vrf isnt L3 device so thought it would pick up

> eth0 DNS servers?

> 

> I dont think this helps with my update cache traffic from global vrf

> though on port 80

> 



Let's back up a bit: your subject line says vrf issue since kernel 5.

Did you update / change the OS as well?



ie., the previous version that worked what is the OS and kernel version?

What is the OS and kernel version with the problem?


^ permalink raw reply

* KASAN: use-after-free Read in __xfrm_decode_session
From: syzbot @ 2019-09-12  8:02 UTC (permalink / raw)
  To: davem, herbert, linux-kernel, netdev, steffen.klassert,
	syzkaller-bugs

Hello,

syzbot found the following crash on:

HEAD commit:    6d028043 Add linux-next specific files for 20190830
git tree:       linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=150de9b9600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=82a6bec43ab0cb69
dashboard link: https://syzkaller.appspot.com/bug?extid=55d9cf7c57894c1e4860
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+55d9cf7c57894c1e4860@syzkaller.appspotmail.com

==================================================================
BUG: KASAN: use-after-free in decode_session6 net/xfrm/xfrm_policy.c:3390  
[inline]
BUG: KASAN: use-after-free in __xfrm_decode_session+0x1cfb/0x2e90  
net/xfrm/xfrm_policy.c:3482
Read of size 1 at addr ffff88805cdb288e by task syz-executor.0/24778

CPU: 0 PID: 24778 Comm: syz-executor.0 Not tainted 5.3.0-rc6-next-20190830  
#75
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
  print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374
  __kasan_report.cold+0x1b/0x41 mm/kasan/report.c:506
  kasan_report+0x12/0x20 mm/kasan/common.c:634
  __asan_report_load1_noabort+0x14/0x20 mm/kasan/generic_report.c:129
  decode_session6 net/xfrm/xfrm_policy.c:3390 [inline]
  __xfrm_decode_session+0x1cfb/0x2e90 net/xfrm/xfrm_policy.c:3482
  xfrm_decode_session_reverse include/net/xfrm.h:1144 [inline]
  icmpv6_route_lookup+0x31b/0x4d0 net/ipv6/icmp.c:369
  icmp6_send+0x129e/0x1e20 net/ipv6/icmp.c:557
  icmpv6_send+0xec/0x230 net/ipv6/ip6_icmp.c:43
  ip6_link_failure+0x2b/0x530 net/ipv6/route.c:2640
  dst_link_failure include/net/dst.h:419 [inline]
  ip6_tnl_xmit+0x45b/0x33f0 net/ipv6/ip6_tunnel.c:1222
  ip6ip6_tnl_xmit net/ipv6/ip6_tunnel.c:1376 [inline]
  ip6_tnl_start_xmit+0x847/0x1c90 net/ipv6/ip6_tunnel.c:1402
  __netdev_start_xmit include/linux/netdevice.h:4419 [inline]
  netdev_start_xmit include/linux/netdevice.h:4433 [inline]
  xmit_one net/core/dev.c:3280 [inline]
  dev_hard_start_xmit+0x1a3/0x9c0 net/core/dev.c:3296
  sch_direct_xmit+0x372/0xc30 net/sched/sch_generic.c:308
  qdisc_restart net/sched/sch_generic.c:371 [inline]
  __qdisc_run+0x586/0x19d0 net/sched/sch_generic.c:379
  __dev_xmit_skb net/core/dev.c:3533 [inline]
  __dev_queue_xmit+0x16f1/0x37c0 net/core/dev.c:3838
  dev_queue_xmit+0x18/0x20 net/core/dev.c:3902
  neigh_direct_output+0x16/0x20 net/core/neighbour.c:1530
  neigh_output include/net/neighbour.h:511 [inline]
  ip6_finish_output2+0x1034/0x2550 net/ipv6/ip6_output.c:116
  __ip6_finish_output+0x444/0xaa0 net/ipv6/ip6_output.c:142
  ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:152
  NF_HOOK_COND include/linux/netfilter.h:294 [inline]
  ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:175
  dst_output include/net/dst.h:436 [inline]
  NF_HOOK include/linux/netfilter.h:305 [inline]
  NF_HOOK include/linux/netfilter.h:299 [inline]
  ip6_xmit+0xe35/0x20b0 net/ipv6/ip6_output.c:279
  inet6_csk_xmit+0x2fb/0x5d0 net/ipv6/inet6_connection_sock.c:135
  __tcp_transmit_skb+0x1a2f/0x3820 net/ipv4/tcp_output.c:1158
  tcp_transmit_skb net/ipv4/tcp_output.c:1174 [inline]
  tcp_send_syn_data net/ipv4/tcp_output.c:3505 [inline]
  tcp_connect+0x1be7/0x4320 net/ipv4/tcp_output.c:3570
  tcp_sendmsg_fastopen net/ipv4/tcp.c:1155 [inline]
  tcp_sendmsg_locked+0x2898/0x3220 net/ipv4/tcp.c:1206
  tcp_sendmsg+0x30/0x50 net/ipv4/tcp.c:1434
  inet6_sendmsg+0x9e/0xe0 net/ipv6/af_inet6.c:576
  sock_sendmsg_nosec net/socket.c:637 [inline]
  sock_sendmsg+0xd7/0x130 net/socket.c:657
  __sys_sendto+0x262/0x380 net/socket.c:1952
  __do_sys_sendto net/socket.c:1964 [inline]
  __se_sys_sendto net/socket.c:1960 [inline]
  __x64_sys_sendto+0xe1/0x1a0 net/socket.c:1960
  do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4598e9
Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7  
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff  
ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f83d1625c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00000000004598e9
RDX: 0000000000000001 RSI: 0000000020000340 RDI: 0000000000000003
RBP: 000000000075bfc8 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000200400cf R11: 0000000000000246 R12: 00007f83d16266d4
R13: 00000000004c7880 R14: 00000000004dd188 R15: 00000000ffffffff

Allocated by task 3891:
  save_stack+0x23/0x90 mm/kasan/common.c:69
  set_track mm/kasan/common.c:77 [inline]
  __kasan_kmalloc mm/kasan/common.c:510 [inline]
  __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:483
  kasan_kmalloc+0x9/0x10 mm/kasan/common.c:524
  __do_kmalloc mm/slab.c:3655 [inline]
  __kmalloc+0x163/0x770 mm/slab.c:3664
  kmalloc include/linux/slab.h:561 [inline]
  tomoyo_realpath_from_path+0xcd/0x7b0 security/tomoyo/realpath.c:277
  tomoyo_get_realpath security/tomoyo/file.c:151 [inline]
  tomoyo_path_number_perm+0x1dd/0x520 security/tomoyo/file.c:723
  tomoyo_file_ioctl+0x23/0x30 security/tomoyo/tomoyo.c:335
  security_file_ioctl+0x77/0xc0 security/security.c:1409
  ksys_ioctl+0x57/0xd0 fs/ioctl.c:711
  __do_sys_ioctl fs/ioctl.c:720 [inline]
  __se_sys_ioctl fs/ioctl.c:718 [inline]
  __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
  do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 3891:
  save_stack+0x23/0x90 mm/kasan/common.c:69
  set_track mm/kasan/common.c:77 [inline]
  kasan_set_free_info mm/kasan/common.c:332 [inline]
  __kasan_slab_free+0x102/0x150 mm/kasan/common.c:471
  kasan_slab_free+0xe/0x10 mm/kasan/common.c:480
  __cache_free mm/slab.c:3425 [inline]
  kfree+0x10a/0x2c0 mm/slab.c:3756
  tomoyo_realpath_from_path+0x1de/0x7b0 security/tomoyo/realpath.c:319
  tomoyo_get_realpath security/tomoyo/file.c:151 [inline]
  tomoyo_path_number_perm+0x1dd/0x520 security/tomoyo/file.c:723
  tomoyo_file_ioctl+0x23/0x30 security/tomoyo/tomoyo.c:335
  security_file_ioctl+0x77/0xc0 security/security.c:1409
  ksys_ioctl+0x57/0xd0 fs/ioctl.c:711
  __do_sys_ioctl fs/ioctl.c:720 [inline]
  __se_sys_ioctl fs/ioctl.c:718 [inline]
  __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
  do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff88805cdb2000
  which belongs to the cache kmalloc-4k of size 4096
The buggy address is located 2190 bytes inside of
  4096-byte region [ffff88805cdb2000, ffff88805cdb3000)
The buggy address belongs to the page:
page:ffffea0001736c80 refcount:1 mapcount:0 mapping:ffff8880aa402000  
index:0x0 compound_mapcount: 0
flags: 0x1fffc0000010200(slab|head)
raw: 01fffc0000010200 ffffea0001941688 ffffea0002364008 ffff8880aa402000
raw: 0000000000000000 ffff88805cdb2000 0000000100000001 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
  ffff88805cdb2780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff88805cdb2800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff88805cdb2880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                       ^
  ffff88805cdb2900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff88805cdb2980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

^ permalink raw reply

* Re: [PATCH V2 net-next 1/7] net: hns3: add ethtool_ops.set_channels support for HNS3 VF driver
From: tanhuazhong @ 2019-09-12  8:20 UTC (permalink / raw)
  To: Michal Kubecek, netdev
  Cc: davem, linux-kernel, salil.mehta, yisen.zhuang, linuxarm,
	jakub.kicinski
In-Reply-To: <20190912062301.GE24779@unicorn.suse.cz>

Hi, Michal

On 2019/9/12 14:23, Michal Kubecek wrote:
> On Wed, Sep 11, 2019 at 10:40:33AM +0800, Huazhong Tan wrote:
>> From: Guangbin Huang <huangguangbin2@huawei.com>
>>
>> This patch adds ethtool_ops.set_channels support for HNS3 VF driver,
>> and updates related TQP information and RSS information, to support
>> modification of VF TQP number, and uses current rss_size instead of
>> max_rss_size to initialize RSS.
>>
>> Also, fixes a format error in hclgevf_get_rss().
>>
>> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
>> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
>> ---
>>   drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c |  1 +
>>   .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c  | 83 ++++++++++++++++++++--
>>   2 files changed, 79 insertions(+), 5 deletions(-)
>>
> ...
>> diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
>> index 594cae8..e3090b3 100644
>> --- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
>> +++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
> ...
>> +static void hclgevf_update_rss_size(struct hnae3_handle *handle,
>> +				    u32 new_tqps_num)
>> +{
>> +	struct hnae3_knic_private_info *kinfo = &handle->kinfo;
>> +	struct hclgevf_dev *hdev = hclgevf_ae_get_hdev(handle);
>> +	u16 max_rss_size;
>> +
>> +	kinfo->req_rss_size = new_tqps_num;
>> +
>> +	max_rss_size = min_t(u16, hdev->rss_size_max,
>> +			     hdev->num_tqps / kinfo->num_tc);
>> +
>> +	/* Use the user's configuration when it is not larger than
>> +	 * max_rss_size, otherwise, use the maximum specification value.
>> +	 */
>> +	if (kinfo->req_rss_size != kinfo->rss_size && kinfo->req_rss_size &&
>> +	    kinfo->req_rss_size <= max_rss_size)
>> +		kinfo->rss_size = kinfo->req_rss_size;
>> +	else if (kinfo->rss_size > max_rss_size ||
>> +		 (!kinfo->req_rss_size && kinfo->rss_size < max_rss_size))
>> +		kinfo->rss_size = max_rss_size;
> 
> I don't think requested channel count can be larger than max_rss_size
> here. In ethtool_set_channels(), we check that requested channel counts
> do not exceed maximum channel counts as reported by ->get_channels().
> And hclgevf_get_max_channels() cannot return more than max_rss_size.
> 

When we can modify the TC number (which PF has already supported, VF may 
implement in the future) using lldptool or tc cmd, 
hclgevf_update_rss_size will be called to update the rss information, 
which may also change max_rss_size,  so we will use max_rss_size instead 
if the kinfo->rss_size configured using ethtool is bigger than max_rss_size.

>> +
>> +	kinfo->num_tqps = kinfo->num_tc * kinfo->rss_size;
>> +}
>> +
>> +static int hclgevf_set_channels(struct hnae3_handle *handle, u32 new_tqps_num,
>> +				bool rxfh_configured)
>> +{
>> +	struct hclgevf_dev *hdev = hclgevf_ae_get_hdev(handle);
>> +	struct hnae3_knic_private_info *kinfo = &handle->kinfo;
>> +	u16 cur_rss_size = kinfo->rss_size;
>> +	u16 cur_tqps = kinfo->num_tqps;
>> +	u32 *rss_indir;
>> +	unsigned int i;
>> +	int ret;
>> +
>> +	hclgevf_update_rss_size(handle, new_tqps_num);
>> +
>> +	ret = hclgevf_set_rss_tc_mode(hdev, kinfo->rss_size);
>> +	if (ret)
>> +		return ret;
>> +
>> +	/* RSS indirection table has been configuared by user */
>> +	if (rxfh_configured)
>> +		goto out;
>> +
>> +	/* Reinitializes the rss indirect table according to the new RSS size */
>> +	rss_indir = kcalloc(HCLGEVF_RSS_IND_TBL_SIZE, sizeof(u32), GFP_KERNEL);
>> +	if (!rss_indir)
>> +		return -ENOMEM;
>> +
>> +	for (i = 0; i < HCLGEVF_RSS_IND_TBL_SIZE; i++)
>> +		rss_indir[i] = i % kinfo->rss_size;
>> +
>> +	ret = hclgevf_set_rss(handle, rss_indir, NULL, 0);
>> +	if (ret)
>> +		dev_err(&hdev->pdev->dev, "set rss indir table fail, ret=%d\n",
>> +			ret);
>> +
>> +	kfree(rss_indir);
>> +
>> +out:
>> +	if (!ret)
>> +		dev_info(&hdev->pdev->dev,
>> +			 "Channels changed, rss_size from %u to %u, tqps from %u to %u",
>> +			 cur_rss_size, kinfo->rss_size,
>> +			 cur_tqps, kinfo->rss_size * kinfo->num_tc);
>> +
>> +	return ret;
>> +}
> 
> IIRC David asked you not to issue this log message in v1 review.
> 
> Michal Kubecek
> 

Sorry for missing this log.

Thanks.

> .
> 


^ permalink raw reply

* [PATCH net,stable] cdc_ether: fix rndis support for Mediatek based smartphones
From: Bjørn Mork @ 2019-09-12  8:42 UTC (permalink / raw)
  To: netdev; +Cc: Oliver Neukum, linux-usb, Lars Melin, Bjørn Mork

A Mediatek based smartphone owner reports problems with USB
tethering in Linux.  The verbose USB listing shows a rndis_host
interface pair (e0/01/03 + 10/00/00), but the driver fails to
bind with

[  355.960428] usb 1-4: bad CDC descriptors

The problem is a failsafe test intended to filter out ACM serial
functions using the same 02/02/ff class/subclass/protocol as RNDIS.
The serial functions are recognized by their non-zero bmCapabilities.

No RNDIS function with non-zero bmCapabilities were known at the time
this failsafe was added. But it turns out that some Wireless class
RNDIS functions are using the bmCapabilities field. These functions
are uniquely identified as RNDIS by their class/subclass/protocol, so
the failing test can safely be disabled.  The same applies to the two
types of Misc class RNDIS functions.

Applying the failsafe to Communication class functions only retains
the original functionality, and fixes the problem for the Mediatek based
smartphone.

Tow examples of CDC functional descriptors with non-zero bmCapabilities
from Wireless class RNDIS functions are:

0e8d:000a  Mediatek Crosscall Spider X5 3G Phone

      CDC Header:
        bcdCDC               1.10
      CDC ACM:
        bmCapabilities       0x0f
          connection notifications
          sends break
          line coding and serial state
          get/set/clear comm features
      CDC Union:
        bMasterInterface        0
        bSlaveInterface         1
      CDC Call Management:
        bmCapabilities       0x03
          call management
          use DataInterface
        bDataInterface          1

and

19d2:1023  ZTE K4201-z

      CDC Header:
        bcdCDC               1.10
      CDC ACM:
        bmCapabilities       0x02
          line coding and serial state
      CDC Call Management:
        bmCapabilities       0x03
          call management
          use DataInterface
        bDataInterface          1
      CDC Union:
        bMasterInterface        0
        bSlaveInterface         1

The Mediatek example is believed to apply to most smartphones with
Mediatek firmware.  The ZTE example is most likely also part of a larger
family of devices/firmwares.

Suggested-by: Lars Melin <larsm17@gmail.com>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
---
 drivers/net/usb/cdc_ether.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/net/usb/cdc_ether.c b/drivers/net/usb/cdc_ether.c
index 8458e88c18e9..32f53de5b1fe 100644
--- a/drivers/net/usb/cdc_ether.c
+++ b/drivers/net/usb/cdc_ether.c
@@ -206,7 +206,15 @@ int usbnet_generic_cdc_bind(struct usbnet *dev, struct usb_interface *intf)
 		goto bad_desc;
 	}
 skip:
-	if (rndis && header.usb_cdc_acm_descriptor &&
+	/* Communcation class functions with bmCapabilities are not
+	 * RNDIS.  But some Wireless class RNDIS functions use
+	 * bmCapabilities for their own purpose. The failsafe is
+	 * therefore applied only to Communication class RNDIS
+	 * functions.  The rndis test is redundant, but a cheap
+	 * optimization.
+	 */
+	if (rndis && is_rndis(&intf->cur_altsetting->desc) &&
+	    header.usb_cdc_acm_descriptor &&
 	    header.usb_cdc_acm_descriptor->bmCapabilities) {
 		dev_dbg(&intf->dev,
 			"ACM capabilities %02x, not really RNDIS?\n",
-- 
2.20.1


^ permalink raw reply related

* [patch net-next v3 0/3] net: devlink: move reload fail indication to devlink core and expose to user
From: Jiri Pirko @ 2019-09-12  8:49 UTC (permalink / raw)
  To: netdev; +Cc: davem, idosch, dsahern, jakub.kicinski, tariqt, mlxsw

From: Jiri Pirko <jiri@mellanox.com>

First two patches are dependencies of the last one. That moves devlink
reload failure indication to the devlink code, so the drivers do not
have to track it themselves. Currently it is only mlxsw, but I will send
a follow-up patchset that introduces this in netdevsim too.

Jiri Pirko (3):
  mlx4: Split restart_one into two functions
  net: devlink: split reload op into two
  net: devlink: move reload fail indication to devlink core and expose
    to user

 drivers/net/ethernet/mellanox/mlx4/catas.c |  2 +-
 drivers/net/ethernet/mellanox/mlx4/main.c  | 44 ++++++++++++++++++----
 drivers/net/ethernet/mellanox/mlx4/mlx4.h  |  3 +-
 drivers/net/ethernet/mellanox/mlxsw/core.c | 30 +++++++++------
 drivers/net/netdevsim/dev.c                | 13 +++++--
 include/net/devlink.h                      |  8 +++-
 include/uapi/linux/devlink.h               |  2 +
 net/core/devlink.c                         | 35 +++++++++++++++--
 8 files changed, 106 insertions(+), 31 deletions(-)

-- 
2.21.0


^ permalink raw reply

* [patch net-next v3 1/3] mlx4: Split restart_one into two functions
From: Jiri Pirko @ 2019-09-12  8:49 UTC (permalink / raw)
  To: netdev; +Cc: davem, idosch, dsahern, jakub.kicinski, tariqt, mlxsw
In-Reply-To: <20190912084946.7468-1-jiri@resnulli.us>

From: Jiri Pirko <jiri@mellanox.com>

Split the function restart_one into two functions and separate teardown
and buildup.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/catas.c |  2 +-
 drivers/net/ethernet/mellanox/mlx4/main.c  | 25 ++++++++++++++++++----
 drivers/net/ethernet/mellanox/mlx4/mlx4.h  |  3 +--
 3 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/catas.c b/drivers/net/ethernet/mellanox/mlx4/catas.c
index 87e90b5d4d7d..5b11557f1ae4 100644
--- a/drivers/net/ethernet/mellanox/mlx4/catas.c
+++ b/drivers/net/ethernet/mellanox/mlx4/catas.c
@@ -210,7 +210,7 @@ static void mlx4_handle_error_state(struct mlx4_dev_persistent *persist)
 	mutex_lock(&persist->interface_state_mutex);
 	if (persist->interface_state & MLX4_INTERFACE_STATE_UP &&
 	    !(persist->interface_state & MLX4_INTERFACE_STATE_DELETION)) {
-		err = mlx4_restart_one(persist->pdev, false, NULL);
+		err = mlx4_restart_one(persist->pdev);
 		mlx4_info(persist->dev, "mlx4_restart_one was ended, ret=%d\n",
 			  err);
 	}
diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
index 07c204bd3fc4..a39c647c12dc 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -3931,6 +3931,10 @@ static void mlx4_devlink_param_load_driverinit_values(struct devlink *devlink)
 	}
 }
 
+static void mlx4_restart_one_down(struct pci_dev *pdev);
+static int mlx4_restart_one_up(struct pci_dev *pdev, bool reload,
+			       struct devlink *devlink);
+
 static int mlx4_devlink_reload(struct devlink *devlink,
 			       struct netlink_ext_ack *extack)
 {
@@ -3941,9 +3945,11 @@ static int mlx4_devlink_reload(struct devlink *devlink,
 
 	if (persist->num_vfs)
 		mlx4_warn(persist->dev, "Reload performed on PF, will cause reset on operating Virtual Functions\n");
-	err = mlx4_restart_one(persist->pdev, true, devlink);
+	mlx4_restart_one_down(persist->pdev);
+	err = mlx4_restart_one_up(persist->pdev, true, devlink);
 	if (err)
-		mlx4_err(persist->dev, "mlx4_restart_one failed, ret=%d\n", err);
+		mlx4_err(persist->dev, "mlx4_restart_one_up failed, ret=%d\n",
+			 err);
 
 	return err;
 }
@@ -4163,7 +4169,13 @@ static int restore_current_port_types(struct mlx4_dev *dev,
 	return err;
 }
 
-int mlx4_restart_one(struct pci_dev *pdev, bool reload, struct devlink *devlink)
+static void mlx4_restart_one_down(struct pci_dev *pdev)
+{
+	mlx4_unload_one(pdev);
+}
+
+static int mlx4_restart_one_up(struct pci_dev *pdev, bool reload,
+			       struct devlink *devlink)
 {
 	struct mlx4_dev_persistent *persist = pci_get_drvdata(pdev);
 	struct mlx4_dev	 *dev  = persist->dev;
@@ -4175,7 +4187,6 @@ int mlx4_restart_one(struct pci_dev *pdev, bool reload, struct devlink *devlink)
 	total_vfs = dev->persist->num_vfs;
 	memcpy(nvfs, dev->persist->nvfs, sizeof(dev->persist->nvfs));
 
-	mlx4_unload_one(pdev);
 	if (reload)
 		mlx4_devlink_param_load_driverinit_values(devlink);
 	err = mlx4_load_one(pdev, pci_dev_data, total_vfs, nvfs, priv, 1);
@@ -4194,6 +4205,12 @@ int mlx4_restart_one(struct pci_dev *pdev, bool reload, struct devlink *devlink)
 	return err;
 }
 
+int mlx4_restart_one(struct pci_dev *pdev)
+{
+	mlx4_restart_one_down(pdev);
+	return mlx4_restart_one_up(pdev, false, NULL);
+}
+
 #define MLX_SP(id) { PCI_VDEVICE(MELLANOX, id), MLX4_PCI_DEV_FORCE_SENSE_PORT }
 #define MLX_VF(id) { PCI_VDEVICE(MELLANOX, id), MLX4_PCI_DEV_IS_VF }
 #define MLX_GN(id) { PCI_VDEVICE(MELLANOX, id), 0 }
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4.h b/drivers/net/ethernet/mellanox/mlx4/mlx4.h
index 23f1b5b512c2..527b52e48276 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4.h
@@ -1043,8 +1043,7 @@ int mlx4_catas_init(struct mlx4_dev *dev);
 void mlx4_catas_end(struct mlx4_dev *dev);
 int mlx4_crdump_init(struct mlx4_dev *dev);
 void mlx4_crdump_end(struct mlx4_dev *dev);
-int mlx4_restart_one(struct pci_dev *pdev, bool reload,
-		     struct devlink *devlink);
+int mlx4_restart_one(struct pci_dev *pdev);
 int mlx4_register_device(struct mlx4_dev *dev);
 void mlx4_unregister_device(struct mlx4_dev *dev);
 void mlx4_dispatch_event(struct mlx4_dev *dev, enum mlx4_dev_event type,
-- 
2.21.0


^ permalink raw reply related

* [patch net-next v3 2/3] net: devlink: split reload op into two
From: Jiri Pirko @ 2019-09-12  8:49 UTC (permalink / raw)
  To: netdev; +Cc: davem, idosch, dsahern, jakub.kicinski, tariqt, mlxsw
In-Reply-To: <20190912084946.7468-1-jiri@resnulli.us>

From: Jiri Pirko <jiri@mellanox.com>

In order to properly implement failure indication during reload,
split the reload op into two ops, one for down phase and one for
up phase.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/main.c  | 19 +++++++++++++++----
 drivers/net/ethernet/mellanox/mlxsw/core.c | 19 +++++++++++++++----
 drivers/net/netdevsim/dev.c                | 13 ++++++++++---
 include/net/devlink.h                      |  5 ++++-
 net/core/devlink.c                         | 16 ++++++++++++----
 5 files changed, 56 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
index a39c647c12dc..ef3f3d06ff1e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -3935,17 +3935,27 @@ static void mlx4_restart_one_down(struct pci_dev *pdev);
 static int mlx4_restart_one_up(struct pci_dev *pdev, bool reload,
 			       struct devlink *devlink);
 
-static int mlx4_devlink_reload(struct devlink *devlink,
-			       struct netlink_ext_ack *extack)
+static int mlx4_devlink_reload_down(struct devlink *devlink,
+				    struct netlink_ext_ack *extack)
 {
 	struct mlx4_priv *priv = devlink_priv(devlink);
 	struct mlx4_dev *dev = &priv->dev;
 	struct mlx4_dev_persistent *persist = dev->persist;
-	int err;
 
 	if (persist->num_vfs)
 		mlx4_warn(persist->dev, "Reload performed on PF, will cause reset on operating Virtual Functions\n");
 	mlx4_restart_one_down(persist->pdev);
+	return 0;
+}
+
+static int mlx4_devlink_reload_up(struct devlink *devlink,
+				  struct netlink_ext_ack *extack)
+{
+	struct mlx4_priv *priv = devlink_priv(devlink);
+	struct mlx4_dev *dev = &priv->dev;
+	struct mlx4_dev_persistent *persist = dev->persist;
+	int err;
+
 	err = mlx4_restart_one_up(persist->pdev, true, devlink);
 	if (err)
 		mlx4_err(persist->dev, "mlx4_restart_one_up failed, ret=%d\n",
@@ -3956,7 +3966,8 @@ static int mlx4_devlink_reload(struct devlink *devlink,
 
 static const struct devlink_ops mlx4_devlink_ops = {
 	.port_type_set	= mlx4_devlink_port_type_set,
-	.reload		= mlx4_devlink_reload,
+	.reload_down	= mlx4_devlink_reload_down,
+	.reload_up	= mlx4_devlink_reload_up,
 };
 
 static int mlx4_init_one(struct pci_dev *pdev, const struct pci_device_id *id)
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c b/drivers/net/ethernet/mellanox/mlxsw/core.c
index 963a2b4b61b1..c71a1d9ea17b 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.c
@@ -984,16 +984,26 @@ mlxsw_devlink_info_get(struct devlink *devlink, struct devlink_info_req *req,
 	return 0;
 }
 
-static int mlxsw_devlink_core_bus_device_reload(struct devlink *devlink,
-						struct netlink_ext_ack *extack)
+static int
+mlxsw_devlink_core_bus_device_reload_down(struct devlink *devlink,
+					  struct netlink_ext_ack *extack)
 {
 	struct mlxsw_core *mlxsw_core = devlink_priv(devlink);
-	int err;
 
 	if (!(mlxsw_core->bus->features & MLXSW_BUS_F_RESET))
 		return -EOPNOTSUPP;
 
 	mlxsw_core_bus_device_unregister(mlxsw_core, true);
+	return 0;
+}
+
+static int
+mlxsw_devlink_core_bus_device_reload_up(struct devlink *devlink,
+					struct netlink_ext_ack *extack)
+{
+	struct mlxsw_core *mlxsw_core = devlink_priv(devlink);
+	int err;
+
 	err = mlxsw_core_bus_device_register(mlxsw_core->bus_info,
 					     mlxsw_core->bus,
 					     mlxsw_core->bus_priv, true,
@@ -1066,7 +1076,8 @@ mlxsw_devlink_trap_group_init(struct devlink *devlink,
 }
 
 static const struct devlink_ops mlxsw_devlink_ops = {
-	.reload				= mlxsw_devlink_core_bus_device_reload,
+	.reload_down		= mlxsw_devlink_core_bus_device_reload_down,
+	.reload_up		= mlxsw_devlink_core_bus_device_reload_up,
 	.port_type_set			= mlxsw_devlink_port_type_set,
 	.port_split			= mlxsw_devlink_port_split,
 	.port_unsplit			= mlxsw_devlink_port_unsplit,
diff --git a/drivers/net/netdevsim/dev.c b/drivers/net/netdevsim/dev.c
index 39cdb6c18ec0..7fba7b271a57 100644
--- a/drivers/net/netdevsim/dev.c
+++ b/drivers/net/netdevsim/dev.c
@@ -521,8 +521,14 @@ static void nsim_dev_traps_exit(struct devlink *devlink)
 	kfree(nsim_dev->trap_data);
 }
 
-static int nsim_dev_reload(struct devlink *devlink,
-			   struct netlink_ext_ack *extack)
+static int nsim_dev_reload_down(struct devlink *devlink,
+				struct netlink_ext_ack *extack)
+{
+	return 0;
+}
+
+static int nsim_dev_reload_up(struct devlink *devlink,
+			      struct netlink_ext_ack *extack)
 {
 	enum nsim_resource_id res_ids[] = {
 		NSIM_RESOURCE_IPV4_FIB, NSIM_RESOURCE_IPV4_FIB_RULES,
@@ -638,7 +644,8 @@ nsim_dev_devlink_trap_action_set(struct devlink *devlink,
 }
 
 static const struct devlink_ops nsim_dev_devlink_ops = {
-	.reload = nsim_dev_reload,
+	.reload_down = nsim_dev_reload_down,
+	.reload_up = nsim_dev_reload_up,
 	.flash_update = nsim_dev_flash_update,
 	.trap_init = nsim_dev_devlink_trap_init,
 	.trap_action_set = nsim_dev_devlink_trap_action_set,
diff --git a/include/net/devlink.h b/include/net/devlink.h
index 03e4d9244ff3..ab8d56d12ffd 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -642,7 +642,10 @@ enum devlink_trap_group_generic_id {
 	}
 
 struct devlink_ops {
-	int (*reload)(struct devlink *devlink, struct netlink_ext_ack *extack);
+	int (*reload_down)(struct devlink *devlink,
+			   struct netlink_ext_ack *extack);
+	int (*reload_up)(struct devlink *devlink,
+			 struct netlink_ext_ack *extack);
 	int (*port_type_set)(struct devlink_port *devlink_port,
 			     enum devlink_port_type port_type);
 	int (*port_split)(struct devlink *devlink, unsigned int port_index,
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 4a2fb94c44cf..9e522639693d 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -2672,12 +2672,17 @@ devlink_resources_validate(struct devlink *devlink,
 	return err;
 }
 
+static bool devlink_reload_supported(struct devlink *devlink)
+{
+	return devlink->ops->reload_down && devlink->ops->reload_up;
+}
+
 static int devlink_nl_cmd_reload(struct sk_buff *skb, struct genl_info *info)
 {
 	struct devlink *devlink = info->user_ptr[0];
 	int err;
 
-	if (!devlink->ops->reload)
+	if (!devlink_reload_supported(devlink))
 		return -EOPNOTSUPP;
 
 	err = devlink_resources_validate(devlink, NULL, info);
@@ -2685,7 +2690,10 @@ static int devlink_nl_cmd_reload(struct sk_buff *skb, struct genl_info *info)
 		NL_SET_ERR_MSG_MOD(info->extack, "resources size validation failed");
 		return err;
 	}
-	return devlink->ops->reload(devlink, info->extack);
+	err = devlink->ops->reload_down(devlink, info->extack);
+	if (err)
+		return err;
+	return devlink->ops->reload_up(devlink, info->extack);
 }
 
 static int devlink_nl_flash_update_fill(struct sk_buff *msg,
@@ -7150,7 +7158,7 @@ __devlink_param_driverinit_value_set(struct devlink *devlink,
 int devlink_param_driverinit_value_get(struct devlink *devlink, u32 param_id,
 				       union devlink_param_value *init_val)
 {
-	if (!devlink->ops->reload)
+	if (!devlink_reload_supported(devlink))
 		return -EOPNOTSUPP;
 
 	return __devlink_param_driverinit_value_get(&devlink->param_list,
@@ -7197,7 +7205,7 @@ int devlink_port_param_driverinit_value_get(struct devlink_port *devlink_port,
 {
 	struct devlink *devlink = devlink_port->devlink;
 
-	if (!devlink->ops->reload)
+	if (!devlink_reload_supported(devlink))
 		return -EOPNOTSUPP;
 
 	return __devlink_param_driverinit_value_get(&devlink_port->param_list,
-- 
2.21.0


^ permalink raw reply related

* [patch net-next v3 3/3] net: devlink: move reload fail indication to devlink core and expose to user
From: Jiri Pirko @ 2019-09-12  8:49 UTC (permalink / raw)
  To: netdev; +Cc: davem, idosch, dsahern, jakub.kicinski, tariqt, mlxsw
In-Reply-To: <20190912084946.7468-1-jiri@resnulli.us>

From: Jiri Pirko <jiri@mellanox.com>

Currently the fact that devlink reload failed is stored in drivers.
Move this flag into devlink core. Also, expose it to the user.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
v2->v3:
- make arg of devlink_is_reload_failed const
v1->v2:
- s/devlink failed/devlink reload failed/ in description
---
 drivers/net/ethernet/mellanox/mlxsw/core.c | 15 +++++----------
 include/net/devlink.h                      |  3 +++
 include/uapi/linux/devlink.h               |  2 ++
 net/core/devlink.c                         | 21 ++++++++++++++++++++-
 4 files changed, 30 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c b/drivers/net/ethernet/mellanox/mlxsw/core.c
index c71a1d9ea17b..3fa96076e8a5 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.c
@@ -80,7 +80,6 @@ struct mlxsw_core {
 	struct mlxsw_thermal *thermal;
 	struct mlxsw_core_port *ports;
 	unsigned int max_ports;
-	bool reload_fail;
 	bool fw_flash_in_progress;
 	unsigned long driver_priv[0];
 	/* driver_priv has to be always the last item */
@@ -1002,15 +1001,11 @@ mlxsw_devlink_core_bus_device_reload_up(struct devlink *devlink,
 					struct netlink_ext_ack *extack)
 {
 	struct mlxsw_core *mlxsw_core = devlink_priv(devlink);
-	int err;
-
-	err = mlxsw_core_bus_device_register(mlxsw_core->bus_info,
-					     mlxsw_core->bus,
-					     mlxsw_core->bus_priv, true,
-					     devlink);
-	mlxsw_core->reload_fail = !!err;
 
-	return err;
+	return mlxsw_core_bus_device_register(mlxsw_core->bus_info,
+					      mlxsw_core->bus,
+					      mlxsw_core->bus_priv, true,
+					      devlink);
 }
 
 static int mlxsw_devlink_flash_update(struct devlink *devlink,
@@ -1254,7 +1249,7 @@ void mlxsw_core_bus_device_unregister(struct mlxsw_core *mlxsw_core,
 {
 	struct devlink *devlink = priv_to_devlink(mlxsw_core);
 
-	if (mlxsw_core->reload_fail) {
+	if (devlink_is_reload_failed(devlink)) {
 		if (!reload)
 			/* Only the parts that were not de-initialized in the
 			 * failed reload attempt need to be de-initialized.
diff --git a/include/net/devlink.h b/include/net/devlink.h
index ab8d56d12ffd..23e4b65ec9df 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -38,6 +38,7 @@ struct devlink {
 	struct device *dev;
 	possible_net_t _net;
 	struct mutex lock;
+	bool reload_failed;
 	char priv[0] __aligned(NETDEV_ALIGN);
 };
 
@@ -945,6 +946,8 @@ void
 devlink_health_reporter_state_update(struct devlink_health_reporter *reporter,
 				     enum devlink_health_reporter_state state);
 
+bool devlink_is_reload_failed(const struct devlink *devlink);
+
 void devlink_flash_update_begin_notify(struct devlink *devlink);
 void devlink_flash_update_end_notify(struct devlink *devlink);
 void devlink_flash_update_status_notify(struct devlink *devlink,
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index 8da5365850cd..580b7a2e40e1 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -419,6 +419,8 @@ enum devlink_attr {
 	DEVLINK_ATTR_TRAP_METADATA,			/* nested */
 	DEVLINK_ATTR_TRAP_GROUP_NAME,			/* string */
 
+	DEVLINK_ATTR_RELOAD_FAILED,			/* u8 0 or 1 */
+
 	/* add new attributes above here, update the policy in devlink.c */
 
 	__DEVLINK_ATTR_MAX,
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 9e522639693d..e48680efe54a 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -471,6 +471,8 @@ static int devlink_nl_fill(struct sk_buff *msg, struct devlink *devlink,
 
 	if (devlink_nl_put_handle(msg, devlink))
 		goto nla_put_failure;
+	if (nla_put_u8(msg, DEVLINK_ATTR_RELOAD_FAILED, devlink->reload_failed))
+		goto nla_put_failure;
 
 	genlmsg_end(msg, hdr);
 	return 0;
@@ -2677,6 +2679,21 @@ static bool devlink_reload_supported(struct devlink *devlink)
 	return devlink->ops->reload_down && devlink->ops->reload_up;
 }
 
+static void devlink_reload_failed_set(struct devlink *devlink,
+				      bool reload_failed)
+{
+	if (devlink->reload_failed == reload_failed)
+		return;
+	devlink->reload_failed = reload_failed;
+	devlink_notify(devlink, DEVLINK_CMD_NEW);
+}
+
+bool devlink_is_reload_failed(const struct devlink *devlink)
+{
+	return devlink->reload_failed;
+}
+EXPORT_SYMBOL_GPL(devlink_is_reload_failed);
+
 static int devlink_nl_cmd_reload(struct sk_buff *skb, struct genl_info *info)
 {
 	struct devlink *devlink = info->user_ptr[0];
@@ -2693,7 +2710,9 @@ static int devlink_nl_cmd_reload(struct sk_buff *skb, struct genl_info *info)
 	err = devlink->ops->reload_down(devlink, info->extack);
 	if (err)
 		return err;
-	return devlink->ops->reload_up(devlink, info->extack);
+	err = devlink->ops->reload_up(devlink, info->extack);
+	devlink_reload_failed_set(devlink, !!err);
+	return err;
 }
 
 static int devlink_nl_flash_update_fill(struct sk_buff *msg,
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH 0/7] net: dsa: mv88e6xxx: features to handle network storms
From: Ido Schimmel @ 2019-09-12  9:03 UTC (permalink / raw)
  To: Robert Beckett
  Cc: Florian Fainelli, netdev@vger.kernel.org, Andrew Lunn,
	Vivien Didelot, David S. Miller, Jiri Pirko
In-Reply-To: <3f50ee51ec04a2d683a5338a68607824a3f45711.camel@collabora.com>

On Wed, Sep 11, 2019 at 12:49:03PM +0100, Robert Beckett wrote:
> On Wed, 2019-09-11 at 11:21 +0000, Ido Schimmel wrote:
> > On Tue, Sep 10, 2019 at 09:49:46AM -0700, Florian Fainelli wrote:
> > > +Ido, Jiri,
> > > 
> > > On 9/10/19 8:41 AM, Robert Beckett wrote:
> > > > This patch-set adds support for some features of the Marvell
> > > > switch
> > > > chips that can be used to handle packet storms.
> > > > 
> > > > The rationale for this was a setup that requires the ability to
> > > > receive
> > > > traffic from one port, while a packet storm is occuring on
> > > > another port
> > > > (via an external switch with a deliberate loop). This is needed
> > > > to
> > > > ensure vital data delivery from a specific port, while mitigating
> > > > any
> > > > loops or DoS that a user may introduce on another port (can't
> > > > guarantee
> > > > sensible users).
> > > 
> > > The use case is reasonable, but the implementation is not really.
> > > You
> > > are using Device Tree which is meant to describe hardware as a
> > > policy
> > > holder for setting up queue priorities and likewise for queue
> > > scheduling.
> > > 
> > > The tool that should be used for that purpose is tc and possibly an
> > > appropriately offloaded queue scheduler in order to map the desired
> > > scheduling class to what the hardware supports.
> > > 
> > > Jiri, Ido, how do you guys support this with mlxsw?
> > 
> > Hi Florian,
> > 
> > Are you referring to policing traffic towards the CPU using a policer
> > on
> > the egress of the CPU port? At least that's what I understand from
> > the
> > description of patch 6 below.
> > 
> > If so, mlxsw sets policers for different traffic types during its
> > initialization sequence. These policers are not exposed to the user
> > nor
> > configurable. While the default settings are good for most users, we
> > do
> > want to allow users to change these and expose current settings.
> > 
> > I agree that tc seems like the right choice, but the question is
> > where
> > are we going to install the filters?
> > 
> 
> Before I go too far down the rabbit hole of tc traffic shaping, maybe
> it would be good to explain in more detail the problem I am trying to
> solve.
> 
> We have a setup as follows:
> 
> Marvell 88E6240 switch chip, accepting traffic from 4 ports. Port 1
> (P1) is critical priority, no dropped packets allowed, all others can
> be best effort.
> 
> CPU port of swtich chip is connected via phy to phy of intel i210 (igb
> driver).
> 
> i210 is connected via pcie switch to imx6.
> 
> When too many small packets attempt to be delivered to CPU port (e.g.
> during broadcast flood) we saw dropped packets.
> 
> The packets were being received by i210 in to rx descriptor buffer
> fine, but the CPU could not keep up with the load. We saw
> rx_fifo_errors increasing rapidly and ksoftirqd at ~100% CPU.
> 
> 
> With this in mind, I am wondering whether any amount of tc traffic
> shaping would help? Would tc shaping require that the packet reception
> manages to keep up before it can enact its policies? Does the
> infrastructure have accelerator offload hooks to be able to apply it
> via HW? I dont see how it would be able to inspect the packets to apply
> filtering if they were dropped due to rx descriptor exhaustion. (please
> bear with me with the basic questions, I am not familiar with this part
> of the stack).
> 
> Assuming that tc is still the way to go, after a brief look in to the
> man pages and the documentation at largc.org, it seems like it would
> need to use the ingress qdisc, with some sort of system to segregate
> and priortise based on ingress port. Is this possible?

Hi Robert,

As I see it, you have two problems here:

1. Classification: Based on ingress port in your case

2. Scheduling: How to schedule between the different transmission queues

Where the port from which the packets should egress is the CPU port,
before they cross the PCI towards the imx6.

Both of these issues can be solved by tc. The main problem is that today
we do not have a netdev to represent the CPU port and therefore can't
use existing infra like tc. I believe we need to create one. Besides
scheduling, we can also use it to permit/deny certain traffic from
reaching the CPU and perform policing.

Drivers can run the received packets via taps using
dev_queue_xmit_nit(), so that users will see all the traffic directed at
the host when running tcpdump on this netdev.

> 
> 
> 
> > > 
> > > > 
> > > > [patch 1/7] configures auto negotiation for CPU ports connected
> > > > with
> > > > phys to enable pause frame propogation.
> > > > 
> > > > [patch 2/7] allows setting of port's default output queue
> > > > priority for
> > > > any ingressing packets on that port.
> > > > 
> > > > [patch 3/7] dt-bindings for patch 2.
> > > > 
> > > > [patch 4/7] allows setting of a port's queue scheduling so that
> > > > it can
> > > > prioritise egress of traffic routed from high priority ports.
> > > > 
> > > > [patch 5/7] dt-bindings for patch 4.
> > > > 
> > > > [patch 6/7] allows ports to rate limit their egress. This can be
> > > > used to
> > > > stop the host CPU from becoming swamped by packet delivery and
> > > > exhasting
> > > > descriptors.
> > > > 
> > > > [patch 7/7] dt-bindings for patch 6.
> > > > 
> > > > 
> > > > Robert Beckett (7):
> > > >   net/dsa: configure autoneg for CPU port
> > > >   net: dsa: mv88e6xxx: add ability to set default queue
> > > > priorities per
> > > >     port
> > > >   dt-bindings: mv88e6xxx: add ability to set default queue
> > > > priorities
> > > >     per port
> > > >   net: dsa: mv88e6xxx: add ability to set queue scheduling
> > > >   dt-bindings: mv88e6xxx: add ability to set queue scheduling
> > > >   net: dsa: mv88e6xxx: add egress rate limiting
> > > >   dt-bindings: mv88e6xxx: add egress rate limiting
> > > > 
> > > >  .../devicetree/bindings/net/dsa/marvell.txt   |  38 +++++
> > > >  drivers/net/dsa/mv88e6xxx/chip.c              | 122
> > > > ++++++++++++---
> > > >  drivers/net/dsa/mv88e6xxx/chip.h              |   5 +-
> > > >  drivers/net/dsa/mv88e6xxx/port.c              | 140
> > > > +++++++++++++++++-
> > > >  drivers/net/dsa/mv88e6xxx/port.h              |  24 ++-
> > > >  include/dt-bindings/net/dsa-mv88e6xxx.h       |  22 +++
> > > >  net/dsa/port.c                                |  10 ++
> > > >  7 files changed, 327 insertions(+), 34 deletions(-)
> > > >  create mode 100644 include/dt-bindings/net/dsa-mv88e6xxx.h
> > > > 
> > > 
> > > 
> > > -- 
> > > Florian
> 

^ permalink raw reply

* Re: [PATCH 0/7] net: dsa: mv88e6xxx: features to handle network storms
From: Ido Schimmel @ 2019-09-12  9:05 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Robert Beckett, Florian Fainelli, netdev@vger.kernel.org,
	Vivien Didelot, David S. Miller, Jiri Pirko
In-Reply-To: <20190911225841.GB5710@lunn.ch>

On Thu, Sep 12, 2019 at 12:58:41AM +0200, Andrew Lunn wrote:
> So think about how your can model the Marvell switch capabilities
> using TC, and implement offload support for it.

+1 :)

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox