Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH iproute2 net-next v2 3/4] ss: Buffer raw fields first, then render them as a table
From: Phil Sutter @ 2019-02-13 23:39 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Stefano Brivio, Eric Dumazet, netdev, Sabrina Dubroca,
	David Ahern
In-Reply-To: <20190213135534.01dacee5@shemminger-XPS-13-9360>

Hi Stephen,

On Wed, Feb 13, 2019 at 01:55:34PM -0800, Stephen Hemminger wrote:
> On Wed, 13 Feb 2019 22:17:16 +0100
> Stefano Brivio <sbrivio@redhat.com> wrote:
> 
> > On Wed, 13 Feb 2019 09:31:03 -0800
> > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > 
> > > On 02/13/2019 12:37 AM, Stefano Brivio wrote:  
> > > > On Tue, 12 Feb 2019 16:42:04 -0800
> > > > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > >     
> > > >> I do not get it.
> > > >>
> > > >> "ss -emoi " uses almost 1KB per socket.
> > > >>
> > > >> 10,000,000 sockets -> we need about 10GB of memory  ???
> > > >>
> > > >> This is a serious regression.    
> > > > 
> > > > I guess this is rather subjective: the worst case I considered back then
> > > > was the output of 'ss -tei0' (less than 500 bytes) for one million
> > > > sockets, which gives 500M of memory, which should in turn be fine on a
> > > > machine handling one million sockets.
> > > > 
> > > > Now, if 'ss -emoi' on 10 million sockets is an actual use case (out of
> > > > curiosity: how are you going to process that output? Would JSON help?),
> > > > I see two easy options to solve this:    
> > > 
> > > 
> > > ss -temoi | parser (written in shell or awk or whatever...)
> > > 
> > > This is a use case, I just got bitten because using ss command
> > > actually OOM my container, while trying to debug a busy GFE.
> > > 
> > > The host itself can have 10,000,000 TCP sockets, but usually sysadmin shells
> > > run in a container with no more than 500 MB available. 
> > > 
> > > Otherwise, it would be too easy for a buggy program to OOM the whole machine
> > > and have angry customers.
> > >   
> > > > 
> > > > 1. flush the output every time we reach a given buffer size (1M
> > > >    perhaps). This might make the resulting blocks slightly unaligned,
> > > >    with occasional loss of readability on lines occurring every 1k to
> > > >    10k sockets approximately, even though after 1k sockets column sizes
> > > >    won't change much (it looks anyway better than the original), and I
> > > >    don't expect anybody to actually scroll that output
> > > > 
> > > > 2. add a switch for unbuffered output, but then you need to remember to
> > > >    pass it manually, and the whole output would be as bad as the
> > > >    original in case you need the switch.
> > > > 
> > > > I'd rather go with 1., it's easy to implement (we already have partial
> > > > flushing with '--events') and it looks like a good compromise on
> > > > usability. Thoughts?
> > > >     
> > > 
> > > 1 seems fine, but a switch for 'please do not try to format' would be fine.
> > > 
> > > I wonder why we try to 'format' when stdout is a pipe or a regular file .  
> > 
> > On a second thought: what about | less, or | grep [ports],
> > or > readable.log? I guess those might also be rather common use cases,
> > what do you think?
> > 
> > I'm tempted to skip this for the moment and just go with option 1.
> > 
> 
> What I would favor:
> 	* use big enough columns that for the common case everything lines up fine
> 	* if column is to wide just print that element wider (which is what print %Ns does)

This is pretty much the situation Stefano attempted to improve, minus
scaling the columns to max terminal width. ss output formatting being
quirky and unreadable with either small or large terminals was the
number one reason I heard so far why people prefer netstat.

> and
> 	* add json output for programs that want to parse
> 	* use print_uint etc for that

For Eric's use-case, skipping any buffering and tabular output if stdout
is not a TTY suffices. In fact, iproute2 does this already for colored
output (see check_enable_color() for reference).

Adding JSON output support everywhere is a nice feature when it comes to
scripting, but it won't help console users. Unless you expect CLI
frontends to come turning that JSON into human-readable output.

IMHO, JSON output wouldn't even help in this case - unless Eric indeed
prefers to write/use a JSON parser for his analysis instead of something
along 'ss | grep'.

> The buffering patch (in iproute2-next) can/will be reverted.

It's not fair to claim that despite Stefano's commitment to fix the
reported issues. His ss output rewrite is there since v4.15.0 and
according to git history it needed only two fixes so far. I've had
one-liners which required more follow-ups than that! Also, we're still
discovering issues introduced by all the jsonify patches. Allowing for
people to get things right not the first time but after a few tries is
important. If you want to revert something, start with features which
have a fundamental design issue in the exact situation they tried to
improve, like the MSG_PEEK | MSG_TRUNC thing Hangbin and me wrote.

Thanks, Phil

^ permalink raw reply

* [pull request][net 0/4] Mellanox, mlx5 fixes 2019-02-13
From: Saeed Mahameed @ 2019-02-13 23:44 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Saeed Mahameed

Hi Dave,

This series introduces some fixes to mlx5 driver.
For more information please see tag log below.

Please pull and let me know if there is any problem.

For -stable v4.19:
('net/mlx5e: XDP, fix redirect resources availability check')

Thanks,
Saeed.

---
The following changes since commit 2fdeee2549231b1f989f011bb18191f5660d3745:

  team: avoid complex list operations in team_nl_cmd_options_set() (2019-02-12 14:19:27 -0500)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5-fixes-2019-02-13

for you to fetch changes up to 407e17b1a69a51ba9a512a04342da56c1f931df4:

  net/mlx5e: XDP, fix redirect resources availability check (2019-02-13 15:40:51 -0800)

----------------------------------------------------------------
mlx5-fixes-2019-02-13

----------------------------------------------------------------
Huy Nguyen (1):
      net/mlx5: No command allowed when command interface is not ready

Maria Pasechnik (1):
      net/mlx5e: Fix NULL pointer derefernce in set channels error flow

Saeed Mahameed (1):
      net/mlx5e: XDP, fix redirect resources availability check

Tariq Toukan (1):
      net/mlx5: Fix a compilation warning in events.c

 drivers/net/ethernet/mellanox/mlx5/core/cmd.c        | 18 ++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx5/core/en.h         |  1 +
 drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c     |  6 ++----
 drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h     | 17 +++++++++++++++++
 drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c |  7 ++++---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c    |  2 ++
 drivers/net/ethernet/mellanox/mlx5/core/events.c     | 17 +++++++++--------
 drivers/net/ethernet/mellanox/mlx5/core/health.c     |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h  |  1 +
 9 files changed, 55 insertions(+), 16 deletions(-)

^ permalink raw reply

* [net 1/4] net/mlx5e: Fix NULL pointer derefernce in set channels error flow
From: Saeed Mahameed @ 2019-02-13 23:44 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Maria Pasechnik, Tariq Toukan, Saeed Mahameed
In-Reply-To: <20190213234451.27029-1-saeedm@mellanox.com>

From: Maria Pasechnik <mariap@mellanox.com>

New channels are applied to the priv channels only after they
are successfully opened. Then, the indirection table should be built
according to the new number of channels.
Currently, such build is preformed independently of whether the
channels opening is successful, and is not reverted on failure.

The bug is caused due to removal of rss params from channels struct
and moving it to priv struct. That change cause to independency between
channels and rss params.
This causes a crash on a later point, when accessing rqn of a non
existing channel.

This patch fixes it by moving the indirection table build right before
switching the priv channels to new channels struct, after the new set of
channels was successfully opened.

Fixes: bbeb53b8b2c9 ("net/mlx5e: Move RSS params to a dedicated struct")
Signed-off-by: Maria Pasechnik <mariap@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 3bbccead2f63..47233b9a4f81 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -354,9 +354,6 @@ int mlx5e_ethtool_set_channels(struct mlx5e_priv *priv,

 	new_channels.params = priv->channels.params;
 	new_channels.params.num_channels = count;
-	if (!netif_is_rxfh_configured(priv->netdev))
-		mlx5e_build_default_indir_rqt(priv->rss_params.indirection_rqt,
-					      MLX5E_INDIR_RQT_SIZE, count);

 	if (!test_bit(MLX5E_STATE_OPENED, &priv->state)) {
 		priv->channels.params = new_channels.params;
@@ -372,6 +369,10 @@ int mlx5e_ethtool_set_channels(struct mlx5e_priv *priv,
 	if (arfs_enabled)
 		mlx5e_arfs_disable(priv);

+	if (!netif_is_rxfh_configured(priv->netdev))
+		mlx5e_build_default_indir_rqt(priv->rss_params.indirection_rqt,
+					      MLX5E_INDIR_RQT_SIZE, count);
+
 	/* Switch to new channels, set new parameters and close old ones */
 	mlx5e_switch_priv_channels(priv, &new_channels, NULL);

-- 
2.20.1

^ permalink raw reply related

* [net 2/4] net/mlx5: No command allowed when command interface is not ready
From: Saeed Mahameed @ 2019-02-13 23:44 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Huy Nguyen, Daniel Jurgens, Saeed Mahameed
In-Reply-To: <20190213234451.27029-1-saeedm@mellanox.com>

From: Huy Nguyen <huyn@mellanox.com>

When EEH is injected and PCI bus stalls, mlx5's pci error detect
function is called to deactivate the command interface and tear down
the device. The issue is that there can be a thread that already
passed MLX5_DEVICE_STATE_INTERNAL_ERROR check, it will send the command
and stuck in the wait_func.

Solution:
Add function mlx5_cmd_flush to disable command interface and clear all
the pending commands. When device state is set to
MLX5_DEVICE_STATE_INTERNAL_ERROR, call mlx5_cmd_flush to ensure all
pending threads waiting for firmware commands completion are terminated.

Fixes: c1d4d2e92ad6 ("net/mlx5: Avoid calling sleeping function by the health poll thread")
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/cmd.c  | 18 ++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/health.c   |  2 +-
 .../ethernet/mellanox/mlx5/core/mlx5_core.h    |  1 +
 3 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
index 3e0fa8a8077b..e267ff93e8a8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
@@ -1583,6 +1583,24 @@ void mlx5_cmd_trigger_completions(struct mlx5_core_dev *dev)
 	spin_unlock_irqrestore(&dev->cmd.alloc_lock, flags);
 }
 
+void mlx5_cmd_flush(struct mlx5_core_dev *dev)
+{
+	struct mlx5_cmd *cmd = &dev->cmd;
+	int i;
+
+	for (i = 0; i < cmd->max_reg_cmds; i++)
+		while (down_trylock(&cmd->sem))
+			mlx5_cmd_trigger_completions(dev);
+
+	while (down_trylock(&cmd->pages_sem))
+		mlx5_cmd_trigger_completions(dev);
+
+	/* Unlock cmdif */
+	up(&cmd->pages_sem);
+	for (i = 0; i < cmd->max_reg_cmds; i++)
+		up(&cmd->sem);
+}
+
 static int status_to_err(u8 status)
 {
 	return status ? -1 : 0; /* TBD more meaningful codes */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/health.c b/drivers/net/ethernet/mellanox/mlx5/core/health.c
index 196c07383082..cb9fa3430c53 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/health.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/health.c
@@ -103,7 +103,7 @@ void mlx5_enter_error_state(struct mlx5_core_dev *dev, bool force)
 	mlx5_core_err(dev, "start\n");
 	if (pci_channel_offline(dev->pdev) || in_fatal(dev) || force) {
 		dev->state = MLX5_DEVICE_STATE_INTERNAL_ERROR;
-		mlx5_cmd_trigger_completions(dev);
+		mlx5_cmd_flush(dev);
 	}
 
 	mlx5_notifier_call_chain(dev->priv.events, MLX5_DEV_EVENT_SYS_ERROR, (void *)1);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index 5300b0b6d836..4fdac020b795 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -126,6 +126,7 @@ u64 mlx5_read_internal_timer(struct mlx5_core_dev *dev,
 			     struct ptp_system_timestamp *sts);
 
 void mlx5_cmd_trigger_completions(struct mlx5_core_dev *dev);
+void mlx5_cmd_flush(struct mlx5_core_dev *dev);
 int mlx5_cq_debugfs_init(struct mlx5_core_dev *dev);
 void mlx5_cq_debugfs_cleanup(struct mlx5_core_dev *dev);
 
-- 
2.20.1


^ permalink raw reply related

* [net 3/4] net/mlx5: Fix a compilation warning in events.c
From: Saeed Mahameed @ 2019-02-13 23:44 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Tariq Toukan, Mikhael Goikhman, Saeed Mahameed
In-Reply-To: <20190213234451.27029-1-saeedm@mellanox.com>

From: Tariq Toukan <tariqt@mellanox.com>

Eliminate the following compilation warning:

drivers/net/ethernet/mellanox/mlx5/core/events.c: warning: 'error_str'
may be used uninitialized in this function [-Wuninitialized]:  => 238:3

Fixes: c2fb3db22d35 ("net/mlx5: Rework handling of port module events")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Mikhael Goikhman <migo@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 .../net/ethernet/mellanox/mlx5/core/events.c    | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/events.c b/drivers/net/ethernet/mellanox/mlx5/core/events.c
index fbc42b7252a9..503035469d2d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/events.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/events.c
@@ -211,11 +211,10 @@ static int port_module(struct notifier_block *nb, unsigned long type, void *data
 	enum port_module_event_status_type module_status;
 	enum port_module_event_error_type error_type;
 	struct mlx5_eqe_port_module *module_event_eqe;
-	const char *status_str, *error_str;
+	const char *status_str;
 	u8 module_num;
 
 	module_event_eqe = &eqe->data.port_module;
-	module_num = module_event_eqe->module;
 	module_status = module_event_eqe->module_status &
 			PORT_MODULE_EVENT_MODULE_STATUS_MASK;
 	error_type = module_event_eqe->error_type &
@@ -223,25 +222,27 @@ static int port_module(struct notifier_block *nb, unsigned long type, void *data
 
 	if (module_status < MLX5_MODULE_STATUS_NUM)
 		events->pme_stats.status_counters[module_status]++;
-	status_str = mlx5_pme_status_to_string(module_status);
 
-	if (module_status == MLX5_MODULE_STATUS_ERROR) {
+	if (module_status == MLX5_MODULE_STATUS_ERROR)
 		if (error_type < MLX5_MODULE_EVENT_ERROR_NUM)
 			events->pme_stats.error_counters[error_type]++;
-		error_str = mlx5_pme_error_to_string(error_type);
-	}
 
 	if (!printk_ratelimit())
 		return NOTIFY_OK;
 
-	if (module_status == MLX5_MODULE_STATUS_ERROR)
+	module_num = module_event_eqe->module;
+	status_str = mlx5_pme_status_to_string(module_status);
+	if (module_status == MLX5_MODULE_STATUS_ERROR) {
+		const char *error_str = mlx5_pme_error_to_string(error_type);
+
 		mlx5_core_err(events->dev,
 			      "Port module event[error]: module %u, %s, %s\n",
 			      module_num, status_str, error_str);
-	else
+	} else {
 		mlx5_core_info(events->dev,
 			       "Port module event: module %u, %s\n",
 			       module_num, status_str);
+	}
 
 	return NOTIFY_OK;
 }
-- 
2.20.1


^ permalink raw reply related

* [net 4/4] net/mlx5e: XDP, fix redirect resources availability check
From: Saeed Mahameed @ 2019-02-13 23:44 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Saeed Mahameed, Toke Høiland-Jørgensen,
	Tariq Toukan
In-Reply-To: <20190213234451.27029-1-saeedm@mellanox.com>

Currently mlx5 driver creates xdp redirect hw queues unconditionally on
netdevice open, This is great until someone starts redirecting XDP traffic
via ndo_xdp_xmit on mlx5 device and changes the device configuration at
the same time, this might cause crashes, since the other device's napi
is not aware of the mlx5 state change (resources un-availability).

To fix this we must synchronize with other devices napi's on the system.
Added a new flag under mlx5e_priv to determine XDP TX resources are
available, set/clear it up when necessary and use synchronize_rcu()
when the flag is turned off, so other napi's are in-sync with it, before
we actually cleanup the hw resources.

The flag is tested prior to committing to transmit on mlx5e_xdp_xmit, and
it is sufficient to determine if it safe to transmit or not. The other
two internal flags (MLX5E_STATE_OPENED and MLX5E_SQ_STATE_ENABLED) become
unnecessary. Thus, they are removed from data path.

Fixes: 58b99ee3e3eb ("net/mlx5e: Add support for XDP_REDIRECT in device-out side")
Reported-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h    |  1 +
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c    |  6 ++----
 .../net/ethernet/mellanox/mlx5/core/en/xdp.h    | 17 +++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/en_main.c   |  2 ++
 4 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 8fa8fdd30b85..448a92561567 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -657,6 +657,7 @@ struct mlx5e_channel_stats {
 enum {
 	MLX5E_STATE_OPENED,
 	MLX5E_STATE_DESTROYING,
+	MLX5E_STATE_XDP_TX_ENABLED,
 };
 
 struct mlx5e_rqt {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index 3740177eed09..03b2a9f9c589 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -365,7 +365,8 @@ int mlx5e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
 	int sq_num;
 	int i;
 
-	if (unlikely(!test_bit(MLX5E_STATE_OPENED, &priv->state)))
+	/* this flag is sufficient, no need to test internal sq state */
+	if (unlikely(!mlx5e_xdp_tx_is_enabled(priv)))
 		return -ENETDOWN;
 
 	if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
@@ -378,9 +379,6 @@ int mlx5e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
 
 	sq = &priv->channels.c[sq_num]->xdpsq;
 
-	if (unlikely(!test_bit(MLX5E_SQ_STATE_ENABLED, &sq->state)))
-		return -ENETDOWN;
-
 	for (i = 0; i < n; i++) {
 		struct xdp_frame *xdpf = frames[i];
 		struct mlx5e_xdp_info xdpi;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
index 3a67cb3cd179..ee27a7c8cd87 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
@@ -50,6 +50,23 @@ void mlx5e_xdp_rx_poll_complete(struct mlx5e_rq *rq);
 int mlx5e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
 		   u32 flags);
 
+static inline void mlx5e_xdp_tx_enable(struct mlx5e_priv *priv)
+{
+	set_bit(MLX5E_STATE_XDP_TX_ENABLED, &priv->state);
+}
+
+static inline void mlx5e_xdp_tx_disable(struct mlx5e_priv *priv)
+{
+	clear_bit(MLX5E_STATE_XDP_TX_ENABLED, &priv->state);
+	/* let other device's napi(s) see our new state */
+	synchronize_rcu();
+}
+
+static inline bool mlx5e_xdp_tx_is_enabled(struct mlx5e_priv *priv)
+{
+	return test_bit(MLX5E_STATE_XDP_TX_ENABLED, &priv->state);
+}
+
 static inline void mlx5e_xmit_xdp_doorbell(struct mlx5e_xdpsq *sq)
 {
 	if (sq->doorbell_cseg) {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 01819e5c9975..93e50ccd44c3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2938,6 +2938,7 @@ void mlx5e_activate_priv_channels(struct mlx5e_priv *priv)
 
 	mlx5e_build_tx2sq_maps(priv);
 	mlx5e_activate_channels(&priv->channels);
+	mlx5e_xdp_tx_enable(priv);
 	netif_tx_start_all_queues(priv->netdev);
 
 	if (mlx5e_is_vport_rep(priv))
@@ -2959,6 +2960,7 @@ void mlx5e_deactivate_priv_channels(struct mlx5e_priv *priv)
 	 */
 	netif_tx_stop_all_queues(priv->netdev);
 	netif_tx_disable(priv->netdev);
+	mlx5e_xdp_tx_disable(priv);
 	mlx5e_deactivate_channels(&priv->channels);
 }
 
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH iproute2 net-next v2 3/4] ss: Buffer raw fields first, then render them as a table
From: David Ahern @ 2019-02-13 23:47 UTC (permalink / raw)
  To: Phil Sutter, Stephen Hemminger, Stefano Brivio, Eric Dumazet,
	netdev, Sabrina Dubroca
In-Reply-To: <20190213233950.GQ26388@orbyte.nwl.cc>

On 2/13/19 4:39 PM, Phil Sutter wrote:
>> What I would favor:
>> 	* use big enough columns that for the common case everything lines up fine
>> 	* if column is to wide just print that element wider (which is what print %Ns does)
> 
> This is pretty much the situation Stefano attempted to improve, minus
> scaling the columns to max terminal width. ss output formatting being
> quirky and unreadable with either small or large terminals was the
> number one reason I heard so far why people prefer netstat.

+1.

prior to Stefano's change ss was a PITA trying to read in an xterm. I
for one would run the command and then have to adjust the terminal to
get it to display an actual readable format.

> 
>> and
>> 	* add json output for programs that want to parse
>> 	* use print_uint etc for that
> 
> For Eric's use-case, skipping any buffering and tabular output if stdout
> is not a TTY suffices. In fact, iproute2 does this already for colored
> output (see check_enable_color() for reference).
> 
> Adding JSON output support everywhere is a nice feature when it comes to
> scripting, but it won't help console users. Unless you expect CLI
> frontends to come turning that JSON into human-readable output.
> 
> IMHO, JSON output wouldn't even help in this case - unless Eric indeed
> prefers to write/use a JSON parser for his analysis instead of something
> along 'ss | grep'.

I agree. json has its uses, console/xterm for humans is not one and
piping into something like jq to selectively pick columns is not a user
friendly solution.

> 
>> The buffering patch (in iproute2-next) can/will be reverted.
> 
> It's not fair to claim that despite Stefano's commitment to fix the
> reported issues. His ss output rewrite is there since v4.15.0 and
> according to git history it needed only two fixes so far. I've had
> one-liners which required more follow-ups than that! Also, we're still
> discovering issues introduced by all the jsonify patches. Allowing for
> people to get things right not the first time but after a few tries is
> important. If you want to revert something, start with features which
> have a fundamental design issue in the exact situation they tried to
> improve, like the MSG_PEEK | MSG_TRUNC thing Hangbin and me wrote.

I was just looking at the overhead of that. While it is deceiving to
twice as many recvmsg calls as you expect, the overhead of the peek in
reading 700k+ routes is on the order of 3% with the 32k min buffer size.
The true overhead of the dump functions for ip is the device index to
name mapping (just like the overhead of a batch is the name to index
mapping). I will send a v2 of my patches soon.


^ permalink raw reply

* Re: [pull request][net 0/4] Mellanox, mlx5 fixes 2019-02-13
From: David Miller @ 2019-02-14  0:10 UTC (permalink / raw)
  To: saeedm; +Cc: netdev
In-Reply-To: <20190213234451.27029-1-saeedm@mellanox.com>

From: Saeed Mahameed <saeedm@mellanox.com>
Date: Wed, 13 Feb 2019 15:44:47 -0800

> This series introduces some fixes to mlx5 driver.
> For more information please see tag log below.
> 
> Please pull and let me know if there is any problem.

Pulled.

> For -stable v4.19:
> ('net/mlx5e: XDP, fix redirect resources availability check')

Queued up for -stable, thanks Saeed.

^ permalink raw reply

* Re: [PATCH 0/3] Netfilter/IPVS fixes for net
From: David Miller @ 2019-02-14  0:15 UTC (permalink / raw)
  To: pablo; +Cc: netfilter-devel, netdev
In-Reply-To: <20190213174758.17275-1-pablo@netfilter.org>

From: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Wed, 13 Feb 2019 18:47:55 +0100

> The following patchset contains Netfilter/IPVS fixes for net:
> 
> 1) Missing structure initialization in ebtables causes splat with
>    32-bit user level on a 64-bit kernel, from Francesco Ruggeri.
> 
> 2) Missing dependency on nf_defrag in IPVS IPv6 codebase, from
>    Andrea Claudi.
> 
> 3) Fix possible use-after-free from release path of target extensions.
> 
> You can pull these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git

Pulled, thanks Pablo.

^ permalink raw reply

* Re: [PATCH net-next 0/4] net: phy: Add 2.5G/5GBASET PHYs support
From: David Miller @ 2019-02-14  0:19 UTC (permalink / raw)
  To: maxime.chevallier
  Cc: netdev, linux-kernel, andrew, f.fainelli, hkallweit1, linux,
	linux-arm-kernel, antoine.tenart, thomas.petazzoni,
	gregory.clement, miquel.raynal, nadavh, stefanc, mw
In-Reply-To: <20190211142529.22885-1-maxime.chevallier@bootlin.com>

From: Maxime Chevallier <maxime.chevallier@bootlin.com>
Date: Mon, 11 Feb 2019 15:25:25 +0100

> The 802.3bz standard defines 2 modes based on the NBASET alliance work
> that allow to use 2.5Gbps and 5Gbps speeds on Cat 5e, 6 and 7 cables.
> 
> This series adds the necessary infrastructure to handle these modes with
> C45 PHYs. This series was originally part of a bigger one, that has
> seen 2 iterations [1] [2] that added support for these modes on Marvell
> Alaska PHYs.
> 
> Following some discussions with Heiner and Andrew [3], we decided to
> split-out the generic parts so that we can work together on the
> following steps to get these mode fully working with Aquantia and
> Marvell PHYS.
> 
> The first 3 patches are reworking some of the internal network phy
> infrastructure to handle the new modes in a more generic way.
> 
> The 4th patch adds all the C45 register definition and accesses that
> follows the 802.3bz standard to support 2.5GBASET and 5GBASET.
> 
> [1] : https://lore.kernel.org/netdev/20190118152352.26417-1-maxime.chevallier@bootlin.com/
> [2] : https://lore.kernel.org/netdev/20190207094939.27369-1-maxime.chevallier@bootlin.com/
> [3] : https://lore.kernel.org/netdev/81c340ea-54b0-1abf-94af-b8dc4ee83e3a@gmail.com/

Series applied, thanks Maxime.

^ permalink raw reply

* [PATCH iproute2-next v2 1/3] ll_map: Add function to remove link cache entry by index
From: David Ahern @ 2019-02-14  0:22 UTC (permalink / raw)
  To: stephen; +Cc: netdev, David Ahern
In-Reply-To: <20190214002249.31866-1-dsahern@kernel.org>

From: David Ahern <dsahern@gmail.com>

Add ll_drop_by_index to remove an entry from the link cache.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/ll_map.h |  1 +
 lib/ll_map.c     | 14 ++++++++++++++
 2 files changed, 15 insertions(+)

diff --git a/include/ll_map.h b/include/ll_map.h
index 511fe00b8567..4de1041e2746 100644
--- a/include/ll_map.h
+++ b/include/ll_map.h
@@ -9,6 +9,7 @@ unsigned ll_name_to_index(const char *name);
 const char *ll_index_to_name(unsigned idx);
 int ll_index_to_type(unsigned idx);
 int ll_index_to_flags(unsigned idx);
+void ll_drop_by_index(unsigned index);
 unsigned namehash(const char *str);
 
 const char *ll_idx_n2a(unsigned int idx);
diff --git a/lib/ll_map.c b/lib/ll_map.c
index 1ab8ef0758ac..8e8a0b1e9c9d 100644
--- a/lib/ll_map.c
+++ b/lib/ll_map.c
@@ -210,6 +210,20 @@ unsigned ll_name_to_index(const char *name)
 	return idx;
 }
 
+void ll_drop_by_index(unsigned index)
+{
+	struct ll_cache *im;
+
+	im = ll_get_by_index(index);
+	if (!im)
+		return;
+
+	hlist_del(&im->idx_hash);
+	hlist_del(&im->name_hash);
+
+	free(im);
+}
+
 void ll_init_map(struct rtnl_handle *rth)
 {
 	static int initialized;
-- 
2.11.0


^ permalink raw reply related

* [PATCH iproute2-next v2 3/3] Improve batch and dump times by caching link lookups
From: David Ahern @ 2019-02-14  0:22 UTC (permalink / raw)
  To: stephen; +Cc: netdev, David Ahern
In-Reply-To: <20190214002249.31866-1-dsahern@kernel.org>

From: David Ahern <dsahern@gmail.com>

ip route uses ll_name_to_index and ll_index_to_name to convert between
device names and indices. At the moment both use for the ioctl based glibc
functions if_nametoindex and if_indextoname and does not cache the result.
When using a batch file or dumping large number of routes this means the
same device lookups can be done repeatedly adding unnecessary overhead
(socket + ioctl + close for each device lookup).

Add a new function, ll_link_get, to send a netlink based RTM_GETLINK. If
successful, cache the result in idx_head and name_head so future lookups
can re-use the entry. Update ll_name_to_index and ll_index_to_name to use
ll_link_get and only fallback to the glibc functions if it fails.

With this change the time to install 720,022 routes with 2 ecmp nexthops
where the nexthop device is given is reduced from 31.4 seconds to 19.2
seconds. A dump of those routes drops from 13.3 to 2.8 seconds.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 lib/ll_map.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 51 insertions(+), 1 deletion(-)

diff --git a/lib/ll_map.c b/lib/ll_map.c
index 8e8a0b1e9c9d..2d7b65dcb8f7 100644
--- a/lib/ll_map.c
+++ b/lib/ll_map.c
@@ -152,6 +152,48 @@ static unsigned int ll_idx_a2n(const char *name)
 	return idx;
 }
 
+static int ll_link_get(const char *name, int index)
+{
+	struct {
+		struct nlmsghdr		n;
+		struct ifinfomsg	ifm;
+		char			buf[1024];
+	} req = {
+		.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)),
+		.n.nlmsg_flags = NLM_F_REQUEST,
+		.n.nlmsg_type = RTM_GETLINK,
+		.ifm.ifi_index = index,
+	};
+	__u32 filt_mask = RTEXT_FILTER_VF | RTEXT_FILTER_SKIP_STATS;
+	struct rtnl_handle rth = {};
+	struct nlmsghdr *answer;
+	int rc = 0;
+
+	if (rtnl_open(&rth, 0) < 0)
+		return 0;
+
+	addattr32(&req.n, sizeof(req), IFLA_EXT_MASK, filt_mask);
+	if (name)
+		addattr_l(&req.n, sizeof(req), IFLA_IFNAME, name,
+			  strlen(name) + 1);
+
+	if (rtnl_talk(&rth, &req.n, &answer) < 0)
+		goto out;
+
+	/* add entry to cache */
+	rc  = ll_remember_index(answer, NULL);
+	if (!rc) {
+		struct ifinfomsg *ifm = NLMSG_DATA(answer);
+
+		rc = ifm->ifi_index;
+	}
+
+	free(answer);
+out:
+	rtnl_close(&rth);
+	return rc;
+}
+
 const char *ll_index_to_name(unsigned int idx)
 {
 	static char buf[IFNAMSIZ];
@@ -164,6 +206,12 @@ const char *ll_index_to_name(unsigned int idx)
 	if (im)
 		return im->name;
 
+	if (ll_link_get(NULL, idx) == idx) {
+		im = ll_get_by_index(idx);
+		if (im)
+			return im->name;
+	}
+
 	if (if_indextoname(idx, buf) == NULL)
 		snprintf(buf, IFNAMSIZ, "if%u", idx);
 
@@ -204,7 +252,9 @@ unsigned ll_name_to_index(const char *name)
 	if (im)
 		return im->index;
 
-	idx = if_nametoindex(name);
+	idx = ll_link_get(name, 0);
+	if (idx == 0)
+		idx = if_nametoindex(name);
 	if (idx == 0)
 		idx = ll_idx_a2n(name);
 	return idx;
-- 
2.11.0


^ permalink raw reply related

* [PATCH iproute2-next v2 0/3] Improve batch and dump times by caching link lookups
From: David Ahern @ 2019-02-14  0:22 UTC (permalink / raw)
  To: stephen; +Cc: netdev, David Ahern

From: David Ahern <dsahern@gmail.com>

Many commands convert device names to an index using ll_name_to_index and
the reverse from an index to a name using ll_index_to_name.

At the moment both of the ll_ functions use the ioctl based helpers from
glibc which involves opening socket, calling ioctl and then closing the
socket on each device lookup. When using a batch file or dumping large
number of routes this means the same device lookups can be done repeatedly
adding unnecessary overhead to both operations.

This series adds a new function, ll_link_get, to send a netlink based
RTM_GETLINK. If successful, the result is cached in idx_head and name_head
so future lookups can re-use the entry. iproute2's ll_map functions are
updated to use ll_link_get over the glibc functions. The result is a
significant speed up in both batch and dumps with negligible overhead if
ip is invoked for single operations.

The first 2 patches add a means to drop an entry from the cache and updates
iplink_modify to use that new function to drop entries on device changes.
This forces the cache to re-learn device information if a batch file has a
mix of link set operations with other commands - such as adding a route.

v2
- changed the second patch to drop cache entry on any link changes
- added ll_link_get to index to name conversion improving dumps

David Ahern (3):
  ll_map: Add function to remove link cache entry by index
  ip link: Drop cache entry on any changes
  Improve batch and dump times by caching link lookups

 include/ll_map.h |  1 +
 ip/iplink.c      |  3 +++
 lib/ll_map.c     | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 69 insertions(+), 1 deletion(-)

-- 
2.11.0

^ permalink raw reply

* [PATCH iproute2-next v2 2/3] ip link: Drop cache entry on any changes
From: David Ahern @ 2019-02-14  0:22 UTC (permalink / raw)
  To: stephen; +Cc: netdev, David Ahern
In-Reply-To: <20190214002249.31866-1-dsahern@kernel.org>

From: David Ahern <dsahern@gmail.com>

Remove any entry from the link cache when the link is modified.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 ip/iplink.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/ip/iplink.c b/ip/iplink.c
index b5519201fef7..393cefdc89df 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -1083,6 +1083,9 @@ static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv)
 	if (rtnl_talk(&rth, &req.n, NULL) < 0)
 		return -2;
 
+	/* remove device from cache; next use can refresh with new data */
+	ll_drop_by_index(req.i.ifi_index);
+
 	return 0;
 }
 
-- 
2.11.0


^ permalink raw reply related

* Re: [PATCH net] sctp: call gso_reset_checksum when computing checksum in sctp_gso_segment
From: David Miller @ 2019-02-14  0:33 UTC (permalink / raw)
  To: lucien.xin; +Cc: linux-kernel, netdev, linux-sctp, marcelo.leitner, nhorman
In-Reply-To: <5b8187d1eabd52e4db7d3e4506d98c33571c1c83.1549968450.git.lucien.xin@gmail.com>

From: Xin Long <lucien.xin@gmail.com>
Date: Tue, 12 Feb 2019 18:47:30 +0800

> Jianlin reported a panic when running sctp gso over gre over vlan device:
 ...
> It was caused by SKB_GSO_CB(skb)->csum_start not set in sctp_gso_segment.
> sctp_gso_segment() calls skb_segment() with 'feature | NETIF_F_HW_CSUM',
> which causes SKB_GSO_CB(skb)->csum_start not to be set in skb_segment().
> 
> For TCP/UDP, when feature supports HW_CSUM, CHECKSUM_PARTIAL will be set
> and gso_reset_checksum will be called to set SKB_GSO_CB(skb)->csum_start.
> 
> So SCTP should do the same as TCP/UDP, to call gso_reset_checksum() when
> computing checksum in sctp_gso_segment.
> 
> Reported-by: Jianlin Shi <jishi@redhat.com>
> Signed-off-by: Xin Long <lucien.xin@gmail.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH net] sctp: set stream ext to NULL after freeing it in sctp_stream_outq_migrate
From: David Miller @ 2019-02-14  0:34 UTC (permalink / raw)
  To: lucien.xin; +Cc: linux-kernel, netdev, linux-sctp, marcelo.leitner, nhorman
In-Reply-To: <0cb9e543c21495df48c3723044d6c9f64f238eca.1549968661.git.lucien.xin@gmail.com>

From: Xin Long <lucien.xin@gmail.com>
Date: Tue, 12 Feb 2019 18:51:01 +0800

> In sctp_stream_init(), after sctp_stream_outq_migrate() freed the
> surplus streams' ext, but sctp_stream_alloc_out() returns -ENOMEM,
> stream->outcnt will not be set to 'outcnt'.
> 
> With the bigger value on stream->outcnt, when closing the assoc and
> freeing its streams, the ext of those surplus streams will be freed
> again since those stream exts were not set to NULL after freeing in
> sctp_stream_outq_migrate(). Then the invalid-free issue reported by
> syzbot would be triggered.
> 
> We fix it by simply setting them to NULL after freeing.
> 
> Fixes: 5bbbbe32a431 ("sctp: introduce stream scheduler foundations")
> Reported-by: syzbot+58e480e7b28f2d890bfd@syzkaller.appspotmail.com
> Signed-off-by: Xin Long <lucien.xin@gmail.com>

Also applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH bpf-next 4/4] selftests/bpf: Test static data relocation
From: Alexei Starovoitov @ 2019-02-14  0:35 UTC (permalink / raw)
  To: Joe Stringer; +Cc: bpf, netdev, Daniel Borkmann, ast
In-Reply-To: <CAOftzPjffHZUK=uxiTsuOGLX4A5ag5oqRz6gorhREj6hxMFA4A@mail.gmail.com>

On Tue, Feb 12, 2019 at 12:43:21PM -0800, Joe Stringer wrote:
> 
> Do you see any value in having incremental support in libbpf that
> could be used as a fallback for older kernels like in patch #2 of this
> series? I could imagine libbpf probing kernel support for
> global/static variables and attempting to handle references to .data
> via some more comprehensive mechanism in-kernel, or falling back to
> this approach if it is not available.

I don't think we have to view it as older vs new kernel and fallback discussion.
I think access to static vars can be implemented in libbpf today without
changing llvm and kernel.

For the following code:
static volatile __u32 static_data = 42;

SEC("anything")
int load_static_data(struct __sk_buff *skb)
{
   __u32 value = static_data;

llvm will generate asm:

  r1 = static_data ll
  r1 = *(u32 *)(r1 + 0)

libbpf can replace first insn with r1 = 0 (or remove it altogether)
and second insn with r1 = 42 _when it is safe_.

If there was no volatile keyword llvm would have optimized
these two instructions into operation with immediate constant.
libbpf can do this opimization instead of llvm.
libbpf can check that 'static_data' is indeed not global in elf file
and there are no store operations in all programs in that elf file.
Then every load from that address can be replaced with rX=imm
of the value from data section.
libbpf would need to do data flow analysis which is substantial
feature addition. I think it's inevitable next step anyway.

The key point that this approach will be compatible with future
global variables and modifiable static variables.
In such case load/store instructions will stay as-is
and kernel support will be needed to replace 'r1 = static_data ll'
with properly marked ld_imm64 insn.

^ permalink raw reply

* Re: [PATCH bpf-next v11 0/7] bpf: add BPF_LWT_ENCAP_IP option to bpf_lwt_push_encap
From: David Ahern @ 2019-02-14  0:46 UTC (permalink / raw)
  To: Peter Oskolkov, Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, Willem de Bruijn
In-Reply-To: <20190213195341.184969-1-posk@google.com>

On 2/13/19 12:53 PM, Peter Oskolkov wrote:
> This patchset implements BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap
> BPF helper. It enables BPF programs (specifically, BPF_PROG_TYPE_LWT_IN
> and BPF_PROG_TYPE_LWT_XMIT prog types) to add IP encapsulation headers
> to packets (e.g. IP/GRE, GUE, IPIP).
> 
> This is useful when thousands of different short-lived flows should be
> encapped, each with different and dynamically determined destination.
> Although lwtunnels can be used in some of these scenarios, the ability
> to dynamically generate encap headers adds more flexibility, e.g.
> when routing depends on the state of the host (reflected in global bpf
> maps).
> 


For the set:
Reviewed-by: David Ahern <dsahern@gmail.com>



^ permalink raw reply

* [PATCH iproute2] ss: Render buffer to output every time a number of chunks are allocated
From: Stefano Brivio @ 2019-02-14  0:58 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Eric Dumazet, Phil Sutter, David Ahern, Sabrina Dubroca, netdev

Eric reported that, with 10 million sockets, ss -emoi (about 1000 bytes
output per socket) can easily lead to OOM (buffer would grow to 10GB of
memory).

Limit the maximum size of the buffer to five chunks, 1M each. Render and
flush buffers whenever we reach that.

This might make the resulting blocks slightly unaligned between them, with
occasional loss of readability on lines occurring every 5k to 50k sockets
approximately. Something like (from ss -tu):

[...]
CLOSE-WAIT   32       0           192.168.1.50:35232           10.0.0.1:https
ESTAB        0        0           192.168.1.50:53820           10.0.0.1:https
ESTAB       0        0           192.168.1.50:46924            10.0.0.1:https
CLOSE-WAIT  32       0           192.168.1.50:35228            10.0.0.1:https
[...]

However, I don't actually expect any human user to scroll through that
amount of sockets, so readability should be preserved when it matters.

The bulk of the diffstat comes from moving field_next() around, as we now
call render() from it. Functionally, this is implemented by six lines of
code, most of them in field_next().

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Fixes: 691bd854bf4a ("ss: Buffer raw fields first, then render them as a table")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
Eric, it would be nice if you could test this with your bazillion sockets,
I checked this with -emoi and "only" 500,000 sockets.

 misc/ss.c | 68 ++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 40 insertions(+), 28 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index 9e821faf0d31..28bdcba72d73 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -52,7 +52,8 @@
 #include <linux/tipc_sockets_diag.h>
 
 #define MAGIC_SEQ 123456
-#define BUF_CHUNK (1024 * 1024)
+#define BUF_CHUNK (1024 * 1024)	/* Buffer chunk allocation size */
+#define BUF_CHUNKS_MAX 5	/* Maximum number of allocated buffer chunks */
 #define LEN_ALIGN(x) (((x) + 1) & ~1)
 
 #define DIAG_REQUEST(_req, _r)						    \
@@ -176,6 +177,7 @@ static struct {
 	struct buf_token *cur;	/* Position of current token in chunk */
 	struct buf_chunk *head;	/* First chunk */
 	struct buf_chunk *tail;	/* Current chunk */
+	int chunks;		/* Number of allocated chunks */
 } buffer;
 
 static const char *TCP_PROTO = "tcp";
@@ -936,6 +938,8 @@ static struct buf_chunk *buf_chunk_new(void)
 
 	new->end = buffer.cur->data;
 
+	buffer.chunks++;
+
 	return new;
 }
 
@@ -1080,33 +1084,6 @@ static int field_is_last(struct column *f)
 	return f - columns == COL_MAX - 1;
 }
 
-static void field_next(void)
-{
-	field_flush(current_field);
-
-	if (field_is_last(current_field))
-		current_field = columns;
-	else
-		current_field++;
-}
-
-/* Walk through fields and flush them until we reach the desired one */
-static void field_set(enum col_id id)
-{
-	while (id != current_field - columns)
-		field_next();
-}
-
-/* Print header for all non-empty columns */
-static void print_header(void)
-{
-	while (!field_is_last(current_field)) {
-		if (!current_field->disabled)
-			out("%s", current_field->header);
-		field_next();
-	}
-}
-
 /* Get the next available token in the buffer starting from the current token */
 static struct buf_token *buf_token_next(struct buf_token *cur)
 {
@@ -1132,6 +1109,7 @@ static void buf_free_all(void)
 		free(tmp);
 	}
 	buffer.head = NULL;
+	buffer.chunks = 0;
 }
 
 /* Get current screen width, default to 80 columns if TIOCGWINSZ fails */
@@ -1294,6 +1272,40 @@ static void render(void)
 	current_field = columns;
 }
 
+/* Move to next field, and render buffer if we reached the maximum number of
+ * chunks, at the last field in a line.
+ */
+static void field_next(void)
+{
+	if (field_is_last(current_field) && buffer.chunks >= BUF_CHUNKS_MAX) {
+		render();
+		return;
+	}
+
+	field_flush(current_field);
+	if (field_is_last(current_field))
+		current_field = columns;
+	else
+		current_field++;
+}
+
+/* Walk through fields and flush them until we reach the desired one */
+static void field_set(enum col_id id)
+{
+	while (id != current_field - columns)
+		field_next();
+}
+
+/* Print header for all non-empty columns */
+static void print_header(void)
+{
+	while (!field_is_last(current_field)) {
+		if (!current_field->disabled)
+			out("%s", current_field->header);
+		field_next();
+	}
+}
+
 static void sock_state_print(struct sockstat *s)
 {
 	const char *sock_name;
-- 
2.20.1


^ permalink raw reply related

* [PATCH] tcp: Namespace-ify sysctl_tcp_rmem and sysctl_tcp_wmem
From: Alakesh Haloi @ 2019-02-14  1:12 UTC (permalink / raw)
  To: stable
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	Eric Dumazet, netdev, linux-kernel

[ Upstream commit 356d1833b638bd465672aefeb71def3ab93fc17d ]

Note that when a new netns is created, it inherits its
sysctl_tcp_rmem and sysctl_tcp_wmem from initial netns.

This change is needed so that we can refine TCP rcvbuf autotuning,
to take RTT into consideration.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[alakeshh: backport to v4.14: The patch does not apply to v4.14
directly and hence needed manual backport. Function signature for
the function tcp_select_initial_window had to be changed to be able
to pass pointer to struct sock.]
Signed-off-by: Alakesh Haloi <alakeshh@amazon.com>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: <stable@vger.kernel.org> # 4.14.x
---
 include/net/netns/ipv4.h   |  2 ++
 include/net/sock.h         |  3 +++
 include/net/tcp.h          |  5 ++---
 net/ipv4/syncookies.c      |  2 +-
 net/ipv4/sysctl_net_ipv4.c | 34 +++++++++++++++++-----------------
 net/ipv4/tcp.c             | 21 ++++++++-------------
 net/ipv4/tcp_input.c       | 17 +++++++++++------
 net/ipv4/tcp_ipv4.c        | 12 ++++++++++--
 net/ipv4/tcp_minisocks.c   |  2 +-
 net/ipv4/tcp_output.c      |  7 ++++---
 net/ipv6/syncookies.c      |  2 +-
 net/ipv6/tcp_ipv6.c        |  4 ++--
 12 files changed, 62 insertions(+), 49 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 8fcff2837484..ea48e5b8dbda 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -126,6 +126,8 @@ struct netns_ipv4 {
 	int sysctl_tcp_sack;
 	int sysctl_tcp_window_scaling;
 	int sysctl_tcp_timestamps;
+	int sysctl_tcp_wmem[3];
+	int sysctl_tcp_rmem[3];
 	struct inet_timewait_death_row tcp_death_row;
 	int sysctl_max_syn_backlog;
 
diff --git a/include/net/sock.h b/include/net/sock.h
index 4280e96d4b46..cec9b63a482a 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1095,8 +1095,11 @@ struct proto {
 	 */
 	unsigned long		*memory_pressure;
 	long			*sysctl_mem;
+
 	int			*sysctl_wmem;
 	int			*sysctl_rmem;
+	u32                     sysctl_wmem_offset;
+	u32                     sysctl_rmem_offset;
 	int			max_header;
 	bool			no_autobind;
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 0c828aac7e04..a234f0d83184 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -251,8 +251,6 @@ extern int sysctl_tcp_reordering;
 extern int sysctl_tcp_max_reordering;
 extern int sysctl_tcp_dsack;
 extern long sysctl_tcp_mem[3];
-extern int sysctl_tcp_wmem[3];
-extern int sysctl_tcp_rmem[3];
 extern int sysctl_tcp_app_win;
 extern int sysctl_tcp_adv_win_scale;
 extern int sysctl_tcp_frto;
@@ -1322,7 +1320,8 @@ static inline void tcp_slow_start_after_idle_check(struct sock *sk)
 }
 
 /* Determine a window scaling and initial window to offer. */
-void tcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd,
+void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss,
+			       __u32 *rcv_wnd,
 			       __u32 *window_clamp, int wscale_ok,
 			       __u8 *rcv_wscale, __u32 init_rcv_wnd);
 
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 77cf32a80952..fda37f2862c9 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -385,7 +385,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	/* Try to redo what tcp_v4_send_synack did. */
 	req->rsk_window_clamp = tp->window_clamp ? :dst_metric(&rt->dst, RTAX_WINDOW);
 
-	tcp_select_initial_window(tcp_full_space(sk), req->mss,
+	tcp_select_initial_window(sk, tcp_full_space(sk), req->mss,
 				  &req->rsk_rcv_wnd, &req->rsk_window_clamp,
 				  ireq->wscale_ok, &rcv_wscale,
 				  dst_metric(&rt->dst, RTAX_INITRWND));
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index d82e8344fc54..0a518d3fdd5a 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -508,22 +508,6 @@ static struct ctl_table ipv4_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_doulongvec_minmax,
 	},
-	{
-		.procname	= "tcp_wmem",
-		.data		= &sysctl_tcp_wmem,
-		.maxlen		= sizeof(sysctl_tcp_wmem),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
-		.extra1		= &one,
-	},
-	{
-		.procname	= "tcp_rmem",
-		.data		= &sysctl_tcp_rmem,
-		.maxlen		= sizeof(sysctl_tcp_rmem),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
-		.extra1		= &one,
-	},
 	{
 		.procname	= "tcp_app_win",
 		.data		= &sysctl_tcp_app_win,
@@ -1152,7 +1136,23 @@ static struct ctl_table ipv4_net_table[] = {
 		.data		= &init_net.ipv4.sysctl_tcp_timestamps,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec
+		.proc_handler	= proc_dointvec,
+	},
+	{
+		.procname	= "tcp_wmem",
+		.data		= &init_net.ipv4.sysctl_tcp_wmem,
+		.maxlen		= sizeof(init_net.ipv4.sysctl_tcp_wmem),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &one,
+	},
+	{
+		.procname	= "tcp_rmem",
+		.data		= &init_net.ipv4.sysctl_tcp_rmem,
+		.maxlen		= sizeof(init_net.ipv4.sysctl_tcp_rmem),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &one,
 	},
 	{ }
 };
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index fd14501ac3af..57db728ec5f7 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -290,12 +290,8 @@ struct percpu_counter tcp_orphan_count;
 EXPORT_SYMBOL_GPL(tcp_orphan_count);
 
 long sysctl_tcp_mem[3] __read_mostly;
-int sysctl_tcp_wmem[3] __read_mostly;
-int sysctl_tcp_rmem[3] __read_mostly;
 
 EXPORT_SYMBOL(sysctl_tcp_mem);
-EXPORT_SYMBOL(sysctl_tcp_rmem);
-EXPORT_SYMBOL(sysctl_tcp_wmem);
 
 atomic_long_t tcp_memory_allocated;	/* Current allocated memory. */
 EXPORT_SYMBOL(tcp_memory_allocated);
@@ -449,9 +445,8 @@ void tcp_init_sock(struct sock *sk)
 
 	icsk->icsk_sync_mss = tcp_sync_mss;
 
-	sk->sk_sndbuf = sysctl_tcp_wmem[1];
-	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
-
+	sk->sk_sndbuf = sock_net(sk)->ipv4.sysctl_tcp_wmem[1];
+	sk->sk_rcvbuf = sock_net(sk)->ipv4.sysctl_tcp_rmem[1];
 	sk_sockets_allocated_inc(sk);
 }
 EXPORT_SYMBOL(tcp_init_sock);
@@ -3538,13 +3533,13 @@ void __init tcp_init(void)
 	max_wshare = min(4UL*1024*1024, limit);
 	max_rshare = min(6UL*1024*1024, limit);
 
-	sysctl_tcp_wmem[0] = SK_MEM_QUANTUM;
-	sysctl_tcp_wmem[1] = 16*1024;
-	sysctl_tcp_wmem[2] = max(64*1024, max_wshare);
+	init_net.ipv4.sysctl_tcp_wmem[0] = SK_MEM_QUANTUM;
+	init_net.ipv4.sysctl_tcp_wmem[1] = 16 * 1024;
+	init_net.ipv4.sysctl_tcp_wmem[2] = max(64 * 1024, max_wshare);
 
-	sysctl_tcp_rmem[0] = SK_MEM_QUANTUM;
-	sysctl_tcp_rmem[1] = 87380;
-	sysctl_tcp_rmem[2] = max(87380, max_rshare);
+	init_net.ipv4.sysctl_tcp_rmem[0] = SK_MEM_QUANTUM;
+	init_net.ipv4.sysctl_tcp_rmem[1] = 87380;
+	init_net.ipv4.sysctl_tcp_rmem[2] = max(87380, max_rshare);
 
 	pr_info("Hash tables configured (established %u bind %u)\n",
 		tcp_hashinfo.ehash_mask + 1, tcp_hashinfo.bhash_size);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e24c0d7adf65..19b59488d4d5 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -340,7 +340,8 @@ static void tcp_sndbuf_expand(struct sock *sk)
 	sndmem *= nr_segs * per_mss;
 
 	if (sk->sk_sndbuf < sndmem)
-		sk->sk_sndbuf = min(sndmem, sysctl_tcp_wmem[2]);
+		sk->sk_sndbuf = min(sndmem,
+				    sock_net(sk)->ipv4.sysctl_tcp_wmem[2]);
 }
 
 /* 2. Tuning advertised window (window_clamp, rcv_ssthresh)
@@ -372,9 +373,10 @@ static void tcp_sndbuf_expand(struct sock *sk)
 static int __tcp_grow_window(const struct sock *sk, const struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
+	struct net *net = sock_net(sk);
 	/* Optimize this! */
 	int truesize = tcp_win_from_space(skb->truesize) >> 1;
-	int window = tcp_win_from_space(sysctl_tcp_rmem[2]) >> 1;
+	int window = tcp_win_from_space(net->ipv4.sysctl_tcp_rmem[2]) >> 1;
 
 	while (tp->rcv_ssthresh <= window) {
 		if (truesize <= skb->len)
@@ -417,6 +419,7 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
 static void tcp_fixup_rcvbuf(struct sock *sk)
 {
 	u32 mss = tcp_sk(sk)->advmss;
+	struct net *net = sock_net(sk);
 	int rcvmem;
 
 	rcvmem = 2 * SKB_TRUESIZE(mss + MAX_TCP_HEADER) *
@@ -429,7 +432,7 @@ static void tcp_fixup_rcvbuf(struct sock *sk)
 		rcvmem <<= 2;
 
 	if (sk->sk_rcvbuf < rcvmem)
-		sk->sk_rcvbuf = min(rcvmem, sysctl_tcp_rmem[2]);
+		sk->sk_rcvbuf = min(rcvmem, net->ipv4.sysctl_tcp_rmem[2]);
 }
 
 /* 4. Try to fixup all. It is made immediately after connection enters
@@ -476,15 +479,16 @@ static void tcp_clamp_window(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct inet_connection_sock *icsk = inet_csk(sk);
+	struct net *net = sock_net(sk);
 
 	icsk->icsk_ack.quick = 0;
 
-	if (sk->sk_rcvbuf < sysctl_tcp_rmem[2] &&
+	if (sk->sk_rcvbuf < net->ipv4.sysctl_tcp_rmem[2] &&
 	    !(sk->sk_userlocks & SOCK_RCVBUF_LOCK) &&
 	    !tcp_under_memory_pressure(sk) &&
 	    sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0)) {
 		sk->sk_rcvbuf = min(atomic_read(&sk->sk_rmem_alloc),
-				    sysctl_tcp_rmem[2]);
+				    net->ipv4.sysctl_tcp_rmem[2]);
 	}
 	if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
 		tp->rcv_ssthresh = min(tp->window_clamp, 2U * tp->advmss);
@@ -647,7 +651,8 @@ void tcp_rcv_space_adjust(struct sock *sk)
 			rcvmem += 128;
 
 		do_div(rcvwin, tp->advmss);
-		rcvbuf = min_t(u64, rcvwin * rcvmem, sysctl_tcp_rmem[2]);
+		rcvbuf = min_t(u64, rcvwin * rcvmem,
+			       sock_net(sk)->ipv4.sysctl_tcp_rmem[2]);
 		if (rcvbuf > sk->sk_rcvbuf) {
 			sk->sk_rcvbuf = rcvbuf;
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 31b34c0c2d5f..ae7409861b7d 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2428,8 +2428,8 @@ struct proto tcp_prot = {
 	.memory_allocated	= &tcp_memory_allocated,
 	.memory_pressure	= &tcp_memory_pressure,
 	.sysctl_mem		= sysctl_tcp_mem,
-	.sysctl_wmem		= sysctl_tcp_wmem,
-	.sysctl_rmem		= sysctl_tcp_rmem,
+	.sysctl_wmem_offset	= offsetof(struct net, ipv4.sysctl_tcp_wmem),
+	.sysctl_rmem_offset	= offsetof(struct net, ipv4.sysctl_tcp_rmem),
 	.max_header		= MAX_TCP_HEADER,
 	.obj_size		= sizeof(struct tcp_sock),
 	.slab_flags		= SLAB_TYPESAFE_BY_RCU,
@@ -2509,6 +2509,14 @@ static int __net_init tcp_sk_init(struct net *net)
 	net->ipv4.sysctl_tcp_sack = 1;
 	net->ipv4.sysctl_tcp_window_scaling = 1;
 	net->ipv4.sysctl_tcp_timestamps = 1;
+	if (net != &init_net) {
+		memcpy(net->ipv4.sysctl_tcp_rmem,
+		       init_net.ipv4.sysctl_tcp_rmem,
+		       sizeof(init_net.ipv4.sysctl_tcp_rmem));
+		memcpy(net->ipv4.sysctl_tcp_wmem,
+		       init_net.ipv4.sysctl_tcp_wmem,
+		       sizeof(init_net.ipv4.sysctl_tcp_wmem));
+	}
 
 	return 0;
 fail:
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 61584638dba7..e50139d51ed2 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -378,7 +378,7 @@ void tcp_openreq_init_rwin(struct request_sock *req,
 		full_space = rcv_wnd * mss;
 
 	/* tcp_full_space because it is guaranteed to be the first packet */
-	tcp_select_initial_window(full_space,
+	tcp_select_initial_window(sk_listener, full_space,
 		mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0),
 		&req->rsk_rcv_wnd,
 		&req->rsk_window_clamp,
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 24bad638c2ec..a87d44a80c7d 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -208,12 +208,13 @@ u32 tcp_default_init_rwnd(u32 mss)
  * be a multiple of mss if possible. We assume here that mss >= 1.
  * This MUST be enforced by all callers.
  */
-void tcp_select_initial_window(int __space, __u32 mss,
+void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss,
 			       __u32 *rcv_wnd, __u32 *window_clamp,
 			       int wscale_ok, __u8 *rcv_wscale,
 			       __u32 init_rcv_wnd)
 {
 	unsigned int space = (__space < 0 ? 0 : __space);
+	struct net *net = sock_net(sk);
 
 	/* If no clamp set the clamp to the max possible scaled window */
 	if (*window_clamp == 0)
@@ -240,7 +241,7 @@ void tcp_select_initial_window(int __space, __u32 mss,
 	(*rcv_wscale) = 0;
 	if (wscale_ok) {
 		/* Set window scaling on max possible window */
-		space = max_t(u32, space, sysctl_tcp_rmem[2]);
+		space = max_t(u32, space, net->ipv4.sysctl_tcp_rmem[2]);
 		space = max_t(u32, space, sysctl_rmem_max);
 		space = min_t(u32, space, *window_clamp);
 		while (space > U16_MAX && (*rcv_wscale) < TCP_MAX_WSCALE) {
@@ -3331,7 +3332,7 @@ static void tcp_connect_init(struct sock *sk)
 	if (rcv_wnd == 0)
 		rcv_wnd = dst_metric(dst, RTAX_INITRWND);
 
-	tcp_select_initial_window(tcp_full_space(sk),
+	tcp_select_initial_window(sk, tcp_full_space(sk),
 				  tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0),
 				  &tp->rcv_wnd,
 				  &tp->window_clamp,
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 4e7817abc0b9..e7a3a6b6cf56 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -244,7 +244,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	}
 
 	req->rsk_window_clamp = tp->window_clamp ? :dst_metric(dst, RTAX_WINDOW);
-	tcp_select_initial_window(tcp_full_space(sk), req->mss,
+	tcp_select_initial_window(sk, tcp_full_space(sk), req->mss,
 				  &req->rsk_rcv_wnd, &req->rsk_window_clamp,
 				  ireq->wscale_ok, &rcv_wscale,
 				  dst_metric(dst, RTAX_INITRWND));
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index ba8586aadffa..de89bcee62d7 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1940,8 +1940,8 @@ struct proto tcpv6_prot = {
 	.memory_pressure	= &tcp_memory_pressure,
 	.orphan_count		= &tcp_orphan_count,
 	.sysctl_mem		= sysctl_tcp_mem,
-	.sysctl_wmem		= sysctl_tcp_wmem,
-	.sysctl_rmem		= sysctl_tcp_rmem,
+	.sysctl_wmem_offset	= offsetof(struct net, ipv4.sysctl_tcp_wmem),
+	.sysctl_rmem_offset	= offsetof(struct net, ipv4.sysctl_tcp_rmem),
 	.max_header		= MAX_TCP_HEADER,
 	.obj_size		= sizeof(struct tcp6_sock),
 	.slab_flags		= SLAB_TYPESAFE_BY_RCU,
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH net] net: phy: fix interrupt handling in non-started states
From: Andrew Lunn @ 2019-02-14  1:17 UTC (permalink / raw)
  To: Heiner Kallweit
  Cc: Florian Fainelli, David Miller, netdev@vger.kernel.org,
	Russell King - ARM Linux
In-Reply-To: <25e86edc-0b88-8c03-b692-776e971331f2@gmail.com>

On Tue, Feb 12, 2019 at 07:56:15PM +0100, Heiner Kallweit wrote:
> phylib enables interrupts before phy_start() has been called, and if
> we receive an interrupt in a non-started state, the interrupt handler
> returns IRQ_NONE. This causes problems with at least one Marvell chip
> as reported by Andrew.
> Fix this by handling interrupts the same as in phy_mac_interrupt(),
> basically always running the phylib state machine. It knows when it
> has to do something and when not.
> This change allows to handle interrupts gracefully even if they
> occur in a non-started state.
> 
> Fixes: 2b3e88ea6528 ("net: phy: improve phy state checking")
> Reported-by: Andrew Lunn <andrew@lunn.ch>
> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH net] dsa: mv88e6xxx: Ensure all pending interrupts are handled prior to exit
From: Andrew Lunn @ 2019-02-14  2:07 UTC (permalink / raw)
  To: John David Anglin; +Cc: Russell King, Vivien Didelot, Florian Fainelli, netdev
In-Reply-To: <6a1ebc61-3505-beb8-21cb-ea42ad9fe67e@bell.net>

On Mon, Feb 11, 2019 at 01:40:21PM -0500, John David Anglin wrote:
> The GPIO interrupt controller on the espressobin board only supports edge interrupts.
> If one enables the use of hardware interrupts in the device tree for the 88E6341, it is
> possible to miss an edge.  When this happens, the INTn pin on the Marvell switch is
> stuck low and no further interrupts occur.
> 
> I found after adding debug statements to mv88e6xxx_g1_irq_thread_work() that there is
> a race in handling device interrupts (e.g. PHY link interrupts).  Some interrupts are
> directly cleared by reading the Global 1 status register.  However, the device interrupt
> flag, for example, is not cleared until all the unmasked SERDES and PHY ports are serviced.
> This is done by reading the relevant SERDES and PHY status register.
> 
> The code only services interrupts whose status bit is set at the time of reading its status
> register.  If an interrupt event occurs after its status is read and before all interrupts
> are serviced, then this event will not be serviced and the INTn output pin will remain low.
> 
> This is not a problem with polling or level interrupts since the handler will be called
> again to process the event.  However, it's a big problem when using level interrupts.
> 
> The fix presented here is to add a loop around the code servicing switch interrupts.  If
> any pending interrupts remain after the current set has been handled, we loop and process
> the new set.  If there are no pending interrupts after servicing, we are sure that INTn has
> gone high and we will get an edge when a new event occurs.
> 
> Tested on espressobin board.
> 
> Signed-off-by:  John David Anglin <dave.anglin@bell.net>

Fixes: dc30c35be720 ("net: dsa: mv88e6xxx: Implement interrupt support.")

Tested-by: Andrew Lunn <andrew@lunn.ch>

David, please ensure that Heiner's patch:

net: phy: fix interrupt handling in non-started states

is applied first. Otherwise we can get into an interrupt storm.

    Andrew

^ permalink raw reply

* [PATCH net-next v2] bonding: check slave set command firstly
From: xiangxia.m.yue @ 2019-02-11 18:49 UTC (permalink / raw)
  To: davem; +Cc: netdev, Tonghao Zhang

From: Tonghao Zhang <xiangxia.m.yue@gmail.com>

This patch is a little improvement. If user use the
command shown as below, we should print the info [1]
instead of [2]. The eth0 exists actually, and it may
confuse user.

$ echo "eth0" > /sys/class/net/bond4/bonding/slaves

[1] "bond4: no command found in slaves file - use +ifname or -ifname"
[2] "write error: No such device"

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
---
v2: fix compiling warning.
---
 drivers/net/bonding/bond_options.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
index 4d5d01c..da1fc17 100644
--- a/drivers/net/bonding/bond_options.c
+++ b/drivers/net/bonding/bond_options.c
@@ -1375,6 +1375,7 @@ static int bond_option_slaves_set(struct bonding *bond,
 	sscanf(newval->string, "%16s", command); /* IFNAMSIZ*/
 	ifname = command + 1;
 	if ((strlen(command) <= 1) ||
+	    (command[0] != '+' && command[0] != '-') ||
 	    !dev_valid_name(ifname))
 		goto err_no_cmd;
 
@@ -1398,6 +1399,7 @@ static int bond_option_slaves_set(struct bonding *bond,
 		break;
 
 	default:
+		/* should not run here. */
 		goto err_no_cmd;
 	}
 
-- 
1.8.3.1


^ permalink raw reply related

* Re: [PATCH bpf-next v11 0/7] bpf: add BPF_LWT_ENCAP_IP option to bpf_lwt_push_encap
From: Alexei Starovoitov @ 2019-02-14  2:39 UTC (permalink / raw)
  To: David Ahern
  Cc: Peter Oskolkov, Alexei Starovoitov, Daniel Borkmann, netdev,
	Peter Oskolkov, Willem de Bruijn
In-Reply-To: <783b5578-cba4-904d-4ade-c8c08b47a3ba@gmail.com>

On Wed, Feb 13, 2019 at 05:46:26PM -0700, David Ahern wrote:
> On 2/13/19 12:53 PM, Peter Oskolkov wrote:
> > This patchset implements BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap
> > BPF helper. It enables BPF programs (specifically, BPF_PROG_TYPE_LWT_IN
> > and BPF_PROG_TYPE_LWT_XMIT prog types) to add IP encapsulation headers
> > to packets (e.g. IP/GRE, GUE, IPIP).
> > 
> > This is useful when thousands of different short-lived flows should be
> > encapped, each with different and dynamically determined destination.
> > Although lwtunnels can be used in some of these scenarios, the ability
> > to dynamically generate encap headers adds more flexibility, e.g.
> > when routing depends on the state of the host (reflected in global bpf
> > maps).
> > 
> 
> 
> For the set:
> Reviewed-by: David Ahern <dsahern@gmail.com>

Applied. Thanks everyone!


^ permalink raw reply

* [PATCH net-next v5] ipmr: ip6mr: Create new sockopt to clear mfc cache or vifs
From: Callum Sinclair @ 2019-02-14  2:44 UTC (permalink / raw)
  To: davem, kuznet, yoshfuji, nikolay, netdev, linux-kernel
  Cc: nicolas.dichtel, Callum Sinclair
In-Reply-To: <20190214024418.21490-1-callum.sinclair@alliedtelesis.co.nz>

Currently the only way to clear the forwarding cache was to delete the
entries one by one using the MRT_DEL_MFC socket option or to destroy and
recreate the socket.

Create a new socket option which with the use of optional flags can
clear any combination of multicast entries (static or not static) and
multicast vifs (static or not static).

Calling the new socket option MRT_FLUSH with the flags MRT_FLUSH_MFC and
MRT_FLUSH_VIFS will clear all entries and vifs on the socket except for
static entries.

Signed-off-by: Callum Sinclair <callum.sinclair@alliedtelesis.co.nz>
---
v1 -> v2:
  Implemented additional flags for static entries
v2 -> v3:
  Cleaned up flag logic so any combination of routes can be cleared.
  Fixed style errors
  Fixed incorrect flag values
v3 -> v4:
  Fixed style errors
  Fixed incorrect flag (MRT_FLUSH was used instead of MRT_FLUSH_VIFS)
v4 -> v5:
  Only clear the unresolved queue when MRT_FLUSH_MFC flag is set.

 include/uapi/linux/mroute.h  |  9 ++++-
 include/uapi/linux/mroute6.h |  9 ++++-
 net/ipv4/ipmr.c              | 75 +++++++++++++++++++++-------------
 net/ipv6/ip6mr.c             | 78 +++++++++++++++++++++++-------------
 4 files changed, 115 insertions(+), 56 deletions(-)

diff --git a/include/uapi/linux/mroute.h b/include/uapi/linux/mroute.h
index 5d37a9ccce63..11c8c1fc1124 100644
--- a/include/uapi/linux/mroute.h
+++ b/include/uapi/linux/mroute.h
@@ -28,12 +28,19 @@
 #define MRT_TABLE	(MRT_BASE+9)	/* Specify mroute table ID		*/
 #define MRT_ADD_MFC_PROXY	(MRT_BASE+10)	/* Add a (*,*|G) mfc entry	*/
 #define MRT_DEL_MFC_PROXY	(MRT_BASE+11)	/* Del a (*,*|G) mfc entry	*/
-#define MRT_MAX		(MRT_BASE+11)
+#define MRT_FLUSH	(MRT_BASE+12)	/* Flush all mfc entries and/or vifs	*/
+#define MRT_MAX		(MRT_BASE+12)
 
 #define SIOCGETVIFCNT	SIOCPROTOPRIVATE	/* IP protocol privates */
 #define SIOCGETSGCNT	(SIOCPROTOPRIVATE+1)
 #define SIOCGETRPF	(SIOCPROTOPRIVATE+2)
 
+/* MRT_FLUSH optional flags */
+#define MRT_FLUSH_MFC	1	/* Flush multicast entries */
+#define MRT_FLUSH_MFC_STATIC	2	/* Flush static multicast entries */
+#define MRT_FLUSH_VIFS	4	/* Flush multicast vifs */
+#define MRT_FLUSH_VIFS_STATIC	8	/* Flush static multicast vifs */
+
 #define MAXVIFS		32
 typedef unsigned long vifbitmap_t;	/* User mode code depends on this lot */
 typedef unsigned short vifi_t;
diff --git a/include/uapi/linux/mroute6.h b/include/uapi/linux/mroute6.h
index 9999cc006390..ac84ef11b29c 100644
--- a/include/uapi/linux/mroute6.h
+++ b/include/uapi/linux/mroute6.h
@@ -31,12 +31,19 @@
 #define MRT6_TABLE	(MRT6_BASE+9)	/* Specify mroute table ID		*/
 #define MRT6_ADD_MFC_PROXY	(MRT6_BASE+10)	/* Add a (*,*|G) mfc entry	*/
 #define MRT6_DEL_MFC_PROXY	(MRT6_BASE+11)	/* Del a (*,*|G) mfc entry	*/
-#define MRT6_MAX	(MRT6_BASE+11)
+#define MRT6_FLUSH	(MRT6_BASE+12)	/* Flush all mfc entries and/or vifs	*/
+#define MRT6_MAX	(MRT6_BASE+12)
 
 #define SIOCGETMIFCNT_IN6	SIOCPROTOPRIVATE	/* IP protocol privates */
 #define SIOCGETSGCNT_IN6	(SIOCPROTOPRIVATE+1)
 #define SIOCGETRPF	(SIOCPROTOPRIVATE+2)
 
+/* MRT6_FLUSH optional flags */
+#define MRT6_FLUSH_MFC	1	/* Flush multicast entries */
+#define MRT6_FLUSH_MFC_STATIC	2	/* Flush static multicast entries */
+#define MRT6_FLUSH_VIFS	4	/* Flushing multicast vifs */
+#define MRT6_FLUSH_VIFS_STATIC	8	/* Flush static multicast vifs */
+
 #define MAXMIFS		32
 typedef unsigned long mifbitmap_t;	/* User mode code depends on this lot */
 typedef unsigned short mifi_t;
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index e536970557dd..53869779af74 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -110,7 +110,7 @@ static int ipmr_cache_report(struct mr_table *mrt,
 static void mroute_netlink_event(struct mr_table *mrt, struct mfc_cache *mfc,
 				 int cmd);
 static void igmpmsg_netlink_event(struct mr_table *mrt, struct sk_buff *pkt);
-static void mroute_clean_tables(struct mr_table *mrt, bool all);
+static void mroute_clean_tables(struct mr_table *mrt, int flags);
 static void ipmr_expire_process(struct timer_list *t);
 
 #ifdef CONFIG_IP_MROUTE_MULTIPLE_TABLES
@@ -415,7 +415,8 @@ static struct mr_table *ipmr_new_table(struct net *net, u32 id)
 static void ipmr_free_table(struct mr_table *mrt)
 {
 	del_timer_sync(&mrt->ipmr_expire_timer);
-	mroute_clean_tables(mrt, true);
+	mroute_clean_tables(mrt, MRT_FLUSH_VIFS | MRT_FLUSH_VIFS_STATIC |
+					  MRT_FLUSH_MFC | MRT_FLUSH_MFC_STATIC);
 	rhltable_destroy(&mrt->mfc_hash);
 	kfree(mrt);
 }
@@ -1296,7 +1297,7 @@ static int ipmr_mfc_add(struct net *net, struct mr_table *mrt,
 }
 
 /* Close the multicast socket, and clear the vif tables etc */
-static void mroute_clean_tables(struct mr_table *mrt, bool all)
+static void mroute_clean_tables(struct mr_table *mrt, int flags)
 {
 	struct net *net = read_pnet(&mrt->net);
 	struct mr_mfc *c, *tmp;
@@ -1305,35 +1306,44 @@ static void mroute_clean_tables(struct mr_table *mrt, bool all)
 	int i;
 
 	/* Shut down all active vif entries */
-	for (i = 0; i < mrt->maxvif; i++) {
-		if (!all && (mrt->vif_table[i].flags & VIFF_STATIC))
-			continue;
-		vif_delete(mrt, i, 0, &list);
+	if (flags & (MRT_FLUSH_VIFS | MRT_FLUSH_VIFS_STATIC)) {
+		for (i = 0; i < mrt->maxvif; i++) {
+			if (((mrt->vif_table[i].flags & VIFF_STATIC) &&
+			     !(flags & MRT_FLUSH_VIFS_STATIC)) ||
+			    (!(mrt->vif_table[i].flags & VIFF_STATIC) && !(flags & MRT_FLUSH_VIFS)))
+				continue;
+			vif_delete(mrt, i, 0, &list);
+		}
+		unregister_netdevice_many(&list);
 	}
-	unregister_netdevice_many(&list);
 
 	/* Wipe the cache */
-	list_for_each_entry_safe(c, tmp, &mrt->mfc_cache_list, list) {
-		if (!all && (c->mfc_flags & MFC_STATIC))
-			continue;
-		rhltable_remove(&mrt->mfc_hash, &c->mnode, ipmr_rht_params);
-		list_del_rcu(&c->list);
-		cache = (struct mfc_cache *)c;
-		call_ipmr_mfc_entry_notifiers(net, FIB_EVENT_ENTRY_DEL, cache,
-					      mrt->id);
-		mroute_netlink_event(mrt, cache, RTM_DELROUTE);
-		mr_cache_put(c);
-	}
-
-	if (atomic_read(&mrt->cache_resolve_queue_len) != 0) {
-		spin_lock_bh(&mfc_unres_lock);
-		list_for_each_entry_safe(c, tmp, &mrt->mfc_unres_queue, list) {
-			list_del(&c->list);
+	if (flags & (MRT_FLUSH_MFC | MRT_FLUSH_MFC_STATIC)) {
+		list_for_each_entry_safe(c, tmp, &mrt->mfc_cache_list, list) {
+			if (((c->mfc_flags & MFC_STATIC) && !(flags & MRT_FLUSH_MFC_STATIC)) ||
+			    (!(c->mfc_flags & MFC_STATIC) && !(flags & MRT_FLUSH_MFC)))
+				continue;
+			rhltable_remove(&mrt->mfc_hash, &c->mnode, ipmr_rht_params);
+			list_del_rcu(&c->list);
 			cache = (struct mfc_cache *)c;
+			call_ipmr_mfc_entry_notifiers(net, FIB_EVENT_ENTRY_DEL, cache,
+						      mrt->id);
 			mroute_netlink_event(mrt, cache, RTM_DELROUTE);
-			ipmr_destroy_unres(mrt, cache);
+			mr_cache_put(c);
+		}
+	}
+
+	if (flags & MRT_FLUSH_MFC) {
+		if (atomic_read(&mrt->cache_resolve_queue_len) != 0) {
+			spin_lock_bh(&mfc_unres_lock);
+			list_for_each_entry_safe(c, tmp, &mrt->mfc_unres_queue, list) {
+				list_del(&c->list);
+				cache = (struct mfc_cache *)c;
+				mroute_netlink_event(mrt, cache, RTM_DELROUTE);
+				ipmr_destroy_unres(mrt, cache);
+			}
+			spin_unlock_bh(&mfc_unres_lock);
 		}
-		spin_unlock_bh(&mfc_unres_lock);
 	}
 }
 
@@ -1354,7 +1364,7 @@ static void mrtsock_destruct(struct sock *sk)
 						    NETCONFA_IFINDEX_ALL,
 						    net->ipv4.devconf_all);
 			RCU_INIT_POINTER(mrt->mroute_sk, NULL);
-			mroute_clean_tables(mrt, false);
+			mroute_clean_tables(mrt, MRT_FLUSH_VIFS | MRT_FLUSH_MFC);
 		}
 	}
 	rtnl_unlock();
@@ -1479,6 +1489,17 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval,
 					   sk == rtnl_dereference(mrt->mroute_sk),
 					   parent);
 		break;
+	case MRT_FLUSH:
+		if (optlen != sizeof(val)) {
+			ret = -EINVAL;
+			break;
+		}
+		if (get_user(val, (int __user *)optval)) {
+			ret = -EFAULT;
+			break;
+		}
+		mroute_clean_tables(mrt, val);
+		break;
 	/* Control PIM assert. */
 	case MRT_ASSERT:
 		if (optlen != sizeof(val)) {
diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index cc01aa3f2b5e..b67a7c1e3615 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -97,7 +97,7 @@ static void mr6_netlink_event(struct mr_table *mrt, struct mfc6_cache *mfc,
 static void mrt6msg_netlink_event(struct mr_table *mrt, struct sk_buff *pkt);
 static int ip6mr_rtm_dumproute(struct sk_buff *skb,
 			       struct netlink_callback *cb);
-static void mroute_clean_tables(struct mr_table *mrt, bool all);
+static void mroute_clean_tables(struct mr_table *mrt, int flags);
 static void ipmr_expire_process(struct timer_list *t);
 
 #ifdef CONFIG_IPV6_MROUTE_MULTIPLE_TABLES
@@ -393,7 +393,8 @@ static struct mr_table *ip6mr_new_table(struct net *net, u32 id)
 static void ip6mr_free_table(struct mr_table *mrt)
 {
 	del_timer_sync(&mrt->ipmr_expire_timer);
-	mroute_clean_tables(mrt, true);
+	mroute_clean_tables(mrt, MRT6_FLUSH_VIFS | MRT6_FLUSH_VIFS_STATIC |
+					  MRT6_FLUSH_MFC | MRT6_FLUSH_MFC_STATIC);
 	rhltable_destroy(&mrt->mfc_hash);
 	kfree(mrt);
 }
@@ -1496,42 +1497,51 @@ static int ip6mr_mfc_add(struct net *net, struct mr_table *mrt,
  *	Close the multicast socket, and clear the vif tables etc
  */
 
-static void mroute_clean_tables(struct mr_table *mrt, bool all)
+static void mroute_clean_tables(struct mr_table *mrt, int flags)
 {
 	struct mr_mfc *c, *tmp;
 	LIST_HEAD(list);
 	int i;
 
 	/* Shut down all active vif entries */
-	for (i = 0; i < mrt->maxvif; i++) {
-		if (!all && (mrt->vif_table[i].flags & VIFF_STATIC))
-			continue;
-		mif6_delete(mrt, i, 0, &list);
+	if (flags & (MRT6_FLUSH_VIFS | MRT6_FLUSH_VIFS_STATIC)) {
+		for (i = 0; i < mrt->maxvif; i++) {
+			if (((mrt->vif_table[i].flags & VIFF_STATIC) &&
+			     !(flags & MRT6_FLUSH_VIFS_STATIC)) ||
+			    (!(mrt->vif_table[i].flags & VIFF_STATIC) && !(flags & MRT6_FLUSH_VIFS)))
+				continue;
+			mif6_delete(mrt, i, 0, &list);
+		}
+		unregister_netdevice_many(&list);
 	}
-	unregister_netdevice_many(&list);
 
 	/* Wipe the cache */
-	list_for_each_entry_safe(c, tmp, &mrt->mfc_cache_list, list) {
-		if (!all && (c->mfc_flags & MFC_STATIC))
-			continue;
-		rhltable_remove(&mrt->mfc_hash, &c->mnode, ip6mr_rht_params);
-		list_del_rcu(&c->list);
-		call_ip6mr_mfc_entry_notifiers(read_pnet(&mrt->net),
-					       FIB_EVENT_ENTRY_DEL,
-					       (struct mfc6_cache *)c, mrt->id);
-		mr6_netlink_event(mrt, (struct mfc6_cache *)c, RTM_DELROUTE);
-		mr_cache_put(c);
+	if (flags & (MRT6_FLUSH_MFC | MRT6_FLUSH_MFC_STATIC)) {
+		list_for_each_entry_safe(c, tmp, &mrt->mfc_cache_list, list) {
+			if (((c->mfc_flags & MFC_STATIC) && !(flags & MRT6_FLUSH_MFC_STATIC)) ||
+			    (!(c->mfc_flags & MFC_STATIC) && !(flags & MRT6_FLUSH_MFC)))
+				continue;
+			rhltable_remove(&mrt->mfc_hash, &c->mnode, ip6mr_rht_params);
+			list_del_rcu(&c->list);
+			call_ip6mr_mfc_entry_notifiers(read_pnet(&mrt->net),
+						       FIB_EVENT_ENTRY_DEL,
+										   (struct mfc6_cache *)c, mrt->id);
+			mr6_netlink_event(mrt, (struct mfc6_cache *)c, RTM_DELROUTE);
+			mr_cache_put(c);
+		}
 	}
 
-	if (atomic_read(&mrt->cache_resolve_queue_len) != 0) {
-		spin_lock_bh(&mfc_unres_lock);
-		list_for_each_entry_safe(c, tmp, &mrt->mfc_unres_queue, list) {
-			list_del(&c->list);
-			mr6_netlink_event(mrt, (struct mfc6_cache *)c,
-					  RTM_DELROUTE);
-			ip6mr_destroy_unres(mrt, (struct mfc6_cache *)c);
+	if (flags & MRT6_FLUSH_MFC) {
+		if (atomic_read(&mrt->cache_resolve_queue_len) != 0) {
+			spin_lock_bh(&mfc_unres_lock);
+			list_for_each_entry_safe(c, tmp, &mrt->mfc_unres_queue, list) {
+				list_del(&c->list);
+				mr6_netlink_event(mrt, (struct mfc6_cache *)c,
+						  RTM_DELROUTE);
+				ip6mr_destroy_unres(mrt, (struct mfc6_cache *)c);
+			}
+			spin_unlock_bh(&mfc_unres_lock);
 		}
-		spin_unlock_bh(&mfc_unres_lock);
 	}
 }
 
@@ -1587,7 +1597,7 @@ int ip6mr_sk_done(struct sock *sk)
 						     NETCONFA_IFINDEX_ALL,
 						     net->ipv6.devconf_all);
 
-			mroute_clean_tables(mrt, false);
+			mroute_clean_tables(mrt, MRT6_FLUSH_VIFS | MRT6_FLUSH_MFC);
 			err = 0;
 			break;
 		}
@@ -1703,6 +1713,20 @@ int ip6_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, uns
 		rtnl_unlock();
 		return ret;
 
+	case MRT6_FLUSH:
+	{
+		int flags;
+
+		if (optlen != sizeof(flags))
+			return -EINVAL;
+		if (get_user(flags, (int __user *)optval))
+			return -EFAULT;
+		rtnl_lock();
+		mroute_clean_tables(mrt, flags);
+		rtnl_unlock();
+		return 0;
+	}
+
 	/*
 	 *	Control PIM assert (to activate pim will activate assert)
 	 */
-- 
2.20.1


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox