Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH V3 net-next 1/2] tcp: send in-queue bytes in cmsg upon read
From: Soheil Hassas Yeganeh @ 2018-05-01 19:28 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Yuchung Cheng, Neal Cardwell, Eric Dumazet,
	Willem de Bruijn
In-Reply-To: <20180501.143447.737462619555001271.davem@davemloft.net>

On Tue, May 1, 2018 at 2:34 PM, David Miller <davem@davemloft.net> wrote:
> From: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
> Date: Tue,  1 May 2018 10:11:27 -0400
>
>> +static inline int tcp_inq_hint(struct sock *sk)
>
> Please do not use 'inline' in foo.c files, let the compiler decide.
>
> Otherwise looks great, thanks.

Oops, sorry about this. Will send a v4.

^ permalink raw reply

* Re: [PATCH net-next v6] Add Common Applications Kept Enhanced (cake) qdisc
From: Dave Taht @ 2018-05-01 19:22 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Eric Dumazet, Cong Wang, Linux Kernel Network Developers,
	Cake List
In-Reply-To: <878t932ont.fsf@toke.dk>

On Tue, May 1, 2018 at 11:53 AM, Toke Høiland-Jørgensen <toke@toke.dk> wrote:
> Eric Dumazet <eric.dumazet@gmail.com> writes:
>
>> On 04/30/2018 02:27 PM, Dave Taht wrote:
>>
>>> I actually have a tc - bpf based ack filter, during the development of
>>> cake's ack-thinner, that I should submit one of these days. It
>>> proved to be of limited use.
>>>
>>> Probably the biggest mistake we made is by calling this cake feature a
>>> filter. It isn't.
>>>
>>> Maybe we should have called it a "thinner" or something like that? In
>>> order to properly "thin" or "reduce" an ack stream
>>> you have to have a queue to look at and some related state. TC filters
>>> do not operate on queues, qdiscs do. Thus the "ack-filter" here is
>>> deeply embedded into cake's flow isolation and queue structures.
>>
>> A feature eating packets _is_ a filter.
>>
>> Given that a qdisc only sees one direction, I really do not get why
>> ack-filter is so desirable in a packet scheduler.
>
> The ACK filter in CAKE is there to solve a very particular use case:
> Residential internet connections with bandwidths so asymmetric that it
> hurts TCP performance. It is not on by default; but rather meant to be
> configured by users which suffer from this particular ISP brokenness
> (which, sadly, does happen; there are ISPs who believe a 50/1 bandwidth
> ratio is reasonable). For those users, the ACK filter can help.
>
> We certainly do not advise people to turn it on in the general case! As
> you point, in general TCP performance is best improved in the TCP stack...
>
>> You have not provided any numbers to show how useful it is to maintain
>> this code
>
> You mean apart from that in the linked blog post and the paper?
> Here's an executive summary:
>
> On a 30/1 Mbps connection with a bidirectional traffic test (RRUL),
> turning on the ACK filter improves downstream throughput by ~20% and
> upstream throughput by ~12% in conservative mode and ~40% in aggressive
> mode, at the cost of ~5ms of inter-flow latency due to the increased
> congestion.

On a simulated 50/1 comcast connection, I got double the throughput
on a similar test, with no obvious glitches in the tcp flow, in the early stages
of development.

http://blog.cerowrt.org/post/ack_filtering/

I was very, very dubious about the value of ack thinning until that point,
but it was hard to argue with the data.

> In *really* pathological cases, the effect can be a lot more; for
> instance, the ACK filter increases the achievable downstream throughput
> on a link with 100 Kbps in the upstream direction by an order of
> magnitude (from ~2.5 Mbps to ~25 Mbps).
>
>> (probably still broken btw, considering it is changing some skb
>> attributes).
>
> As you may have noticed over the last few iterations, I've actually been
> trying to fix any brokenness. If you could be specific as to what is
> still broken, that would be helpful.
>
> (I'm assuming are referring to the calls to skb_set_inner* - but do you
> mean all of them, or just the ones that set inner == outer?)
>
>> On wifi (or any half duplex medium), you might gain something by
>> carefully sending ACK not too often, but ultimately this should be
>> done by TCP stack, in well controlled environment [1], instead of
>> middle-boxes happily playing/breaking some Internet standards.
>>
>> [1] TCP stack has the estimations of RTT, RWIN, throughput, and might
>> be able to avoid flooding the network with acks under some conditions.
>
> You are quite right that in general, TCP performance is best improved in
> the TCP stack. But home users are not generally in control of that; the
> ACK filter helps in those specific deployments (again, it's off by
> default, and you shouldn't turn it on in the general case!).
>
>> Say RTT is 100ms, and we receive 1 packet every 100 usec (no GRO
>> aggregation) Maybe we do not really _need_ to send 5000 ack per second
>> (or even 10,000 ack per second if a hole needs a repair)
>
> Yes, please do fix that.

:) I really would like to see cake tested at 10GigE and above, and its
performance improved overall. I tend to think we need more queues than 1024
at 40GigE+, and we presently run out of cpu (even unshaped) long
before we hit that point.

>
>> Also on wifi, the queue builds in the driver queues anyway, not in the
>> qdisc.
>
> Actually, we've fixed that (for some drivers); there is now no qdisc on
> WiFi interfaces, and we've integrated FQ-CoDel'ed queueing into the
> stack where it can be effective. But no, running CAKE with an ACK filter
> on a WiFi link is not going to be effective. Don't do that.

I share the belief with eric that thinning acks (either at the wifi qdisc or
in tcp) on wifi interfaces will help - given that the underlying wifi layer
is reliable and does retransmits, and the number of packets that can
fit into a wifi aggregate limited,
you really only need one ack per wifi aggregate per flow to keep the
tcp connection running.

That said, I'd much rather see fq_codel work with more wifi drivers
than pursue this.

>
>> So ACK filtering, if _really_ successful, would need to be
>> modularized.

Heh. I keep hoping ISPs will ship symmetric links and wifi antennas

> I really hope ACK filtering won't be "_really_ successful". Again, it is
> not meant to be a feature that's on everywhere, it's targeting a very
> specific use case that CAKE is optimised for: The home router use case.

Please note that I find cake far more general purpose than just this, the
ease of just slamming:

tc qdisc add dev eth0 root cake bandwidth 50mbit

on a link that needs it is far easier than the equivalent htb +
fq_codel + other filters, and more effective.

That mode is with nat on, some diffserv awareness (more correct than
pfifo_fast), no link layer compensation, no ack-filter, and "just
works".

Certainly the major use case to date has been on home routers. Every
feature in cake was based
on extensive feedback from that part of the field.

>> Please split Cake into a patch series.
>> Presumably putting the ack-filter on a patch of its own.
>
> *sigh* - can do, I guess. Though I'm not sure what that is going to
> accomplish, exactly?
>
> -Toke



-- 

Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619

^ permalink raw reply

* Re: [PATCH net-next v6] Add Common Applications Kept Enhanced (cake) qdisc
From: David Miller @ 2018-05-01 19:14 UTC (permalink / raw)
  To: eric.dumazet; +Cc: toke, dave.taht, xiyou.wangcong, netdev, cake
In-Reply-To: <4ec8da81-8671-f434-bada-27088b09ce7b@gmail.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 1 May 2018 12:12:45 -0700

> It seems you guys spent years/months on work on this stuff, so what
> is the big deal about presenting your work in the best possible way
> ?

+1

^ permalink raw reply

* Re: [PATCH net-next v6] Add Common Applications Kept Enhanced (cake) qdisc
From: Eric Dumazet @ 2018-05-01 19:12 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Eric Dumazet, Dave Taht,
	Cong Wang
  Cc: Linux Kernel Network Developers, Cake List
In-Reply-To: <878t932ont.fsf@toke.dk>

On 05/01/2018 11:53 AM, Toke Høiland-Jørgensen wrote:

> *sigh* - can do, I guess. Though I'm not sure what that is going to
> accomplish, exactly?

I guess that nobody really wants to really review Cake if
it is a file with 2700 lines of code and hundreds of variables/tunables.

Sure, we have big files in networking land, as a result of thousands of changes.

If you split it, then maybe the work can be split among reviewers as a result.

Or maybe David Miller can simply merge your patch as is, and hope for the best,
I really do not know.

It seems you guys spent years/months on work on this stuff, so what is the big deal
about presenting your work in the best possible way ?

Thanks.

^ permalink raw reply

* Re: [PATCH net-next 0/9] Misc bug fixes for HNS3 Ethernet driver
From: David Miller @ 2018-05-01 19:08 UTC (permalink / raw)
  To: salil.mehta
  Cc: yisen.zhuang, lipeng321, mehta.salil.lnk, netdev, linux-kernel,
	linuxarm
In-Reply-To: <20180501185605.9584-1-salil.mehta@huawei.com>

From: Salil Mehta <salil.mehta@huawei.com>
Date: Tue, 1 May 2018 19:55:56 +0100

> This patch-set presents some miscellaneous bug fixs and cleanups for
> HNS3 Ethernet Driver.

Series applied, thank you.

^ permalink raw reply

* Re: Silently dropped UDP packets on kernel 4.14
From: Kristian Evensen @ 2018-05-01 18:58 UTC (permalink / raw)
  To: Netfilter Development Mailing list, Network Development
In-Reply-To: <CAKfDRXjop-G3qOZg+P6Mb4VKjvgJejScv9VHaJHB5OJBnqG30w@mail.gmail.com>

On Tue, May 1, 2018 at 8:50 PM, Kristian Evensen
<kristian.evensen@gmail.com> wrote:
> Does anyone have any idea of what could be wrong, where I should look
> or other things I can try? I tried to space the requests out a bit in
> time (I inserted a sleep 1 between them), and then the problem went
> away.

I should learn to always go through everything one last time before
sending an email. First of all, I see that both requests are treated
as new. Second, on my router, new requests are sent to user space for
marking, which explains the large delay in processing. When removing
the NFQUEUE-rule + handling and marking statically, my problem goes
away and I get an answer for both packets.

However, I do have one question. In my application, both packets are
assigned the same mark. Shouldn't they then match the same conntrack
entry, or am I missing something since that seems not to be the case?

BR,
Kristian

^ permalink raw reply

* [PATCH net-next 5/9] net: hns3: fix for phy_addr error in hclge_mac_mdio_config
From: Salil Mehta @ 2018-05-01 18:56 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, lipeng321, mehta.salil.lnk, netdev,
	linux-kernel, linuxarm, Huazhong Tan
In-Reply-To: <20180501185605.9584-1-salil.mehta@huawei.com>

From: Huazhong Tan <tanhuazhong@huawei.com>

When phy exists, phy_addr must less than PHY_MAX_ADDR.
If not, hclge_mac_mdio_config should return error.
And for fiber(phy_addr=0xff), it does not need hclge_mac_mdio_config.

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 12 +++++++-----
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c |  7 +++++--
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index b5e0c58..cc09713 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -5503,11 +5503,13 @@ static int hclge_init_ae_dev(struct hnae3_ae_dev *ae_dev)
 		goto err_sriov_disable;
 	}
 
-	ret = hclge_mac_mdio_config(hdev);
-	if (ret) {
-		dev_warn(&hdev->pdev->dev,
-			 "mdio config fail ret=%d\n", ret);
-		goto err_sriov_disable;
+	if (hdev->hw.mac.media_type == HNAE3_MEDIA_TYPE_COPPER) {
+		ret = hclge_mac_mdio_config(hdev);
+		if (ret) {
+			dev_err(&hdev->pdev->dev,
+				"mdio config fail ret=%d\n", ret);
+			goto err_sriov_disable;
+		}
 	}
 
 	ret = hclge_mac_init(hdev);
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c
index 682c2d6..9f7932e42 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c
@@ -140,8 +140,11 @@ int hclge_mac_mdio_config(struct hclge_dev *hdev)
 	struct mii_bus *mdio_bus;
 	int ret;
 
-	if (hdev->hw.mac.phy_addr >= PHY_MAX_ADDR)
-		return 0;
+	if (hdev->hw.mac.phy_addr >= PHY_MAX_ADDR) {
+		dev_err(&hdev->pdev->dev, "phy_addr(%d) is too large.\n",
+			hdev->hw.mac.phy_addr);
+		return -EINVAL;
+	}
 
 	mdio_bus = devm_mdiobus_alloc(&hdev->pdev->dev);
 	if (!mdio_bus)
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 9/9] net: hns3: Remove packet statistics in the range of 8192~12287
From: Salil Mehta @ 2018-05-01 18:56 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, lipeng321, mehta.salil.lnk, netdev,
	linux-kernel, linuxarm, Xi Wang
In-Reply-To: <20180501185605.9584-1-salil.mehta@huawei.com>

From: Xi Wang <wangxi11@huawei.com>

Because the current statistics for size 8192~12287 are only valid for GE,
the ranges of 8192~9216 and 9217~12287 are valid only for LGE/CGE, and are
always 0 for GE interfaces. it is easy to cause confusion when viewing the
packet statistics using the command ethtool -S.

This patch removes the 8192~12287 range of packet statistics and uses the
8192~9216 and 9217~12287 ranges for statistics. This change depends on the
firmware upgrade.

Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c |  4 ----
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h | 12 ++++++------
 2 files changed, 6 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 77d9e4c0..dd5d65c 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -304,8 +304,6 @@ static const struct hclge_comm_stats_str g_mac_stats_string[] = {
 		HCLGE_MAC_STATS_FIELD_OFF(mac_tx_2048_4095_oct_pkt_num)},
 	{"mac_tx_4096_8191_oct_pkt_num",
 		HCLGE_MAC_STATS_FIELD_OFF(mac_tx_4096_8191_oct_pkt_num)},
-	{"mac_tx_8192_12287_oct_pkt_num",
-		HCLGE_MAC_STATS_FIELD_OFF(mac_tx_8192_12287_oct_pkt_num)},
 	{"mac_tx_8192_9216_oct_pkt_num",
 		HCLGE_MAC_STATS_FIELD_OFF(mac_tx_8192_9216_oct_pkt_num)},
 	{"mac_tx_9217_12287_oct_pkt_num",
@@ -356,8 +354,6 @@ static const struct hclge_comm_stats_str g_mac_stats_string[] = {
 		HCLGE_MAC_STATS_FIELD_OFF(mac_rx_2048_4095_oct_pkt_num)},
 	{"mac_rx_4096_8191_oct_pkt_num",
 		HCLGE_MAC_STATS_FIELD_OFF(mac_rx_4096_8191_oct_pkt_num)},
-	{"mac_rx_8192_12287_oct_pkt_num",
-		HCLGE_MAC_STATS_FIELD_OFF(mac_rx_8192_12287_oct_pkt_num)},
 	{"mac_rx_8192_9216_oct_pkt_num",
 		HCLGE_MAC_STATS_FIELD_OFF(mac_rx_8192_9216_oct_pkt_num)},
 	{"mac_rx_9217_12287_oct_pkt_num",
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
index 6432f754..b7ee91d 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
@@ -408,9 +408,9 @@ struct hclge_mac_stats {
 	u64 mac_tx_1519_2047_oct_pkt_num;
 	u64 mac_tx_2048_4095_oct_pkt_num;
 	u64 mac_tx_4096_8191_oct_pkt_num;
-	u64 mac_tx_8192_12287_oct_pkt_num; /* valid for GE MAC only */
-	u64 mac_tx_8192_9216_oct_pkt_num; /* valid for LGE & CGE MAC only */
-	u64 mac_tx_9217_12287_oct_pkt_num; /* valid for LGE & CGE MAC */
+	u64 rsv0;
+	u64 mac_tx_8192_9216_oct_pkt_num;
+	u64 mac_tx_9217_12287_oct_pkt_num;
 	u64 mac_tx_12288_16383_oct_pkt_num;
 	u64 mac_tx_1519_max_good_oct_pkt_num;
 	u64 mac_tx_1519_max_bad_oct_pkt_num;
@@ -435,9 +435,9 @@ struct hclge_mac_stats {
 	u64 mac_rx_1519_2047_oct_pkt_num;
 	u64 mac_rx_2048_4095_oct_pkt_num;
 	u64 mac_rx_4096_8191_oct_pkt_num;
-	u64 mac_rx_8192_12287_oct_pkt_num;/* valid for GE MAC only */
-	u64 mac_rx_8192_9216_oct_pkt_num; /* valid for LGE & CGE MAC only */
-	u64 mac_rx_9217_12287_oct_pkt_num; /* valid for LGE & CGE MAC only */
+	u64 rsv1;
+	u64 mac_rx_8192_9216_oct_pkt_num;
+	u64 mac_rx_9217_12287_oct_pkt_num;
 	u64 mac_rx_12288_16383_oct_pkt_num;
 	u64 mac_rx_1519_max_good_oct_pkt_num;
 	u64 mac_rx_1519_max_bad_oct_pkt_num;
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 8/9] net: hns3: Fix for packet loss due wrong filter config in VLAN tbls
From: Salil Mehta @ 2018-05-01 18:56 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, lipeng321, mehta.salil.lnk, netdev,
	linux-kernel, linuxarm, Yunsheng Lin
In-Reply-To: <20180501185605.9584-1-salil.mehta@huawei.com>

From: Yunsheng Lin <linyunsheng@huawei.com>

There are two level of vlan tables in hardware, one is port vlan
which is shared by all functions, the other one is function
vlan table, each function has it's own function vlan table.
Currently, PF sets the port vlan table, and vf sets the function
vlan table, which will cause packet lost problem.

This patch fixes this problem by setting both vlan table, and
use hdev->vlan_table to manage thet port vlan table.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c    | 79 +++++++++++++++++-----
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h    |  8 ++-
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c |  7 +-
 3 files changed, 70 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index a4e9991..77d9e4c0 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -4543,8 +4543,9 @@ static void hclge_enable_vlan_filter(struct hnae3_handle *handle, bool enable)
 	hclge_set_vlan_filter_ctrl(hdev, HCLGE_FILTER_TYPE_VF, enable);
 }
 
-int hclge_set_vf_vlan_common(struct hclge_dev *hdev, int vfid,
-			     bool is_kill, u16 vlan, u8 qos, __be16 proto)
+static int hclge_set_vf_vlan_common(struct hclge_dev *hdev, int vfid,
+				    bool is_kill, u16 vlan, u8 qos,
+				    __be16 proto)
 {
 #define HCLGE_MAX_VF_BYTES  16
 	struct hclge_vlan_filter_vf_cfg_cmd *req0;
@@ -4602,12 +4603,9 @@ int hclge_set_vf_vlan_common(struct hclge_dev *hdev, int vfid,
 	return -EIO;
 }
 
-static int hclge_set_port_vlan_filter(struct hnae3_handle *handle,
-				      __be16 proto, u16 vlan_id,
-				      bool is_kill)
+static int hclge_set_port_vlan_filter(struct hclge_dev *hdev, __be16 proto,
+				      u16 vlan_id, bool is_kill)
 {
-	struct hclge_vport *vport = hclge_get_vport(handle);
-	struct hclge_dev *hdev = vport->back;
 	struct hclge_vlan_filter_pf_cfg_cmd *req;
 	struct hclge_desc desc;
 	u8 vlan_offset_byte_val;
@@ -4627,22 +4625,66 @@ static int hclge_set_port_vlan_filter(struct hnae3_handle *handle,
 	req->vlan_offset_bitmap[vlan_offset_byte] = vlan_offset_byte_val;
 
 	ret = hclge_cmd_send(&hdev->hw, &desc, 1);
+	if (ret)
+		dev_err(&hdev->pdev->dev,
+			"port vlan command, send fail, ret =%d.\n", ret);
+	return ret;
+}
+
+static int hclge_set_vlan_filter_hw(struct hclge_dev *hdev, __be16 proto,
+				    u16 vport_id, u16 vlan_id, u8 qos,
+				    bool is_kill)
+{
+	u16 vport_idx, vport_num = 0;
+	int ret;
+
+	ret = hclge_set_vf_vlan_common(hdev, vport_id, is_kill, vlan_id,
+				       0, proto);
 	if (ret) {
 		dev_err(&hdev->pdev->dev,
-			"port vlan command, send fail, ret =%d.\n",
-			ret);
+			"Set %d vport vlan filter config fail, ret =%d.\n",
+			vport_id, ret);
 		return ret;
 	}
 
-	ret = hclge_set_vf_vlan_common(hdev, 0, is_kill, vlan_id, 0, proto);
-	if (ret) {
+	/* vlan 0 may be added twice when 8021q module is enabled */
+	if (!is_kill && !vlan_id &&
+	    test_bit(vport_id, hdev->vlan_table[vlan_id]))
+		return 0;
+
+	if (!is_kill && test_and_set_bit(vport_id, hdev->vlan_table[vlan_id])) {
 		dev_err(&hdev->pdev->dev,
-			"Set pf vlan filter config fail, ret =%d.\n",
-			ret);
-		return -EIO;
+			"Add port vlan failed, vport %d is already in vlan %d\n",
+			vport_id, vlan_id);
+		return -EINVAL;
 	}
 
-	return 0;
+	if (is_kill &&
+	    !test_and_clear_bit(vport_id, hdev->vlan_table[vlan_id])) {
+		dev_err(&hdev->pdev->dev,
+			"Delete port vlan failed, vport %d is not in vlan %d\n",
+			vport_id, vlan_id);
+		return -EINVAL;
+	}
+
+	for_each_set_bit(vport_idx, hdev->vlan_table[vlan_id], VLAN_N_VID)
+		vport_num++;
+
+	if ((is_kill && vport_num == 0) || (!is_kill && vport_num == 1))
+		ret = hclge_set_port_vlan_filter(hdev, proto, vlan_id,
+						 is_kill);
+
+	return ret;
+}
+
+int hclge_set_vlan_filter(struct hnae3_handle *handle, __be16 proto,
+			  u16 vlan_id, bool is_kill)
+{
+	struct hclge_vport *vport = hclge_get_vport(handle);
+	struct hclge_dev *hdev = vport->back;
+
+	return hclge_set_vlan_filter_hw(hdev, proto, vport->vport_id, vlan_id,
+					0, is_kill);
 }
 
 static int hclge_set_vf_vlan_filter(struct hnae3_handle *handle, int vfid,
@@ -4656,7 +4698,7 @@ static int hclge_set_vf_vlan_filter(struct hnae3_handle *handle, int vfid,
 	if (proto != htons(ETH_P_8021Q))
 		return -EPROTONOSUPPORT;
 
-	return hclge_set_vf_vlan_common(hdev, vfid, false, vlan, qos, proto);
+	return hclge_set_vlan_filter_hw(hdev, proto, vfid, vlan, qos, false);
 }
 
 static int hclge_set_vlan_tx_offload_cfg(struct hclge_vport *vport)
@@ -4821,7 +4863,7 @@ static int hclge_init_vlan_config(struct hclge_dev *hdev)
 	}
 
 	handle = &hdev->vport[0].nic;
-	return hclge_set_port_vlan_filter(handle, htons(ETH_P_8021Q), 0, false);
+	return hclge_set_vlan_filter(handle, htons(ETH_P_8021Q), 0, false);
 }
 
 static int hclge_en_hw_strip_rxvtag(struct hnae3_handle *handle, bool enable)
@@ -5604,6 +5646,7 @@ static int hclge_reset_ae_dev(struct hnae3_ae_dev *ae_dev)
 	set_bit(HCLGE_STATE_DOWN, &hdev->state);
 
 	hclge_stats_clear(hdev);
+	memset(hdev->vlan_table, 0, sizeof(hdev->vlan_table));
 
 	ret = hclge_cmd_init(hdev);
 	if (ret) {
@@ -6221,7 +6264,7 @@ static const struct hnae3_ae_ops hclge_ops = {
 	.get_fw_version = hclge_get_fw_version,
 	.get_mdix_mode = hclge_get_mdix_mode,
 	.enable_vlan_filter = hclge_enable_vlan_filter,
-	.set_vlan_filter = hclge_set_port_vlan_filter,
+	.set_vlan_filter = hclge_set_vlan_filter,
 	.set_vf_vlan_filter = hclge_set_vf_vlan_filter,
 	.enable_hw_strip_rxvtag = hclge_en_hw_strip_rxvtag,
 	.reset_event = hclge_reset_event,
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
index 0f4157e..6432f754 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
@@ -12,6 +12,8 @@
 #include <linux/fs.h>
 #include <linux/types.h>
 #include <linux/phy.h>
+#include <linux/if_vlan.h>
+
 #include "hclge_cmd.h"
 #include "hnae3.h"
 
@@ -471,6 +473,7 @@ struct hclge_vlan_type_cfg {
 	u16 tx_in_vlan_type;
 };
 
+#define HCLGE_VPORT_NUM 256
 struct hclge_dev {
 	struct pci_dev *pdev;
 	struct hnae3_ae_dev *ae_dev;
@@ -562,6 +565,7 @@ struct hclge_dev {
 
 	u64 rx_pkts_for_led;
 	u64 tx_pkts_for_led;
+	unsigned long vlan_table[VLAN_N_VID][BITS_TO_LONGS(HCLGE_VPORT_NUM)];
 };
 
 /* VPort level vlan tag configuration for TX direction */
@@ -646,8 +650,8 @@ static inline int hclge_get_queue_id(struct hnae3_queue *queue)
 }
 
 int hclge_cfg_mac_speed_dup(struct hclge_dev *hdev, int speed, u8 duplex);
-int hclge_set_vf_vlan_common(struct hclge_dev *vport, int vfid,
-			     bool is_kill, u16 vlan, u8 qos, __be16 proto);
+int hclge_set_vlan_filter(struct hnae3_handle *handle, __be16 proto,
+			  u16 vlan_id, bool is_kill);
 
 int hclge_buffer_alloc(struct hclge_dev *hdev);
 int hclge_rss_init_hw(struct hclge_dev *hdev);
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c
index a6f7ffa..7563335 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c
@@ -264,19 +264,18 @@ static int hclge_set_vf_vlan_cfg(struct hclge_vport *vport,
 				 struct hclge_mbx_vf_to_pf_cmd *mbx_req,
 				 bool gen_resp)
 {
-	struct hclge_dev *hdev = vport->back;
 	int status = 0;
 
 	if (mbx_req->msg[1] == HCLGE_MBX_VLAN_FILTER) {
+		struct hnae3_handle *handle = &vport->nic;
 		u16 vlan, proto;
 		bool is_kill;
 
 		is_kill = !!mbx_req->msg[2];
 		memcpy(&vlan, &mbx_req->msg[3], sizeof(vlan));
 		memcpy(&proto, &mbx_req->msg[5], sizeof(proto));
-		status = hclge_set_vf_vlan_common(hdev, vport->vport_id,
-						  is_kill, vlan, 0,
-						  cpu_to_be16(proto));
+		status = hclge_set_vlan_filter(handle, cpu_to_be16(proto),
+					       vlan, is_kill);
 	}
 
 	if (gen_resp)
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 7/9] net: hns3: fix a dead loop in hclge_cmd_csq_clean
From: Salil Mehta @ 2018-05-01 18:56 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, lipeng321, mehta.salil.lnk, netdev,
	linux-kernel, linuxarm, Huazhong Tan
In-Reply-To: <20180501185605.9584-1-salil.mehta@huawei.com>

From: Huazhong Tan <tanhuazhong@huawei.com>

If head has invlid value then a dead loop can be triggered in
hclge_cmd_csq_clean. This patch adds sanity check for this case.

Fixes: 68c0a5c70614 ("net: hns3: Add HNS3 IMP(Integrated Mgmt Proc) Cmd
Interface Support")
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c    | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c
index ff13d18..fab7068 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c
@@ -31,6 +31,17 @@ static int hclge_ring_space(struct hclge_cmq_ring *ring)
 	return ring->desc_num - used - 1;
 }
 
+static int is_valid_csq_clean_head(struct hclge_cmq_ring *ring, int h)
+{
+	int u = ring->next_to_use;
+	int c = ring->next_to_clean;
+
+	if (unlikely(h >= ring->desc_num))
+		return 0;
+
+	return u > c ? (h > c && h <= u) : (h > c || h <= u);
+}
+
 static int hclge_alloc_cmd_desc(struct hclge_cmq_ring *ring)
 {
 	int size  = ring->desc_num * sizeof(struct hclge_desc);
@@ -141,6 +152,7 @@ static void hclge_cmd_init_regs(struct hclge_hw *hw)
 
 static int hclge_cmd_csq_clean(struct hclge_hw *hw)
 {
+	struct hclge_dev *hdev = (struct hclge_dev *)hw->back;
 	struct hclge_cmq_ring *csq = &hw->cmq.csq;
 	u16 ntc = csq->next_to_clean;
 	struct hclge_desc *desc;
@@ -149,6 +161,13 @@ static int hclge_cmd_csq_clean(struct hclge_hw *hw)
 
 	desc = &csq->desc[ntc];
 	head = hclge_read_dev(hw, HCLGE_NIC_CSQ_HEAD_REG);
+	rmb(); /* Make sure head is ready before touch any data */
+
+	if (!is_valid_csq_clean_head(csq, head)) {
+		dev_warn(&hdev->pdev->dev, "wrong head (%d, %d-%d)\n", head,
+			   csq->next_to_use, csq->next_to_clean);
+		return 0;
+	}
 
 	while (head != ntc) {
 		memset(desc, 0, sizeof(*desc));
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 6/9] net: hns3: Fix to support autoneg only for port attached with phy
From: Salil Mehta @ 2018-05-01 18:56 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, lipeng321, mehta.salil.lnk, netdev,
	linux-kernel, linuxarm, Fuyun Liang
In-Reply-To: <20180501185605.9584-1-salil.mehta@huawei.com>

From: Fuyun Liang <liangfuyun1@huawei.com>

This patch adds a check to support autoneg(ethtool -A) only when PHY
is attached with the port.

Fixes: e2cb1dec9779 ("net: hns3: Add HNS3 VF HCL(Hardware Compatibility
Layer) Support")
Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index cc09713..a4e9991 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -5169,12 +5169,6 @@ static int hclge_set_pauseparam(struct hnae3_handle *handle, u32 auto_neg,
 	struct phy_device *phydev = hdev->hw.mac.phydev;
 	u32 fc_autoneg;
 
-	/* Only support flow control negotiation for netdev with
-	 * phy attached for now.
-	 */
-	if (!phydev)
-		return -EOPNOTSUPP;
-
 	fc_autoneg = hclge_get_autoneg(handle);
 	if (auto_neg != fc_autoneg) {
 		dev_info(&hdev->pdev->dev,
@@ -5193,6 +5187,12 @@ static int hclge_set_pauseparam(struct hnae3_handle *handle, u32 auto_neg,
 	if (!fc_autoneg)
 		return hclge_cfg_pauseparam(hdev, rx_en, tx_en);
 
+	/* Only support flow control negotiation for netdev with
+	 * phy attached for now.
+	 */
+	if (!phydev)
+		return -EOPNOTSUPP;
+
 	return phy_start_aneg(phydev);
 }
 
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 4/9] net: hns3: Fixes the error legs in hclge_init_ae_dev function
From: Salil Mehta @ 2018-05-01 18:56 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, lipeng321, mehta.salil.lnk, netdev,
	linux-kernel, linuxarm, Huazhong Tan
In-Reply-To: <20180501185605.9584-1-salil.mehta@huawei.com>

From: Huazhong Tan <tanhuazhong@huawei.com>

This patch fixes some of the missed error legs in the initialization
function of the ae device. This might cause leaks in case of failure.

Fixes: 46a3df9f9718 ("net: hns3: Add HNS3 Acceleration Engine & Compatibility Layer
Support")
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c    | 55 +++++++++++++---------
 1 file changed, 34 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index c9e80ca..b5e0c58 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -5430,7 +5430,7 @@ static int hclge_init_ae_dev(struct hnae3_ae_dev *ae_dev)
 	hdev = devm_kzalloc(&pdev->dev, sizeof(*hdev), GFP_KERNEL);
 	if (!hdev) {
 		ret = -ENOMEM;
-		goto err_hclge_dev;
+		goto out;
 	}
 
 	hdev->pdev = pdev;
@@ -5443,38 +5443,38 @@ static int hclge_init_ae_dev(struct hnae3_ae_dev *ae_dev)
 	ret = hclge_pci_init(hdev);
 	if (ret) {
 		dev_err(&pdev->dev, "PCI init failed\n");
-		goto err_pci_init;
+		goto out;
 	}
 
 	/* Firmware command queue initialize */
 	ret = hclge_cmd_queue_init(hdev);
 	if (ret) {
 		dev_err(&pdev->dev, "Cmd queue init failed, ret = %d.\n", ret);
-		return ret;
+		goto err_pci_uninit;
 	}
 
 	/* Firmware command initialize */
 	ret = hclge_cmd_init(hdev);
 	if (ret)
-		goto err_cmd_init;
+		goto err_cmd_uninit;
 
 	ret = hclge_get_cap(hdev);
 	if (ret) {
 		dev_err(&pdev->dev, "get hw capability error, ret = %d.\n",
 			ret);
-		return ret;
+		goto err_cmd_uninit;
 	}
 
 	ret = hclge_configure(hdev);
 	if (ret) {
 		dev_err(&pdev->dev, "Configure dev error, ret = %d.\n", ret);
-		return ret;
+		goto err_cmd_uninit;
 	}
 
 	ret = hclge_init_msi(hdev);
 	if (ret) {
 		dev_err(&pdev->dev, "Init MSI/MSI-X error, ret = %d.\n", ret);
-		return ret;
+		goto err_cmd_uninit;
 	}
 
 	ret = hclge_misc_irq_init(hdev);
@@ -5482,69 +5482,69 @@ static int hclge_init_ae_dev(struct hnae3_ae_dev *ae_dev)
 		dev_err(&pdev->dev,
 			"Misc IRQ(vector0) init error, ret = %d.\n",
 			ret);
-		return ret;
+		goto err_msi_uninit;
 	}
 
 	ret = hclge_alloc_tqps(hdev);
 	if (ret) {
 		dev_err(&pdev->dev, "Allocate TQPs error, ret = %d.\n", ret);
-		return ret;
+		goto err_msi_irq_uninit;
 	}
 
 	ret = hclge_alloc_vport(hdev);
 	if (ret) {
 		dev_err(&pdev->dev, "Allocate vport error, ret = %d.\n", ret);
-		return ret;
+		goto err_msi_irq_uninit;
 	}
 
 	ret = hclge_map_tqp(hdev);
 	if (ret) {
 		dev_err(&pdev->dev, "Map tqp error, ret = %d.\n", ret);
-		return ret;
+		goto err_sriov_disable;
 	}
 
 	ret = hclge_mac_mdio_config(hdev);
 	if (ret) {
 		dev_warn(&hdev->pdev->dev,
 			 "mdio config fail ret=%d\n", ret);
-		return ret;
+		goto err_sriov_disable;
 	}
 
 	ret = hclge_mac_init(hdev);
 	if (ret) {
 		dev_err(&pdev->dev, "Mac init error, ret = %d\n", ret);
-		return ret;
+		goto err_mdiobus_unreg;
 	}
 
 	ret = hclge_config_tso(hdev, HCLGE_TSO_MSS_MIN, HCLGE_TSO_MSS_MAX);
 	if (ret) {
 		dev_err(&pdev->dev, "Enable tso fail, ret =%d\n", ret);
-		return ret;
+		goto err_mdiobus_unreg;
 	}
 
 	ret = hclge_init_vlan_config(hdev);
 	if (ret) {
 		dev_err(&pdev->dev, "VLAN init fail, ret =%d\n", ret);
-		return  ret;
+		goto err_mdiobus_unreg;
 	}
 
 	ret = hclge_tm_schd_init(hdev);
 	if (ret) {
 		dev_err(&pdev->dev, "tm schd init fail, ret =%d\n", ret);
-		return ret;
+		goto err_mdiobus_unreg;
 	}
 
 	hclge_rss_init_cfg(hdev);
 	ret = hclge_rss_init_hw(hdev);
 	if (ret) {
 		dev_err(&pdev->dev, "Rss init fail, ret =%d\n", ret);
-		return ret;
+		goto err_mdiobus_unreg;
 	}
 
 	ret = init_mgr_tbl(hdev);
 	if (ret) {
 		dev_err(&pdev->dev, "manager table init fail, ret =%d\n", ret);
-		return ret;
+		goto err_mdiobus_unreg;
 	}
 
 	hclge_dcb_ops_set(hdev);
@@ -5567,11 +5567,24 @@ static int hclge_init_ae_dev(struct hnae3_ae_dev *ae_dev)
 	pr_info("%s driver initialization finished.\n", HCLGE_DRIVER_NAME);
 	return 0;
 
-err_cmd_init:
+err_mdiobus_unreg:
+	if (hdev->hw.mac.phydev)
+		mdiobus_unregister(hdev->hw.mac.mdio_bus);
+err_sriov_disable:
+	if (IS_ENABLED(CONFIG_PCI_IOV))
+		hclge_disable_sriov(hdev);
+err_msi_irq_uninit:
+	hclge_misc_irq_uninit(hdev);
+err_msi_uninit:
+	pci_free_irq_vectors(pdev);
+err_cmd_uninit:
+	hclge_destroy_cmd_queue(&hdev->hw);
+err_pci_uninit:
+	pci_clear_master(pdev);
 	pci_release_regions(pdev);
-err_pci_init:
+	pci_disable_device(pdev);
 	pci_set_drvdata(pdev, NULL);
-err_hclge_dev:
+out:
 	return ret;
 }
 
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 3/9] net: hns3: Fixes the out of bounds access in hclge_map_tqp
From: Salil Mehta @ 2018-05-01 18:55 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, lipeng321, mehta.salil.lnk, netdev,
	linux-kernel, linuxarm, Huazhong Tan
In-Reply-To: <20180501185605.9584-1-salil.mehta@huawei.com>

From: Huazhong Tan <tanhuazhong@huawei.com>

This patch fixes the handling of the check when number of vports
are detected to be more than available TPQs. Current handling causes
an out of bounds access in hclge_map_tqp().

Fixes: 7df7dad633e2 ("net: hns3: Refactor the mapping of tqp to vport")
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 2066dd7..c9e80ca 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -1459,8 +1459,11 @@ static int hclge_alloc_vport(struct hclge_dev *hdev)
 	/* We need to alloc a vport for main NIC of PF */
 	num_vport = hdev->num_vmdq_vport + hdev->num_req_vfs + 1;
 
-	if (hdev->num_tqps < num_vport)
-		num_vport = hdev->num_tqps;
+	if (hdev->num_tqps < num_vport) {
+		dev_err(&hdev->pdev->dev, "tqps(%d) is less than vports(%d)",
+			hdev->num_tqps, num_vport);
+		return -EINVAL;
+	}
 
 	/* Alloc the same number of TQPs for every vport */
 	tqp_per_vport = hdev->num_tqps / num_vport;
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 2/9] net: hns3: fix to correctly fetch l4 protocol outer header
From: Salil Mehta @ 2018-05-01 18:55 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, lipeng321, mehta.salil.lnk, netdev,
	linux-kernel, linuxarm, Huazhong Tan
In-Reply-To: <20180501185605.9584-1-salil.mehta@huawei.com>

From: Huazhong Tan <tanhuazhong@huawei.com>

This patch fixes the function being used to fetch L4
protocol outer header. Mistakenly skb_inner_transport_header
API was being used earlier.

Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC")
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index 8c55965..c4e2950 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -502,7 +502,7 @@ static int hns3_get_l4_protocol(struct sk_buff *skb, u8 *ol4_proto,
 
 	/* find outer header point */
 	l3.hdr = skb_network_header(skb);
-	l4_hdr = skb_inner_transport_header(skb);
+	l4_hdr = skb_transport_header(skb);
 
 	if (skb->protocol == htons(ETH_P_IPV6)) {
 		exthdr = l3.hdr + sizeof(*l3.v6);
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 1/9] net: hns3: Remove error log when getting pfc stats fails
From: Salil Mehta @ 2018-05-01 18:55 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, lipeng321, mehta.salil.lnk, netdev,
	linux-kernel, linuxarm, Yunsheng Lin
In-Reply-To: <20180501185605.9584-1-salil.mehta@huawei.com>

From: Yunsheng Lin <linyunsheng@huawei.com>

When mac supports DCB, but is in GE mode, it does not support
querying pfc stats, firmware returns error when trying to
query the pfc stats. this creates a lot of noise in the kernel
log when it prints the error log.

This patch fixes it by removing the error log, because it already
return the error to the user space, so the user should be aware of
the error.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c
index 885f25c..c69ecab 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c
@@ -134,11 +134,8 @@ static int hclge_pfc_stats_get(struct hclge_dev *hdev,
 	}
 
 	ret = hclge_cmd_send(&hdev->hw, desc, HCLGE_TM_PFC_PKT_GET_CMD_NUM);
-	if (ret) {
-		dev_err(&hdev->pdev->dev,
-			"Get pfc pause stats fail, ret = %d.\n", ret);
+	if (ret)
 		return ret;
-	}
 
 	for (i = 0; i < HCLGE_TM_PFC_PKT_GET_CMD_NUM; i++) {
 		struct hclge_pfc_stats_cmd *pfc_stats =
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 0/9] Misc bug fixes for HNS3 Ethernet driver
From: Salil Mehta @ 2018-05-01 18:55 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, lipeng321, mehta.salil.lnk, netdev,
	linux-kernel, linuxarm

This patch-set presents some miscellaneous bug fixs and cleanups for
HNS3 Ethernet Driver.

Fuyun Liang (1):
  net: hns3: Fix to support autoneg only for port attached with phy

Huazhong Tan (5):
  net: hns3: fix to correctly fetch l4 protocol outer header
  net: hns3: Fixes the out of bounds access in hclge_map_tqp
  net: hns3: Fixes the error legs in hclge_init_ae_dev function
  net: hns3: fix for phy_addr error in hclge_mac_mdio_config
  net: hns3: fix a dead loop in hclge_cmd_csq_clean

Xi Wang (1):
  net: hns3: Remove packet statistics in the range of 8192~12287

Yunsheng Lin (2):
  net: hns3: Remove error log when getting pfc stats fails
  net: hns3: Fix for packet loss due wrong filter config in VLAN tbls

 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c    |   2 +-
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c |  19 +++
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c    | 167 ++++++++++++++-------
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h    |  20 ++-
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c |   7 +-
 .../ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c    |   7 +-
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c  |   5 +-
 7 files changed, 153 insertions(+), 74 deletions(-)

-- 
2.7.4

^ permalink raw reply

* Re: [PATCH net-next v6] Add Common Applications Kept Enhanced (cake) qdisc
From: Toke Høiland-Jørgensen @ 2018-05-01 18:53 UTC (permalink / raw)
  To: Eric Dumazet, Dave Taht, Cong Wang
  Cc: Linux Kernel Network Developers, Cake List
In-Reply-To: <dff5ba09-1b81-7cd8-4be3-e7a419797d8e@gmail.com>

Eric Dumazet <eric.dumazet@gmail.com> writes:

> On 04/30/2018 02:27 PM, Dave Taht wrote:
>
>> I actually have a tc - bpf based ack filter, during the development of
>> cake's ack-thinner, that I should submit one of these days. It
>> proved to be of limited use.
>> 
>> Probably the biggest mistake we made is by calling this cake feature a
>> filter. It isn't.
>> 
>> Maybe we should have called it a "thinner" or something like that? In
>> order to properly "thin" or "reduce" an ack stream
>> you have to have a queue to look at and some related state. TC filters
>> do not operate on queues, qdiscs do. Thus the "ack-filter" here is
>> deeply embedded into cake's flow isolation and queue structures.
>
> A feature eating packets _is_ a filter.
>
> Given that a qdisc only sees one direction, I really do not get why
> ack-filter is so desirable in a packet scheduler.

The ACK filter in CAKE is there to solve a very particular use case:
Residential internet connections with bandwidths so asymmetric that it
hurts TCP performance. It is not on by default; but rather meant to be
configured by users which suffer from this particular ISP brokenness
(which, sadly, does happen; there are ISPs who believe a 50/1 bandwidth
ratio is reasonable). For those users, the ACK filter can help. 

We certainly do not advise people to turn it on in the general case! As
you point, in general TCP performance is best improved in the TCP stack...

> You have not provided any numbers to show how useful it is to maintain
> this code

You mean apart from that in the linked blog post and the paper?
Here's an executive summary:

On a 30/1 Mbps connection with a bidirectional traffic test (RRUL),
turning on the ACK filter improves downstream throughput by ~20% and
upstream throughput by ~12% in conservative mode and ~40% in aggressive
mode, at the cost of ~5ms of inter-flow latency due to the increased
congestion.

In *really* pathological cases, the effect can be a lot more; for
instance, the ACK filter increases the achievable downstream throughput
on a link with 100 Kbps in the upstream direction by an order of
magnitude (from ~2.5 Mbps to ~25 Mbps).

> (probably still broken btw, considering it is changing some skb
> attributes).

As you may have noticed over the last few iterations, I've actually been
trying to fix any brokenness. If you could be specific as to what is
still broken, that would be helpful.

(I'm assuming are referring to the calls to skb_set_inner* - but do you
mean all of them, or just the ones that set inner == outer?)

> On wifi (or any half duplex medium), you might gain something by
> carefully sending ACK not too often, but ultimately this should be
> done by TCP stack, in well controlled environment [1], instead of
> middle-boxes happily playing/breaking some Internet standards.
>
> [1] TCP stack has the estimations of RTT, RWIN, throughput, and might
> be able to avoid flooding the network with acks under some conditions.

You are quite right that in general, TCP performance is best improved in
the TCP stack. But home users are not generally in control of that; the
ACK filter helps in those specific deployments (again, it's off by
default, and you shouldn't turn it on in the general case!).

> Say RTT is 100ms, and we receive 1 packet every 100 usec (no GRO
> aggregation) Maybe we do not really _need_ to send 5000 ack per second
> (or even 10,000 ack per second if a hole needs a repair)

Yes, please do fix that.

> Also on wifi, the queue builds in the driver queues anyway, not in the
> qdisc.

Actually, we've fixed that (for some drivers); there is now no qdisc on
WiFi interfaces, and we've integrated FQ-CoDel'ed queueing into the
stack where it can be effective. But no, running CAKE with an ACK filter
on a WiFi link is not going to be effective. Don't do that.

> So ACK filtering, if _really_ successful, would need to be
> modularized.

I really hope ACK filtering won't be "_really_ successful". Again, it is
not meant to be a feature that's on everywhere, it's targeting a very
specific use case that CAKE is optimised for: The home router use case.

> Please split Cake into a patch series.
> Presumably putting the ack-filter on a patch of its own.

*sigh* - can do, I guess. Though I'm not sure what that is going to
accomplish, exactly?

-Toke

^ permalink raw reply

* Silently dropped UDP packets on kernel 4.14
From: Kristian Evensen @ 2018-05-01 18:50 UTC (permalink / raw)
  To: Netfilter Development Mailing list, Network Development

Hello,

I am currently debugging an issue where it looks like UDP packets are
silently dropped. My setup is a client with a tool that sends two UDP
packets (DNS requests) "simultaneously" using the same socket, and
then a router running latest OpenWRT (with kernel 4.14.37). What I am
seeing is that the first request goes through, while the second seems
to be lost somewhere. I can see the second packet in my firewall logs
on the router (all the way past the POSTROUTING chain in the
nat-table), but the packet is not sent over the wire (checked with
tcpdump).

In order to try to figure out what was going on, I added some
LOG-rules to the firewall. I Inserted the rules in the PREROUTING
chains in the raw, mangle and nat tables, in filter FORWARD and nat
and mangle POSTROUTING. Here are the logs for one set of requests
(with some notes):

First DNS request (verified by packet length):
[723269.968256] raw-pre: IN=br-lan OUT=
MAC=78:a3:51:34:e4:86:2e:8e:de:be:22:5a:08:00 SRC=10.0.0.3 DST=1.1.1.1
LEN=56 TOS=0x00 PREC=0x00 TTL=64 ID=8147 DF PROTO=UDP SPT=42397 DPT=53
LEN=36
[723269.984892] mangle-pre: IN=br-lan OUT=
MAC=78:a3:51:34:e4:86:2e:8e:de:be:22:5a:08:00 SRC=10.0.0.3 DST=1.1.1.1
LEN=56 TOS=0x00 PREC=0x00 TTL=64 ID=8147 DF PROTO=UDP SPT=42397 DPT=53
LEN=36
[723270.001769] mangle-pre #2: IN=br-lan OUT=
MAC=78:a3:51:34:e4:86:2e:8e:de:be:22:5a:08:00 SRC=10.0.0.3 DST=1.1.1.1
LEN=56 TOS=0x00 PREC=0x00 TTL=64 ID=8147 DF PROTO=UDP SPT=42397 DPT=53
LEN=36 -> this LOG-rule only matches when ctstate is NEW.

Second DNS request (verified by packet length):
[723270.019033] raw-pre: IN=br-lan OUT=
MAC=78:a3:51:34:e4:86:2e:8e:de:be:22:5a:08:00 SRC=10.0.0.3 DST=1.1.1.1
LEN=58 TOS=0x00 PREC=0x00 TTL=64 ID=8148 DF PROTO=UDP SPT=42397 DPT=53
LEN=38
[723270.035734] mangle-pre: IN=br-lan OUT=
MAC=78:a3:51:34:e4:86:2e:8e:de:be:22:5a:08:00 SRC=10.0.0.3 DST=1.1.1.1
LEN=58 TOS=0x00 PREC=0x00 TTL=64 ID=8148 DF PROTO=UDP SPT=42397 DPT=53
LEN=38
-> No mangle-pre #2, so the conntrack entry is found

Processing first request:
[723270.036213] nat-pre: IN=br-lan OUT=
MAC=78:a3:51:34:e4:86:2e:8e:de:be:22:5a:08:00 SRC=10.0.0.3 DST=1.1.1.1
LEN=56 TOS=0x00 PREC=0x00 TTL=64 ID=8147 DF PROTO=UDP SPT=42397 DPT=53
LEN=36 MARK=0x4000200 -> I use policy routing, therefore mark. Routing
works correctly.
[723270.036366] filter-fwd: IN=br-lan OUT=eth0.2
MAC=78:a3:51:34:e4:86:2e:8e:de:be:22:5a:08:00 SRC=10.0.0.3 DST=1.1.1.1
LEN=56 TOS=0x00 PREC=0x00 TTL=63 ID=8147 DF PROTO=UDP SPT=42397 DPT=53
LEN=36 MARK=0x4000200
[723270.036419] mangle-post: IN= OUT=eth0.2 SRC=10.0.0.3 DST=1.1.1.1
LEN=56 TOS=0x00 PREC=0x00 TTL=63 ID=8147 DF PROTO=UDP SPT=42397 DPT=53
LEN=36 MARK=0x4000200
[723270.036457] nat-post: IN= OUT=eth0.2 SRC=10.0.0.3 DST=1.1.1.1
LEN=56 TOS=0x00 PREC=0x00 TTL=63 ID=8147 DF PROTO=UDP SPT=42397 DPT=53
LEN=36 MARK=0x4000200

Processing second request:
[723270.117394] mangle-pre #2: IN=br-lan OUT=
MAC=78:a3:51:34:e4:86:2e:8e:de:be:22:5a:08:00 SRC=10.0.0.3 DST=1.1.1.1
LEN=58 TOS=0x00 PREC=0x00 TTL=64 ID=8148 DF PROTO=UDP SPT=42397 DPT=53
LEN=38
[723270.136644] nat-pre: IN=br-lan OUT=
MAC=78:a3:51:34:e4:86:2e:8e:de:be:22:5a:08:00 SRC=10.0.0.3 DST=1.1.1.1
LEN=58 TOS=0x00 PREC=0x00 TTL=64 ID=8148 DF PROTO=UDP SPT=42397 DPT=53
LEN=38 MARK=0x4000200
[723270.154716] filter-fwd: IN=br-lan OUT=eth0.2
MAC=78:a3:51:34:e4:86:2e:8e:de:be:22:5a:08:00 SRC=10.0.0.3 DST=1.1.1.1
LEN=58 TOS=0x00 PREC=0x00 TTL=63 ID=8148 DF PROTO=UDP SPT=42397 DPT=53
LEN=38 MARK=0x4000200
[723270.173501] mangle-post: IN= OUT=eth0.2 SRC=10.0.0.3 DST=1.1.1.1
LEN=58 TOS=0x00 PREC=0x00 TTL=63 ID=8148 DF PROTO=UDP SPT=42397 DPT=53
LEN=38 MARK=0x4000200
[723270.187787] nat-post: IN= OUT=eth0.2 SRC=10.0.0.3 DST=1.1.1.1
LEN=58 TOS=0x00 PREC=0x00 TTL=63 ID=8148 DF PROTO=UDP SPT=42397 DPT=53
LEN=38 MARK=0x4000200

When dumping conntrack after the second request has been processed, I
get the following:
udp      17 55 src=10.0.0.3 dst=1.1.1.1 sport=42397 dport=53 packets=1
bytes=56 src=1.1.1.1 dst=193.213.155.210 sport=53 dport=42397
packets=1 bytes=72 mark=67109120 delta-time=4 use=1

This looks a bit strange to me, it seems that only the first request
has been counted.

Does anyone have any idea of what could be wrong, where I should look
or other things I can try? I tried to space the requests out a bit in
time (I inserted a sleep 1 between them), and then the problem went
away.

Thanks in advance for any help,
Kristian

^ permalink raw reply

* Re: [RFC net-next 4/5] net: phy: Add support for IEEE standard test modes
From: Florian Fainelli @ 2018-05-01 18:43 UTC (permalink / raw)
  To: Woojung.Huh, netdev
  Cc: andrew, rmk, linux-kernel, davem, cphealy, nikita.yoush,
	vivien.didelot, Nisar.Sayed, UNGLinuxDriver
In-Reply-To: <9235D6609DB808459E95D78E17F2E43D40D553D5@CHN-SV-EXMX02.mchp-main.com>

On 05/01/2018 10:29 AM, Woojung.Huh@microchip.com wrote:
> Hi Florian,
> 
>> diff --git a/drivers/net/phy/phy-tests.c b/drivers/net/phy/phy-tests.c
> ...
>> +/* genphy_set_test - Make a PHY enter one of the standard IEEE defined
>> + * test modes
>> + * @phydev: the PHY device instance
>> + * @test: the desired test mode
>> + * @data: test specific data (none)
>> + *
>> + * This function makes the designated @phydev enter the desired standard
>> + * 100BaseT2 or 1000BaseT test mode as defined in IEEE 802.3-2012 section TWO
>> + * and THREE under 32.6.1.2.1 and 40.6.1.1.2 respectively
>> + */
>> +int genphy_set_test(struct phy_device *phydev,
>> +		    struct ethtool_phy_test *test, const u8 *data)
>> +{
>> +	u16 shift, base, bmcr = 0;
>> +	int ret;
>> +
>> +	/* Exit test mode */
>> +	if (test->mode == PHY_STD_TEST_MODE_NORMAL) {
>> +		ret = phy_read(phydev, MII_CTRL1000);
>> +		if (ret < 0)
>> +			return ret;
>> +
>> +		ret &= ~GENMASK(15, 13);
>> +
>> +		return phy_write(phydev, MII_CTRL1000, ret);
>> +	}
>> +
>> +	switch (test->mode) {
>> +	case PHY_STD_TEST_MODE_100BASET2_1:
>> +	case PHY_STD_TEST_MODE_100BASET2_2:
>> +	case PHY_STD_TEST_MODE_100BASET2_3:
>> +		if (!(phydev->supported & PHY_100BT_FEATURES))
>> +			return -EOPNOTSUPP;
>> +
>> +		shift = 14;
>> +		base = test->mode - PHY_STD_TEST_MODE_NORMAL;
>> +		bmcr = BMCR_SPEED100;
>> +		break;
>> +
>> +	case PHY_STD_TEST_MODE_1000BASET_1:
>> +	case PHY_STD_TEST_MODE_1000BASET_2:
>> +	case PHY_STD_TEST_MODE_1000BASET_3:
>> +	case PHY_STD_TEST_MODE_1000BASET_4:
>> +		if (!(phydev->supported & PHY_1000BT_FEATURES))
>> +			return -EOPNOTSUPP;
>> +
>> +		shift = 13;
>> +		base = test->mode - PHY_STD_TEST_MODE_100BASET2_MAX;
>> +		bmcr = BMCR_SPEED1000;
>> +		break;
>> +
>> +	default:
>> +		/* Let an upper driver deal with additional modes it may
>> +		 * support
>> +		 */
>> +		return -EOPNOTSUPP;
>> +	}
>> +
>> +	/* Force speed and duplex */
>> +	ret = phy_write(phydev, MII_BMCR, bmcr | BMCR_FULLDPLX);
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	/* Set the desired test mode bit */
>> +	return phy_write(phydev, MII_CTRL1000, (test->mode + base) << shift);
>> +}
> For now, these are for 100B-T2 & 1000B-TX.
> But, other speeds such as 802.3bw/bq/cq have very similar format,
> how about make  phy_write() to BMCR & CTRL1000 as another function call per speed?

Not sure I completely understand your suggestion, do you mean that I
should break down the body of that function above such that there are
per-speed lower level functions? Something like the pseudo-code below:

genphy_set_test() {
	switch (mode) {
	case PHY_STD_TEST_MODE_100BASET2_1:
	..
	case PHY_STD_TEST_MODE_100BASET2_3:
		return genphy_set_100baset2();

	case PHY_STD_TEST_MODE_1000BASET_1:
	..
	case PHY_STD_TEST_MODE_1000BASET_4:
		return genphy_set_1000baset();

	case PHY_STD_TEST_MODE_8021BWQCQ_1:
		return genphy_set_100baset1();

}

Or did you want to see a different way of mapping a given speed/feature
set to a specific test function?
-- 
Florian

^ permalink raw reply

* Re: [PATCH V3 net-next 1/2] tcp: send in-queue bytes in cmsg upon read
From: David Miller @ 2018-05-01 18:34 UTC (permalink / raw)
  To: soheil.kdev; +Cc: netdev, ycheng, ncardwell, edumazet, willemb, soheil
In-Reply-To: <20180501141128.208705-1-soheil.kdev@gmail.com>

From: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
Date: Tue,  1 May 2018 10:11:27 -0400

> +static inline int tcp_inq_hint(struct sock *sk)

Please do not use 'inline' in foo.c files, let the compiler decide.

Otherwise looks great, thanks.

^ permalink raw reply

* [PATCH iproute2-next] Add tc-BPF example for a TCP pure ack recognizer
From: Dave Taht @ 2018-05-01 18:32 UTC (permalink / raw)
  To: netdev; +Cc: Dave Taht

ack_recognize can shift pure tcp acks into another flowid.
---
 examples/bpf/ack_recognize.c | 98 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 examples/bpf/ack_recognize.c

diff --git a/examples/bpf/ack_recognize.c b/examples/bpf/ack_recognize.c
new file mode 100644
index 0000000..5da620c
--- /dev/null
+++ b/examples/bpf/ack_recognize.c
@@ -0,0 +1,98 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright 2017 Google Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+ * 02110-1301, USA.
+ */
+
+/*
+ * Author: dave.taht@gmail.com (Dave Taht)
+ *
+ * ack_recognizer: An eBPF program that correctly recognizes modern TCP ACKs,
+ * with tcp option fields like SACK and timestamps, and no additional data.
+ *
+ * ack_match call: Recognize "pure acks" with no data payload
+ *
+ */
+
+#include "bpf_api.h"
+#include "linux/if_ether.h"
+#include "linux/ip.h"
+#include "linux/in.h"
+#include "linux/ipv6.h"
+#include "linux/tcp.h"
+
+/*
+ * A pure ack contains the ip header, the tcp header + options, flags with the
+ * ack field set, and no additional payload. That last bit is what every prior
+ * ack filter gets wrong, they typically assume an obsolete 64 bytes, and don't
+ * calculate the options (like sack or timestamps) to subtract from the payload.
+ */
+
+__section_cls_entry
+int ack_match(struct __sk_buff *skb)
+{
+	void *data = (void *)(long)skb->data;
+	void *data_end = (void *)(long)skb->data_end;
+	struct ethhdr *eth = data;
+	struct iphdr *iph = data + sizeof(*eth);
+	struct tcphdr *tcp;
+
+	if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*tcp) > data_end)
+		return 0;
+
+	if (eth->h_proto == htons(ETH_P_IP) &&
+	    iph->version == 4) {
+		if(iph->protocol == IPPROTO_TCP &&
+		   iph->ihl == 5 &&
+		   data + sizeof(*eth) + 20 + sizeof(*tcp) <= data_end) {
+			tcp = data + sizeof(*eth) + 20;
+			if (tcp->ack &&
+			    htons(iph->tot_len) == 20 + tcp->doff*4)
+				return -1;
+		}
+	} else if (eth->h_proto == htons(ETH_P_IPV6) &&
+		   iph->version == 6) {
+		struct ipv6hdr *iph6 = (struct ipv6hdr *) iph;
+		if(iph6->nexthdr == IPPROTO_TCP &&
+		   data + sizeof(*eth) + 40 + sizeof(*tcp) <= data_end ) {
+			tcp = data + sizeof(*eth) + 40;
+			if (tcp->ack &&
+			    tcp->doff*4 == htons(iph6->payload_len))
+				return -1;
+		}
+	}
+
+	return 0;
+}
+
+/* Example: Move acks into a priority queue:
+
+IFACE=eth0
+tc qdisc del dev $IFACE root 2> /dev/null
+tc qdisc add dev $IFACE root handle 1: prio bands 3 \
+	priomap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
+tc qdisc add dev $IFACE parent 1:1 handle 10:1 sfq headdrop # acks only
+tc qdisc add dev $IFACE parent 1:2 handle 20:1 fq_codel # all other traffic
+tc qdisc add dev $IFACE parent 1:3 handle 30:1 fq_codel # unused
+tc filter add dev $IFACE parent 1: prio 1 bpf \
+	object-file ack_recognize.o flowid 1:1
+
+Please note that a strict priority queue is not a good idea (drr would be
+better), nor is doing any level of prioritization on acks at all....
+*/
+
+BPF_LICENSE("GPL");
-- 
2.7.4

^ permalink raw reply related

* Re: [Cake] [PATCH net-next v6] Add Common Applications Kept Enhanced (cake) qdisc
From: Jonathan Morton @ 2018-05-01 18:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Dave Taht, Cong Wang, Cake List, Linux Kernel Network Developers
In-Reply-To: <dff5ba09-1b81-7cd8-4be3-e7a419797d8e@gmail.com>

> On 1 May, 2018, at 7:06 pm, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> You have not provided any numbers to show how useful it is to maintain this
> code (probably still broken btw, considering it is changing some skb attributes).

A simple calculation shows what the maximum tolerable asymmetry for a conventional TCP connection is.  If the MTU is 1500, a pure ack is 84 bytes, and an ack is sent for every 3 packets received, bandwidth asymmetry exceeding about 50:1 will inevitably result in the inability to fully utilise downstream bandwidth.  This ratio drops to 34:1 if acks are sent every two data packets, and 17:1 for the RFC-mandated "ack every packet" behaviour when loss is detected.

Physical asymmetry ratios exceeding 20:1 are not rare.  They are increased still further if reverse traffic is present; given a nominal 10:1 asymmetry, interleaving one ack with one MTU packet (as SFQ would) gives an *effective* asymmetry for the downstream flow of 188:1, and seriously affects downstream goodput.

High asymmetry ratios can be tolerated by TCPs implementing sparser acks than 1:3 ratios, as proposed in AckCC.  Without AckCC however, I understand a strict reading of the RFCs prohibits TCPs from going beyond 1:3.  Even if Linux TCPs do so anyway, billions of Windows, Mac and mobile hosts most likely will not.  This makes a solely end-to-end solution impractical for the time being, with no obvious hope of improvement.

To maintain downstream goodput under these conditions, it is either necessary to send bursts of acks between the upstream traffic (which is slightly wasteful of bandwidth and also upsets RTT-sensitive TCPs like BBR) - which Cake will do quite happily if the ack-filter is disabled - or to delete some of the acks.  When there are many upstream flows competing with a single ack flow, the latter is the only reasonable solution unless acks are hard-prioritised (which has negative effects of its own, so we didn't consider it).

Middleboxes *are* permitted to drop packets, and AQM routinely does so.  We found that AQM could delete acks beneficially, but that it ramped up too slowly to really solve the problem (acks being individually small, much larger numbers of them must be dropped once they become a saturating flow, compared to a data flow).  Also, ack flows are often *antiresponsive*, in that dropping some of them causes an increase in their arrival rate due to increasing the ack-clocked downstream traffic, so it should be moderately obvious that conventional AQM strategies don't apply.  We also looked at existing ack filters' behaviour and found it wanting, so we were initially reluctant to implement our own.

However, having seen many counterexamples of how *not* to write an ack filter, we were eventually able to write one that *didn't* break the rules, and ensured that the information TCP relies on remained in acks that were not deleted even when many acks were.  This includes preserving the triple-repetition rule for detecting packet loss, and preserving the tail ack signifying reception of the end of a flow - which naturally results in only dropping acks if it's worthwhile to do so.  In short, we're fully aware of the potential for breaking TCP this way, and we've done our level best to avoid it.

Testing revealed an appreciable, simultaneous improvement in both downstream and upstream goodput, and in the downstream flow's internal RTT, under appropriate traffic conditions.  A more aggressive variant, going to the edges of what might be allowed by RFCs, actually showed *less* improvement than the standard one - it interfered with TCP's behaviour too much.  We can dig up the data if required.

> Also on wifi, the queue builds in the driver queues anyway, not in the qdisc.
> So ACK filtering, if _really_ successful, would need to be modularized.

Agree that the wifi queues would also benefit from ack filtering.  In fact, with make-wifi-fast improvements active, installing Cake or any other qdisc on that interface should be unnecessary.  Cake's shaper can't easily adapt to variable-rate links on the timescales required, anyway.  (It could if something told it directly about the variations.)

However, I don't see a way to install the ack-filter as a separate entity in its own right, without the ability to manipulate the queue downstream of itself.  The arrival of a new ack may trigger the deletion of a previous one, never the one just arrived, yet keeping an internal queue of acks within the filter would be counterproductive.  That's why we implemented it within Cake instead of as a separate thing.

It may be possible to modularise it sufficiently that qdiscs could add support for ack-filtering without duplicating the code, rather than the other way around.  This would also allow wifi queues to be modified the same way.  Would that be acceptable?

 - Jonathan Morton

^ permalink raw reply

* [PATCH net-next v7 3/3] cxgb4: collect hardware dump in second kernel
From: Rahul Lakkireddy @ 2018-05-01 18:27 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA,
	kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: indranil-ut6Up61K2wZBDgjK7y7TUQ, nirranjan-ut6Up61K2wZBDgjK7y7TUQ,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	ganeshgr-ut6Up61K2wZBDgjK7y7TUQ, Rahul Lakkireddy,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn
In-Reply-To: <cover.1525197407.git.rahul.lakkireddy-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>

Register callback to collect hardware/firmware dumps in second kernel
before hardware/firmware is initialized. The dumps for each device
will be available as elf notes in /proc/vmcore in second kernel.

Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
Signed-off-by: Ganesh Goudar <ganeshgr-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
---
v7:
- Removed "CHELSIO" vendor identifier in Elf Note name.

v6:
- Added "CHELSIO" string as vendor identifier in the Elf Note name.

v5:
- No changes.

v4:
- No changes.

v3:
- Replaced all crashdd* with vmcoredd*.
- Replaced crashdd_add_dump() with vmcore_add_device_dump().
- Updated comments and commit message.

v2:
- No Changes.

Changes since rfc v2:
- Update comments and commit message for sysfs change.

rfc v2:
- Updated dump registration to the new API in patch 1.

 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h       |  4 ++++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c | 25 ++++++++++++++++++++++++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h |  3 +++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c  | 10 ++++++++++
 4 files changed, 42 insertions(+)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 688f95440af2..01e7aad4ce5b 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -50,6 +50,7 @@
 #include <linux/net_tstamp.h>
 #include <linux/ptp_clock_kernel.h>
 #include <linux/ptp_classify.h>
+#include <linux/crash_dump.h>
 #include <asm/io.h>
 #include "t4_chip_type.h"
 #include "cxgb4_uld.h"
@@ -964,6 +965,9 @@ struct adapter {
 	struct hma_data hma;
 
 	struct srq_data *srq;
+
+	/* Dump buffer for collecting logs in kdump kernel */
+	struct vmcoredd_data vmcoredd;
 };
 
 /* Support for "sched-class" command to allow a TX Scheduling Class to be
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c
index 143686c60234..085691eb2b95 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c
@@ -488,3 +488,28 @@ void cxgb4_init_ethtool_dump(struct adapter *adapter)
 	adapter->eth_dump.version = adapter->params.fw_vers;
 	adapter->eth_dump.len = 0;
 }
+
+static int cxgb4_cudbg_vmcoredd_collect(struct vmcoredd_data *data, void *buf)
+{
+	struct adapter *adap = container_of(data, struct adapter, vmcoredd);
+	u32 len = data->size;
+
+	return cxgb4_cudbg_collect(adap, buf, &len, CXGB4_ETH_DUMP_ALL);
+}
+
+int cxgb4_cudbg_vmcore_add_dump(struct adapter *adap)
+{
+	struct vmcoredd_data *data = &adap->vmcoredd;
+	u32 len;
+
+	len = sizeof(struct cudbg_hdr) +
+	      sizeof(struct cudbg_entity_hdr) * CUDBG_MAX_ENTITY;
+	len += CUDBG_DUMP_BUFF_SIZE;
+
+	data->size = len;
+	snprintf(data->dump_name, sizeof(data->dump_name), "%s_%s",
+		 cxgb4_driver_name, adap->name);
+	data->vmcoredd_callback = cxgb4_cudbg_vmcoredd_collect;
+
+	return vmcore_add_device_dump(data);
+}
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h
index ce1ac9a1c878..ef59ba1ed968 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h
@@ -41,8 +41,11 @@ enum CXGB4_ETHTOOL_DUMP_FLAGS {
 	CXGB4_ETH_DUMP_HW = (1 << 1), /* various FW and HW dumps */
 };
 
+#define CXGB4_ETH_DUMP_ALL (CXGB4_ETH_DUMP_MEM | CXGB4_ETH_DUMP_HW)
+
 u32 cxgb4_get_dump_length(struct adapter *adap, u32 flag);
 int cxgb4_cudbg_collect(struct adapter *adap, void *buf, u32 *buf_size,
 			u32 flag);
 void cxgb4_init_ethtool_dump(struct adapter *adapter);
+int cxgb4_cudbg_vmcore_add_dump(struct adapter *adap);
 #endif /* __CXGB4_CUDBG_H__ */
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 24d2865b8806..32cad0acf76c 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -5544,6 +5544,16 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	if (err)
 		goto out_free_adapter;
 
+	if (is_kdump_kernel()) {
+		/* Collect hardware state and append to /proc/vmcore */
+		err = cxgb4_cudbg_vmcore_add_dump(adapter);
+		if (err) {
+			dev_warn(adapter->pdev_dev,
+				 "Fail collecting vmcore device dump, err: %d. Continuing\n",
+				 err);
+			err = 0;
+		}
+	}
 
 	if (!is_t4(adapter->params.chip)) {
 		s_qpp = (QUEUESPERPAGEPF0_S +
-- 
2.14.1

^ permalink raw reply related

* [PATCH net-next v7 2/3] vmcore: append device dumps to vmcore as elf notes
From: Rahul Lakkireddy @ 2018-05-01 18:27 UTC (permalink / raw)
  To: netdev, kexec, linux-fsdevel, linux-kernel
  Cc: davem, viro, ebiederm, stephen, akpm, torvalds, ganeshgr,
	nirranjan, indranil, Rahul Lakkireddy
In-Reply-To: <cover.1525197407.git.rahul.lakkireddy@chelsio.com>

Update read and mmap logic to append device dumps as additional notes
before the other elf notes. We add device dumps before other elf notes
because the other elf notes may not fill the elf notes buffer
completely and we will end up with zero-filled data between the elf
notes and the device dumps. Tools will then try to decode this
zero-filled data as valid notes and we don't want that. Hence, adding
device dumps before the other elf notes ensure that zero-filled data
can be avoided. This also ensures that the device dumps and the
other elf notes can be properly mmaped at page aligned address.

Incorporate device dump size into the total vmcore size. Also update
offsets for other program headers after the device dumps are added.

Suggested-by: Eric Biederman <ebiederm@xmission.com>.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
---
v7:
- No changes.

v6:
- No changes.

v5:
- No changes.

v4:
- No changes.

v3:
- Patch added in this version.
- Exported dumps as elf notes. Suggested by Eric Biederman
  <ebiederm@xmission.com>.

 fs/proc/vmcore.c | 247 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 243 insertions(+), 4 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 60fce8dfe4e3..e06c9b942c22 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -39,6 +39,8 @@ static size_t elfcorebuf_sz_orig;
 
 static char *elfnotes_buf;
 static size_t elfnotes_sz;
+/* Size of all notes minus the device dump notes */
+static size_t elfnotes_orig_sz;
 
 /* Total size of vmcore file. */
 static u64 vmcore_size;
@@ -51,6 +53,9 @@ static LIST_HEAD(vmcoredd_list);
 static DEFINE_MUTEX(vmcoredd_mutex);
 #endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
 
+/* Device Dump Size */
+static size_t vmcoredd_orig_sz;
+
 /*
  * Returns > 0 for RAM pages, 0 for non-RAM pages, < 0 on error
  * The called function has to take care of module refcounting.
@@ -185,6 +190,77 @@ static int copy_to(void *target, void *src, size_t size, int userbuf)
 	return 0;
 }
 
+#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
+static int vmcoredd_copy_dumps(void *dst, u64 start, size_t size, int userbuf)
+{
+	struct vmcoredd_node *dump;
+	u64 offset = 0;
+	int ret = 0;
+	size_t tsz;
+	char *buf;
+
+	mutex_lock(&vmcoredd_mutex);
+	list_for_each_entry(dump, &vmcoredd_list, list) {
+		if (start < offset + dump->size) {
+			tsz = min(offset + (u64)dump->size - start, (u64)size);
+			buf = dump->buf + start - offset;
+			if (copy_to(dst, buf, tsz, userbuf)) {
+				ret = -EFAULT;
+				goto out_unlock;
+			}
+
+			size -= tsz;
+			start += tsz;
+			dst += tsz;
+
+			/* Leave now if buffer filled already */
+			if (!size)
+				goto out_unlock;
+		}
+		offset += dump->size;
+	}
+
+out_unlock:
+	mutex_unlock(&vmcoredd_mutex);
+	return ret;
+}
+
+static int vmcoredd_mmap_dumps(struct vm_area_struct *vma, unsigned long dst,
+			       u64 start, size_t size)
+{
+	struct vmcoredd_node *dump;
+	u64 offset = 0;
+	int ret = 0;
+	size_t tsz;
+	char *buf;
+
+	mutex_lock(&vmcoredd_mutex);
+	list_for_each_entry(dump, &vmcoredd_list, list) {
+		if (start < offset + dump->size) {
+			tsz = min(offset + (u64)dump->size - start, (u64)size);
+			buf = dump->buf + start - offset;
+			if (remap_vmalloc_range_partial(vma, dst, buf, tsz)) {
+				ret = -EFAULT;
+				goto out_unlock;
+			}
+
+			size -= tsz;
+			start += tsz;
+			dst += tsz;
+
+			/* Leave now if buffer filled already */
+			if (!size)
+				goto out_unlock;
+		}
+		offset += dump->size;
+	}
+
+out_unlock:
+	mutex_unlock(&vmcoredd_mutex);
+	return ret;
+}
+#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
+
 /* Read from the ELF header and then the crash dump. On error, negative value is
  * returned otherwise number of bytes read are returned.
  */
@@ -222,10 +298,41 @@ static ssize_t __read_vmcore(char *buffer, size_t buflen, loff_t *fpos,
 	if (*fpos < elfcorebuf_sz + elfnotes_sz) {
 		void *kaddr;
 
+		/* We add device dumps before other elf notes because the
+		 * other elf notes may not fill the elf notes buffer
+		 * completely and we will end up with zero-filled data
+		 * between the elf notes and the device dumps. Tools will
+		 * then try to decode this zero-filled data as valid notes
+		 * and we don't want that. Hence, adding device dumps before
+		 * the other elf notes ensure that zero-filled data can be
+		 * avoided.
+		 */
+#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
+		/* Read device dumps */
+		if (*fpos < elfcorebuf_sz + vmcoredd_orig_sz) {
+			tsz = min(elfcorebuf_sz + vmcoredd_orig_sz -
+				  (size_t)*fpos, buflen);
+			start = *fpos - elfcorebuf_sz;
+			if (vmcoredd_copy_dumps(buffer, start, tsz, userbuf))
+				return -EFAULT;
+
+			buflen -= tsz;
+			*fpos += tsz;
+			buffer += tsz;
+			acc += tsz;
+
+			/* leave now if filled buffer already */
+			if (!buflen)
+				return acc;
+		}
+#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
+
+		/* Read remaining elf notes */
 		tsz = min(elfcorebuf_sz + elfnotes_sz - (size_t)*fpos, buflen);
-		kaddr = elfnotes_buf + *fpos - elfcorebuf_sz;
+		kaddr = elfnotes_buf + *fpos - elfcorebuf_sz - vmcoredd_orig_sz;
 		if (copy_to(buffer, kaddr, tsz, userbuf))
 			return -EFAULT;
+
 		buflen -= tsz;
 		*fpos += tsz;
 		buffer += tsz;
@@ -451,11 +558,46 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
 	if (start < elfcorebuf_sz + elfnotes_sz) {
 		void *kaddr;
 
+		/* We add device dumps before other elf notes because the
+		 * other elf notes may not fill the elf notes buffer
+		 * completely and we will end up with zero-filled data
+		 * between the elf notes and the device dumps. Tools will
+		 * then try to decode this zero-filled data as valid notes
+		 * and we don't want that. Hence, adding device dumps before
+		 * the other elf notes ensure that zero-filled data can be
+		 * avoided. This also ensures that the device dumps and
+		 * other elf notes can be properly mmaped at page aligned
+		 * address.
+		 */
+#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
+		/* Read device dumps */
+		if (start < elfcorebuf_sz + vmcoredd_orig_sz) {
+			u64 start_off;
+
+			tsz = min(elfcorebuf_sz + vmcoredd_orig_sz -
+				  (size_t)start, size);
+			start_off = start - elfcorebuf_sz;
+			if (vmcoredd_mmap_dumps(vma, vma->vm_start + len,
+						start_off, tsz))
+				goto fail;
+
+			size -= tsz;
+			start += tsz;
+			len += tsz;
+
+			/* leave now if filled buffer already */
+			if (!size)
+				return 0;
+		}
+#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
+
+		/* Read remaining elf notes */
 		tsz = min(elfcorebuf_sz + elfnotes_sz - (size_t)start, size);
-		kaddr = elfnotes_buf + start - elfcorebuf_sz;
+		kaddr = elfnotes_buf + start - elfcorebuf_sz - vmcoredd_orig_sz;
 		if (remap_vmalloc_range_partial(vma, vma->vm_start + len,
 						kaddr, tsz))
 			goto fail;
+
 		size -= tsz;
 		start += tsz;
 		len += tsz;
@@ -703,6 +845,11 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
 	/* Modify e_phnum to reflect merged headers. */
 	ehdr_ptr->e_phnum = ehdr_ptr->e_phnum - nr_ptnote + 1;
 
+	/* Store the size of all notes.  We need this to update the note
+	 * header when the device dumps will be added.
+	 */
+	elfnotes_orig_sz = phdr.p_memsz;
+
 	return 0;
 }
 
@@ -889,6 +1036,11 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
 	/* Modify e_phnum to reflect merged headers. */
 	ehdr_ptr->e_phnum = ehdr_ptr->e_phnum - nr_ptnote + 1;
 
+	/* Store the size of all notes.  We need this to update the note
+	 * header when the device dumps will be added.
+	 */
+	elfnotes_orig_sz = phdr.p_memsz;
+
 	return 0;
 }
 
@@ -981,8 +1133,8 @@ static int __init process_ptload_program_headers_elf32(char *elfptr,
 }
 
 /* Sets offset fields of vmcore elements. */
-static void __init set_vmcore_list_offsets(size_t elfsz, size_t elfnotes_sz,
-					   struct list_head *vc_list)
+static void set_vmcore_list_offsets(size_t elfsz, size_t elfnotes_sz,
+				    struct list_head *vc_list)
 {
 	loff_t vmcore_off;
 	struct vmcore *m;
@@ -1174,6 +1326,92 @@ static void vmcoredd_write_header(void *buf, struct vmcoredd_data *data,
 	memcpy(vdd_hdr->dump_name, data->dump_name, sizeof(vdd_hdr->dump_name));
 }
 
+/**
+ * vmcoredd_update_program_headers - Update all Elf program headers
+ * @elfptr: Pointer to elf header
+ * @elfnotesz: Size of elf notes aligned to page size
+ * @vmcoreddsz: Size of device dumps to be added to elf note header
+ *
+ * Determine type of Elf header (Elf64 or Elf32) and update the elf note size.
+ * Also update the offsets of all the program headers after the elf note header.
+ */
+static void vmcoredd_update_program_headers(char *elfptr, size_t elfnotesz,
+					    size_t vmcoreddsz)
+{
+	unsigned char *e_ident = (unsigned char *)elfptr;
+	u64 start, end, size;
+	loff_t vmcore_off;
+	u32 i;
+
+	vmcore_off = elfcorebuf_sz + elfnotesz;
+
+	if (e_ident[EI_CLASS] == ELFCLASS64) {
+		Elf64_Ehdr *ehdr = (Elf64_Ehdr *)elfptr;
+		Elf64_Phdr *phdr = (Elf64_Phdr *)(elfptr + sizeof(Elf64_Ehdr));
+
+		/* Update all program headers */
+		for (i = 0; i < ehdr->e_phnum; i++, phdr++) {
+			if (phdr->p_type == PT_NOTE) {
+				/* Update note size */
+				phdr->p_memsz = elfnotes_orig_sz + vmcoreddsz;
+				phdr->p_filesz = phdr->p_memsz;
+				continue;
+			}
+
+			start = rounddown(phdr->p_offset, PAGE_SIZE);
+			end = roundup(phdr->p_offset + phdr->p_memsz,
+				      PAGE_SIZE);
+			size = end - start;
+			phdr->p_offset = vmcore_off + (phdr->p_offset - start);
+			vmcore_off += size;
+		}
+	} else {
+		Elf32_Ehdr *ehdr = (Elf32_Ehdr *)elfptr;
+		Elf32_Phdr *phdr = (Elf32_Phdr *)(elfptr + sizeof(Elf32_Ehdr));
+
+		/* Update all program headers */
+		for (i = 0; i < ehdr->e_phnum; i++, phdr++) {
+			if (phdr->p_type == PT_NOTE) {
+				/* Update note size */
+				phdr->p_memsz = elfnotes_orig_sz + vmcoreddsz;
+				phdr->p_filesz = phdr->p_memsz;
+				continue;
+			}
+
+			start = rounddown(phdr->p_offset, PAGE_SIZE);
+			end = roundup(phdr->p_offset + phdr->p_memsz,
+				      PAGE_SIZE);
+			size = end - start;
+			phdr->p_offset = vmcore_off + (phdr->p_offset - start);
+			vmcore_off += size;
+		}
+	}
+}
+
+/**
+ * vmcoredd_update_size - Update the total size of the device dumps and update
+ * Elf header
+ * @dump_size: Size of the current device dump to be added to total size
+ *
+ * Update the total size of all the device dumps and update the Elf program
+ * headers. Calculate the new offsets for the vmcore list and update the
+ * total vmcore size.
+ */
+static void vmcoredd_update_size(size_t dump_size)
+{
+	vmcoredd_orig_sz += dump_size;
+	elfnotes_sz = roundup(elfnotes_orig_sz, PAGE_SIZE) + vmcoredd_orig_sz;
+	vmcoredd_update_program_headers(elfcorebuf, elfnotes_sz,
+					vmcoredd_orig_sz);
+
+	/* Update vmcore list offsets */
+	set_vmcore_list_offsets(elfcorebuf_sz, elfnotes_sz, &vmcore_list);
+
+	vmcore_size = get_vmcore_size(elfcorebuf_sz, elfnotes_sz,
+				      &vmcore_list);
+	proc_vmcore->size = vmcore_size;
+}
+
 /**
  * vmcore_add_device_dump - Add a buffer containing device dump to vmcore
  * @data: dump info.
@@ -1227,6 +1465,7 @@ static int __vmcore_add_device_dump(struct vmcoredd_data *data)
 	list_add_tail(&dump->list, &vmcoredd_list);
 	mutex_unlock(&vmcoredd_mutex);
 
+	vmcoredd_update_size(data_size);
 	return 0;
 
 out_err:
-- 
2.14.1

^ permalink raw reply related

* [PATCH net-next v7 1/3] vmcore: add API to collect hardware dump in second kernel
From: Rahul Lakkireddy @ 2018-05-01 18:27 UTC (permalink / raw)
  To: netdev, kexec, linux-fsdevel, linux-kernel
  Cc: davem, viro, ebiederm, stephen, akpm, torvalds, ganeshgr,
	nirranjan, indranil, Rahul Lakkireddy
In-Reply-To: <cover.1525197407.git.rahul.lakkireddy@chelsio.com>

The sequence of actions done by device drivers to append their device
specific hardware/firmware logs to /proc/vmcore are as follows:

1. During probe (before hardware is initialized), device drivers
register to the vmcore module (via vmcore_add_device_dump()), with
callback function, along with buffer size and log name needed for
firmware/hardware log collection.

2. vmcore module allocates the buffer with requested size. It adds
an Elf note and invokes the device driver's registered callback
function.

3. Device driver collects all hardware/firmware logs into the buffer
and returns control back to vmcore module.

Ensure that the device dump buffer size is always aligned to page size
so that it can be mmaped.

Also, rename alloc_elfnotes_buf() to vmcore_alloc_buf() to make it more
generic and reserve NT_VMCOREDD note type to indicate vmcore device
dump.

Suggested-by: Eric Biederman <ebiederm@xmission.com>.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
---
v7:
- Removed "CHELSIO" vendor identifier in Elf Note name. Instead,
  writing "LINUX".
- Moved vmcoredd_header to new file include/uapi/linux/vmcore.h
- Reworked vmcoredd_header to include Elf Note as part of the header
  itself.
- Removed vmcoredd_get_note_size().
- Renamed vmcoredd_write_note() to vmcoredd_write_header().
- Replaced all "unsigned long" with "unsigned int" for device dump
  size since max size of Elf Word is u32.

v6:
- Reworked device dump elf note name to contain vendor identifier.
- Added vmcoredd_header that precedes actual dump in the Elf Note.
- Device dump's name is moved inside vmcoredd_header.

v5:
- Removed enabling CONFIG_PROC_VMCORE_DEVICE_DUMP by default and
  updated help message to indicate that the driver must be present
  in second kernel's initramfs to collect the underlying device
  snapshot.

v4:
- Made __vmcore_add_device_dump() static.
- Moved compile check to define vmcore_add_device_dump() to
  crash_dump.h to fix compilation when vmcore.c is not compiled in.
- Convert ---help--- to help in Kconfig as indicated by checkpatch.
- Rebased to tip.

v3:
- Dropped sysfs crashdd module.
- Added CONFIG_PROC_VMCORE_DEVICE_DUMP to allow configuring device
  dump support.
- Moved logic related to adding dumps from crashdd to vmcore module.
- Rename all crashdd* to vmcoredd*.

v2:
- Added ABI Documentation for crashdd.
- Directly use octal permission instead of macro.

Changes since rfc v2:
- Moved exporting crashdd from procfs to sysfs.  Suggested by
  Stephen Hemminger <stephen@networkplumber.org>
- Moved code from fs/proc/crashdd.c to fs/crashdd/ directory.
- Replaced all proc API with sysfs API and updated comments.
- Calling driver callback before creating the binary file under
  crashdd sysfs.
- Changed binary dump file permission from S_IRUSR to S_IRUGO.
- Changed module name from CRASH_DRIVER_DUMP to CRASH_DEVICE_DUMP.

rfc v2:
- Collecting logs in 2nd kernel instead of during kernel panic.
  Suggested by Eric Biederman <ebiederm@xmission.com>.
- Patch added in this series.

 fs/proc/Kconfig             |  15 +++++
 fs/proc/vmcore.c            | 140 +++++++++++++++++++++++++++++++++++++++++---
 include/linux/crash_dump.h  |  18 ++++++
 include/linux/kcore.h       |   6 ++
 include/uapi/linux/elf.h    |   1 +
 include/uapi/linux/vmcore.h |  16 +++++
 6 files changed, 187 insertions(+), 9 deletions(-)
 create mode 100644 include/uapi/linux/vmcore.h

diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
index 1ade1206bb89..0eaeb41453f5 100644
--- a/fs/proc/Kconfig
+++ b/fs/proc/Kconfig
@@ -43,6 +43,21 @@ config PROC_VMCORE
         help
         Exports the dump image of crashed kernel in ELF format.
 
+config PROC_VMCORE_DEVICE_DUMP
+	bool "Device Hardware/Firmware Log Collection"
+	depends on PROC_VMCORE
+	default n
+	help
+	  After kernel panic, device drivers can collect the device
+	  specific snapshot of their hardware or firmware before the
+	  underlying devices are initialized in crash recovery kernel.
+	  Note that the device driver must be present in the crash
+	  recovery kernel's initramfs to collect its underlying device
+	  snapshot.
+
+	  If you say Y here, the collected device dumps will be added
+	  as ELF notes to /proc/vmcore.
+
 config PROC_SYSCTL
 	bool "Sysctl support (/proc/sys)" if EXPERT
 	depends on PROC_FS
diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index a45f0af22a60..60fce8dfe4e3 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -20,6 +20,7 @@
 #include <linux/init.h>
 #include <linux/crash_dump.h>
 #include <linux/list.h>
+#include <linux/mutex.h>
 #include <linux/vmalloc.h>
 #include <linux/pagemap.h>
 #include <linux/uaccess.h>
@@ -44,6 +45,12 @@ static u64 vmcore_size;
 
 static struct proc_dir_entry *proc_vmcore;
 
+#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
+/* Device Dump list and mutex to synchronize access to list */
+static LIST_HEAD(vmcoredd_list);
+static DEFINE_MUTEX(vmcoredd_mutex);
+#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
+
 /*
  * Returns > 0 for RAM pages, 0 for non-RAM pages, < 0 on error
  * The called function has to take care of module refcounting.
@@ -302,10 +309,8 @@ static const struct vm_operations_struct vmcore_mmap_ops = {
 };
 
 /**
- * alloc_elfnotes_buf - allocate buffer for ELF note segment in
- *                      vmalloc memory
- *
- * @notes_sz: size of buffer
+ * vmcore_alloc_buf - allocate buffer in vmalloc memory
+ * @sizez: size of buffer
  *
  * If CONFIG_MMU is defined, use vmalloc_user() to allow users to mmap
  * the buffer to user-space by means of remap_vmalloc_range().
@@ -313,12 +318,12 @@ static const struct vm_operations_struct vmcore_mmap_ops = {
  * If CONFIG_MMU is not defined, use vzalloc() since mmap_vmcore() is
  * disabled and there's no need to allow users to mmap the buffer.
  */
-static inline char *alloc_elfnotes_buf(size_t notes_sz)
+static inline char *vmcore_alloc_buf(size_t size)
 {
 #ifdef CONFIG_MMU
-	return vmalloc_user(notes_sz);
+	return vmalloc_user(size);
 #else
-	return vzalloc(notes_sz);
+	return vzalloc(size);
 #endif
 }
 
@@ -665,7 +670,7 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
 		return rc;
 
 	*notes_sz = roundup(phdr_sz, PAGE_SIZE);
-	*notes_buf = alloc_elfnotes_buf(*notes_sz);
+	*notes_buf = vmcore_alloc_buf(*notes_sz);
 	if (!*notes_buf)
 		return -ENOMEM;
 
@@ -851,7 +856,7 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
 		return rc;
 
 	*notes_sz = roundup(phdr_sz, PAGE_SIZE);
-	*notes_buf = alloc_elfnotes_buf(*notes_sz);
+	*notes_buf = vmcore_alloc_buf(*notes_sz);
 	if (!*notes_buf)
 		return -ENOMEM;
 
@@ -1145,6 +1150,120 @@ static int __init parse_crash_elf_headers(void)
 	return 0;
 }
 
+#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
+/**
+ * vmcoredd_write_header - Write vmcore device dump header at the
+ * beginning of the dump's buffer.
+ * @buf: Output buffer where the note is written
+ * @data: Dump info
+ * @size: Size of the dump
+ *
+ * Fills beginning of the dump's buffer with vmcore device dump header.
+ */
+static void vmcoredd_write_header(void *buf, struct vmcoredd_data *data,
+				  u32 size)
+{
+	struct vmcoredd_header *vdd_hdr = (struct vmcoredd_header *)buf;
+
+	vdd_hdr->n_namesz = sizeof(vdd_hdr->name);
+	vdd_hdr->n_descsz = size + sizeof(vdd_hdr->dump_name);
+	vdd_hdr->n_type = NT_VMCOREDD;
+
+	strncpy((char *)vdd_hdr->name, VMCOREDD_NOTE_NAME,
+		sizeof(vdd_hdr->name));
+	memcpy(vdd_hdr->dump_name, data->dump_name, sizeof(vdd_hdr->dump_name));
+}
+
+/**
+ * vmcore_add_device_dump - Add a buffer containing device dump to vmcore
+ * @data: dump info.
+ *
+ * Allocate a buffer and invoke the calling driver's dump collect routine.
+ * Write Elf note at the beginning of the buffer to indicate vmcore device
+ * dump and add the dump to global list.
+ */
+static int __vmcore_add_device_dump(struct vmcoredd_data *data)
+{
+	struct vmcoredd_node *dump;
+	void *buf = NULL;
+	size_t data_size;
+	int ret;
+
+	if (!data || !strlen(data->dump_name) ||
+	    !data->vmcoredd_callback || !data->size)
+		return -EINVAL;
+
+	dump = vzalloc(sizeof(*dump));
+	if (!dump) {
+		ret = -ENOMEM;
+		goto out_err;
+	}
+
+	/* Keep size of the buffer page aligned so that it can be mmaped */
+	data_size = roundup(sizeof(struct vmcoredd_header) + data->size,
+			    PAGE_SIZE);
+
+	/* Allocate buffer for driver's to write their dumps */
+	buf = vmcore_alloc_buf(data_size);
+	if (!buf) {
+		ret = -ENOMEM;
+		goto out_err;
+	}
+
+	vmcoredd_write_header(buf, data, data_size -
+			      sizeof(struct vmcoredd_header));
+
+	/* Invoke the driver's dump collection routing */
+	ret = data->vmcoredd_callback(data, buf +
+				      sizeof(struct vmcoredd_header));
+	if (ret)
+		goto out_err;
+
+	dump->buf = buf;
+	dump->size = data_size;
+
+	/* Add the dump to driver sysfs list */
+	mutex_lock(&vmcoredd_mutex);
+	list_add_tail(&dump->list, &vmcoredd_list);
+	mutex_unlock(&vmcoredd_mutex);
+
+	return 0;
+
+out_err:
+	if (buf)
+		vfree(buf);
+
+	if (dump)
+		vfree(dump);
+
+	return ret;
+}
+
+int vmcore_add_device_dump(struct vmcoredd_data *data)
+{
+	return __vmcore_add_device_dump(data);
+}
+EXPORT_SYMBOL(vmcore_add_device_dump);
+#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
+
+/* Free all dumps in vmcore device dump list */
+static void vmcore_free_device_dumps(void)
+{
+#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
+	mutex_lock(&vmcoredd_mutex);
+	while (!list_empty(&vmcoredd_list)) {
+		struct vmcoredd_node *dump;
+
+		dump = list_first_entry(&vmcoredd_list, struct vmcoredd_node,
+					list);
+		list_del(&dump->list);
+		vfree(dump->buf);
+		vfree(dump);
+	}
+	mutex_unlock(&vmcoredd_mutex);
+#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
+}
+
 /* Init function for vmcore module. */
 static int __init vmcore_init(void)
 {
@@ -1192,4 +1311,7 @@ void vmcore_cleanup(void)
 		kfree(m);
 	}
 	free_elfcorebuf();
+
+	/* clear vmcore device dump list */
+	vmcore_free_device_dumps();
 }
diff --git a/include/linux/crash_dump.h b/include/linux/crash_dump.h
index f7ac2aa93269..3e4ba9d753c8 100644
--- a/include/linux/crash_dump.h
+++ b/include/linux/crash_dump.h
@@ -5,6 +5,7 @@
 #include <linux/kexec.h>
 #include <linux/proc_fs.h>
 #include <linux/elf.h>
+#include <uapi/linux/vmcore.h>
 
 #include <asm/pgtable.h> /* for pgprot_t */
 
@@ -93,4 +94,21 @@ static inline bool is_kdump_kernel(void) { return 0; }
 #endif /* CONFIG_CRASH_DUMP */
 
 extern unsigned long saved_max_pfn;
+
+/* Device Dump information to be filled by drivers */
+struct vmcoredd_data {
+	char dump_name[VMCOREDD_MAX_NAME_BYTES]; /* Unique name of the dump */
+	unsigned int size;                       /* Size of the dump */
+	/* Driver's registered callback to be invoked to collect dump */
+	int (*vmcoredd_callback)(struct vmcoredd_data *data, void *buf);
+};
+
+#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
+int vmcore_add_device_dump(struct vmcoredd_data *data);
+#else
+static inline int vmcore_add_device_dump(struct vmcoredd_data *data)
+{
+	return -EOPNOTSUPP;
+}
+#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
 #endif /* LINUX_CRASHDUMP_H */
diff --git a/include/linux/kcore.h b/include/linux/kcore.h
index 80db19d3a505..8de55e4b5ee9 100644
--- a/include/linux/kcore.h
+++ b/include/linux/kcore.h
@@ -28,6 +28,12 @@ struct vmcore {
 	loff_t offset;
 };
 
+struct vmcoredd_node {
+	struct list_head list;	/* List of dumps */
+	void *buf;		/* Buffer containing device's dump */
+	unsigned int size;	/* Size of the buffer */
+};
+
 #ifdef CONFIG_PROC_KCORE
 extern void kclist_add(struct kcore_list *, void *, size_t, int type);
 #else
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index e2535d6dcec7..4e12c423b9fe 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -421,6 +421,7 @@ typedef struct elf64_shdr {
 #define NT_ARM_SYSTEM_CALL	0x404	/* ARM system call number */
 #define NT_ARM_SVE	0x405		/* ARM Scalable Vector Extension registers */
 #define NT_ARC_V2	0x600		/* ARCv2 accumulator/extra registers */
+#define NT_VMCOREDD	0x700		/* Vmcore Device Dump Note */
 
 /* Note header in a PT_NOTE section */
 typedef struct elf32_note {
diff --git a/include/uapi/linux/vmcore.h b/include/uapi/linux/vmcore.h
new file mode 100644
index 000000000000..f9eab9a37370
--- /dev/null
+++ b/include/uapi/linux/vmcore.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _UAPI_VMCORE_H
+#define _UAPI_VMCORE_H
+
+#define VMCOREDD_NOTE_NAME "LINUX"
+#define VMCOREDD_MAX_NAME_BYTES 44
+
+struct vmcoredd_header {
+	__u32 n_namesz; /* Name size */
+	__u32 n_descsz; /* Content size */
+	__u32 n_type;   /* NT_VMCOREDD */
+	__u8 name[8];   /* LINUX\0\0\0 */
+	__u8 dump_name[VMCOREDD_MAX_NAME_BYTES]; /* Device dump's name */
+};
+
+#endif /* _UAPI_VMCORE_H */
-- 
2.14.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox