Netdev List
 help / color / mirror / Atom feed
* [PATCH net-next 1/9] net: hns3: Remove error log when getting pfc stats fails
From: Salil Mehta @ 2018-05-01 18:55 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, lipeng321, mehta.salil.lnk, netdev,
	linux-kernel, linuxarm, Yunsheng Lin
In-Reply-To: <20180501185605.9584-1-salil.mehta@huawei.com>

From: Yunsheng Lin <linyunsheng@huawei.com>

When mac supports DCB, but is in GE mode, it does not support
querying pfc stats, firmware returns error when trying to
query the pfc stats. this creates a lot of noise in the kernel
log when it prints the error log.

This patch fixes it by removing the error log, because it already
return the error to the user space, so the user should be aware of
the error.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c
index 885f25c..c69ecab 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c
@@ -134,11 +134,8 @@ static int hclge_pfc_stats_get(struct hclge_dev *hdev,
 	}
 
 	ret = hclge_cmd_send(&hdev->hw, desc, HCLGE_TM_PFC_PKT_GET_CMD_NUM);
-	if (ret) {
-		dev_err(&hdev->pdev->dev,
-			"Get pfc pause stats fail, ret = %d.\n", ret);
+	if (ret)
 		return ret;
-	}
 
 	for (i = 0; i < HCLGE_TM_PFC_PKT_GET_CMD_NUM; i++) {
 		struct hclge_pfc_stats_cmd *pfc_stats =
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 0/9] Misc bug fixes for HNS3 Ethernet driver
From: Salil Mehta @ 2018-05-01 18:55 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, lipeng321, mehta.salil.lnk, netdev,
	linux-kernel, linuxarm

This patch-set presents some miscellaneous bug fixs and cleanups for
HNS3 Ethernet Driver.

Fuyun Liang (1):
  net: hns3: Fix to support autoneg only for port attached with phy

Huazhong Tan (5):
  net: hns3: fix to correctly fetch l4 protocol outer header
  net: hns3: Fixes the out of bounds access in hclge_map_tqp
  net: hns3: Fixes the error legs in hclge_init_ae_dev function
  net: hns3: fix for phy_addr error in hclge_mac_mdio_config
  net: hns3: fix a dead loop in hclge_cmd_csq_clean

Xi Wang (1):
  net: hns3: Remove packet statistics in the range of 8192~12287

Yunsheng Lin (2):
  net: hns3: Remove error log when getting pfc stats fails
  net: hns3: Fix for packet loss due wrong filter config in VLAN tbls

 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c    |   2 +-
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c |  19 +++
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c    | 167 ++++++++++++++-------
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h    |  20 ++-
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c |   7 +-
 .../ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c    |   7 +-
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c  |   5 +-
 7 files changed, 153 insertions(+), 74 deletions(-)

-- 
2.7.4

^ permalink raw reply

* Re: [PATCH net-next v6] Add Common Applications Kept Enhanced (cake) qdisc
From: Toke Høiland-Jørgensen @ 2018-05-01 18:53 UTC (permalink / raw)
  To: Eric Dumazet, Dave Taht, Cong Wang
  Cc: Linux Kernel Network Developers, Cake List
In-Reply-To: <dff5ba09-1b81-7cd8-4be3-e7a419797d8e@gmail.com>

Eric Dumazet <eric.dumazet@gmail.com> writes:

> On 04/30/2018 02:27 PM, Dave Taht wrote:
>
>> I actually have a tc - bpf based ack filter, during the development of
>> cake's ack-thinner, that I should submit one of these days. It
>> proved to be of limited use.
>> 
>> Probably the biggest mistake we made is by calling this cake feature a
>> filter. It isn't.
>> 
>> Maybe we should have called it a "thinner" or something like that? In
>> order to properly "thin" or "reduce" an ack stream
>> you have to have a queue to look at and some related state. TC filters
>> do not operate on queues, qdiscs do. Thus the "ack-filter" here is
>> deeply embedded into cake's flow isolation and queue structures.
>
> A feature eating packets _is_ a filter.
>
> Given that a qdisc only sees one direction, I really do not get why
> ack-filter is so desirable in a packet scheduler.

The ACK filter in CAKE is there to solve a very particular use case:
Residential internet connections with bandwidths so asymmetric that it
hurts TCP performance. It is not on by default; but rather meant to be
configured by users which suffer from this particular ISP brokenness
(which, sadly, does happen; there are ISPs who believe a 50/1 bandwidth
ratio is reasonable). For those users, the ACK filter can help. 

We certainly do not advise people to turn it on in the general case! As
you point, in general TCP performance is best improved in the TCP stack...

> You have not provided any numbers to show how useful it is to maintain
> this code

You mean apart from that in the linked blog post and the paper?
Here's an executive summary:

On a 30/1 Mbps connection with a bidirectional traffic test (RRUL),
turning on the ACK filter improves downstream throughput by ~20% and
upstream throughput by ~12% in conservative mode and ~40% in aggressive
mode, at the cost of ~5ms of inter-flow latency due to the increased
congestion.

In *really* pathological cases, the effect can be a lot more; for
instance, the ACK filter increases the achievable downstream throughput
on a link with 100 Kbps in the upstream direction by an order of
magnitude (from ~2.5 Mbps to ~25 Mbps).

> (probably still broken btw, considering it is changing some skb
> attributes).

As you may have noticed over the last few iterations, I've actually been
trying to fix any brokenness. If you could be specific as to what is
still broken, that would be helpful.

(I'm assuming are referring to the calls to skb_set_inner* - but do you
mean all of them, or just the ones that set inner == outer?)

> On wifi (or any half duplex medium), you might gain something by
> carefully sending ACK not too often, but ultimately this should be
> done by TCP stack, in well controlled environment [1], instead of
> middle-boxes happily playing/breaking some Internet standards.
>
> [1] TCP stack has the estimations of RTT, RWIN, throughput, and might
> be able to avoid flooding the network with acks under some conditions.

You are quite right that in general, TCP performance is best improved in
the TCP stack. But home users are not generally in control of that; the
ACK filter helps in those specific deployments (again, it's off by
default, and you shouldn't turn it on in the general case!).

> Say RTT is 100ms, and we receive 1 packet every 100 usec (no GRO
> aggregation) Maybe we do not really _need_ to send 5000 ack per second
> (or even 10,000 ack per second if a hole needs a repair)

Yes, please do fix that.

> Also on wifi, the queue builds in the driver queues anyway, not in the
> qdisc.

Actually, we've fixed that (for some drivers); there is now no qdisc on
WiFi interfaces, and we've integrated FQ-CoDel'ed queueing into the
stack where it can be effective. But no, running CAKE with an ACK filter
on a WiFi link is not going to be effective. Don't do that.

> So ACK filtering, if _really_ successful, would need to be
> modularized.

I really hope ACK filtering won't be "_really_ successful". Again, it is
not meant to be a feature that's on everywhere, it's targeting a very
specific use case that CAKE is optimised for: The home router use case.

> Please split Cake into a patch series.
> Presumably putting the ack-filter on a patch of its own.

*sigh* - can do, I guess. Though I'm not sure what that is going to
accomplish, exactly?

-Toke

^ permalink raw reply

* Silently dropped UDP packets on kernel 4.14
From: Kristian Evensen @ 2018-05-01 18:50 UTC (permalink / raw)
  To: Netfilter Development Mailing list, Network Development

Hello,

I am currently debugging an issue where it looks like UDP packets are
silently dropped. My setup is a client with a tool that sends two UDP
packets (DNS requests) "simultaneously" using the same socket, and
then a router running latest OpenWRT (with kernel 4.14.37). What I am
seeing is that the first request goes through, while the second seems
to be lost somewhere. I can see the second packet in my firewall logs
on the router (all the way past the POSTROUTING chain in the
nat-table), but the packet is not sent over the wire (checked with
tcpdump).

In order to try to figure out what was going on, I added some
LOG-rules to the firewall. I Inserted the rules in the PREROUTING
chains in the raw, mangle and nat tables, in filter FORWARD and nat
and mangle POSTROUTING. Here are the logs for one set of requests
(with some notes):

First DNS request (verified by packet length):
[723269.968256] raw-pre: IN=br-lan OUT=
MAC=78:a3:51:34:e4:86:2e:8e:de:be:22:5a:08:00 SRC=10.0.0.3 DST=1.1.1.1
LEN=56 TOS=0x00 PREC=0x00 TTL=64 ID=8147 DF PROTO=UDP SPT=42397 DPT=53
LEN=36
[723269.984892] mangle-pre: IN=br-lan OUT=
MAC=78:a3:51:34:e4:86:2e:8e:de:be:22:5a:08:00 SRC=10.0.0.3 DST=1.1.1.1
LEN=56 TOS=0x00 PREC=0x00 TTL=64 ID=8147 DF PROTO=UDP SPT=42397 DPT=53
LEN=36
[723270.001769] mangle-pre #2: IN=br-lan OUT=
MAC=78:a3:51:34:e4:86:2e:8e:de:be:22:5a:08:00 SRC=10.0.0.3 DST=1.1.1.1
LEN=56 TOS=0x00 PREC=0x00 TTL=64 ID=8147 DF PROTO=UDP SPT=42397 DPT=53
LEN=36 -> this LOG-rule only matches when ctstate is NEW.

Second DNS request (verified by packet length):
[723270.019033] raw-pre: IN=br-lan OUT=
MAC=78:a3:51:34:e4:86:2e:8e:de:be:22:5a:08:00 SRC=10.0.0.3 DST=1.1.1.1
LEN=58 TOS=0x00 PREC=0x00 TTL=64 ID=8148 DF PROTO=UDP SPT=42397 DPT=53
LEN=38
[723270.035734] mangle-pre: IN=br-lan OUT=
MAC=78:a3:51:34:e4:86:2e:8e:de:be:22:5a:08:00 SRC=10.0.0.3 DST=1.1.1.1
LEN=58 TOS=0x00 PREC=0x00 TTL=64 ID=8148 DF PROTO=UDP SPT=42397 DPT=53
LEN=38
-> No mangle-pre #2, so the conntrack entry is found

Processing first request:
[723270.036213] nat-pre: IN=br-lan OUT=
MAC=78:a3:51:34:e4:86:2e:8e:de:be:22:5a:08:00 SRC=10.0.0.3 DST=1.1.1.1
LEN=56 TOS=0x00 PREC=0x00 TTL=64 ID=8147 DF PROTO=UDP SPT=42397 DPT=53
LEN=36 MARK=0x4000200 -> I use policy routing, therefore mark. Routing
works correctly.
[723270.036366] filter-fwd: IN=br-lan OUT=eth0.2
MAC=78:a3:51:34:e4:86:2e:8e:de:be:22:5a:08:00 SRC=10.0.0.3 DST=1.1.1.1
LEN=56 TOS=0x00 PREC=0x00 TTL=63 ID=8147 DF PROTO=UDP SPT=42397 DPT=53
LEN=36 MARK=0x4000200
[723270.036419] mangle-post: IN= OUT=eth0.2 SRC=10.0.0.3 DST=1.1.1.1
LEN=56 TOS=0x00 PREC=0x00 TTL=63 ID=8147 DF PROTO=UDP SPT=42397 DPT=53
LEN=36 MARK=0x4000200
[723270.036457] nat-post: IN= OUT=eth0.2 SRC=10.0.0.3 DST=1.1.1.1
LEN=56 TOS=0x00 PREC=0x00 TTL=63 ID=8147 DF PROTO=UDP SPT=42397 DPT=53
LEN=36 MARK=0x4000200

Processing second request:
[723270.117394] mangle-pre #2: IN=br-lan OUT=
MAC=78:a3:51:34:e4:86:2e:8e:de:be:22:5a:08:00 SRC=10.0.0.3 DST=1.1.1.1
LEN=58 TOS=0x00 PREC=0x00 TTL=64 ID=8148 DF PROTO=UDP SPT=42397 DPT=53
LEN=38
[723270.136644] nat-pre: IN=br-lan OUT=
MAC=78:a3:51:34:e4:86:2e:8e:de:be:22:5a:08:00 SRC=10.0.0.3 DST=1.1.1.1
LEN=58 TOS=0x00 PREC=0x00 TTL=64 ID=8148 DF PROTO=UDP SPT=42397 DPT=53
LEN=38 MARK=0x4000200
[723270.154716] filter-fwd: IN=br-lan OUT=eth0.2
MAC=78:a3:51:34:e4:86:2e:8e:de:be:22:5a:08:00 SRC=10.0.0.3 DST=1.1.1.1
LEN=58 TOS=0x00 PREC=0x00 TTL=63 ID=8148 DF PROTO=UDP SPT=42397 DPT=53
LEN=38 MARK=0x4000200
[723270.173501] mangle-post: IN= OUT=eth0.2 SRC=10.0.0.3 DST=1.1.1.1
LEN=58 TOS=0x00 PREC=0x00 TTL=63 ID=8148 DF PROTO=UDP SPT=42397 DPT=53
LEN=38 MARK=0x4000200
[723270.187787] nat-post: IN= OUT=eth0.2 SRC=10.0.0.3 DST=1.1.1.1
LEN=58 TOS=0x00 PREC=0x00 TTL=63 ID=8148 DF PROTO=UDP SPT=42397 DPT=53
LEN=38 MARK=0x4000200

When dumping conntrack after the second request has been processed, I
get the following:
udp      17 55 src=10.0.0.3 dst=1.1.1.1 sport=42397 dport=53 packets=1
bytes=56 src=1.1.1.1 dst=193.213.155.210 sport=53 dport=42397
packets=1 bytes=72 mark=67109120 delta-time=4 use=1

This looks a bit strange to me, it seems that only the first request
has been counted.

Does anyone have any idea of what could be wrong, where I should look
or other things I can try? I tried to space the requests out a bit in
time (I inserted a sleep 1 between them), and then the problem went
away.

Thanks in advance for any help,
Kristian

^ permalink raw reply

* Re: [RFC net-next 4/5] net: phy: Add support for IEEE standard test modes
From: Florian Fainelli @ 2018-05-01 18:43 UTC (permalink / raw)
  To: Woojung.Huh, netdev
  Cc: andrew, rmk, linux-kernel, davem, cphealy, nikita.yoush,
	vivien.didelot, Nisar.Sayed, UNGLinuxDriver
In-Reply-To: <9235D6609DB808459E95D78E17F2E43D40D553D5@CHN-SV-EXMX02.mchp-main.com>

On 05/01/2018 10:29 AM, Woojung.Huh@microchip.com wrote:
> Hi Florian,
> 
>> diff --git a/drivers/net/phy/phy-tests.c b/drivers/net/phy/phy-tests.c
> ...
>> +/* genphy_set_test - Make a PHY enter one of the standard IEEE defined
>> + * test modes
>> + * @phydev: the PHY device instance
>> + * @test: the desired test mode
>> + * @data: test specific data (none)
>> + *
>> + * This function makes the designated @phydev enter the desired standard
>> + * 100BaseT2 or 1000BaseT test mode as defined in IEEE 802.3-2012 section TWO
>> + * and THREE under 32.6.1.2.1 and 40.6.1.1.2 respectively
>> + */
>> +int genphy_set_test(struct phy_device *phydev,
>> +		    struct ethtool_phy_test *test, const u8 *data)
>> +{
>> +	u16 shift, base, bmcr = 0;
>> +	int ret;
>> +
>> +	/* Exit test mode */
>> +	if (test->mode == PHY_STD_TEST_MODE_NORMAL) {
>> +		ret = phy_read(phydev, MII_CTRL1000);
>> +		if (ret < 0)
>> +			return ret;
>> +
>> +		ret &= ~GENMASK(15, 13);
>> +
>> +		return phy_write(phydev, MII_CTRL1000, ret);
>> +	}
>> +
>> +	switch (test->mode) {
>> +	case PHY_STD_TEST_MODE_100BASET2_1:
>> +	case PHY_STD_TEST_MODE_100BASET2_2:
>> +	case PHY_STD_TEST_MODE_100BASET2_3:
>> +		if (!(phydev->supported & PHY_100BT_FEATURES))
>> +			return -EOPNOTSUPP;
>> +
>> +		shift = 14;
>> +		base = test->mode - PHY_STD_TEST_MODE_NORMAL;
>> +		bmcr = BMCR_SPEED100;
>> +		break;
>> +
>> +	case PHY_STD_TEST_MODE_1000BASET_1:
>> +	case PHY_STD_TEST_MODE_1000BASET_2:
>> +	case PHY_STD_TEST_MODE_1000BASET_3:
>> +	case PHY_STD_TEST_MODE_1000BASET_4:
>> +		if (!(phydev->supported & PHY_1000BT_FEATURES))
>> +			return -EOPNOTSUPP;
>> +
>> +		shift = 13;
>> +		base = test->mode - PHY_STD_TEST_MODE_100BASET2_MAX;
>> +		bmcr = BMCR_SPEED1000;
>> +		break;
>> +
>> +	default:
>> +		/* Let an upper driver deal with additional modes it may
>> +		 * support
>> +		 */
>> +		return -EOPNOTSUPP;
>> +	}
>> +
>> +	/* Force speed and duplex */
>> +	ret = phy_write(phydev, MII_BMCR, bmcr | BMCR_FULLDPLX);
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	/* Set the desired test mode bit */
>> +	return phy_write(phydev, MII_CTRL1000, (test->mode + base) << shift);
>> +}
> For now, these are for 100B-T2 & 1000B-TX.
> But, other speeds such as 802.3bw/bq/cq have very similar format,
> how about make  phy_write() to BMCR & CTRL1000 as another function call per speed?

Not sure I completely understand your suggestion, do you mean that I
should break down the body of that function above such that there are
per-speed lower level functions? Something like the pseudo-code below:

genphy_set_test() {
	switch (mode) {
	case PHY_STD_TEST_MODE_100BASET2_1:
	..
	case PHY_STD_TEST_MODE_100BASET2_3:
		return genphy_set_100baset2();

	case PHY_STD_TEST_MODE_1000BASET_1:
	..
	case PHY_STD_TEST_MODE_1000BASET_4:
		return genphy_set_1000baset();

	case PHY_STD_TEST_MODE_8021BWQCQ_1:
		return genphy_set_100baset1();

}

Or did you want to see a different way of mapping a given speed/feature
set to a specific test function?
-- 
Florian

^ permalink raw reply

* Re: [PATCH V3 net-next 1/2] tcp: send in-queue bytes in cmsg upon read
From: David Miller @ 2018-05-01 18:34 UTC (permalink / raw)
  To: soheil.kdev; +Cc: netdev, ycheng, ncardwell, edumazet, willemb, soheil
In-Reply-To: <20180501141128.208705-1-soheil.kdev@gmail.com>

From: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
Date: Tue,  1 May 2018 10:11:27 -0400

> +static inline int tcp_inq_hint(struct sock *sk)

Please do not use 'inline' in foo.c files, let the compiler decide.

Otherwise looks great, thanks.

^ permalink raw reply

* [PATCH iproute2-next] Add tc-BPF example for a TCP pure ack recognizer
From: Dave Taht @ 2018-05-01 18:32 UTC (permalink / raw)
  To: netdev; +Cc: Dave Taht

ack_recognize can shift pure tcp acks into another flowid.
---
 examples/bpf/ack_recognize.c | 98 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 98 insertions(+)
 create mode 100644 examples/bpf/ack_recognize.c

diff --git a/examples/bpf/ack_recognize.c b/examples/bpf/ack_recognize.c
new file mode 100644
index 0000000..5da620c
--- /dev/null
+++ b/examples/bpf/ack_recognize.c
@@ -0,0 +1,98 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright 2017 Google Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+ * 02110-1301, USA.
+ */
+
+/*
+ * Author: dave.taht@gmail.com (Dave Taht)
+ *
+ * ack_recognizer: An eBPF program that correctly recognizes modern TCP ACKs,
+ * with tcp option fields like SACK and timestamps, and no additional data.
+ *
+ * ack_match call: Recognize "pure acks" with no data payload
+ *
+ */
+
+#include "bpf_api.h"
+#include "linux/if_ether.h"
+#include "linux/ip.h"
+#include "linux/in.h"
+#include "linux/ipv6.h"
+#include "linux/tcp.h"
+
+/*
+ * A pure ack contains the ip header, the tcp header + options, flags with the
+ * ack field set, and no additional payload. That last bit is what every prior
+ * ack filter gets wrong, they typically assume an obsolete 64 bytes, and don't
+ * calculate the options (like sack or timestamps) to subtract from the payload.
+ */
+
+__section_cls_entry
+int ack_match(struct __sk_buff *skb)
+{
+	void *data = (void *)(long)skb->data;
+	void *data_end = (void *)(long)skb->data_end;
+	struct ethhdr *eth = data;
+	struct iphdr *iph = data + sizeof(*eth);
+	struct tcphdr *tcp;
+
+	if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*tcp) > data_end)
+		return 0;
+
+	if (eth->h_proto == htons(ETH_P_IP) &&
+	    iph->version == 4) {
+		if(iph->protocol == IPPROTO_TCP &&
+		   iph->ihl == 5 &&
+		   data + sizeof(*eth) + 20 + sizeof(*tcp) <= data_end) {
+			tcp = data + sizeof(*eth) + 20;
+			if (tcp->ack &&
+			    htons(iph->tot_len) == 20 + tcp->doff*4)
+				return -1;
+		}
+	} else if (eth->h_proto == htons(ETH_P_IPV6) &&
+		   iph->version == 6) {
+		struct ipv6hdr *iph6 = (struct ipv6hdr *) iph;
+		if(iph6->nexthdr == IPPROTO_TCP &&
+		   data + sizeof(*eth) + 40 + sizeof(*tcp) <= data_end ) {
+			tcp = data + sizeof(*eth) + 40;
+			if (tcp->ack &&
+			    tcp->doff*4 == htons(iph6->payload_len))
+				return -1;
+		}
+	}
+
+	return 0;
+}
+
+/* Example: Move acks into a priority queue:
+
+IFACE=eth0
+tc qdisc del dev $IFACE root 2> /dev/null
+tc qdisc add dev $IFACE root handle 1: prio bands 3 \
+	priomap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
+tc qdisc add dev $IFACE parent 1:1 handle 10:1 sfq headdrop # acks only
+tc qdisc add dev $IFACE parent 1:2 handle 20:1 fq_codel # all other traffic
+tc qdisc add dev $IFACE parent 1:3 handle 30:1 fq_codel # unused
+tc filter add dev $IFACE parent 1: prio 1 bpf \
+	object-file ack_recognize.o flowid 1:1
+
+Please note that a strict priority queue is not a good idea (drr would be
+better), nor is doing any level of prioritization on acks at all....
+*/
+
+BPF_LICENSE("GPL");
-- 
2.7.4

^ permalink raw reply related

* Re: [Cake] [PATCH net-next v6] Add Common Applications Kept Enhanced (cake) qdisc
From: Jonathan Morton @ 2018-05-01 18:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Dave Taht, Cong Wang, Cake List, Linux Kernel Network Developers
In-Reply-To: <dff5ba09-1b81-7cd8-4be3-e7a419797d8e@gmail.com>

> On 1 May, 2018, at 7:06 pm, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> You have not provided any numbers to show how useful it is to maintain this
> code (probably still broken btw, considering it is changing some skb attributes).

A simple calculation shows what the maximum tolerable asymmetry for a conventional TCP connection is.  If the MTU is 1500, a pure ack is 84 bytes, and an ack is sent for every 3 packets received, bandwidth asymmetry exceeding about 50:1 will inevitably result in the inability to fully utilise downstream bandwidth.  This ratio drops to 34:1 if acks are sent every two data packets, and 17:1 for the RFC-mandated "ack every packet" behaviour when loss is detected.

Physical asymmetry ratios exceeding 20:1 are not rare.  They are increased still further if reverse traffic is present; given a nominal 10:1 asymmetry, interleaving one ack with one MTU packet (as SFQ would) gives an *effective* asymmetry for the downstream flow of 188:1, and seriously affects downstream goodput.

High asymmetry ratios can be tolerated by TCPs implementing sparser acks than 1:3 ratios, as proposed in AckCC.  Without AckCC however, I understand a strict reading of the RFCs prohibits TCPs from going beyond 1:3.  Even if Linux TCPs do so anyway, billions of Windows, Mac and mobile hosts most likely will not.  This makes a solely end-to-end solution impractical for the time being, with no obvious hope of improvement.

To maintain downstream goodput under these conditions, it is either necessary to send bursts of acks between the upstream traffic (which is slightly wasteful of bandwidth and also upsets RTT-sensitive TCPs like BBR) - which Cake will do quite happily if the ack-filter is disabled - or to delete some of the acks.  When there are many upstream flows competing with a single ack flow, the latter is the only reasonable solution unless acks are hard-prioritised (which has negative effects of its own, so we didn't consider it).

Middleboxes *are* permitted to drop packets, and AQM routinely does so.  We found that AQM could delete acks beneficially, but that it ramped up too slowly to really solve the problem (acks being individually small, much larger numbers of them must be dropped once they become a saturating flow, compared to a data flow).  Also, ack flows are often *antiresponsive*, in that dropping some of them causes an increase in their arrival rate due to increasing the ack-clocked downstream traffic, so it should be moderately obvious that conventional AQM strategies don't apply.  We also looked at existing ack filters' behaviour and found it wanting, so we were initially reluctant to implement our own.

However, having seen many counterexamples of how *not* to write an ack filter, we were eventually able to write one that *didn't* break the rules, and ensured that the information TCP relies on remained in acks that were not deleted even when many acks were.  This includes preserving the triple-repetition rule for detecting packet loss, and preserving the tail ack signifying reception of the end of a flow - which naturally results in only dropping acks if it's worthwhile to do so.  In short, we're fully aware of the potential for breaking TCP this way, and we've done our level best to avoid it.

Testing revealed an appreciable, simultaneous improvement in both downstream and upstream goodput, and in the downstream flow's internal RTT, under appropriate traffic conditions.  A more aggressive variant, going to the edges of what might be allowed by RFCs, actually showed *less* improvement than the standard one - it interfered with TCP's behaviour too much.  We can dig up the data if required.

> Also on wifi, the queue builds in the driver queues anyway, not in the qdisc.
> So ACK filtering, if _really_ successful, would need to be modularized.

Agree that the wifi queues would also benefit from ack filtering.  In fact, with make-wifi-fast improvements active, installing Cake or any other qdisc on that interface should be unnecessary.  Cake's shaper can't easily adapt to variable-rate links on the timescales required, anyway.  (It could if something told it directly about the variations.)

However, I don't see a way to install the ack-filter as a separate entity in its own right, without the ability to manipulate the queue downstream of itself.  The arrival of a new ack may trigger the deletion of a previous one, never the one just arrived, yet keeping an internal queue of acks within the filter would be counterproductive.  That's why we implemented it within Cake instead of as a separate thing.

It may be possible to modularise it sufficiently that qdiscs could add support for ack-filtering without duplicating the code, rather than the other way around.  This would also allow wifi queues to be modified the same way.  Would that be acceptable?

 - Jonathan Morton

^ permalink raw reply

* [PATCH net-next v7 3/3] cxgb4: collect hardware dump in second kernel
From: Rahul Lakkireddy @ 2018-05-01 18:27 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA,
	kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: indranil-ut6Up61K2wZBDgjK7y7TUQ, nirranjan-ut6Up61K2wZBDgjK7y7TUQ,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	ganeshgr-ut6Up61K2wZBDgjK7y7TUQ, Rahul Lakkireddy,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn
In-Reply-To: <cover.1525197407.git.rahul.lakkireddy-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>

Register callback to collect hardware/firmware dumps in second kernel
before hardware/firmware is initialized. The dumps for each device
will be available as elf notes in /proc/vmcore in second kernel.

Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
Signed-off-by: Ganesh Goudar <ganeshgr-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
---
v7:
- Removed "CHELSIO" vendor identifier in Elf Note name.

v6:
- Added "CHELSIO" string as vendor identifier in the Elf Note name.

v5:
- No changes.

v4:
- No changes.

v3:
- Replaced all crashdd* with vmcoredd*.
- Replaced crashdd_add_dump() with vmcore_add_device_dump().
- Updated comments and commit message.

v2:
- No Changes.

Changes since rfc v2:
- Update comments and commit message for sysfs change.

rfc v2:
- Updated dump registration to the new API in patch 1.

 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h       |  4 ++++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c | 25 ++++++++++++++++++++++++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h |  3 +++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c  | 10 ++++++++++
 4 files changed, 42 insertions(+)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 688f95440af2..01e7aad4ce5b 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -50,6 +50,7 @@
 #include <linux/net_tstamp.h>
 #include <linux/ptp_clock_kernel.h>
 #include <linux/ptp_classify.h>
+#include <linux/crash_dump.h>
 #include <asm/io.h>
 #include "t4_chip_type.h"
 #include "cxgb4_uld.h"
@@ -964,6 +965,9 @@ struct adapter {
 	struct hma_data hma;
 
 	struct srq_data *srq;
+
+	/* Dump buffer for collecting logs in kdump kernel */
+	struct vmcoredd_data vmcoredd;
 };
 
 /* Support for "sched-class" command to allow a TX Scheduling Class to be
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c
index 143686c60234..085691eb2b95 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c
@@ -488,3 +488,28 @@ void cxgb4_init_ethtool_dump(struct adapter *adapter)
 	adapter->eth_dump.version = adapter->params.fw_vers;
 	adapter->eth_dump.len = 0;
 }
+
+static int cxgb4_cudbg_vmcoredd_collect(struct vmcoredd_data *data, void *buf)
+{
+	struct adapter *adap = container_of(data, struct adapter, vmcoredd);
+	u32 len = data->size;
+
+	return cxgb4_cudbg_collect(adap, buf, &len, CXGB4_ETH_DUMP_ALL);
+}
+
+int cxgb4_cudbg_vmcore_add_dump(struct adapter *adap)
+{
+	struct vmcoredd_data *data = &adap->vmcoredd;
+	u32 len;
+
+	len = sizeof(struct cudbg_hdr) +
+	      sizeof(struct cudbg_entity_hdr) * CUDBG_MAX_ENTITY;
+	len += CUDBG_DUMP_BUFF_SIZE;
+
+	data->size = len;
+	snprintf(data->dump_name, sizeof(data->dump_name), "%s_%s",
+		 cxgb4_driver_name, adap->name);
+	data->vmcoredd_callback = cxgb4_cudbg_vmcoredd_collect;
+
+	return vmcore_add_device_dump(data);
+}
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h
index ce1ac9a1c878..ef59ba1ed968 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h
@@ -41,8 +41,11 @@ enum CXGB4_ETHTOOL_DUMP_FLAGS {
 	CXGB4_ETH_DUMP_HW = (1 << 1), /* various FW and HW dumps */
 };
 
+#define CXGB4_ETH_DUMP_ALL (CXGB4_ETH_DUMP_MEM | CXGB4_ETH_DUMP_HW)
+
 u32 cxgb4_get_dump_length(struct adapter *adap, u32 flag);
 int cxgb4_cudbg_collect(struct adapter *adap, void *buf, u32 *buf_size,
 			u32 flag);
 void cxgb4_init_ethtool_dump(struct adapter *adapter);
+int cxgb4_cudbg_vmcore_add_dump(struct adapter *adap);
 #endif /* __CXGB4_CUDBG_H__ */
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 24d2865b8806..32cad0acf76c 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -5544,6 +5544,16 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	if (err)
 		goto out_free_adapter;
 
+	if (is_kdump_kernel()) {
+		/* Collect hardware state and append to /proc/vmcore */
+		err = cxgb4_cudbg_vmcore_add_dump(adapter);
+		if (err) {
+			dev_warn(adapter->pdev_dev,
+				 "Fail collecting vmcore device dump, err: %d. Continuing\n",
+				 err);
+			err = 0;
+		}
+	}
 
 	if (!is_t4(adapter->params.chip)) {
 		s_qpp = (QUEUESPERPAGEPF0_S +
-- 
2.14.1

^ permalink raw reply related

* [PATCH net-next v7 2/3] vmcore: append device dumps to vmcore as elf notes
From: Rahul Lakkireddy @ 2018-05-01 18:27 UTC (permalink / raw)
  To: netdev, kexec, linux-fsdevel, linux-kernel
  Cc: davem, viro, ebiederm, stephen, akpm, torvalds, ganeshgr,
	nirranjan, indranil, Rahul Lakkireddy
In-Reply-To: <cover.1525197407.git.rahul.lakkireddy@chelsio.com>

Update read and mmap logic to append device dumps as additional notes
before the other elf notes. We add device dumps before other elf notes
because the other elf notes may not fill the elf notes buffer
completely and we will end up with zero-filled data between the elf
notes and the device dumps. Tools will then try to decode this
zero-filled data as valid notes and we don't want that. Hence, adding
device dumps before the other elf notes ensure that zero-filled data
can be avoided. This also ensures that the device dumps and the
other elf notes can be properly mmaped at page aligned address.

Incorporate device dump size into the total vmcore size. Also update
offsets for other program headers after the device dumps are added.

Suggested-by: Eric Biederman <ebiederm@xmission.com>.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
---
v7:
- No changes.

v6:
- No changes.

v5:
- No changes.

v4:
- No changes.

v3:
- Patch added in this version.
- Exported dumps as elf notes. Suggested by Eric Biederman
  <ebiederm@xmission.com>.

 fs/proc/vmcore.c | 247 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 243 insertions(+), 4 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 60fce8dfe4e3..e06c9b942c22 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -39,6 +39,8 @@ static size_t elfcorebuf_sz_orig;
 
 static char *elfnotes_buf;
 static size_t elfnotes_sz;
+/* Size of all notes minus the device dump notes */
+static size_t elfnotes_orig_sz;
 
 /* Total size of vmcore file. */
 static u64 vmcore_size;
@@ -51,6 +53,9 @@ static LIST_HEAD(vmcoredd_list);
 static DEFINE_MUTEX(vmcoredd_mutex);
 #endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
 
+/* Device Dump Size */
+static size_t vmcoredd_orig_sz;
+
 /*
  * Returns > 0 for RAM pages, 0 for non-RAM pages, < 0 on error
  * The called function has to take care of module refcounting.
@@ -185,6 +190,77 @@ static int copy_to(void *target, void *src, size_t size, int userbuf)
 	return 0;
 }
 
+#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
+static int vmcoredd_copy_dumps(void *dst, u64 start, size_t size, int userbuf)
+{
+	struct vmcoredd_node *dump;
+	u64 offset = 0;
+	int ret = 0;
+	size_t tsz;
+	char *buf;
+
+	mutex_lock(&vmcoredd_mutex);
+	list_for_each_entry(dump, &vmcoredd_list, list) {
+		if (start < offset + dump->size) {
+			tsz = min(offset + (u64)dump->size - start, (u64)size);
+			buf = dump->buf + start - offset;
+			if (copy_to(dst, buf, tsz, userbuf)) {
+				ret = -EFAULT;
+				goto out_unlock;
+			}
+
+			size -= tsz;
+			start += tsz;
+			dst += tsz;
+
+			/* Leave now if buffer filled already */
+			if (!size)
+				goto out_unlock;
+		}
+		offset += dump->size;
+	}
+
+out_unlock:
+	mutex_unlock(&vmcoredd_mutex);
+	return ret;
+}
+
+static int vmcoredd_mmap_dumps(struct vm_area_struct *vma, unsigned long dst,
+			       u64 start, size_t size)
+{
+	struct vmcoredd_node *dump;
+	u64 offset = 0;
+	int ret = 0;
+	size_t tsz;
+	char *buf;
+
+	mutex_lock(&vmcoredd_mutex);
+	list_for_each_entry(dump, &vmcoredd_list, list) {
+		if (start < offset + dump->size) {
+			tsz = min(offset + (u64)dump->size - start, (u64)size);
+			buf = dump->buf + start - offset;
+			if (remap_vmalloc_range_partial(vma, dst, buf, tsz)) {
+				ret = -EFAULT;
+				goto out_unlock;
+			}
+
+			size -= tsz;
+			start += tsz;
+			dst += tsz;
+
+			/* Leave now if buffer filled already */
+			if (!size)
+				goto out_unlock;
+		}
+		offset += dump->size;
+	}
+
+out_unlock:
+	mutex_unlock(&vmcoredd_mutex);
+	return ret;
+}
+#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
+
 /* Read from the ELF header and then the crash dump. On error, negative value is
  * returned otherwise number of bytes read are returned.
  */
@@ -222,10 +298,41 @@ static ssize_t __read_vmcore(char *buffer, size_t buflen, loff_t *fpos,
 	if (*fpos < elfcorebuf_sz + elfnotes_sz) {
 		void *kaddr;
 
+		/* We add device dumps before other elf notes because the
+		 * other elf notes may not fill the elf notes buffer
+		 * completely and we will end up with zero-filled data
+		 * between the elf notes and the device dumps. Tools will
+		 * then try to decode this zero-filled data as valid notes
+		 * and we don't want that. Hence, adding device dumps before
+		 * the other elf notes ensure that zero-filled data can be
+		 * avoided.
+		 */
+#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
+		/* Read device dumps */
+		if (*fpos < elfcorebuf_sz + vmcoredd_orig_sz) {
+			tsz = min(elfcorebuf_sz + vmcoredd_orig_sz -
+				  (size_t)*fpos, buflen);
+			start = *fpos - elfcorebuf_sz;
+			if (vmcoredd_copy_dumps(buffer, start, tsz, userbuf))
+				return -EFAULT;
+
+			buflen -= tsz;
+			*fpos += tsz;
+			buffer += tsz;
+			acc += tsz;
+
+			/* leave now if filled buffer already */
+			if (!buflen)
+				return acc;
+		}
+#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
+
+		/* Read remaining elf notes */
 		tsz = min(elfcorebuf_sz + elfnotes_sz - (size_t)*fpos, buflen);
-		kaddr = elfnotes_buf + *fpos - elfcorebuf_sz;
+		kaddr = elfnotes_buf + *fpos - elfcorebuf_sz - vmcoredd_orig_sz;
 		if (copy_to(buffer, kaddr, tsz, userbuf))
 			return -EFAULT;
+
 		buflen -= tsz;
 		*fpos += tsz;
 		buffer += tsz;
@@ -451,11 +558,46 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
 	if (start < elfcorebuf_sz + elfnotes_sz) {
 		void *kaddr;
 
+		/* We add device dumps before other elf notes because the
+		 * other elf notes may not fill the elf notes buffer
+		 * completely and we will end up with zero-filled data
+		 * between the elf notes and the device dumps. Tools will
+		 * then try to decode this zero-filled data as valid notes
+		 * and we don't want that. Hence, adding device dumps before
+		 * the other elf notes ensure that zero-filled data can be
+		 * avoided. This also ensures that the device dumps and
+		 * other elf notes can be properly mmaped at page aligned
+		 * address.
+		 */
+#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
+		/* Read device dumps */
+		if (start < elfcorebuf_sz + vmcoredd_orig_sz) {
+			u64 start_off;
+
+			tsz = min(elfcorebuf_sz + vmcoredd_orig_sz -
+				  (size_t)start, size);
+			start_off = start - elfcorebuf_sz;
+			if (vmcoredd_mmap_dumps(vma, vma->vm_start + len,
+						start_off, tsz))
+				goto fail;
+
+			size -= tsz;
+			start += tsz;
+			len += tsz;
+
+			/* leave now if filled buffer already */
+			if (!size)
+				return 0;
+		}
+#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
+
+		/* Read remaining elf notes */
 		tsz = min(elfcorebuf_sz + elfnotes_sz - (size_t)start, size);
-		kaddr = elfnotes_buf + start - elfcorebuf_sz;
+		kaddr = elfnotes_buf + start - elfcorebuf_sz - vmcoredd_orig_sz;
 		if (remap_vmalloc_range_partial(vma, vma->vm_start + len,
 						kaddr, tsz))
 			goto fail;
+
 		size -= tsz;
 		start += tsz;
 		len += tsz;
@@ -703,6 +845,11 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
 	/* Modify e_phnum to reflect merged headers. */
 	ehdr_ptr->e_phnum = ehdr_ptr->e_phnum - nr_ptnote + 1;
 
+	/* Store the size of all notes.  We need this to update the note
+	 * header when the device dumps will be added.
+	 */
+	elfnotes_orig_sz = phdr.p_memsz;
+
 	return 0;
 }
 
@@ -889,6 +1036,11 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
 	/* Modify e_phnum to reflect merged headers. */
 	ehdr_ptr->e_phnum = ehdr_ptr->e_phnum - nr_ptnote + 1;
 
+	/* Store the size of all notes.  We need this to update the note
+	 * header when the device dumps will be added.
+	 */
+	elfnotes_orig_sz = phdr.p_memsz;
+
 	return 0;
 }
 
@@ -981,8 +1133,8 @@ static int __init process_ptload_program_headers_elf32(char *elfptr,
 }
 
 /* Sets offset fields of vmcore elements. */
-static void __init set_vmcore_list_offsets(size_t elfsz, size_t elfnotes_sz,
-					   struct list_head *vc_list)
+static void set_vmcore_list_offsets(size_t elfsz, size_t elfnotes_sz,
+				    struct list_head *vc_list)
 {
 	loff_t vmcore_off;
 	struct vmcore *m;
@@ -1174,6 +1326,92 @@ static void vmcoredd_write_header(void *buf, struct vmcoredd_data *data,
 	memcpy(vdd_hdr->dump_name, data->dump_name, sizeof(vdd_hdr->dump_name));
 }
 
+/**
+ * vmcoredd_update_program_headers - Update all Elf program headers
+ * @elfptr: Pointer to elf header
+ * @elfnotesz: Size of elf notes aligned to page size
+ * @vmcoreddsz: Size of device dumps to be added to elf note header
+ *
+ * Determine type of Elf header (Elf64 or Elf32) and update the elf note size.
+ * Also update the offsets of all the program headers after the elf note header.
+ */
+static void vmcoredd_update_program_headers(char *elfptr, size_t elfnotesz,
+					    size_t vmcoreddsz)
+{
+	unsigned char *e_ident = (unsigned char *)elfptr;
+	u64 start, end, size;
+	loff_t vmcore_off;
+	u32 i;
+
+	vmcore_off = elfcorebuf_sz + elfnotesz;
+
+	if (e_ident[EI_CLASS] == ELFCLASS64) {
+		Elf64_Ehdr *ehdr = (Elf64_Ehdr *)elfptr;
+		Elf64_Phdr *phdr = (Elf64_Phdr *)(elfptr + sizeof(Elf64_Ehdr));
+
+		/* Update all program headers */
+		for (i = 0; i < ehdr->e_phnum; i++, phdr++) {
+			if (phdr->p_type == PT_NOTE) {
+				/* Update note size */
+				phdr->p_memsz = elfnotes_orig_sz + vmcoreddsz;
+				phdr->p_filesz = phdr->p_memsz;
+				continue;
+			}
+
+			start = rounddown(phdr->p_offset, PAGE_SIZE);
+			end = roundup(phdr->p_offset + phdr->p_memsz,
+				      PAGE_SIZE);
+			size = end - start;
+			phdr->p_offset = vmcore_off + (phdr->p_offset - start);
+			vmcore_off += size;
+		}
+	} else {
+		Elf32_Ehdr *ehdr = (Elf32_Ehdr *)elfptr;
+		Elf32_Phdr *phdr = (Elf32_Phdr *)(elfptr + sizeof(Elf32_Ehdr));
+
+		/* Update all program headers */
+		for (i = 0; i < ehdr->e_phnum; i++, phdr++) {
+			if (phdr->p_type == PT_NOTE) {
+				/* Update note size */
+				phdr->p_memsz = elfnotes_orig_sz + vmcoreddsz;
+				phdr->p_filesz = phdr->p_memsz;
+				continue;
+			}
+
+			start = rounddown(phdr->p_offset, PAGE_SIZE);
+			end = roundup(phdr->p_offset + phdr->p_memsz,
+				      PAGE_SIZE);
+			size = end - start;
+			phdr->p_offset = vmcore_off + (phdr->p_offset - start);
+			vmcore_off += size;
+		}
+	}
+}
+
+/**
+ * vmcoredd_update_size - Update the total size of the device dumps and update
+ * Elf header
+ * @dump_size: Size of the current device dump to be added to total size
+ *
+ * Update the total size of all the device dumps and update the Elf program
+ * headers. Calculate the new offsets for the vmcore list and update the
+ * total vmcore size.
+ */
+static void vmcoredd_update_size(size_t dump_size)
+{
+	vmcoredd_orig_sz += dump_size;
+	elfnotes_sz = roundup(elfnotes_orig_sz, PAGE_SIZE) + vmcoredd_orig_sz;
+	vmcoredd_update_program_headers(elfcorebuf, elfnotes_sz,
+					vmcoredd_orig_sz);
+
+	/* Update vmcore list offsets */
+	set_vmcore_list_offsets(elfcorebuf_sz, elfnotes_sz, &vmcore_list);
+
+	vmcore_size = get_vmcore_size(elfcorebuf_sz, elfnotes_sz,
+				      &vmcore_list);
+	proc_vmcore->size = vmcore_size;
+}
+
 /**
  * vmcore_add_device_dump - Add a buffer containing device dump to vmcore
  * @data: dump info.
@@ -1227,6 +1465,7 @@ static int __vmcore_add_device_dump(struct vmcoredd_data *data)
 	list_add_tail(&dump->list, &vmcoredd_list);
 	mutex_unlock(&vmcoredd_mutex);
 
+	vmcoredd_update_size(data_size);
 	return 0;
 
 out_err:
-- 
2.14.1

^ permalink raw reply related

* [PATCH net-next v7 1/3] vmcore: add API to collect hardware dump in second kernel
From: Rahul Lakkireddy @ 2018-05-01 18:27 UTC (permalink / raw)
  To: netdev, kexec, linux-fsdevel, linux-kernel
  Cc: davem, viro, ebiederm, stephen, akpm, torvalds, ganeshgr,
	nirranjan, indranil, Rahul Lakkireddy
In-Reply-To: <cover.1525197407.git.rahul.lakkireddy@chelsio.com>

The sequence of actions done by device drivers to append their device
specific hardware/firmware logs to /proc/vmcore are as follows:

1. During probe (before hardware is initialized), device drivers
register to the vmcore module (via vmcore_add_device_dump()), with
callback function, along with buffer size and log name needed for
firmware/hardware log collection.

2. vmcore module allocates the buffer with requested size. It adds
an Elf note and invokes the device driver's registered callback
function.

3. Device driver collects all hardware/firmware logs into the buffer
and returns control back to vmcore module.

Ensure that the device dump buffer size is always aligned to page size
so that it can be mmaped.

Also, rename alloc_elfnotes_buf() to vmcore_alloc_buf() to make it more
generic and reserve NT_VMCOREDD note type to indicate vmcore device
dump.

Suggested-by: Eric Biederman <ebiederm@xmission.com>.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
---
v7:
- Removed "CHELSIO" vendor identifier in Elf Note name. Instead,
  writing "LINUX".
- Moved vmcoredd_header to new file include/uapi/linux/vmcore.h
- Reworked vmcoredd_header to include Elf Note as part of the header
  itself.
- Removed vmcoredd_get_note_size().
- Renamed vmcoredd_write_note() to vmcoredd_write_header().
- Replaced all "unsigned long" with "unsigned int" for device dump
  size since max size of Elf Word is u32.

v6:
- Reworked device dump elf note name to contain vendor identifier.
- Added vmcoredd_header that precedes actual dump in the Elf Note.
- Device dump's name is moved inside vmcoredd_header.

v5:
- Removed enabling CONFIG_PROC_VMCORE_DEVICE_DUMP by default and
  updated help message to indicate that the driver must be present
  in second kernel's initramfs to collect the underlying device
  snapshot.

v4:
- Made __vmcore_add_device_dump() static.
- Moved compile check to define vmcore_add_device_dump() to
  crash_dump.h to fix compilation when vmcore.c is not compiled in.
- Convert ---help--- to help in Kconfig as indicated by checkpatch.
- Rebased to tip.

v3:
- Dropped sysfs crashdd module.
- Added CONFIG_PROC_VMCORE_DEVICE_DUMP to allow configuring device
  dump support.
- Moved logic related to adding dumps from crashdd to vmcore module.
- Rename all crashdd* to vmcoredd*.

v2:
- Added ABI Documentation for crashdd.
- Directly use octal permission instead of macro.

Changes since rfc v2:
- Moved exporting crashdd from procfs to sysfs.  Suggested by
  Stephen Hemminger <stephen@networkplumber.org>
- Moved code from fs/proc/crashdd.c to fs/crashdd/ directory.
- Replaced all proc API with sysfs API and updated comments.
- Calling driver callback before creating the binary file under
  crashdd sysfs.
- Changed binary dump file permission from S_IRUSR to S_IRUGO.
- Changed module name from CRASH_DRIVER_DUMP to CRASH_DEVICE_DUMP.

rfc v2:
- Collecting logs in 2nd kernel instead of during kernel panic.
  Suggested by Eric Biederman <ebiederm@xmission.com>.
- Patch added in this series.

 fs/proc/Kconfig             |  15 +++++
 fs/proc/vmcore.c            | 140 +++++++++++++++++++++++++++++++++++++++++---
 include/linux/crash_dump.h  |  18 ++++++
 include/linux/kcore.h       |   6 ++
 include/uapi/linux/elf.h    |   1 +
 include/uapi/linux/vmcore.h |  16 +++++
 6 files changed, 187 insertions(+), 9 deletions(-)
 create mode 100644 include/uapi/linux/vmcore.h

diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
index 1ade1206bb89..0eaeb41453f5 100644
--- a/fs/proc/Kconfig
+++ b/fs/proc/Kconfig
@@ -43,6 +43,21 @@ config PROC_VMCORE
         help
         Exports the dump image of crashed kernel in ELF format.
 
+config PROC_VMCORE_DEVICE_DUMP
+	bool "Device Hardware/Firmware Log Collection"
+	depends on PROC_VMCORE
+	default n
+	help
+	  After kernel panic, device drivers can collect the device
+	  specific snapshot of their hardware or firmware before the
+	  underlying devices are initialized in crash recovery kernel.
+	  Note that the device driver must be present in the crash
+	  recovery kernel's initramfs to collect its underlying device
+	  snapshot.
+
+	  If you say Y here, the collected device dumps will be added
+	  as ELF notes to /proc/vmcore.
+
 config PROC_SYSCTL
 	bool "Sysctl support (/proc/sys)" if EXPERT
 	depends on PROC_FS
diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index a45f0af22a60..60fce8dfe4e3 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -20,6 +20,7 @@
 #include <linux/init.h>
 #include <linux/crash_dump.h>
 #include <linux/list.h>
+#include <linux/mutex.h>
 #include <linux/vmalloc.h>
 #include <linux/pagemap.h>
 #include <linux/uaccess.h>
@@ -44,6 +45,12 @@ static u64 vmcore_size;
 
 static struct proc_dir_entry *proc_vmcore;
 
+#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
+/* Device Dump list and mutex to synchronize access to list */
+static LIST_HEAD(vmcoredd_list);
+static DEFINE_MUTEX(vmcoredd_mutex);
+#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
+
 /*
  * Returns > 0 for RAM pages, 0 for non-RAM pages, < 0 on error
  * The called function has to take care of module refcounting.
@@ -302,10 +309,8 @@ static const struct vm_operations_struct vmcore_mmap_ops = {
 };
 
 /**
- * alloc_elfnotes_buf - allocate buffer for ELF note segment in
- *                      vmalloc memory
- *
- * @notes_sz: size of buffer
+ * vmcore_alloc_buf - allocate buffer in vmalloc memory
+ * @sizez: size of buffer
  *
  * If CONFIG_MMU is defined, use vmalloc_user() to allow users to mmap
  * the buffer to user-space by means of remap_vmalloc_range().
@@ -313,12 +318,12 @@ static const struct vm_operations_struct vmcore_mmap_ops = {
  * If CONFIG_MMU is not defined, use vzalloc() since mmap_vmcore() is
  * disabled and there's no need to allow users to mmap the buffer.
  */
-static inline char *alloc_elfnotes_buf(size_t notes_sz)
+static inline char *vmcore_alloc_buf(size_t size)
 {
 #ifdef CONFIG_MMU
-	return vmalloc_user(notes_sz);
+	return vmalloc_user(size);
 #else
-	return vzalloc(notes_sz);
+	return vzalloc(size);
 #endif
 }
 
@@ -665,7 +670,7 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
 		return rc;
 
 	*notes_sz = roundup(phdr_sz, PAGE_SIZE);
-	*notes_buf = alloc_elfnotes_buf(*notes_sz);
+	*notes_buf = vmcore_alloc_buf(*notes_sz);
 	if (!*notes_buf)
 		return -ENOMEM;
 
@@ -851,7 +856,7 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
 		return rc;
 
 	*notes_sz = roundup(phdr_sz, PAGE_SIZE);
-	*notes_buf = alloc_elfnotes_buf(*notes_sz);
+	*notes_buf = vmcore_alloc_buf(*notes_sz);
 	if (!*notes_buf)
 		return -ENOMEM;
 
@@ -1145,6 +1150,120 @@ static int __init parse_crash_elf_headers(void)
 	return 0;
 }
 
+#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
+/**
+ * vmcoredd_write_header - Write vmcore device dump header at the
+ * beginning of the dump's buffer.
+ * @buf: Output buffer where the note is written
+ * @data: Dump info
+ * @size: Size of the dump
+ *
+ * Fills beginning of the dump's buffer with vmcore device dump header.
+ */
+static void vmcoredd_write_header(void *buf, struct vmcoredd_data *data,
+				  u32 size)
+{
+	struct vmcoredd_header *vdd_hdr = (struct vmcoredd_header *)buf;
+
+	vdd_hdr->n_namesz = sizeof(vdd_hdr->name);
+	vdd_hdr->n_descsz = size + sizeof(vdd_hdr->dump_name);
+	vdd_hdr->n_type = NT_VMCOREDD;
+
+	strncpy((char *)vdd_hdr->name, VMCOREDD_NOTE_NAME,
+		sizeof(vdd_hdr->name));
+	memcpy(vdd_hdr->dump_name, data->dump_name, sizeof(vdd_hdr->dump_name));
+}
+
+/**
+ * vmcore_add_device_dump - Add a buffer containing device dump to vmcore
+ * @data: dump info.
+ *
+ * Allocate a buffer and invoke the calling driver's dump collect routine.
+ * Write Elf note at the beginning of the buffer to indicate vmcore device
+ * dump and add the dump to global list.
+ */
+static int __vmcore_add_device_dump(struct vmcoredd_data *data)
+{
+	struct vmcoredd_node *dump;
+	void *buf = NULL;
+	size_t data_size;
+	int ret;
+
+	if (!data || !strlen(data->dump_name) ||
+	    !data->vmcoredd_callback || !data->size)
+		return -EINVAL;
+
+	dump = vzalloc(sizeof(*dump));
+	if (!dump) {
+		ret = -ENOMEM;
+		goto out_err;
+	}
+
+	/* Keep size of the buffer page aligned so that it can be mmaped */
+	data_size = roundup(sizeof(struct vmcoredd_header) + data->size,
+			    PAGE_SIZE);
+
+	/* Allocate buffer for driver's to write their dumps */
+	buf = vmcore_alloc_buf(data_size);
+	if (!buf) {
+		ret = -ENOMEM;
+		goto out_err;
+	}
+
+	vmcoredd_write_header(buf, data, data_size -
+			      sizeof(struct vmcoredd_header));
+
+	/* Invoke the driver's dump collection routing */
+	ret = data->vmcoredd_callback(data, buf +
+				      sizeof(struct vmcoredd_header));
+	if (ret)
+		goto out_err;
+
+	dump->buf = buf;
+	dump->size = data_size;
+
+	/* Add the dump to driver sysfs list */
+	mutex_lock(&vmcoredd_mutex);
+	list_add_tail(&dump->list, &vmcoredd_list);
+	mutex_unlock(&vmcoredd_mutex);
+
+	return 0;
+
+out_err:
+	if (buf)
+		vfree(buf);
+
+	if (dump)
+		vfree(dump);
+
+	return ret;
+}
+
+int vmcore_add_device_dump(struct vmcoredd_data *data)
+{
+	return __vmcore_add_device_dump(data);
+}
+EXPORT_SYMBOL(vmcore_add_device_dump);
+#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
+
+/* Free all dumps in vmcore device dump list */
+static void vmcore_free_device_dumps(void)
+{
+#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
+	mutex_lock(&vmcoredd_mutex);
+	while (!list_empty(&vmcoredd_list)) {
+		struct vmcoredd_node *dump;
+
+		dump = list_first_entry(&vmcoredd_list, struct vmcoredd_node,
+					list);
+		list_del(&dump->list);
+		vfree(dump->buf);
+		vfree(dump);
+	}
+	mutex_unlock(&vmcoredd_mutex);
+#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
+}
+
 /* Init function for vmcore module. */
 static int __init vmcore_init(void)
 {
@@ -1192,4 +1311,7 @@ void vmcore_cleanup(void)
 		kfree(m);
 	}
 	free_elfcorebuf();
+
+	/* clear vmcore device dump list */
+	vmcore_free_device_dumps();
 }
diff --git a/include/linux/crash_dump.h b/include/linux/crash_dump.h
index f7ac2aa93269..3e4ba9d753c8 100644
--- a/include/linux/crash_dump.h
+++ b/include/linux/crash_dump.h
@@ -5,6 +5,7 @@
 #include <linux/kexec.h>
 #include <linux/proc_fs.h>
 #include <linux/elf.h>
+#include <uapi/linux/vmcore.h>
 
 #include <asm/pgtable.h> /* for pgprot_t */
 
@@ -93,4 +94,21 @@ static inline bool is_kdump_kernel(void) { return 0; }
 #endif /* CONFIG_CRASH_DUMP */
 
 extern unsigned long saved_max_pfn;
+
+/* Device Dump information to be filled by drivers */
+struct vmcoredd_data {
+	char dump_name[VMCOREDD_MAX_NAME_BYTES]; /* Unique name of the dump */
+	unsigned int size;                       /* Size of the dump */
+	/* Driver's registered callback to be invoked to collect dump */
+	int (*vmcoredd_callback)(struct vmcoredd_data *data, void *buf);
+};
+
+#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
+int vmcore_add_device_dump(struct vmcoredd_data *data);
+#else
+static inline int vmcore_add_device_dump(struct vmcoredd_data *data)
+{
+	return -EOPNOTSUPP;
+}
+#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
 #endif /* LINUX_CRASHDUMP_H */
diff --git a/include/linux/kcore.h b/include/linux/kcore.h
index 80db19d3a505..8de55e4b5ee9 100644
--- a/include/linux/kcore.h
+++ b/include/linux/kcore.h
@@ -28,6 +28,12 @@ struct vmcore {
 	loff_t offset;
 };
 
+struct vmcoredd_node {
+	struct list_head list;	/* List of dumps */
+	void *buf;		/* Buffer containing device's dump */
+	unsigned int size;	/* Size of the buffer */
+};
+
 #ifdef CONFIG_PROC_KCORE
 extern void kclist_add(struct kcore_list *, void *, size_t, int type);
 #else
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index e2535d6dcec7..4e12c423b9fe 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -421,6 +421,7 @@ typedef struct elf64_shdr {
 #define NT_ARM_SYSTEM_CALL	0x404	/* ARM system call number */
 #define NT_ARM_SVE	0x405		/* ARM Scalable Vector Extension registers */
 #define NT_ARC_V2	0x600		/* ARCv2 accumulator/extra registers */
+#define NT_VMCOREDD	0x700		/* Vmcore Device Dump Note */
 
 /* Note header in a PT_NOTE section */
 typedef struct elf32_note {
diff --git a/include/uapi/linux/vmcore.h b/include/uapi/linux/vmcore.h
new file mode 100644
index 000000000000..f9eab9a37370
--- /dev/null
+++ b/include/uapi/linux/vmcore.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _UAPI_VMCORE_H
+#define _UAPI_VMCORE_H
+
+#define VMCOREDD_NOTE_NAME "LINUX"
+#define VMCOREDD_MAX_NAME_BYTES 44
+
+struct vmcoredd_header {
+	__u32 n_namesz; /* Name size */
+	__u32 n_descsz; /* Content size */
+	__u32 n_type;   /* NT_VMCOREDD */
+	__u8 name[8];   /* LINUX\0\0\0 */
+	__u8 dump_name[VMCOREDD_MAX_NAME_BYTES]; /* Device dump's name */
+};
+
+#endif /* _UAPI_VMCORE_H */
-- 
2.14.1

^ permalink raw reply related

* [PATCH net-next v7 0/3] kernel: add support to collect hardware logs in crash recovery kernel
From: Rahul Lakkireddy @ 2018-05-01 18:27 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA,
	kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: indranil-ut6Up61K2wZBDgjK7y7TUQ, nirranjan-ut6Up61K2wZBDgjK7y7TUQ,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	ganeshgr-ut6Up61K2wZBDgjK7y7TUQ, Rahul Lakkireddy,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn

On production servers running variety of workloads over time, kernel
panic can happen sporadically after days or even months. It is
important to collect as much debug logs as possible to root cause
and fix the problem, that may not be easy to reproduce. Snapshot of
underlying hardware/firmware state (like register dump, firmware
logs, adapter memory, etc.), at the time of kernel panic will be very
helpful while debugging the culprit device driver.

This series of patches add new generic framework that enable device
drivers to collect device specific snapshot of the hardware/firmware
state of the underlying device in the crash recovery kernel. In crash
recovery kernel, the collected logs are added as elf notes to
/proc/vmcore, which is copied by user space scripts for post-analysis.

The sequence of actions done by device drivers to append their device
specific hardware/firmware logs to /proc/vmcore are as follows:

1. During probe (before hardware is initialized), device drivers
register to the vmcore module (via vmcore_add_device_dump()), with
callback function, along with buffer size and log name needed for
firmware/hardware log collection.

2. vmcore module allocates the buffer with requested size. It adds
an elf note and invokes the device driver's registered callback
function.

3. Device driver collects all hardware/firmware logs into the buffer
and returns control back to vmcore module.

The device specific hardware/firmware logs can be seen as elf notes
with note type 0x700, as shown below:

# readelf -n /proc/vmcore

Displaying notes found at file offset 0x00001000 with length 0x040032c0:
  Owner                 Data size	Description
  LINUX                0x02000fec	Unknown note type: (0x00000700)
  LINUX                0x02000fec	Unknown note type: (0x00000700)
  CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
  CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
  CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
  CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
  CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
  CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
  CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
  CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
  VMCOREINFO           0x00000785	Unknown note type: (0x00000000)

Patch 1 adds API to vmcore module to allow drivers to register callback
to collect the device specific hardware/firmware logs.  The logs will
be added to /proc/vmcore as elf notes.

Patch 2 updates read and mmap logic to append device specific hardware/
firmware logs as elf notes.

Patch 3 shows a cxgb4 driver example using the API to collect
hardware/firmware logs in crash recovery kernel, before hardware is
initialized.

Thanks,
Rahul

---
v7:
- Removed "CHELSIO" vendor identifier in Elf Note name. Instead,
  writing "LINUX".
- Moved vmcoredd_header to new file include/uapi/linux/vmcore.h
- Reworked vmcoredd_header to include Elf Note as part of the header
  itself.
- Removed vmcoredd_get_note_size().
- Renamed vmcoredd_write_note() to vmcoredd_write_header().
- Replaced all "unsigned long" with "unsigned int" for device dump
  size since max size of Elf Word is u32.

v6:
- Reworked device dump elf note name to contain vendor identifier.
- Added vmcoredd_header that precedes actual dump in the Elf Note.
- Device dump's name is moved inside vmcoredd_header.
- Added "CHELSIO" string as vendor identifier in the Elf Note name
  for cxgb4 device dumps.

v5:
- Removed enabling CONFIG_PROC_VMCORE_DEVICE_DUMP by default and
  updated help message.

v4:
- Made __vmcore_add_device_dump() static.
- Moved compile check to define vmcore_add_device_dump() to
  crash_dump.h to fix compilation when vmcore.c is not compiled in.
- Convert ---help--- to help in Kconfig as indicated by checkpatch.
- Rebased to tip.

v3:
- Dropped sysfs crashdd module.
- Exported dumps as elf notes. Suggested by Eric Biederman
  <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>.  Added as patch 2 in this version.
- Added CONFIG_PROC_VMCORE_DEVICE_DUMP to allow configuring device
  dump support.
- Moved logic related to adding dumps from crashdd to vmcore module.
- Rename all crashdd* to vmcoredd*.
- Updated comments.

v2:
- Added ABI Documentation for crashdd.
- Directly use octal permission instead of macro.

Changes since rfc v2:
- Moved exporting crashdd from procfs to sysfs. Suggested by
  Stephen Hemminger <stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ@public.gmane.org>
- Moved code from fs/proc/crashdd.c to fs/crashdd/ directory.
- Replaced all proc API with sysfs API and updated comments.
- Calling driver callback before creating the binary file under
  crashdd sysfs.
- Changed binary dump file permission from S_IRUSR to S_IRUGO.
- Changed module name from CRASH_DRIVER_DUMP to CRASH_DEVICE_DUMP.

rfc v2:
- Collecting logs in 2nd kernel instead of during kernel panic.
  Suggested by Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>.
- Added new crashdd module that exports /proc/crashdd/ containing
  driver's registered hardware/firmware logs in patch 1.
- Replaced the API to allow drivers to register their hardware/firmware
  log collect routine in crash recovery kernel in patch 1.
- Updated patch 2 to use the new API in patch 1.

Rahul Lakkireddy (3):
  vmcore: add API to collect hardware dump in second kernel
  vmcore: append device dumps to vmcore as elf notes
  cxgb4: collect hardware dump in second kernel

 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h       |   4 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c |  25 ++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h |   3 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c  |  10 +
 fs/proc/Kconfig                                  |  15 +
 fs/proc/vmcore.c                                 | 387 ++++++++++++++++++++++-
 include/linux/crash_dump.h                       |  18 ++
 include/linux/kcore.h                            |   6 +
 include/uapi/linux/elf.h                         |   1 +
 include/uapi/linux/vmcore.h                      |  16 +
 10 files changed, 472 insertions(+), 13 deletions(-)
 create mode 100644 include/uapi/linux/vmcore.h

-- 
2.14.1

^ permalink raw reply

* Re: [RFC net-next 0/5] Support for PHY test modes
From: Florian Fainelli @ 2018-05-01 18:27 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: David Miller, netdev, rmk, linux-kernel, cphealy, nikita.yoush,
	vivien.didelot, Nisar.Sayed, UNGLinuxDriver
In-Reply-To: <20180501174752.GA1265@lunn.ch>

On 05/01/2018 10:47 AM, Andrew Lunn wrote:
> On Tue, May 01, 2018 at 10:21:54AM -0700, Florian Fainelli wrote:
>> On 04/30/2018 04:24 PM, Andrew Lunn wrote:
>>>> Turning these tests on will typically result in the link partner
>>>> dropping the link with us, and the interface will be non-functional as
>>>> far as the data path is concerned (similar to an isolation mode). This
>>>> might warrant properly reporting that to user-space through e.g: a
>>>> private IFF_* value maybe?
>>>
>>> Hi Florian
>>>
>>> I think a IFF_* value would be a good idea. We want to give the user
>>> some indicate why they don't have working networking. ip link show
>>> showing PHY-TEST-MODE would help.
>>
>> IF_OPER_TESTING as defined in RFC 2863 looks like the correct way to
>> signal that. I did a quick test and setting operstate to
>> IFF_OPER_TESTING seems to correctly get reflected by iproute2/ifconfig
>> which no longer see RUNNING though the interface is still UP.
> 
> Hi Florian
> 
> I should really play with this.... but is the opstate printed by ip
> link show? Not showing RUNNING is not the best hint something else is
> going on. Actually saying TESTING somewhere is much clearer.

The operstate is settable and gettable from iproute2:

# ip link show gphy
4: gphy@eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
switchid 00000000 state UP mode DEFAULT group default qlen 1000

# ip link set gphy state testing

Although the kernel refuses to allow user space to set it to something
different from up or down AFAICT. With my local hack to allow setting
operstate from user-space through sysfs, we do see that iproute2
correctly show it:

# echo 4 > /sys/class/net/gphy/operstate
# ip link show gphy
4: gphy@eth1: <NO-CARRIER,BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
qdisc noqueue switchid 00000000 state TESTING mode DEFAULT group default
qlen 1000
    link/ether 00:10:18:de:38:1f brd ff:ff:ff:ff:ff:ff

but not from ifconfig. I was not too keen on adding a new IFF_* flag
because it usually means there was nothing else that could be done, and
it is disruptive to include/uapi/linux/if.h which I am afraid to break
(especially for non glibc systems).

Do you think that is acceptable? There are lots of things ifconfig can't
report, but at least, the state of the interface would not be "UP".
-- 
Florian

^ permalink raw reply

* Re: [PATCH RESEND] connector: add parent pid and tgid to coredump and exit events
From: David Miller @ 2018-05-01 18:26 UTC (permalink / raw)
  To: sstrogin
  Cc: zbr, netdev, linux-kernel, xe-linux-external, jderehag,
	matt.helsley
In-Reply-To: <20180430220429.14788-1-sstrogin@cisco.com>

From: Stefan Strogin <sstrogin@cisco.com>
Date: Tue,  1 May 2018 01:04:29 +0300

> The intention is to get notified of process failures as soon
> as possible, before a possible core dumping (which could be very long)
> (e.g. in some process-manager). Coredump and exit process events
> are perfect for such use cases (see 2b5faa4c553f "connector: Added
> coredumping event to the process connector").
> 
> The problem is that for now the process-manager cannot know the parent
> of a dying process using connectors. This could be useful if the
> process-manager should monitor for failures only children of certain
> parents, so we could filter the coredump and exit events by parent
> process and/or thread ID.
> 
> Add parent pid and tgid to coredump and exit process connectors event
> data.
> 
> Signed-off-by: Stefan Strogin <sstrogin@cisco.com>
> Acked-by: Evgeniy Polyakov <zbr@ioremap.net>

Applied to net-next, thank you.

^ permalink raw reply

* Re: [RFC v2 bpf-next 5/9] net/ipv6: Add fib6_lookup
From: David Ahern @ 2018-05-01 18:25 UTC (permalink / raw)
  To: Vincent Bernat
  Cc: netdev, borkmann, ast, davem, shm, roopa, brouer, toke,
	john.fastabend
In-Reply-To: <m3wownckdr.fsf@luffy.cx>

On 5/1/18 12:15 PM, Vincent Bernat wrote:
>  ❦ 29 avril 2018 11:07 -0700, David Ahern <dsahern@gmail.com> :
> 
>> +struct fib6_info *fib6_lookup(struct net *net, int oif, struct flowi6 *fl6,
>> +			      int flags)
> 
> Maybe an EXPORT_SYMBOL_GPL? There is one for __fib_lookup (fib_lookup is
> an inline function).
> 

There needs to be an in-tree user relying on it. My intention is to
convert rt6_lookup to fib6_lookup; as I recall all of those just want
the egress device and do not need a dst based response. That might
provide the in-tree user - depending on how the conversion is done.

And I presume you are asking for your benchmark modules. I can send you
my diffs and results.

^ permalink raw reply

* Re: [PATCH net-next] net: core: Inline netdev_features_size_check()
From: David Miller @ 2018-05-01 18:24 UTC (permalink / raw)
  To: f.fainelli; +Cc: netdev, stephen
In-Reply-To: <20180430212005.28204-1-f.fainelli@gmail.com>

From: Florian Fainelli <f.fainelli@gmail.com>
Date: Mon, 30 Apr 2018 14:20:05 -0700

> We do not require this inline function to be used in multiple different
> locations, just inline it where it gets used in register_netdevice().
> 
> Suggested-by: David Miller <davem@davemloft.net>
> Suggested-by: Stephen Hemminger <stephen@networkplumber.org>
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>

Yeah that looks a lot better, applied, thanks Florian.

^ permalink raw reply

* Re: [PATCH] ipv6: Allow non-gateway ECMP for IPv6
From: David Miller @ 2018-05-01 18:23 UTC (permalink / raw)
  To: Thomas.Winter; +Cc: netdev, dsahern, kuznet, yoshfuji
In-Reply-To: <20180430211529.8295-1-Thomas.Winter@alliedtelesis.co.nz>

From: Thomas Winter <Thomas.Winter@alliedtelesis.co.nz>
Date: Tue,  1 May 2018 09:15:29 +1200

> It is valid to have static routes where the nexthop
> is an interface not an address such as tunnels.
> For IPv4 it was possible to use ECMP on these routes
> but not for IPv6.
> 
> Signed-off-by: Thomas Winter <Thomas.Winter@alliedtelesis.co.nz>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH net-next] udp: disable gso with no_check_tx
From: David Miller @ 2018-05-01 18:21 UTC (permalink / raw)
  To: willemdebruijn.kernel; +Cc: netdev, willemb
In-Reply-To: <20180430195836.69378-1-willemdebruijn.kernel@gmail.com>

From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Mon, 30 Apr 2018 15:58:36 -0400

> From: Willem de Bruijn <willemb@google.com>
> 
> Syzbot managed to send a udp gso packet without checksum offload into
> the gso stack by disabling tx checksum (UDP_NO_CHECK6_TX).

Impressive...

> This triggered the skb_warn_bad_offload.
> 
>   RIP: 0010:skb_warn_bad_offload+0x2bc/0x600 net/core/dev.c:2658
>    skb_gso_segment include/linux/netdevice.h:4038 [inline]
>    validate_xmit_skb+0x54d/0xd90 net/core/dev.c:3120
>    __dev_queue_xmit+0xbf8/0x34c0 net/core/dev.c:3577
>    dev_queue_xmit+0x17/0x20 net/core/dev.c:3618
> 
> UDP_NO_CHECK6_TX sets skb->ip_summed to CHECKSUM_NONE just after the
> udp gso integrity checks in udp_(v6_)send_skb. Extend those checks to
> catch and fail in this case.
> 
> After the integrity checks jump directly to the CHECKSUM_PARTIAL case
> to avoid reading the no_check_tx flags again (a TOCTTOU race).
> 
> Fixes: bec1f6f69736 ("udp: generate gso with UDP_SEGMENT")
> Signed-off-by: Willem de Bruijn <willemb@google.com>

Applied, thanks Willem.

^ permalink raw reply

* Re: [PATCH v2] ethtool: fix a potential missing-check bug
From: David Miller @ 2018-05-01 18:19 UTC (permalink / raw)
  To: wang6495
  Cc: kjlu, f.fainelli, andrew, ecree, rmk+kernel, alan.brady, stephen,
	eugenia, inbark, vidya.chowdary, ynorov, viro, netdev,
	linux-kernel
In-Reply-To: <1525109474-5595-1-git-send-email-wang6495@umn.edu>

From: Wenwen Wang <wang6495@umn.edu>
Date: Mon, 30 Apr 2018 12:31:13 -0500

> In ethtool_get_rxnfc(), the object "info" is firstly copied from
> user-space. If the FLOW_RSS flag is set in the member field flow_type of
> "info" (and cmd is ETHTOOL_GRXFH), info needs to be copied again from
> user-space because FLOW_RSS is newer and has new definition, as mentioned
> in the comment. However, given that the user data resides in user-space, a
> malicious user can race to change the data after the first copy. By doing
> so, the user can inject inconsistent data. For example, in the second
> copy, the FLOW_RSS flag could be cleared in the field flow_type of "info".
> In the following execution, "info" will be used in the function
> ops->get_rxnfc(). Such inconsistent data can potentially lead to unexpected
> information leakage since ops->get_rxnfc() will prepare various types of
> data according to flow_type, and the prepared data will be eventually
> copied to user-space. This inconsistent data may also cause undefined
> behaviors based on how ops->get_rxnfc() is implemented.
> 
> This patch simply re-verifies the flow_type field of "info" after the
> second copy. If the value is not as expected, an error code will be
> returned.
> 
> Signed-off-by: Wenwen Wang <wang6495@umn.edu>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH] net/mlx4: fix spelling mistake: "failedi" -> "failed"
From: David Miller @ 2018-05-01 18:17 UTC (permalink / raw)
  To: colin.king; +Cc: tariqt, netdev, linux-rdma, kernel-janitors, linux-kernel
In-Reply-To: <20180430162945.5274-1-colin.king@canonical.com>

From: Colin King <colin.king@canonical.com>
Date: Mon, 30 Apr 2018 17:29:45 +0100

> From: Colin Ian King <colin.king@canonical.com>
> 
> trivial fix to spelling mistake in mlx4_warn message.
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Applied, thanks Colin.

^ permalink raw reply

* Re: [RFC v2 bpf-next 5/9] net/ipv6: Add fib6_lookup
From: Vincent Bernat @ 2018-05-01 18:15 UTC (permalink / raw)
  To: David Ahern
  Cc: netdev, borkmann, ast, davem, shm, roopa, brouer, toke,
	john.fastabend
In-Reply-To: <20180429180752.15428-6-dsahern@gmail.com>

 ❦ 29 avril 2018 11:07 -0700, David Ahern <dsahern@gmail.com> :

> +struct fib6_info *fib6_lookup(struct net *net, int oif, struct flowi6 *fl6,
> +			      int flags)

Maybe an EXPORT_SYMBOL_GPL? There is one for __fib_lookup (fib_lookup is
an inline function).
-- 
Use the "telephone test" for readability.
            - The Elements of Programming Style (Kernighan & Plauger)

^ permalink raw reply

* Re: [PATCH net-next V2] cls_flower: Support multiple masks per priority
From: David Miller @ 2018-05-01 18:14 UTC (permalink / raw)
  To: paulb
  Cc: jiri, xiyou.wangcong, jhs, netdev, kliteyn, roid, shahark, markb,
	ogerlitz
In-Reply-To: <1525087710-17495-1-git-send-email-paulb@mellanox.com>

From: Paul Blakey <paulb@mellanox.com>
Date: Mon, 30 Apr 2018 14:28:30 +0300

> Currently flower doesn't support inserting filters with different masks
> on a single priority, even if the actual flows (key + mask) inserted
> aren't overlapping, as with the use case of offloading openvswitch
> datapath flows. Instead one must go up one level, and assign different
> priorities for each mask, which will create a different flower
> instances.
> 
> This patch opens flower to support more than one mask per priority,
> and a single flower instance. It does so by adding another hash table
> on top of the existing one which will store the different masks,
> and the filters that share it.
> 
> The user is left with the responsibility of ensuring non overlapping
> flows, otherwise precedence is not guaranteed.
> 
> Signed-off-by: Paul Blakey <paulb@mellanox.com>
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
> ---
> 
> Changes:
> V1 -> V2: in fl_init, removed unnessecry err variable, just return direct result instead.
>           in fl_mask_range, removed extra new line.
>           commit msg, spelling.

Applied, thanks.

^ permalink raw reply

* Re: [RFC net-next 0/5] Support for PHY test modes
From: David Miller @ 2018-05-01 18:06 UTC (permalink / raw)
  To: f.fainelli
  Cc: andrew, netdev, rmk, linux-kernel, cphealy, nikita.yoush,
	vivien.didelot, Nisar.Sayed, UNGLinuxDriver
In-Reply-To: <19a6bf90-03d5-aa63-5f35-3b26801b79a9@gmail.com>

From: Florian Fainelli <f.fainelli@gmail.com>
Date: Tue, 1 May 2018 10:21:54 -0700

> On 04/30/2018 04:24 PM, Andrew Lunn wrote:
>>> Turning these tests on will typically result in the link partner
>>> dropping the link with us, and the interface will be non-functional as
>>> far as the data path is concerned (similar to an isolation mode). This
>>> might warrant properly reporting that to user-space through e.g: a
>>> private IFF_* value maybe?
>> 
>> Hi Florian
>> 
>> I think a IFF_* value would be a good idea. We want to give the user
>> some indicate why they don't have working networking. ip link show
>> showing PHY-TEST-MODE would help.
> 
> IF_OPER_TESTING as defined in RFC 2863 looks like the correct way to
> signal that. I did a quick test and setting operstate to
> IFF_OPER_TESTING seems to correctly get reflected by iproute2/ifconfig
> which no longer see RUNNING though the interface is still UP. If we
> couple that with a proper phy_stop(), this would IMHO be consistent from
> an user experience perspective.
> 
> David, would that look reasonable to you?

Yes, it does.

^ permalink raw reply

* Re: [PATCH] vhost: make msg padding explicit
From: David Miller @ 2018-05-01 18:05 UTC (permalink / raw)
  To: mst; +Cc: linux-kernel, kevin, jasowang, kvm, virtualization, netdev
In-Reply-To: <20180501201841-mutt-send-email-mst@kernel.org>

From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Tue, 1 May 2018 20:19:19 +0300

> On Tue, May 01, 2018 at 11:28:22AM -0400, David Miller wrote:
>> From: "Michael S. Tsirkin" <mst@redhat.com>
>> Date: Fri, 27 Apr 2018 19:02:05 +0300
>> 
>> > There's a 32 bit hole just after type. It's best to
>> > give it a name, this way compiler is forced to initialize
>> > it with rest of the structure.
>> > 
>> > Reported-by: Kevin Easton <kevin@guarana.org>
>> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>> 
>> Michael, will you be sending this directly to Linus or would you like
>> me to apply it to net or net-next?
>> 
>> Thanks.
> 
> I'd prefer you to apply it for net and cc stable if possible.

Ok, applied, and added to my -stable submission queue.

^ permalink raw reply

* Re: [PATCH net-next 0/2] bridge: FDB: Notify about removal of non-user-added entries
From: Andrew Lunn @ 2018-05-01 17:54 UTC (permalink / raw)
  To: Petr Machata
  Cc: netdev, ivecera, davem, stephen, vivien.didelot, f.fainelli, jiri
In-Reply-To: <cover.1525194039.git.petrm@mellanox.com>

On Tue, May 01, 2018 at 07:04:19PM +0200, Petr Machata wrote:
> Device drivers may generally need to keep in sync with bridge's FDB. In
> particular, for its offload of tc mirror action where the mirrored-to
> device is a gretap device, mlxsw needs to listen to a number of events.
> SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE would be a natural notification to
> listen to in order to keep up with FDB updates.
> 
> However, for removal of FDB entries added due to device activity (as
> opposed to explicit addition through "bridge fdb add" or similar), there
> are no notifications.

Hi Petr

Could you explain this some more. Why is mlxsw special in that it
needs this, but other drivers don't. The b53 driver supports tc
mirror, for example.

    Andrew

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox