Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next 2/2] udp: implement and use per cpu rx skbs cache
From: Eric Dumazet @ 2018-04-21 16:45 UTC (permalink / raw)
  To: Willem de Bruijn, Jesper Dangaard Brouer
  Cc: Paolo Abeni, Network Development, David S. Miller, Tariq Toukan
In-Reply-To: <CAF=yD-LkM3hzcY3B1P_5fW1t+QNtPz6=2YRr4P79t4hZW=0wTA@mail.gmail.com>



On 04/21/2018 08:54 AM, Willem de Bruijn wrote:
> On Fri, Apr 20, 2018 at 9:48 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
>>
>> On Thu, 19 Apr 2018 06:47:10 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> On 04/19/2018 12:40 AM, Paolo Abeni wrote:
>>>> On Wed, 2018-04-18 at 12:21 -0700, Eric Dumazet wrote:
>>>>> On 04/18/2018 10:15 AM, Paolo Abeni wrote:
>> [...]
>>>>
>>>> Any suggestions for better results are more than welcome!
>>>
>>> Yes, remote skb freeing. I mentioned this idea to Jesper and Tariq in
>>> Seoul (netdev conference). Not tied to UDP, but a generic solution.
>>
>> Yes, I remember.  I think... was it the idea, where you basically
>> wanted to queue back SKBs to the CPU that allocated them, right?
>>
>> Freeing an SKB on the same CPU that allocated it, have multiple
>> advantages. (1) the SLUB allocator can use a non-atomic
>> "cpu-local" (double)cmpxchg. (2) the 4 cache-lines memset cleared of
>> the SKB stay local.  (3) the atomic SKB refcnt/users stay local.
>>
>> We just have to avoid that queue back SKB's mechanism, doesn't cost
>> more than the operations we expect to save.  Bulk transfer is an
>> obvious approach.  For storing SKBs until they are returned, we already
>> have a fast mechanism see napi_consume_skb calling _kfree_skb_defer,
>> which SLUB/SLAB-bulk free to amortize cost (1).
>>
>> I guess, the missing information is that we don't know what CPU the SKB
>> were created on...
> 
> For connected sockets, sk->sk_incoming_cpu has this data. It
> records BH cpu on enqueue to udp socket, so one caveat is that
> it may be wrong with rps/rfs.
> 
> Another option is to associate not with source cpu but napi struct
> and have the device driver free in the context of its napi processing.
> This has the additional benefit that skb->napi_id is already stored
> per skb, so this also works for unconnected sockets.
> 
> Third, the skb->napi_id field is unused after setting sk->sk_napi_id
> on sk enqueue, so the BH cpu could be stored here after that,
> essentially extending sk_incoming_cpu to unconnected sockets.

We use at Google something named TXCS, which is what I mentioned to Jesper and Tariq.

(In our case, we wanted to not perform skb destructor/freeing on the cpu handling the TX queue,
but on cpus that originally cooked the skb (running TCP stack))

To accommodate generic needs (both RX and TX), I do not believe we can union any existing fields,
without a lot of pain/bugs.

^ permalink raw reply

* [PATCH bpf-next] bpf: btf: Clean up btf.h in uapi
From: Martin KaFai Lau @ 2018-04-21 16:48 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

This patch cleans up btf.h in uapi:
1) Rename "name" to "name_off" to better reflect it is an offset to the
   string section instead of a char array.
2) Remove unused value BTF_FLAGS_COMPR and BTF_MAGIC_SWAP

Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/uapi/linux/btf.h       |  8 +++-----
 kernel/bpf/btf.c               | 20 ++++++++++----------
 tools/include/uapi/linux/btf.h |  8 +++-----
 tools/lib/bpf/btf.c            |  2 +-
 4 files changed, 17 insertions(+), 21 deletions(-)

diff --git a/include/uapi/linux/btf.h b/include/uapi/linux/btf.h
index 74a30b1090df..bcb56ee47014 100644
--- a/include/uapi/linux/btf.h
+++ b/include/uapi/linux/btf.h
@@ -6,9 +6,7 @@
 #include <linux/types.h>
 
 #define BTF_MAGIC	0xeB9F
-#define BTF_MAGIC_SWAP	0x9FeB
 #define BTF_VERSION	1
-#define BTF_FLAGS_COMPR	0x01
 
 struct btf_header {
 	__u16	magic;
@@ -43,7 +41,7 @@ struct btf_header {
 #define BTF_STR_OFFSET(ref)	((ref) & BTF_MAX_NAME_OFFSET)
 
 struct btf_type {
-	__u32 name;
+	__u32 name_off;
 	/* "info" bits arrangement
 	 * bits  0-15: vlen (e.g. # of struct's members)
 	 * bits 16-23: unused
@@ -105,7 +103,7 @@ struct btf_type {
  * info in "struct btf_type").
  */
 struct btf_enum {
-	__u32	name;
+	__u32	name_off;
 	__s32	val;
 };
 
@@ -122,7 +120,7 @@ struct btf_array {
  * "struct btf_type").
  */
 struct btf_member {
-	__u32	name;
+	__u32	name_off;
 	__u32	type;
 	__u32	offset;	/* offset in bits */
 };
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index eb56ac760547..22e1046a1a86 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -473,7 +473,7 @@ __printf(4, 5) static void __btf_verifier_log_type(struct btf_verifier_env *env,
 	__btf_verifier_log(log, "[%u] %s %s%s",
 			   env->log_type_id,
 			   btf_kind_str[kind],
-			   btf_name_by_offset(btf, t->name),
+			   btf_name_by_offset(btf, t->name_off),
 			   log_details ? " " : "");
 
 	if (log_details)
@@ -517,7 +517,7 @@ static void btf_verifier_log_member(struct btf_verifier_env *env,
 		btf_verifier_log_type(env, struct_type, NULL);
 
 	__btf_verifier_log(log, "\t%s type_id=%u bits_offset=%u",
-			   btf_name_by_offset(btf, member->name),
+			   btf_name_by_offset(btf, member->name_off),
 			   member->type, member->offset);
 
 	if (fmt && *fmt) {
@@ -1419,10 +1419,10 @@ static s32 btf_struct_check_meta(struct btf_verifier_env *env,
 	btf_verifier_log_type(env, t, NULL);
 
 	for_each_member(i, t, member) {
-		if (!btf_name_offset_valid(btf, member->name)) {
+		if (!btf_name_offset_valid(btf, member->name_off)) {
 			btf_verifier_log_member(env, t, member,
 						"Invalid member name_offset:%u",
-						member->name);
+						member->name_off);
 			return -EINVAL;
 		}
 
@@ -1605,14 +1605,14 @@ static s32 btf_enum_check_meta(struct btf_verifier_env *env,
 	btf_verifier_log_type(env, t, NULL);
 
 	for (i = 0; i < nr_enums; i++) {
-		if (!btf_name_offset_valid(btf, enums[i].name)) {
+		if (!btf_name_offset_valid(btf, enums[i].name_off)) {
 			btf_verifier_log(env, "\tInvalid name_offset:%u",
-					 enums[i].name);
+					 enums[i].name_off);
 			return -EINVAL;
 		}
 
 		btf_verifier_log(env, "\t%s val=%d\n",
-				 btf_name_by_offset(btf, enums[i].name),
+				 btf_name_by_offset(btf, enums[i].name_off),
 				 enums[i].val);
 	}
 
@@ -1636,7 +1636,7 @@ static void btf_enum_seq_show(const struct btf *btf, const struct btf_type *t,
 	for (i = 0; i < nr_enums; i++) {
 		if (v == enums[i].val) {
 			seq_printf(m, "%s",
-				   btf_name_by_offset(btf, enums[i].name));
+				   btf_name_by_offset(btf, enums[i].name_off));
 			return;
 		}
 	}
@@ -1687,9 +1687,9 @@ static s32 btf_check_meta(struct btf_verifier_env *env,
 		return -EINVAL;
 	}
 
-	if (!btf_name_offset_valid(env->btf, t->name)) {
+	if (!btf_name_offset_valid(env->btf, t->name_off)) {
 		btf_verifier_log(env, "[%u] Invalid name_offset:%u",
-				 env->log_type_id, t->name);
+				 env->log_type_id, t->name_off);
 		return -EINVAL;
 	}
 
diff --git a/tools/include/uapi/linux/btf.h b/tools/include/uapi/linux/btf.h
index 74a30b1090df..bcb56ee47014 100644
--- a/tools/include/uapi/linux/btf.h
+++ b/tools/include/uapi/linux/btf.h
@@ -6,9 +6,7 @@
 #include <linux/types.h>
 
 #define BTF_MAGIC	0xeB9F
-#define BTF_MAGIC_SWAP	0x9FeB
 #define BTF_VERSION	1
-#define BTF_FLAGS_COMPR	0x01
 
 struct btf_header {
 	__u16	magic;
@@ -43,7 +41,7 @@ struct btf_header {
 #define BTF_STR_OFFSET(ref)	((ref) & BTF_MAX_NAME_OFFSET)
 
 struct btf_type {
-	__u32 name;
+	__u32 name_off;
 	/* "info" bits arrangement
 	 * bits  0-15: vlen (e.g. # of struct's members)
 	 * bits 16-23: unused
@@ -105,7 +103,7 @@ struct btf_type {
  * info in "struct btf_type").
  */
 struct btf_enum {
-	__u32	name;
+	__u32	name_off;
 	__s32	val;
 };
 
@@ -122,7 +120,7 @@ struct btf_array {
  * "struct btf_type").
  */
 struct btf_member {
-	__u32	name;
+	__u32	name_off;
 	__u32	type;
 	__u32	offset;	/* offset in bits */
 };
diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c
index 58b6255abc7a..2bac710e3194 100644
--- a/tools/lib/bpf/btf.c
+++ b/tools/lib/bpf/btf.c
@@ -281,7 +281,7 @@ int32_t btf__find_by_name(const struct btf *btf, const char *type_name)
 
 	for (i = 1; i <= btf->nr_types; i++) {
 		const struct btf_type *t = btf->types[i];
-		const char *name = btf_name_by_offset(btf, t->name);
+		const char *name = btf_name_by_offset(btf, t->name_off);
 
 		if (name && !strcmp(type_name, name))
 			return i;
-- 
2.9.5

^ permalink raw reply related

* Re: [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue
From: Eric Dumazet @ 2018-04-21 16:55 UTC (permalink / raw)
  To: Christoph Hellwig, Eric Dumazet
  Cc: David S . Miller, netdev, linux-kernel, Soheil Hassas Yeganeh,
	linux-mm, linux-fsdevel
In-Reply-To: <20180421090722.GA11998@infradead.org>



On 04/21/2018 02:07 AM, Christoph Hellwig wrote:
> On Fri, Apr 20, 2018 at 08:55:38AM -0700, Eric Dumazet wrote:
>> This patch series provide a new mmap_hook to fs willing to grab
>> a mutex before mm->mmap_sem is taken, to ensure lockdep sanity.
>>
>> This hook allows us to shorten tcp_mmap() execution time (while mmap_sem
>> is held), and improve multi-threading scalability. 
> 
> Missing CC to linu-fsdevel and linux-mm that will have to decide.
> 
> We've rejected this approach multiple times before, so you better
> make a really good argument for it.
> 

Well, tcp code needs to hold socket lock before mm->mmap_sem, so current
mmap hook can not fit. Or we need to revisit all code doing copyin/copyout while
holding a socket lock. (Not feasible really)


> introducing a multiplexer that overloads a single method certainly
> doesn't help making that case.

Well, if you refer to multiple hooks instead of a single one, I basically
thought that since only TCP needs this hook at the moment,
it was not worth adding extra 8-bytes loads for all other mmap() users.

I have no issue adding more hooks and more memory pressure if this is the blocking factor.

We need two actions at this moment, (to lock the socket or release it)
and a third one would allow us to build the array of pages
before grabbing mmap_sem (as I mentioned in the last patch changelog)

^ permalink raw reply

* [PATCH net-next v5 0/3] kernel: add support to collect hardware logs in crash recovery kernel
From: Rahul Lakkireddy @ 2018-04-21 17:05 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA,
	kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: indranil-ut6Up61K2wZBDgjK7y7TUQ, nirranjan-ut6Up61K2wZBDgjK7y7TUQ,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	ganeshgr-ut6Up61K2wZBDgjK7y7TUQ, Rahul Lakkireddy,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn

On production servers running variety of workloads over time, kernel
panic can happen sporadically after days or even months. It is
important to collect as much debug logs as possible to root cause
and fix the problem, that may not be easy to reproduce. Snapshot of
underlying hardware/firmware state (like register dump, firmware
logs, adapter memory, etc.), at the time of kernel panic will be very
helpful while debugging the culprit device driver.

This series of patches add new generic framework that enable device
drivers to collect device specific snapshot of the hardware/firmware
state of the underlying device in the crash recovery kernel. In crash
recovery kernel, the collected logs are added as elf notes to
/proc/vmcore, which is copied by user space scripts for post-analysis.

The sequence of actions done by device drivers to append their device
specific hardware/firmware logs to /proc/vmcore are as follows:

1. During probe (before hardware is initialized), device drivers
register to the vmcore module (via vmcore_add_device_dump()), with
callback function, along with buffer size and log name needed for
firmware/hardware log collection.

2. vmcore module allocates the buffer with requested size. It adds
an elf note and invokes the device driver's registered callback
function.

3. Device driver collects all hardware/firmware logs into the buffer
and returns control back to vmcore module.

The device specific hardware/firmware logs can be seen as elf notes:

# readelf -n /proc/vmcore

Displaying notes found at file offset 0x00001000 with length 0x04003288:
  Owner                 Data size	Description
  VMCOREDD_cxgb4_0000:02:00.4 0x02000fd8	Unknown note type: (0x00000700)
  VMCOREDD_cxgb4_0000:04:00.4 0x02000fd8	Unknown note type: (0x00000700)
  CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
  CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
  CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
  CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
  CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
  CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
  CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
  CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
  VMCOREINFO           0x0000074f	Unknown note type: (0x00000000)

Patch 1 adds API to vmcore module to allow drivers to register callback
to collect the device specific hardware/firmware logs.  The logs will
be added to /proc/vmcore as elf notes.

Patch 2 updates read and mmap logic to append device specific hardware/
firmware logs as elf notes.

Patch 3 shows a cxgb4 driver example using the API to collect
hardware/firmware logs in crash recovery kernel, before hardware is
initialized.

Thanks,
Rahul

---
v5:
- Removed enabling CONFIG_PROC_VMCORE_DEVICE_DUMP by default and
  updated help message.

v4:
- Made __vmcore_add_device_dump() static.
- Moved compile check to define vmcore_add_device_dump() to
  crash_dump.h to fix compilation when vmcore.c is not compiled in.
- Convert ---help--- to help in Kconfig as indicated by checkpatch.
- Rebased to tip.

v3:
- Dropped sysfs crashdd module.
- Exported dumps as elf notes. Suggested by Eric Biederman
  <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>.  Added as patch 2 in this version.
- Added CONFIG_PROC_VMCORE_DEVICE_DUMP to allow configuring device
  dump support.
- Moved logic related to adding dumps from crashdd to vmcore module.
- Rename all crashdd* to vmcoredd*.
- Updated comments.

v2:
- Added ABI Documentation for crashdd.
- Directly use octal permission instead of macro.

Changes since rfc v2:
- Moved exporting crashdd from procfs to sysfs. Suggested by
  Stephen Hemminger <stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ@public.gmane.org>
- Moved code from fs/proc/crashdd.c to fs/crashdd/ directory.
- Replaced all proc API with sysfs API and updated comments.
- Calling driver callback before creating the binary file under
  crashdd sysfs.
- Changed binary dump file permission from S_IRUSR to S_IRUGO.
- Changed module name from CRASH_DRIVER_DUMP to CRASH_DEVICE_DUMP.

rfc v2:
- Collecting logs in 2nd kernel instead of during kernel panic.
  Suggested by Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>.
- Added new crashdd module that exports /proc/crashdd/ containing
  driver's registered hardware/firmware logs in patch 1.
- Replaced the API to allow drivers to register their hardware/firmware
  log collect routine in crash recovery kernel in patch 1.
- Updated patch 2 to use the new API in patch 1.

Rahul Lakkireddy (3):
  vmcore: add API to collect hardware dump in second kernel
  vmcore: append device dumps to vmcore as elf notes
  cxgb4: collect hardware dump in second kernel

 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h       |   4 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c |  25 ++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h |   3 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c  |  10 +
 fs/proc/Kconfig                                  |  15 +
 fs/proc/vmcore.c                                 | 399 ++++++++++++++++++++++-
 include/linux/crash_core.h                       |   4 +
 include/linux/crash_dump.h                       |  17 +
 include/linux/kcore.h                            |   6 +
 include/uapi/linux/elf.h                         |   1 +
 10 files changed, 471 insertions(+), 13 deletions(-)

-- 
2.14.1

^ permalink raw reply

* [PATCH net-next v5 1/3] vmcore: add API to collect hardware dump in second kernel
From: Rahul Lakkireddy @ 2018-04-21 17:05 UTC (permalink / raw)
  To: netdev, kexec, linux-fsdevel, linux-kernel
  Cc: davem, viro, ebiederm, stephen, akpm, torvalds, ganeshgr,
	nirranjan, indranil, Rahul Lakkireddy
In-Reply-To: <cover.1524329561.git.rahul.lakkireddy@chelsio.com>

The sequence of actions done by device drivers to append their device
specific hardware/firmware logs to /proc/vmcore are as follows:

1. During probe (before hardware is initialized), device drivers
register to the vmcore module (via vmcore_add_device_dump()), with
callback function, along with buffer size and log name needed for
firmware/hardware log collection.

2. vmcore module allocates the buffer with requested size. It adds
an Elf note and invokes the device driver's registered callback
function.

3. Device driver collects all hardware/firmware logs into the buffer
and returns control back to vmcore module.

Ensure that the device dump buffer size is always aligned to page size
so that it can be mmaped.

Also, rename alloc_elfnotes_buf() to vmcore_alloc_buf() to make it more
generic and reserve NT_VMCOREDD note type to indicate vmcore device
dump.

Suggested-by: Eric Biederman <ebiederm@xmission.com>.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
---
v5:
- Removed enabling CONFIG_PROC_VMCORE_DEVICE_DUMP by default and
  updated help message to indicate that the driver must be present
  in second kernel's initramfs to collect the underlying device
  snapshot.

v4:
- Made __vmcore_add_device_dump() static.
- Moved compile check to define vmcore_add_device_dump() to
  crash_dump.h to fix compilation when vmcore.c is not compiled in.
- Convert ---help--- to help in Kconfig as indicated by checkpatch.
- Rebased to tip.

v3:
- Dropped sysfs crashdd module.
- Added CONFIG_PROC_VMCORE_DEVICE_DUMP to allow configuring device
  dump support.
- Moved logic related to adding dumps from crashdd to vmcore module.
- Rename all crashdd* to vmcoredd*.

v2:
- Added ABI Documentation for crashdd.
- Directly use octal permission instead of macro.

Changes since rfc v2:
- Moved exporting crashdd from procfs to sysfs.  Suggested by
  Stephen Hemminger <stephen@networkplumber.org>
- Moved code from fs/proc/crashdd.c to fs/crashdd/ directory.
- Replaced all proc API with sysfs API and updated comments.
- Calling driver callback before creating the binary file under
  crashdd sysfs.
- Changed binary dump file permission from S_IRUSR to S_IRUGO.
- Changed module name from CRASH_DRIVER_DUMP to CRASH_DEVICE_DUMP.

rfc v2:
- Collecting logs in 2nd kernel instead of during kernel panic.
  Suggested by Eric Biederman <ebiederm@xmission.com>.
- Patch added in this series.

 fs/proc/Kconfig            |  15 +++++
 fs/proc/vmcore.c           | 152 ++++++++++++++++++++++++++++++++++++++++++---
 include/linux/crash_core.h |   4 ++
 include/linux/crash_dump.h |  17 +++++
 include/linux/kcore.h      |   6 ++
 include/uapi/linux/elf.h   |   1 +
 6 files changed, 186 insertions(+), 9 deletions(-)

diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
index 1ade1206bb89..0eaeb41453f5 100644
--- a/fs/proc/Kconfig
+++ b/fs/proc/Kconfig
@@ -43,6 +43,21 @@ config PROC_VMCORE
         help
         Exports the dump image of crashed kernel in ELF format.
 
+config PROC_VMCORE_DEVICE_DUMP
+	bool "Device Hardware/Firmware Log Collection"
+	depends on PROC_VMCORE
+	default n
+	help
+	  After kernel panic, device drivers can collect the device
+	  specific snapshot of their hardware or firmware before the
+	  underlying devices are initialized in crash recovery kernel.
+	  Note that the device driver must be present in the crash
+	  recovery kernel's initramfs to collect its underlying device
+	  snapshot.
+
+	  If you say Y here, the collected device dumps will be added
+	  as ELF notes to /proc/vmcore.
+
 config PROC_SYSCTL
 	bool "Sysctl support (/proc/sys)" if EXPERT
 	depends on PROC_FS
diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index a45f0af22a60..7395462d2f86 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -20,6 +20,7 @@
 #include <linux/init.h>
 #include <linux/crash_dump.h>
 #include <linux/list.h>
+#include <linux/mutex.h>
 #include <linux/vmalloc.h>
 #include <linux/pagemap.h>
 #include <linux/uaccess.h>
@@ -44,6 +45,12 @@ static u64 vmcore_size;
 
 static struct proc_dir_entry *proc_vmcore;
 
+#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
+/* Device Dump list and mutex to synchronize access to list */
+static LIST_HEAD(vmcoredd_list);
+static DEFINE_MUTEX(vmcoredd_mutex);
+#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
+
 /*
  * Returns > 0 for RAM pages, 0 for non-RAM pages, < 0 on error
  * The called function has to take care of module refcounting.
@@ -302,10 +309,8 @@ static const struct vm_operations_struct vmcore_mmap_ops = {
 };
 
 /**
- * alloc_elfnotes_buf - allocate buffer for ELF note segment in
- *                      vmalloc memory
- *
- * @notes_sz: size of buffer
+ * vmcore_alloc_buf - allocate buffer in vmalloc memory
+ * @sizez: size of buffer
  *
  * If CONFIG_MMU is defined, use vmalloc_user() to allow users to mmap
  * the buffer to user-space by means of remap_vmalloc_range().
@@ -313,12 +318,12 @@ static const struct vm_operations_struct vmcore_mmap_ops = {
  * If CONFIG_MMU is not defined, use vzalloc() since mmap_vmcore() is
  * disabled and there's no need to allow users to mmap the buffer.
  */
-static inline char *alloc_elfnotes_buf(size_t notes_sz)
+static inline char *vmcore_alloc_buf(size_t size)
 {
 #ifdef CONFIG_MMU
-	return vmalloc_user(notes_sz);
+	return vmalloc_user(size);
 #else
-	return vzalloc(notes_sz);
+	return vzalloc(size);
 #endif
 }
 
@@ -665,7 +670,7 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
 		return rc;
 
 	*notes_sz = roundup(phdr_sz, PAGE_SIZE);
-	*notes_buf = alloc_elfnotes_buf(*notes_sz);
+	*notes_buf = vmcore_alloc_buf(*notes_sz);
 	if (!*notes_buf)
 		return -ENOMEM;
 
@@ -851,7 +856,7 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
 		return rc;
 
 	*notes_sz = roundup(phdr_sz, PAGE_SIZE);
-	*notes_buf = alloc_elfnotes_buf(*notes_sz);
+	*notes_buf = vmcore_alloc_buf(*notes_sz);
 	if (!*notes_buf)
 		return -ENOMEM;
 
@@ -1145,6 +1150,132 @@ static int __init parse_crash_elf_headers(void)
 	return 0;
 }
 
+#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
+/**
+ * vmcoredd_get_note_size - Get size of the note that will be inserted at
+ * beginning of the dump's buffer.
+ * @name: Note's name
+ *
+ * Gets the overall size of the note that will be inserted at the beginning
+ * of the dump's buffer.  It also adds padding, if necessary to meet
+ * alignment requirements.
+ */
+static inline size_t vmcoredd_get_note_size(const char *name)
+{
+	return CRASH_CORE_NOTE_HEAD_BYTES +
+	       ALIGN(VMCOREDD_NOTE_NAME_BYTES + strlen(name), sizeof(Elf_Word));
+}
+
+/**
+ * vmcoredd_write_note - Write note at the beginning of the dump's buffer
+ * @name: Dump's name
+ * @buf: Output buffer where the note is written
+ * @size: Size of the dump
+ *
+ * Fills beginning of the dump's data with elf note.
+ */
+static void vmcoredd_write_note(const char *name, void *buf, size_t size)
+{
+	struct elf_note *note = (struct elf_note *)buf;
+	Elf_Word *word = (Elf_Word *)note;
+
+	note->n_namesz = ALIGN(VMCOREDD_NOTE_NAME_BYTES + strlen(name),
+			       sizeof(Elf_Word));
+	note->n_descsz = size;
+	note->n_type = NT_VMCOREDD;
+	word += DIV_ROUND_UP(sizeof(*note), sizeof(Elf_Word));
+	snprintf((char *)word, note->n_namesz, "%s_%s", VMCOREDD_NOTE_NAME,
+		 name);
+}
+
+/**
+ * vmcore_add_device_dump - Add a buffer containing device dump to vmcore
+ * @data: dump info.
+ *
+ * Allocate a buffer and invoke the calling driver's dump collect routine.
+ * Write Elf note at the beginning of the buffer to indicate vmcore device
+ * dump and add the dump to global list.
+ */
+static int __vmcore_add_device_dump(struct vmcoredd_data *data)
+{
+	size_t note_size, data_size;
+	struct vmcoredd_node *dump;
+	void *buf = NULL;
+	int ret;
+
+	if (!data || !strlen(data->name) ||
+	    !data->vmcoredd_callback || !data->size)
+		return -EINVAL;
+
+	dump = vzalloc(sizeof(*dump));
+	if (!dump) {
+		ret = -ENOMEM;
+		goto out_err;
+	}
+
+	note_size = vmcoredd_get_note_size(data->name);
+	/* Keep size of the buffer page aligned so that it can be mmaped */
+	data_size = roundup(note_size + data->size, PAGE_SIZE);
+
+	/* Allocate buffer for driver's to write their dumps */
+	buf = vmcore_alloc_buf(data_size);
+	if (!buf) {
+		ret = -ENOMEM;
+		goto out_err;
+	}
+
+	vmcoredd_write_note(data->name, buf, data_size - note_size);
+
+	/* Invoke the driver's dump collection routing */
+	ret = data->vmcoredd_callback(data, buf + note_size);
+	if (ret)
+		goto out_err;
+
+	dump->buf = buf;
+	dump->size = data_size;
+
+	/* Add the dump to driver sysfs list */
+	mutex_lock(&vmcoredd_mutex);
+	list_add_tail(&dump->list, &vmcoredd_list);
+	mutex_unlock(&vmcoredd_mutex);
+
+	return 0;
+
+out_err:
+	if (buf)
+		vfree(buf);
+
+	if (dump)
+		vfree(dump);
+
+	return ret;
+}
+
+int vmcore_add_device_dump(struct vmcoredd_data *data)
+{
+	return __vmcore_add_device_dump(data);
+}
+EXPORT_SYMBOL(vmcore_add_device_dump);
+#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
+
+/* Free all dumps in vmcore device dump list */
+static void vmcore_free_device_dumps(void)
+{
+#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
+	mutex_lock(&vmcoredd_mutex);
+	while (!list_empty(&vmcoredd_list)) {
+		struct vmcoredd_node *dump;
+
+		dump = list_first_entry(&vmcoredd_list, struct vmcoredd_node,
+					list);
+		list_del(&dump->list);
+		vfree(dump->buf);
+		vfree(dump);
+	}
+	mutex_unlock(&vmcoredd_mutex);
+#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
+}
+
 /* Init function for vmcore module. */
 static int __init vmcore_init(void)
 {
@@ -1192,4 +1323,7 @@ void vmcore_cleanup(void)
 		kfree(m);
 	}
 	free_elfcorebuf();
+
+	/* clear vmcore device dump list */
+	vmcore_free_device_dumps();
 }
diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
index b511f6d24b42..3b6b041d84c8 100644
--- a/include/linux/crash_core.h
+++ b/include/linux/crash_core.h
@@ -27,6 +27,10 @@
 				     VMCOREINFO_NOTE_NAME_BYTES +	\
 				     VMCOREINFO_BYTES)
 
+#define VMCOREDD_MAX_NAME_BYTES  32
+#define VMCOREDD_NOTE_NAME	 "VMCOREDD"
+#define VMCOREDD_NOTE_NAME_BYTES sizeof(VMCOREDD_NOTE_NAME)
+
 typedef u32 note_buf_t[CRASH_CORE_NOTE_BYTES/4];
 
 void crash_update_vmcoreinfo_safecopy(void *ptr);
diff --git a/include/linux/crash_dump.h b/include/linux/crash_dump.h
index f7ac2aa93269..658d508d1ec5 100644
--- a/include/linux/crash_dump.h
+++ b/include/linux/crash_dump.h
@@ -93,4 +93,21 @@ static inline bool is_kdump_kernel(void) { return 0; }
 #endif /* CONFIG_CRASH_DUMP */
 
 extern unsigned long saved_max_pfn;
+
+/* Device Dump information to be filled by drivers */
+struct vmcoredd_data {
+	char name[VMCOREDD_MAX_NAME_BYTES]; /* Unique name of the dump */
+	unsigned long size;                 /* Size of the dump */
+	/* Driver's registered callback to be invoked to collect dump */
+	int (*vmcoredd_callback)(struct vmcoredd_data *data, void *buf);
+};
+
+#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
+int vmcore_add_device_dump(struct vmcoredd_data *data);
+#else
+static inline int vmcore_add_device_dump(struct vmcoredd_data *data)
+{
+	return -EOPNOTSUPP;
+}
+#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
 #endif /* LINUX_CRASHDUMP_H */
diff --git a/include/linux/kcore.h b/include/linux/kcore.h
index 80db19d3a505..aa26e7199060 100644
--- a/include/linux/kcore.h
+++ b/include/linux/kcore.h
@@ -28,6 +28,12 @@ struct vmcore {
 	loff_t offset;
 };
 
+struct vmcoredd_node {
+	struct list_head list;	/* List of dumps */
+	void *buf;		/* Buffer containing device's dump */
+	unsigned long size;	/* Size of the buffer */
+};
+
 #ifdef CONFIG_PROC_KCORE
 extern void kclist_add(struct kcore_list *, void *, size_t, int type);
 #else
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index e2535d6dcec7..4e12c423b9fe 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -421,6 +421,7 @@ typedef struct elf64_shdr {
 #define NT_ARM_SYSTEM_CALL	0x404	/* ARM system call number */
 #define NT_ARM_SVE	0x405		/* ARM Scalable Vector Extension registers */
 #define NT_ARC_V2	0x600		/* ARCv2 accumulator/extra registers */
+#define NT_VMCOREDD	0x700		/* Vmcore Device Dump Note */
 
 /* Note header in a PT_NOTE section */
 typedef struct elf32_note {
-- 
2.14.1

^ permalink raw reply related

* [PATCH net-next v5 2/3] vmcore: append device dumps to vmcore as elf notes
From: Rahul Lakkireddy @ 2018-04-21 17:05 UTC (permalink / raw)
  To: netdev, kexec, linux-fsdevel, linux-kernel
  Cc: davem, viro, ebiederm, stephen, akpm, torvalds, ganeshgr,
	nirranjan, indranil, Rahul Lakkireddy
In-Reply-To: <cover.1524329561.git.rahul.lakkireddy@chelsio.com>

Update read and mmap logic to append device dumps as additional notes
before the other elf notes. We add device dumps before other elf notes
because the other elf notes may not fill the elf notes buffer
completely and we will end up with zero-filled data between the elf
notes and the device dumps. Tools will then try to decode this
zero-filled data as valid notes and we don't want that. Hence, adding
device dumps before the other elf notes ensure that zero-filled data
can be avoided. This also ensures that the device dumps and the
other elf notes can be properly mmaped at page aligned address.

Incorporate device dump size into the total vmcore size. Also update
offsets for other program headers after the device dumps are added.

Suggested-by: Eric Biederman <ebiederm@xmission.com>.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
---
v5:
- No changes.

v4:
- No changes.

v3:
- Patch added in this version.
- Exported dumps as elf notes. Suggested by Eric Biederman
  <ebiederm@xmission.com>.

 fs/proc/vmcore.c | 247 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 243 insertions(+), 4 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 7395462d2f86..ed1ebd85e14e 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -39,6 +39,8 @@ static size_t elfcorebuf_sz_orig;
 
 static char *elfnotes_buf;
 static size_t elfnotes_sz;
+/* Size of all notes minus the device dump notes */
+static size_t elfnotes_orig_sz;
 
 /* Total size of vmcore file. */
 static u64 vmcore_size;
@@ -51,6 +53,9 @@ static LIST_HEAD(vmcoredd_list);
 static DEFINE_MUTEX(vmcoredd_mutex);
 #endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
 
+/* Device Dump Size */
+static size_t vmcoredd_orig_sz;
+
 /*
  * Returns > 0 for RAM pages, 0 for non-RAM pages, < 0 on error
  * The called function has to take care of module refcounting.
@@ -185,6 +190,77 @@ static int copy_to(void *target, void *src, size_t size, int userbuf)
 	return 0;
 }
 
+#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
+static int vmcoredd_copy_dumps(void *dst, u64 start, size_t size, int userbuf)
+{
+	struct vmcoredd_node *dump;
+	u64 offset = 0;
+	int ret = 0;
+	size_t tsz;
+	char *buf;
+
+	mutex_lock(&vmcoredd_mutex);
+	list_for_each_entry(dump, &vmcoredd_list, list) {
+		if (start < offset + dump->size) {
+			tsz = min(offset + (u64)dump->size - start, (u64)size);
+			buf = dump->buf + start - offset;
+			if (copy_to(dst, buf, tsz, userbuf)) {
+				ret = -EFAULT;
+				goto out_unlock;
+			}
+
+			size -= tsz;
+			start += tsz;
+			dst += tsz;
+
+			/* Leave now if buffer filled already */
+			if (!size)
+				goto out_unlock;
+		}
+		offset += dump->size;
+	}
+
+out_unlock:
+	mutex_unlock(&vmcoredd_mutex);
+	return ret;
+}
+
+static int vmcoredd_mmap_dumps(struct vm_area_struct *vma, unsigned long dst,
+			       u64 start, size_t size)
+{
+	struct vmcoredd_node *dump;
+	u64 offset = 0;
+	int ret = 0;
+	size_t tsz;
+	char *buf;
+
+	mutex_lock(&vmcoredd_mutex);
+	list_for_each_entry(dump, &vmcoredd_list, list) {
+		if (start < offset + dump->size) {
+			tsz = min(offset + (u64)dump->size - start, (u64)size);
+			buf = dump->buf + start - offset;
+			if (remap_vmalloc_range_partial(vma, dst, buf, tsz)) {
+				ret = -EFAULT;
+				goto out_unlock;
+			}
+
+			size -= tsz;
+			start += tsz;
+			dst += tsz;
+
+			/* Leave now if buffer filled already */
+			if (!size)
+				goto out_unlock;
+		}
+		offset += dump->size;
+	}
+
+out_unlock:
+	mutex_unlock(&vmcoredd_mutex);
+	return ret;
+}
+#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
+
 /* Read from the ELF header and then the crash dump. On error, negative value is
  * returned otherwise number of bytes read are returned.
  */
@@ -222,10 +298,41 @@ static ssize_t __read_vmcore(char *buffer, size_t buflen, loff_t *fpos,
 	if (*fpos < elfcorebuf_sz + elfnotes_sz) {
 		void *kaddr;
 
+		/* We add device dumps before other elf notes because the
+		 * other elf notes may not fill the elf notes buffer
+		 * completely and we will end up with zero-filled data
+		 * between the elf notes and the device dumps. Tools will
+		 * then try to decode this zero-filled data as valid notes
+		 * and we don't want that. Hence, adding device dumps before
+		 * the other elf notes ensure that zero-filled data can be
+		 * avoided.
+		 */
+#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
+		/* Read device dumps */
+		if (*fpos < elfcorebuf_sz + vmcoredd_orig_sz) {
+			tsz = min(elfcorebuf_sz + vmcoredd_orig_sz -
+				  (size_t)*fpos, buflen);
+			start = *fpos - elfcorebuf_sz;
+			if (vmcoredd_copy_dumps(buffer, start, tsz, userbuf))
+				return -EFAULT;
+
+			buflen -= tsz;
+			*fpos += tsz;
+			buffer += tsz;
+			acc += tsz;
+
+			/* leave now if filled buffer already */
+			if (!buflen)
+				return acc;
+		}
+#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
+
+		/* Read remaining elf notes */
 		tsz = min(elfcorebuf_sz + elfnotes_sz - (size_t)*fpos, buflen);
-		kaddr = elfnotes_buf + *fpos - elfcorebuf_sz;
+		kaddr = elfnotes_buf + *fpos - elfcorebuf_sz - vmcoredd_orig_sz;
 		if (copy_to(buffer, kaddr, tsz, userbuf))
 			return -EFAULT;
+
 		buflen -= tsz;
 		*fpos += tsz;
 		buffer += tsz;
@@ -451,11 +558,46 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
 	if (start < elfcorebuf_sz + elfnotes_sz) {
 		void *kaddr;
 
+		/* We add device dumps before other elf notes because the
+		 * other elf notes may not fill the elf notes buffer
+		 * completely and we will end up with zero-filled data
+		 * between the elf notes and the device dumps. Tools will
+		 * then try to decode this zero-filled data as valid notes
+		 * and we don't want that. Hence, adding device dumps before
+		 * the other elf notes ensure that zero-filled data can be
+		 * avoided. This also ensures that the device dumps and
+		 * other elf notes can be properly mmaped at page aligned
+		 * address.
+		 */
+#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
+		/* Read device dumps */
+		if (start < elfcorebuf_sz + vmcoredd_orig_sz) {
+			u64 start_off;
+
+			tsz = min(elfcorebuf_sz + vmcoredd_orig_sz -
+				  (size_t)start, size);
+			start_off = start - elfcorebuf_sz;
+			if (vmcoredd_mmap_dumps(vma, vma->vm_start + len,
+						start_off, tsz))
+				goto fail;
+
+			size -= tsz;
+			start += tsz;
+			len += tsz;
+
+			/* leave now if filled buffer already */
+			if (!size)
+				return 0;
+		}
+#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
+
+		/* Read remaining elf notes */
 		tsz = min(elfcorebuf_sz + elfnotes_sz - (size_t)start, size);
-		kaddr = elfnotes_buf + start - elfcorebuf_sz;
+		kaddr = elfnotes_buf + start - elfcorebuf_sz - vmcoredd_orig_sz;
 		if (remap_vmalloc_range_partial(vma, vma->vm_start + len,
 						kaddr, tsz))
 			goto fail;
+
 		size -= tsz;
 		start += tsz;
 		len += tsz;
@@ -703,6 +845,11 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
 	/* Modify e_phnum to reflect merged headers. */
 	ehdr_ptr->e_phnum = ehdr_ptr->e_phnum - nr_ptnote + 1;
 
+	/* Store the size of all notes.  We need this to update the note
+	 * header when the device dumps will be added.
+	 */
+	elfnotes_orig_sz = phdr.p_memsz;
+
 	return 0;
 }
 
@@ -889,6 +1036,11 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
 	/* Modify e_phnum to reflect merged headers. */
 	ehdr_ptr->e_phnum = ehdr_ptr->e_phnum - nr_ptnote + 1;
 
+	/* Store the size of all notes.  We need this to update the note
+	 * header when the device dumps will be added.
+	 */
+	elfnotes_orig_sz = phdr.p_memsz;
+
 	return 0;
 }
 
@@ -981,8 +1133,8 @@ static int __init process_ptload_program_headers_elf32(char *elfptr,
 }
 
 /* Sets offset fields of vmcore elements. */
-static void __init set_vmcore_list_offsets(size_t elfsz, size_t elfnotes_sz,
-					   struct list_head *vc_list)
+static void set_vmcore_list_offsets(size_t elfsz, size_t elfnotes_sz,
+				    struct list_head *vc_list)
 {
 	loff_t vmcore_off;
 	struct vmcore *m;
@@ -1188,6 +1340,92 @@ static void vmcoredd_write_note(const char *name, void *buf, size_t size)
 		 name);
 }
 
+/**
+ * vmcoredd_update_program_headers - Update all Elf program headers
+ * @elfptr: Pointer to elf header
+ * @elfnotesz: Size of elf notes aligned to page size
+ * @vmcoreddsz: Size of device dumps to be added to elf note header
+ *
+ * Determine type of Elf header (Elf64 or Elf32) and update the elf note size.
+ * Also update the offsets of all the program headers after the elf note header.
+ */
+static void vmcoredd_update_program_headers(char *elfptr, size_t elfnotesz,
+					    size_t vmcoreddsz)
+{
+	unsigned char *e_ident = (unsigned char *)elfptr;
+	u64 start, end, size;
+	loff_t vmcore_off;
+	u32 i;
+
+	vmcore_off = elfcorebuf_sz + elfnotesz;
+
+	if (e_ident[EI_CLASS] == ELFCLASS64) {
+		Elf64_Ehdr *ehdr = (Elf64_Ehdr *)elfptr;
+		Elf64_Phdr *phdr = (Elf64_Phdr *)(elfptr + sizeof(Elf64_Ehdr));
+
+		/* Update all program headers */
+		for (i = 0; i < ehdr->e_phnum; i++, phdr++) {
+			if (phdr->p_type == PT_NOTE) {
+				/* Update note size */
+				phdr->p_memsz = elfnotes_orig_sz + vmcoreddsz;
+				phdr->p_filesz = phdr->p_memsz;
+				continue;
+			}
+
+			start = rounddown(phdr->p_offset, PAGE_SIZE);
+			end = roundup(phdr->p_offset + phdr->p_memsz,
+				      PAGE_SIZE);
+			size = end - start;
+			phdr->p_offset = vmcore_off + (phdr->p_offset - start);
+			vmcore_off += size;
+		}
+	} else {
+		Elf32_Ehdr *ehdr = (Elf32_Ehdr *)elfptr;
+		Elf32_Phdr *phdr = (Elf32_Phdr *)(elfptr + sizeof(Elf32_Ehdr));
+
+		/* Update all program headers */
+		for (i = 0; i < ehdr->e_phnum; i++, phdr++) {
+			if (phdr->p_type == PT_NOTE) {
+				/* Update note size */
+				phdr->p_memsz = elfnotes_orig_sz + vmcoreddsz;
+				phdr->p_filesz = phdr->p_memsz;
+				continue;
+			}
+
+			start = rounddown(phdr->p_offset, PAGE_SIZE);
+			end = roundup(phdr->p_offset + phdr->p_memsz,
+				      PAGE_SIZE);
+			size = end - start;
+			phdr->p_offset = vmcore_off + (phdr->p_offset - start);
+			vmcore_off += size;
+		}
+	}
+}
+
+/**
+ * vmcoredd_update_size - Update the total size of the device dumps and update
+ * Elf header
+ * @dump_size: Size of the current device dump to be added to total size
+ *
+ * Update the total size of all the device dumps and update the Elf program
+ * headers. Calculate the new offsets for the vmcore list and update the
+ * total vmcore size.
+ */
+static void vmcoredd_update_size(size_t dump_size)
+{
+	vmcoredd_orig_sz += dump_size;
+	elfnotes_sz = roundup(elfnotes_orig_sz, PAGE_SIZE) + vmcoredd_orig_sz;
+	vmcoredd_update_program_headers(elfcorebuf, elfnotes_sz,
+					vmcoredd_orig_sz);
+
+	/* Update vmcore list offsets */
+	set_vmcore_list_offsets(elfcorebuf_sz, elfnotes_sz, &vmcore_list);
+
+	vmcore_size = get_vmcore_size(elfcorebuf_sz, elfnotes_sz,
+				      &vmcore_list);
+	proc_vmcore->size = vmcore_size;
+}
+
 /**
  * vmcore_add_device_dump - Add a buffer containing device dump to vmcore
  * @data: dump info.
@@ -1239,6 +1477,7 @@ static int __vmcore_add_device_dump(struct vmcoredd_data *data)
 	list_add_tail(&dump->list, &vmcoredd_list);
 	mutex_unlock(&vmcoredd_mutex);
 
+	vmcoredd_update_size(data_size);
 	return 0;
 
 out_err:
-- 
2.14.1

^ permalink raw reply related

* [PATCH net-next v5 3/3] cxgb4: collect hardware dump in second kernel
From: Rahul Lakkireddy @ 2018-04-21 17:05 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA,
	kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: indranil-ut6Up61K2wZBDgjK7y7TUQ, nirranjan-ut6Up61K2wZBDgjK7y7TUQ,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	ganeshgr-ut6Up61K2wZBDgjK7y7TUQ, Rahul Lakkireddy,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn
In-Reply-To: <cover.1524329561.git.rahul.lakkireddy-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>

Register callback to collect hardware/firmware dumps in second kernel
before hardware/firmware is initialized. The dumps for each device
will be available as elf notes in /proc/vmcore in second kernel.

Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
Signed-off-by: Ganesh Goudar <ganeshgr-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
---
v5:
- No changes.

v4:
- No changes.

v3:
- Replaced all crashdd* with vmcoredd*.
- Replaced crashdd_add_dump() with vmcore_add_device_dump().
- Updated comments and commit message.

v2:
- No Changes.

Changes since rfc v2:
- Update comments and commit message for sysfs change.

rfc v2:
- Updated dump registration to the new API in patch 1.

 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h       |  4 ++++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c | 25 ++++++++++++++++++++++++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h |  3 +++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c  | 10 ++++++++++
 4 files changed, 42 insertions(+)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 688f95440af2..01e7aad4ce5b 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -50,6 +50,7 @@
 #include <linux/net_tstamp.h>
 #include <linux/ptp_clock_kernel.h>
 #include <linux/ptp_classify.h>
+#include <linux/crash_dump.h>
 #include <asm/io.h>
 #include "t4_chip_type.h"
 #include "cxgb4_uld.h"
@@ -964,6 +965,9 @@ struct adapter {
 	struct hma_data hma;
 
 	struct srq_data *srq;
+
+	/* Dump buffer for collecting logs in kdump kernel */
+	struct vmcoredd_data vmcoredd;
 };
 
 /* Support for "sched-class" command to allow a TX Scheduling Class to be
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c
index 143686c60234..76433d4fe483 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c
@@ -488,3 +488,28 @@ void cxgb4_init_ethtool_dump(struct adapter *adapter)
 	adapter->eth_dump.version = adapter->params.fw_vers;
 	adapter->eth_dump.len = 0;
 }
+
+static int cxgb4_cudbg_vmcoredd_collect(struct vmcoredd_data *data, void *buf)
+{
+	struct adapter *adap = container_of(data, struct adapter, vmcoredd);
+	u32 len = data->size;
+
+	return cxgb4_cudbg_collect(adap, buf, &len, CXGB4_ETH_DUMP_ALL);
+}
+
+int cxgb4_cudbg_vmcore_add_dump(struct adapter *adap)
+{
+	struct vmcoredd_data *data = &adap->vmcoredd;
+	u32 len;
+
+	len = sizeof(struct cudbg_hdr) +
+	      sizeof(struct cudbg_entity_hdr) * CUDBG_MAX_ENTITY;
+	len += CUDBG_DUMP_BUFF_SIZE;
+
+	data->size = len;
+	snprintf(data->name, sizeof(data->name), "%s_%s", cxgb4_driver_name,
+		 adap->name);
+	data->vmcoredd_callback = cxgb4_cudbg_vmcoredd_collect;
+
+	return vmcore_add_device_dump(data);
+}
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h
index ce1ac9a1c878..ef59ba1ed968 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h
@@ -41,8 +41,11 @@ enum CXGB4_ETHTOOL_DUMP_FLAGS {
 	CXGB4_ETH_DUMP_HW = (1 << 1), /* various FW and HW dumps */
 };
 
+#define CXGB4_ETH_DUMP_ALL (CXGB4_ETH_DUMP_MEM | CXGB4_ETH_DUMP_HW)
+
 u32 cxgb4_get_dump_length(struct adapter *adap, u32 flag);
 int cxgb4_cudbg_collect(struct adapter *adap, void *buf, u32 *buf_size,
 			u32 flag);
 void cxgb4_init_ethtool_dump(struct adapter *adapter);
+int cxgb4_cudbg_vmcore_add_dump(struct adapter *adap);
 #endif /* __CXGB4_CUDBG_H__ */
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 24d2865b8806..32cad0acf76c 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -5544,6 +5544,16 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	if (err)
 		goto out_free_adapter;
 
+	if (is_kdump_kernel()) {
+		/* Collect hardware state and append to /proc/vmcore */
+		err = cxgb4_cudbg_vmcore_add_dump(adapter);
+		if (err) {
+			dev_warn(adapter->pdev_dev,
+				 "Fail collecting vmcore device dump, err: %d. Continuing\n",
+				 err);
+			err = 0;
+		}
+	}
 
 	if (!is_t4(adapter->params.chip)) {
 		s_qpp = (QUEUESPERPAGEPF0_S +
-- 
2.14.1

^ permalink raw reply related

* Re: BUG: unable to handle kernel paging request in compat_copy_entries
From: Eric Biggers @ 2018-04-21 18:18 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: syzbot, bridge, coreteam, davem, fw, kadlec, linux-kernel, netdev,
	netfilter-devel, pablo, stephen, syzkaller-bugs
In-Reply-To: <1520263080.5898.7.camel@redhat.com>

On Mon, Mar 05, 2018 at 04:18:00PM +0100, Paolo Abeni wrote:
> On Mon, 2018-03-05 at 00:21 -0800, syzbot wrote:
> > Hello,
> > 
> > syzbot hit the following crash on upstream commit
> > 5fbdefcf685defd8bc5a8f37b17538d25c58d77a (Fri Mar 2 21:05:20 2018 +0000)
> > Merge branch 'parisc-4.16-1' of  
> > git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
> > 
> > So far this crash happened 5 times on upstream.
> > syzkaller reproducer is attached.
> > Raw console output is attached.
> > compiler: gcc (GCC) 7.1.1 20170620
> > .config is attached.
> > user-space arch: i386
> > 
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+5705ba91388d7bc30828@syzkaller.appspotmail.com
> > It will help syzbot understand when the bug is fixed. See footer for  
> > details.
> > If you forward the report, please keep this part and the footer.
> > 
> > audit: type=1400 audit(1520098078.492:8): avc:  denied  { map } for   
> > pid=4239 comm="syz-execprog" path="/root/syzkaller-shm255959590" dev="sda1"  
> > ino=16482 scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023  
> > tcontext=unconfined_u:object_r:file_t:s0 tclass=file permissive=1
> > IPVS: ftp: loaded support on port[0] = 21
> > BUG: unable to handle kernel paging request at ffffc90001819e4f
> > IP: ebt_size_mwt net/bridge/netfilter/ebtables.c:2037 [inline]
> > IP: size_entry_mwt net/bridge/netfilter/ebtables.c:2122 [inline]
> > IP: compat_copy_entries+0x49f/0x1050 net/bridge/netfilter/ebtables.c:2160
> > PGD 1dad2f067 P4D 1dad2f067 PUD 1dad30067 PMD 1b2408067 PTE 0
> > Oops: 0000 [#1] SMP KASAN
> > Dumping ftrace buffer:
> >     (ftrace buffer empty)
> > Modules linked in:
> > CPU: 1 PID: 4249 Comm: syz-executor0 Not tainted 4.16.0-rc3+ #248
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
> > Google 01/01/2011
> > RIP: 0010:ebt_size_mwt net/bridge/netfilter/ebtables.c:2037 [inline]
> > RIP: 0010:size_entry_mwt net/bridge/netfilter/ebtables.c:2122 [inline]
> > RIP: 0010:compat_copy_entries+0x49f/0x1050  
> > net/bridge/netfilter/ebtables.c:2160
> > RSP: 0018:ffff8801b34bf7e8 EFLAGS: 00010246
> > RAX: 000000000000000a RBX: ffff8801b34bf9d4 RCX: ffffc90001819e4f
> > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8801b34bf9d8
> > RBP: ffff8801b34bf968 R08: 0000000000000000 R09: 0000000000000000
> > R10: ffffffff88613340 R11: 0000000000000001 R12: 000000000000ee5f
> > R13: dffffc0000000000 R14: ffff8801b34bf9c8 R15: ffffc90001819e2f
> > FS:  0000000000000000(0000) GS:ffff8801db300000(0063) knlGS:00000000085b9900
> > CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
> > CR2: ffffc90001819e4f CR3: 00000001b2bd7003 CR4: 00000000001606e0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > Call Trace:
> >   compat_do_replace+0x398/0x7c0 net/bridge/netfilter/ebtables.c:2249
> >   compat_do_ebt_set_ctl+0x22a/0x2d0 net/bridge/netfilter/ebtables.c:2330
> >   compat_nf_sockopt net/netfilter/nf_sockopt.c:144 [inline]
> >   compat_nf_setsockopt+0x88/0x130 net/netfilter/nf_sockopt.c:156
> >   compat_ip_setsockopt+0x8b/0xd0 net/ipv4/ip_sockglue.c:1285
> >   inet_csk_compat_setsockopt+0x95/0x120 net/ipv4/inet_connection_sock.c:1041
> >   compat_tcp_setsockopt+0x3d/0x70 net/ipv4/tcp.c:2916
> >   compat_sock_common_setsockopt+0xb2/0x140 net/core/sock.c:2986
> >   C_SYSC_setsockopt net/compat.c:403 [inline]
> >   compat_SyS_setsockopt+0x17c/0x410 net/compat.c:386
> >   do_syscall_32_irqs_on arch/x86/entry/common.c:330 [inline]
> >   do_fast_syscall_32+0x3ec/0xf9f arch/x86/entry/common.c:392
> >   entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
> > RIP: 0023:0xf7fbbc99
> > RSP: 002b:00000000ffd5ab8c EFLAGS: 00000286 ORIG_RAX: 000000000000016e
> > RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000000000
> > RDX: 0000000000000080 RSI: 0000000020000280 RDI: 0000000000000208
> > RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> > R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > Code: 8d 4f 20 48 89 c8 48 89 8d c8 fe ff ff 48 c1 e8 03 42 0f b6 14 28 48  
> > 89 c8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 b2 0a 00 00 <45> 8b 67 20  
> > 44 39 a5 04 ff ff ff 0f 82 bd 08 00 00 e8 cb 52 56
> > RIP: ebt_size_mwt net/bridge/netfilter/ebtables.c:2037 [inline] RSP:  
> > ffff8801b34bf7e8
> > RIP: size_entry_mwt net/bridge/netfilter/ebtables.c:2122 [inline] RSP:  
> > ffff8801b34bf7e8
> > RIP: compat_copy_entries+0x49f/0x1050 net/bridge/netfilter/ebtables.c:2160  
> > RSP: ffff8801b34bf7e8
> > CR2: ffffc90001819e4f
> > ---[ end trace cf111332eb971f16 ]---
> > 
> > 
> > ---
> > This bug is generated by a dumb bot. It may contain errors.
> > See https://goo.gl/tpsmEJ for details.
> > Direct all questions to syzkaller@googlegroups.com.
> > 
> > syzbot will keep track of this bug report.
> > If you forgot to add the Reported-by tag, once the fix for this bug is  
> > merged
> > into any tree, please reply to this email with:
> > #syz fix: exact-commit-title
> > If you want to test a patch for this bug, please reply with:
> > #syz test: git://repo/address.git branch
> > and provide the patch inline or as an attachment.
> > To mark this as a duplicate of another syzbot report, please reply with:
> > #syz dup: exact-subject-of-another-report
> > If it's a one-off invalid bug report, please reply with:
> > #syz invalid
> > Note: if the crash happens again, it will cause creation of a new bug  
> > report.
> > Note: all commands must start from beginning of the line in the email body.
> 
> #syz fix: netfilter: ebtables: add CONFIG_COMPAT support
>

Wrong commit title.  The fix for this actually was:

#syz fix: netfilter: ebtables: CONFIG_COMPAT: don't trust userland offsets

- Eric

^ permalink raw reply

* [PATCH net-next 0/6] net: Extend availability of PHY statistics
From: Florian Fainelli @ 2018-04-21 18:55 UTC (permalink / raw)
  To: netdev
  Cc: Florian Fainelli, Andrew Lunn, Vivien Didelot, David S. Miller,
	cphealy

Hi all,

This patch series adds support for retrieving PHY statistics with DSA switches
when the CPU port uses a PHY to PHY connection (as opposed to MAC to MAC).
To get there a number of things are done:

- first we move the code dealing with PHY statistics outside of net/core/ethtool.c
  and create helper functions since the same code will be reused
- then we allow network device drivers to provide an ethtool_get_phy_stats callback
  when the standard PHY library helpers are not suitable
- we update the DSA functions dealing with ethtool operations to get passed a
  stringset instead of assuming ETH_SS_STATS like they currently do
- then we provide a set of standard helpers within DSA as a framework and add
  the plumbing to allow retrieving the PHY statistics of the CPU port(s)
- finally plug support for retrieving such PHY statistics with the b53 driver

Florian Fainelli (6):
  net: Move PHY statistics code into PHY library helpers
  net: Allow network devices to have PHY statistics
  net: dsa: Pass stringset to ethtool operations
  net: dsa: Add helper function to obtain PHY device of a given port
  net: dsa: Allow providing PHY statistics from CPU port
  net: dsa: b53: Add support for reading PHY statistics

 drivers/net/dsa/b53/b53_common.c       | 66 ++++++++++++++++++++++---
 drivers/net/dsa/b53/b53_priv.h         |  6 ++-
 drivers/net/dsa/bcm_sf2.c              |  1 +
 drivers/net/dsa/dsa_loop.c             | 11 ++++-
 drivers/net/dsa/lan9303-core.c         | 11 ++++-
 drivers/net/dsa/microchip/ksz_common.c | 11 ++++-
 drivers/net/dsa/mt7530.c               | 11 ++++-
 drivers/net/dsa/mv88e6xxx/chip.c       | 10 +++-
 drivers/net/dsa/qca8k.c                | 10 +++-
 drivers/net/phy/phy.c                  | 48 ++++++++++++++++++
 include/linux/ethtool.h                |  5 ++
 include/linux/phy.h                    |  4 ++
 include/net/dsa.h                      | 12 ++++-
 net/core/ethtool.c                     | 61 ++++++++---------------
 net/dsa/master.c                       | 41 +++++++++++++---
 net/dsa/port.c                         | 90 +++++++++++++++++++++++++++++-----
 net/dsa/slave.c                        |  5 +-
 17 files changed, 320 insertions(+), 83 deletions(-)

-- 
2.7.4

^ permalink raw reply

* [PATCH net-next 1/6] net: Move PHY statistics code into PHY library helpers
From: Florian Fainelli @ 2018-04-21 18:55 UTC (permalink / raw)
  To: netdev
  Cc: Florian Fainelli, Andrew Lunn, Vivien Didelot, David S. Miller,
	cphealy
In-Reply-To: <1524336953-26108-1-git-send-email-f.fainelli@gmail.com>

In order to make it possible for network device drivers that do not
necessarily have a phy_device attached, but still report PHY statistics,
have a preliminary refactoring consisting in creating helper functions
that encapsulate the PHY device driver knowledge within PHYLIB.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 drivers/net/phy/phy.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/phy.h   |  4 ++++
 net/core/ethtool.c    | 38 ++++++++------------------------------
 3 files changed, 60 insertions(+), 30 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 05c1e8ef15e6..a98ed12c0009 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -1277,3 +1277,51 @@ int phy_ethtool_nway_reset(struct net_device *ndev)
 	return phy_restart_aneg(phydev);
 }
 EXPORT_SYMBOL(phy_ethtool_nway_reset);
+
+int phy_ethtool_get_strings(struct phy_device *phydev, u8 *data)
+{
+	if (!phydev->drv)
+		return -EIO;
+
+	mutex_lock(&phydev->lock);
+	phydev->drv->get_strings(phydev, data);
+	mutex_unlock(&phydev->lock);
+
+	return 0;
+}
+EXPORT_SYMBOL(phy_ethtool_get_strings);
+
+int phy_ethtool_get_sset_count(struct phy_device *phydev)
+{
+	int ret;
+
+	if (!phydev->drv)
+		return -EIO;
+
+	if (phydev->drv->get_sset_count &&
+	    phydev->drv->get_strings &&
+	    phydev->drv->get_stats) {
+		mutex_lock(&phydev->lock);
+		ret = phydev->drv->get_sset_count(phydev);
+		mutex_unlock(&phydev->lock);
+
+		return ret;
+	}
+
+	return -EOPNOTSUPP;
+}
+EXPORT_SYMBOL(phy_ethtool_get_sset_count);
+
+int phy_ethtool_get_stats(struct phy_device *phydev,
+			  struct ethtool_stats *stats, u64 *data)
+{
+	if (!phydev->drv)
+		return -EIO;
+
+	mutex_lock(&phydev->lock);
+	phydev->drv->get_stats(phydev, stats, data);
+	mutex_unlock(&phydev->lock);
+
+	return 0;
+}
+EXPORT_SYMBOL(phy_ethtool_get_stats);
diff --git a/include/linux/phy.h b/include/linux/phy.h
index f0b5870a6d40..c16f6d90044f 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -1062,6 +1062,10 @@ int phy_ethtool_get_link_ksettings(struct net_device *ndev,
 int phy_ethtool_set_link_ksettings(struct net_device *ndev,
 				   const struct ethtool_link_ksettings *cmd);
 int phy_ethtool_nway_reset(struct net_device *ndev);
+int phy_ethtool_get_strings(struct phy_device *phydev, u8 *data);
+int phy_ethtool_get_sset_count(struct phy_device *phydev);
+int phy_ethtool_get_stats(struct phy_device *phydev,
+			  struct ethtool_stats *stats, u64 *data);
 
 #if IS_ENABLED(CONFIG_PHYLIB)
 int __init mdio_bus_init(void);
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 03416e6dd5d7..f0d42e093c4a 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -210,23 +210,6 @@ static int ethtool_set_features(struct net_device *dev, void __user *useraddr)
 	return ret;
 }
 
-static int phy_get_sset_count(struct phy_device *phydev)
-{
-	int ret;
-
-	if (phydev->drv->get_sset_count &&
-	    phydev->drv->get_strings &&
-	    phydev->drv->get_stats) {
-		mutex_lock(&phydev->lock);
-		ret = phydev->drv->get_sset_count(phydev);
-		mutex_unlock(&phydev->lock);
-
-		return ret;
-	}
-
-	return -EOPNOTSUPP;
-}
-
 static int __ethtool_get_sset_count(struct net_device *dev, int sset)
 {
 	const struct ethtool_ops *ops = dev->ethtool_ops;
@@ -245,7 +228,7 @@ static int __ethtool_get_sset_count(struct net_device *dev, int sset)
 
 	if (sset == ETH_SS_PHY_STATS) {
 		if (dev->phydev)
-			return phy_get_sset_count(dev->phydev);
+			return phy_ethtool_get_sset_count(dev->phydev);
 		else
 			return -EOPNOTSUPP;
 	}
@@ -272,15 +255,10 @@ static void __ethtool_get_strings(struct net_device *dev,
 	else if (stringset == ETH_SS_PHY_TUNABLES)
 		memcpy(data, phy_tunable_strings, sizeof(phy_tunable_strings));
 	else if (stringset == ETH_SS_PHY_STATS) {
-		struct phy_device *phydev = dev->phydev;
-
-		if (phydev) {
-			mutex_lock(&phydev->lock);
-			phydev->drv->get_strings(phydev, data);
-			mutex_unlock(&phydev->lock);
-		} else {
+		if (dev->phydev)
+			phy_ethtool_get_strings(dev->phydev, data);
+		else
 			return;
-		}
 	} else
 		/* ops->get_strings is valid because checked earlier */
 		ops->get_strings(dev, stringset, data);
@@ -2001,7 +1979,7 @@ static int ethtool_get_phy_stats(struct net_device *dev, void __user *useraddr)
 	if (!phydev)
 		return -EOPNOTSUPP;
 
-	n_stats = phy_get_sset_count(phydev);
+	n_stats = phy_ethtool_get_sset_count(dev->phydev);
 	if (n_stats < 0)
 		return n_stats;
 	if (n_stats > S32_MAX / sizeof(u64))
@@ -2016,9 +1994,9 @@ static int ethtool_get_phy_stats(struct net_device *dev, void __user *useraddr)
 	if (n_stats && !data)
 		return -ENOMEM;
 
-	mutex_lock(&phydev->lock);
-	phydev->drv->get_stats(phydev, &stats, data);
-	mutex_unlock(&phydev->lock);
+	ret = phy_ethtool_get_stats(dev->phydev, &stats, data);
+	if (ret < 0)
+		return ret;
 
 	ret = -EFAULT;
 	if (copy_to_user(useraddr, &stats, sizeof(stats)))
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 2/6] net: Allow network devices to have PHY statistics
From: Florian Fainelli @ 2018-04-21 18:55 UTC (permalink / raw)
  To: netdev
  Cc: Florian Fainelli, Andrew Lunn, Vivien Didelot, David S. Miller,
	cphealy
In-Reply-To: <1524336953-26108-1-git-send-email-f.fainelli@gmail.com>

Add a new callback: get_ethtool_phy_stats() which allows network device
drivers not making use of the PHY library to return PHY statistics.
Update ethtool_get_phy_stats(), __ethtool_get_sset_count() and
__ethtool_get_strings() accordingly to interogate the network device
about ETH_SS_PHY_STATS.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 include/linux/ethtool.h |  5 +++++
 net/core/ethtool.c      | 39 +++++++++++++++++++++------------------
 2 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index ebe41811ed34..9b19556f0156 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -310,6 +310,9 @@ bool ethtool_convert_link_mode_to_legacy_u32(u32 *legacy_u32,
  *	fields should be ignored (use %__ETHTOOL_LINK_MODE_MASK_NBITS
  *	instead of the latter), any change to them will be overwritten
  *	by kernel. Returns a negative error code or zero.
+ * @get_ethtool_phy_stats: Return extended statistics about the PHY device.
+ *	This is only useful if the device maintains PHY statistics and
+ *	cannot use the standard PHY library helpers.
  *
  * All operations are optional (i.e. the function pointer may be set
  * to %NULL) and callers must take this into account.  Callers must
@@ -405,5 +408,7 @@ struct ethtool_ops {
 				      struct ethtool_fecparam *);
 	int	(*set_fecparam)(struct net_device *,
 				      struct ethtool_fecparam *);
+	void	(*get_ethtool_phy_stats)(struct net_device *,
+					 struct ethtool_stats *, u64 *);
 };
 #endif /* _LINUX_ETHTOOL_H */
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index f0d42e093c4a..4b8992ccf904 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -226,12 +226,9 @@ static int __ethtool_get_sset_count(struct net_device *dev, int sset)
 	if (sset == ETH_SS_PHY_TUNABLES)
 		return ARRAY_SIZE(phy_tunable_strings);
 
-	if (sset == ETH_SS_PHY_STATS) {
-		if (dev->phydev)
-			return phy_ethtool_get_sset_count(dev->phydev);
-		else
-			return -EOPNOTSUPP;
-	}
+	if (sset == ETH_SS_PHY_STATS && dev->phydev &&
+	    !ops->get_ethtool_phy_stats)
+		return phy_ethtool_get_sset_count(dev->phydev);
 
 	if (ops->get_sset_count && ops->get_strings)
 		return ops->get_sset_count(dev, sset);
@@ -254,12 +251,10 @@ static void __ethtool_get_strings(struct net_device *dev,
 		memcpy(data, tunable_strings, sizeof(tunable_strings));
 	else if (stringset == ETH_SS_PHY_TUNABLES)
 		memcpy(data, phy_tunable_strings, sizeof(phy_tunable_strings));
-	else if (stringset == ETH_SS_PHY_STATS) {
-		if (dev->phydev)
-			phy_ethtool_get_strings(dev->phydev, data);
-		else
-			return;
-	} else
+	else if (stringset == ETH_SS_PHY_STATS && dev->phydev &&
+		 !ops->get_ethtool_phy_stats)
+		phy_ethtool_get_strings(dev->phydev, data);
+	else
 		/* ops->get_strings is valid because checked earlier */
 		ops->get_strings(dev, stringset, data);
 }
@@ -1971,15 +1966,19 @@ static int ethtool_get_stats(struct net_device *dev, void __user *useraddr)
 
 static int ethtool_get_phy_stats(struct net_device *dev, void __user *useraddr)
 {
-	struct ethtool_stats stats;
+	const struct ethtool_ops *ops = dev->ethtool_ops;
 	struct phy_device *phydev = dev->phydev;
+	struct ethtool_stats stats;
 	u64 *data;
 	int ret, n_stats;
 
-	if (!phydev)
+	if (!phydev && (!ops->get_ethtool_phy_stats || !ops->get_sset_count))
 		return -EOPNOTSUPP;
 
-	n_stats = phy_ethtool_get_sset_count(dev->phydev);
+	if (dev->phydev && !ops->get_ethtool_phy_stats)
+		n_stats = phy_ethtool_get_sset_count(dev->phydev);
+	else
+		n_stats = ops->get_sset_count(dev, ETH_SS_PHY_STATS);
 	if (n_stats < 0)
 		return n_stats;
 	if (n_stats > S32_MAX / sizeof(u64))
@@ -1994,9 +1993,13 @@ static int ethtool_get_phy_stats(struct net_device *dev, void __user *useraddr)
 	if (n_stats && !data)
 		return -ENOMEM;
 
-	ret = phy_ethtool_get_stats(dev->phydev, &stats, data);
-	if (ret < 0)
-		return ret;
+	if (dev->phydev && !ops->get_ethtool_phy_stats) {
+		ret = phy_ethtool_get_stats(dev->phydev, &stats, data);
+		if (ret < 0)
+			return ret;
+	} else {
+		ops->get_ethtool_phy_stats(dev, &stats, data);
+	}
 
 	ret = -EFAULT;
 	if (copy_to_user(useraddr, &stats, sizeof(stats)))
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 3/6] net: dsa: Pass stringset to ethtool operations
From: Florian Fainelli @ 2018-04-21 18:55 UTC (permalink / raw)
  To: netdev
  Cc: Florian Fainelli, Andrew Lunn, Vivien Didelot, David S. Miller,
	cphealy
In-Reply-To: <1524336953-26108-1-git-send-email-f.fainelli@gmail.com>

Up until now we largely assumed that we were interested in ETH_SS_STATS
type of strings for all ethtool operations, this is about to change with
the introduction of additional string sets, e.g: ETH_SS_PHY_STATS.
Update all functions to take an appropriate stringset argument and act
on it when it is different than ETH_SS_STATS for now.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 drivers/net/dsa/b53/b53_common.c       | 11 +++++++++--
 drivers/net/dsa/b53/b53_priv.h         |  5 +++--
 drivers/net/dsa/dsa_loop.c             | 11 +++++++++--
 drivers/net/dsa/lan9303-core.c         | 11 +++++++++--
 drivers/net/dsa/microchip/ksz_common.c | 11 +++++++++--
 drivers/net/dsa/mt7530.c               | 11 +++++++++--
 drivers/net/dsa/mv88e6xxx/chip.c       | 10 ++++++++--
 drivers/net/dsa/qca8k.c                | 10 ++++++++--
 include/net/dsa.h                      |  5 +++--
 net/dsa/master.c                       | 21 +++++++++++++--------
 net/dsa/slave.c                        |  5 +++--
 11 files changed, 83 insertions(+), 28 deletions(-)

diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c
index 78616787f2a3..726b2d8c6fe9 100644
--- a/drivers/net/dsa/b53/b53_common.c
+++ b/drivers/net/dsa/b53/b53_common.c
@@ -806,13 +806,17 @@ static unsigned int b53_get_mib_size(struct b53_device *dev)
 		return B53_MIBS_SIZE;
 }
 
-void b53_get_strings(struct dsa_switch *ds, int port, uint8_t *data)
+void b53_get_strings(struct dsa_switch *ds, int port, u32 stringset,
+		     uint8_t *data)
 {
 	struct b53_device *dev = ds->priv;
 	const struct b53_mib_desc *mibs = b53_get_mib(dev);
 	unsigned int mib_size = b53_get_mib_size(dev);
 	unsigned int i;
 
+	if (stringset != ETH_SS_STATS)
+		return;
+
 	for (i = 0; i < mib_size; i++)
 		strlcpy(data + i * ETH_GSTRING_LEN,
 			mibs[i].name, ETH_GSTRING_LEN);
@@ -852,10 +856,13 @@ void b53_get_ethtool_stats(struct dsa_switch *ds, int port, uint64_t *data)
 }
 EXPORT_SYMBOL(b53_get_ethtool_stats);
 
-int b53_get_sset_count(struct dsa_switch *ds, int port)
+int b53_get_sset_count(struct dsa_switch *ds, int port, int sset)
 {
 	struct b53_device *dev = ds->priv;
 
+	if (sset != ETH_SS_STATS)
+		return 0;
+
 	return b53_get_mib_size(dev);
 }
 EXPORT_SYMBOL(b53_get_sset_count);
diff --git a/drivers/net/dsa/b53/b53_priv.h b/drivers/net/dsa/b53/b53_priv.h
index 1187ebd79287..b933d5cb5c2d 100644
--- a/drivers/net/dsa/b53/b53_priv.h
+++ b/drivers/net/dsa/b53/b53_priv.h
@@ -286,9 +286,10 @@ static inline int b53_switch_get_reset_gpio(struct b53_device *dev)
 /* Exported functions towards other drivers */
 void b53_imp_vlan_setup(struct dsa_switch *ds, int cpu_port);
 int b53_configure_vlan(struct dsa_switch *ds);
-void b53_get_strings(struct dsa_switch *ds, int port, uint8_t *data);
+void b53_get_strings(struct dsa_switch *ds, int port, u32 stringset,
+		     uint8_t *data);
 void b53_get_ethtool_stats(struct dsa_switch *ds, int port, uint64_t *data);
-int b53_get_sset_count(struct dsa_switch *ds, int port);
+int b53_get_sset_count(struct dsa_switch *ds, int port, int sset);
 int b53_br_join(struct dsa_switch *ds, int port, struct net_device *bridge);
 void b53_br_leave(struct dsa_switch *ds, int port, struct net_device *bridge);
 void b53_br_set_stp_state(struct dsa_switch *ds, int port, u8 state);
diff --git a/drivers/net/dsa/dsa_loop.c b/drivers/net/dsa/dsa_loop.c
index f77be9f85cb3..9354cc08d3fd 100644
--- a/drivers/net/dsa/dsa_loop.c
+++ b/drivers/net/dsa/dsa_loop.c
@@ -86,16 +86,23 @@ static int dsa_loop_setup(struct dsa_switch *ds)
 	return 0;
 }
 
-static int dsa_loop_get_sset_count(struct dsa_switch *ds, int port)
+static int dsa_loop_get_sset_count(struct dsa_switch *ds, int port, int sset)
 {
+	if (sset != ETH_SS_STATS)
+		return 0;
+
 	return __DSA_LOOP_CNT_MAX;
 }
 
-static void dsa_loop_get_strings(struct dsa_switch *ds, int port, uint8_t *data)
+static void dsa_loop_get_strings(struct dsa_switch *ds, int port,
+				 u32 stringset, uint8_t *data)
 {
 	struct dsa_loop_priv *ps = ds->priv;
 	unsigned int i;
 
+	if (stringset != ETH_SS_STATS)
+		return;
+
 	for (i = 0; i < __DSA_LOOP_CNT_MAX; i++)
 		memcpy(data + i * ETH_GSTRING_LEN,
 		       ps->ports[port].mib[i].name, ETH_GSTRING_LEN);
diff --git a/drivers/net/dsa/lan9303-core.c b/drivers/net/dsa/lan9303-core.c
index fefa454f3e56..b4f6e1a67dd9 100644
--- a/drivers/net/dsa/lan9303-core.c
+++ b/drivers/net/dsa/lan9303-core.c
@@ -977,10 +977,14 @@ static const struct lan9303_mib_desc lan9303_mib[] = {
 	{ .offset = LAN9303_MAC_TX_LATECOL_0, .name = "TxLateCol", },
 };
 
-static void lan9303_get_strings(struct dsa_switch *ds, int port, uint8_t *data)
+static void lan9303_get_strings(struct dsa_switch *ds, int port,
+				u32 stringset, uint8_t *data)
 {
 	unsigned int u;
 
+	if (stringset != ETH_SS_STATS)
+		return;
+
 	for (u = 0; u < ARRAY_SIZE(lan9303_mib); u++) {
 		strncpy(data + u * ETH_GSTRING_LEN, lan9303_mib[u].name,
 			ETH_GSTRING_LEN);
@@ -1007,8 +1011,11 @@ static void lan9303_get_ethtool_stats(struct dsa_switch *ds, int port,
 	}
 }
 
-static int lan9303_get_sset_count(struct dsa_switch *ds, int port)
+static int lan9303_get_sset_count(struct dsa_switch *ds, int port, int sset)
 {
+	if (sset != ETH_SS_STATS)
+		return 0;
+
 	return ARRAY_SIZE(lan9303_mib);
 }
 
diff --git a/drivers/net/dsa/microchip/ksz_common.c b/drivers/net/dsa/microchip/ksz_common.c
index bcb3e6c734f2..7210c49b7922 100644
--- a/drivers/net/dsa/microchip/ksz_common.c
+++ b/drivers/net/dsa/microchip/ksz_common.c
@@ -439,15 +439,22 @@ static void ksz_disable_port(struct dsa_switch *ds, int port,
 	ksz_port_cfg(dev, port, REG_PORT_CTRL_0, PORT_MAC_LOOPBACK, true);
 }
 
-static int ksz_sset_count(struct dsa_switch *ds, int port)
+static int ksz_sset_count(struct dsa_switch *ds, int port, int sset)
 {
+	if (sset != ETH_SS_STATS)
+		return 0;
+
 	return TOTAL_SWITCH_COUNTER_NUM;
 }
 
-static void ksz_get_strings(struct dsa_switch *ds, int port, uint8_t *buf)
+static void ksz_get_strings(struct dsa_switch *ds, int port,
+			    u32 stringset, uint8_t *buf)
 {
 	int i;
 
+	if (stringset != ETH_SS_STATS)
+		return;
+
 	for (i = 0; i < TOTAL_SWITCH_COUNTER_NUM; i++) {
 		memcpy(buf + i * ETH_GSTRING_LEN, mib_names[i].string,
 		       ETH_GSTRING_LEN);
diff --git a/drivers/net/dsa/mt7530.c b/drivers/net/dsa/mt7530.c
index 80a4dbc3a499..62e486652e62 100644
--- a/drivers/net/dsa/mt7530.c
+++ b/drivers/net/dsa/mt7530.c
@@ -573,10 +573,14 @@ static int mt7530_phy_write(struct dsa_switch *ds, int port, int regnum,
 }
 
 static void
-mt7530_get_strings(struct dsa_switch *ds, int port, uint8_t *data)
+mt7530_get_strings(struct dsa_switch *ds, int port, u32 stringset,
+		   uint8_t *data)
 {
 	int i;
 
+	if (stringset != ETH_SS_STATS)
+		return;
+
 	for (i = 0; i < ARRAY_SIZE(mt7530_mib); i++)
 		strncpy(data + i * ETH_GSTRING_LEN, mt7530_mib[i].name,
 			ETH_GSTRING_LEN);
@@ -604,8 +608,11 @@ mt7530_get_ethtool_stats(struct dsa_switch *ds, int port,
 }
 
 static int
-mt7530_get_sset_count(struct dsa_switch *ds, int port)
+mt7530_get_sset_count(struct dsa_switch *ds, int port, int sset)
 {
+	if (sset != ETH_SS_STATS)
+		return 0;
+
 	return ARRAY_SIZE(mt7530_mib);
 }
 
diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 3d2091099f7f..8f92ccc0dd54 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -742,11 +742,14 @@ static void mv88e6xxx_atu_vtu_get_strings(uint8_t *data)
 }
 
 static void mv88e6xxx_get_strings(struct dsa_switch *ds, int port,
-				  uint8_t *data)
+				  u32 stringset, uint8_t *data)
 {
 	struct mv88e6xxx_chip *chip = ds->priv;
 	int count = 0;
 
+	if (stringset != ETH_SS_STATS)
+		return;
+
 	mutex_lock(&chip->reg_lock);
 
 	if (chip->info->ops->stats_get_strings)
@@ -789,12 +792,15 @@ static int mv88e6320_stats_get_sset_count(struct mv88e6xxx_chip *chip)
 					      STATS_TYPE_BANK1);
 }
 
-static int mv88e6xxx_get_sset_count(struct dsa_switch *ds, int port)
+static int mv88e6xxx_get_sset_count(struct dsa_switch *ds, int port, int sset)
 {
 	struct mv88e6xxx_chip *chip = ds->priv;
 	int serdes_count = 0;
 	int count = 0;
 
+	if (sset != ETH_SS_STATS)
+		return 0;
+
 	mutex_lock(&chip->reg_lock);
 	if (chip->info->ops->stats_get_sset_count)
 		count = chip->info->ops->stats_get_sset_count(chip);
diff --git a/drivers/net/dsa/qca8k.c b/drivers/net/dsa/qca8k.c
index 600d5ad1fbde..757b6d90ea36 100644
--- a/drivers/net/dsa/qca8k.c
+++ b/drivers/net/dsa/qca8k.c
@@ -600,10 +600,13 @@ qca8k_phy_write(struct dsa_switch *ds, int phy, int regnum, u16 val)
 }
 
 static void
-qca8k_get_strings(struct dsa_switch *ds, int port, uint8_t *data)
+qca8k_get_strings(struct dsa_switch *ds, int port, u32 stringset, uint8_t *data)
 {
 	int i;
 
+	if (stringset != ETH_SS_STATS)
+		return;
+
 	for (i = 0; i < ARRAY_SIZE(ar8327_mib); i++)
 		strncpy(data + i * ETH_GSTRING_LEN, ar8327_mib[i].name,
 			ETH_GSTRING_LEN);
@@ -631,8 +634,11 @@ qca8k_get_ethtool_stats(struct dsa_switch *ds, int port,
 }
 
 static int
-qca8k_get_sset_count(struct dsa_switch *ds, int port)
+qca8k_get_sset_count(struct dsa_switch *ds, int port, int sset)
 {
+	if (sset != ETH_SS_STATS)
+		return 0;
+
 	return ARRAY_SIZE(ar8327_mib);
 }
 
diff --git a/include/net/dsa.h b/include/net/dsa.h
index 60fb4ec8ba61..0bc0aad1b02e 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -356,10 +356,11 @@ struct dsa_switch_ops {
 	/*
 	 * ethtool hardware statistics.
 	 */
-	void	(*get_strings)(struct dsa_switch *ds, int port, uint8_t *data);
+	void	(*get_strings)(struct dsa_switch *ds, int port,
+			       u32 stringset, uint8_t *data);
 	void	(*get_ethtool_stats)(struct dsa_switch *ds,
 				     int port, uint64_t *data);
-	int	(*get_sset_count)(struct dsa_switch *ds, int port);
+	int	(*get_sset_count)(struct dsa_switch *ds, int port, int sset);
 
 	/*
 	 * ethtool Wake-on-LAN
diff --git a/net/dsa/master.c b/net/dsa/master.c
index 90e6df0351eb..d547f3fb44cc 100644
--- a/net/dsa/master.c
+++ b/net/dsa/master.c
@@ -38,11 +38,14 @@ static int dsa_master_get_sset_count(struct net_device *dev, int sset)
 	struct dsa_switch *ds = cpu_dp->ds;
 	int count = 0;
 
-	if (ops && ops->get_sset_count)
-		count += ops->get_sset_count(dev, sset);
+	if (ops && ops->get_sset_count) {
+		count = ops->get_sset_count(dev, sset);
+		if (count < 0)
+			count = 0;
+	}
 
-	if (sset == ETH_SS_STATS && ds->ops->get_sset_count)
-		count += ds->ops->get_sset_count(ds, cpu_dp->index);
+	if (ds->ops->get_sset_count)
+		count += ds->ops->get_sset_count(ds, cpu_dp->index, sset);
 
 	return count;
 }
@@ -65,18 +68,20 @@ static void dsa_master_get_strings(struct net_device *dev, uint32_t stringset,
 	pfx[sizeof(pfx) - 1] = '_';
 
 	if (ops && ops->get_sset_count && ops->get_strings) {
-		mcount = ops->get_sset_count(dev, ETH_SS_STATS);
+		mcount = ops->get_sset_count(dev, stringset);
+		if (mcount < 0)
+			mcount = 0;
 		ops->get_strings(dev, stringset, data);
 	}
 
-	if (stringset == ETH_SS_STATS && ds->ops->get_strings) {
+	if (ds->ops->get_strings) {
 		ndata = data + mcount * len;
 		/* This function copies ETH_GSTRINGS_LEN bytes, we will mangle
 		 * the output after to prepend our CPU port prefix we
 		 * constructed earlier
 		 */
-		ds->ops->get_strings(ds, port, ndata);
-		count = ds->ops->get_sset_count(ds, port);
+		ds->ops->get_strings(ds, port, stringset, ndata);
+		count = ds->ops->get_sset_count(ds, port, stringset);
 		for (i = 0; i < count; i++) {
 			memmove(ndata + (i * len + sizeof(pfx)),
 				ndata + i * len, len - sizeof(pfx));
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 18561af7a8f1..f3fb3a0880b1 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -560,7 +560,8 @@ static void dsa_slave_get_strings(struct net_device *dev,
 		strncpy(data + 2 * len, "rx_packets", len);
 		strncpy(data + 3 * len, "rx_bytes", len);
 		if (ds->ops->get_strings)
-			ds->ops->get_strings(ds, dp->index, data + 4 * len);
+			ds->ops->get_strings(ds, dp->index, stringset,
+					     data + 4 * len);
 	}
 }
 
@@ -605,7 +606,7 @@ static int dsa_slave_get_sset_count(struct net_device *dev, int sset)
 
 		count = 4;
 		if (ds->ops->get_sset_count)
-			count += ds->ops->get_sset_count(ds, dp->index);
+			count += ds->ops->get_sset_count(ds, dp->index, sset);
 
 		return count;
 	}
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 4/6] net: dsa: Add helper function to obtain PHY device of a given port
From: Florian Fainelli @ 2018-04-21 18:55 UTC (permalink / raw)
  To: netdev
  Cc: Florian Fainelli, Andrew Lunn, Vivien Didelot, David S. Miller,
	cphealy
In-Reply-To: <1524336953-26108-1-git-send-email-f.fainelli@gmail.com>

In preparation for having more call sites attempting to obtain a
reference against a PHY device corresponding to a particular port,
introduce a helper function for that purpose.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 net/dsa/port.c | 33 ++++++++++++++++++++++-----------
 1 file changed, 22 insertions(+), 11 deletions(-)

diff --git a/net/dsa/port.c b/net/dsa/port.c
index 7acc1169d75e..5e2a88720a9a 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -273,25 +273,38 @@ int dsa_port_vlan_del(struct dsa_port *dp,
 	return 0;
 }
 
-static int dsa_port_setup_phy_of(struct dsa_port *dp, bool enable)
+static struct phy_device *dsa_port_get_phy_device(struct dsa_port *dp)
 {
-	struct device_node *port_dn = dp->dn;
 	struct device_node *phy_dn;
-	struct dsa_switch *ds = dp->ds;
 	struct phy_device *phydev;
-	int port = dp->index;
-	int err = 0;
 
-	phy_dn = of_parse_phandle(port_dn, "phy-handle", 0);
+	phy_dn = of_parse_phandle(dp->dn, "phy-handle", 0);
 	if (!phy_dn)
-		return 0;
+		return NULL;
 
 	phydev = of_phy_find_device(phy_dn);
 	if (!phydev) {
-		err = -EPROBE_DEFER;
-		goto err_put_of;
+		of_node_put(phy_dn);
+		return ERR_PTR(-EPROBE_DEFER);
 	}
 
+	return phydev;
+}
+
+static int dsa_port_setup_phy_of(struct dsa_port *dp, bool enable)
+{
+	struct dsa_switch *ds = dp->ds;
+	struct phy_device *phydev;
+	int port = dp->index;
+	int err = 0;
+
+	phydev = dsa_port_get_phy_device(dp);
+	if (!phydev)
+		return 0;
+
+	if (IS_ERR(phydev))
+		return PTR_ERR(phydev);
+
 	if (enable) {
 		err = genphy_config_init(phydev);
 		if (err < 0)
@@ -317,8 +330,6 @@ static int dsa_port_setup_phy_of(struct dsa_port *dp, bool enable)
 
 err_put_dev:
 	put_device(&phydev->mdio.dev);
-err_put_of:
-	of_node_put(phy_dn);
 	return err;
 }
 
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 5/6] net: dsa: Allow providing PHY statistics from CPU port
From: Florian Fainelli @ 2018-04-21 18:55 UTC (permalink / raw)
  To: netdev
  Cc: Florian Fainelli, Andrew Lunn, Vivien Didelot, David S. Miller,
	cphealy
In-Reply-To: <1524336953-26108-1-git-send-email-f.fainelli@gmail.com>

Implement the same type of ethtool diversion that we have for
ETH_SS_STATS and make it work with ETH_SS_PHY_STATS. This allows
providing PHY level statistics for CPU ports that are directly
connecting to a PHY device.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 include/net/dsa.h |  7 +++++++
 net/dsa/master.c  | 20 +++++++++++++++++++
 net/dsa/port.c    | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 84 insertions(+)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index 0bc0aad1b02e..462e9741b210 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -361,6 +361,8 @@ struct dsa_switch_ops {
 	void	(*get_ethtool_stats)(struct dsa_switch *ds,
 				     int port, uint64_t *data);
 	int	(*get_sset_count)(struct dsa_switch *ds, int port, int sset);
+	void	(*get_ethtool_phy_stats)(struct dsa_switch *ds,
+					 int port, uint64_t *data);
 
 	/*
 	 * ethtool Wake-on-LAN
@@ -589,4 +591,9 @@ static inline int call_dsa_notifiers(unsigned long val, struct net_device *dev,
 #define BRCM_TAG_GET_PORT(v)		((v) >> 8)
 #define BRCM_TAG_GET_QUEUE(v)		((v) & 0xff)
 
+
+int dsa_port_get_phy_strings(struct dsa_port *dp, uint8_t *data);
+int dsa_port_get_ethtool_phy_stats(struct dsa_port *dp, uint64_t *data);
+int dsa_port_get_phy_sset_count(struct dsa_port *dp);
+
 #endif
diff --git a/net/dsa/master.c b/net/dsa/master.c
index d547f3fb44cc..1fb218c13b36 100644
--- a/net/dsa/master.c
+++ b/net/dsa/master.c
@@ -31,6 +31,25 @@ static void dsa_master_get_ethtool_stats(struct net_device *dev,
 		ds->ops->get_ethtool_stats(ds, port, data + count);
 }
 
+static void dsa_master_get_ethtool_phy_stats(struct net_device *dev,
+					     struct ethtool_stats *stats,
+					     uint64_t *data)
+{
+	struct dsa_port *cpu_dp = dev->dsa_ptr;
+	const struct ethtool_ops *ops = cpu_dp->orig_ethtool_ops;
+	struct dsa_switch *ds = cpu_dp->ds;
+	int port = cpu_dp->index;
+	int count = 0;
+
+	if (ops && ops->get_sset_count && ops->get_ethtool_phy_stats) {
+		count = ops->get_sset_count(dev, ETH_SS_PHY_STATS);
+		ops->get_ethtool_phy_stats(dev, stats, data);
+	}
+
+	if (ds->ops->get_ethtool_phy_stats)
+		ds->ops->get_ethtool_phy_stats(ds, port, data + count);
+}
+
 static int dsa_master_get_sset_count(struct net_device *dev, int sset)
 {
 	struct dsa_port *cpu_dp = dev->dsa_ptr;
@@ -107,6 +126,7 @@ static int dsa_master_ethtool_setup(struct net_device *dev)
 	ops->get_sset_count = dsa_master_get_sset_count;
 	ops->get_ethtool_stats = dsa_master_get_ethtool_stats;
 	ops->get_strings = dsa_master_get_strings;
+	ops->get_ethtool_phy_stats = dsa_master_get_ethtool_phy_stats;
 
 	dev->ethtool_ops = ops;
 
diff --git a/net/dsa/port.c b/net/dsa/port.c
index 5e2a88720a9a..2413beb995be 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -383,3 +383,60 @@ void dsa_port_link_unregister_of(struct dsa_port *dp)
 	else
 		dsa_port_setup_phy_of(dp, false);
 }
+
+int dsa_port_get_phy_strings(struct dsa_port *dp, uint8_t *data)
+{
+	struct phy_device *phydev;
+	int ret = -EOPNOTSUPP;
+
+	if (of_phy_is_fixed_link(dp->dn))
+		return ret;
+
+	phydev = dsa_port_get_phy_device(dp);
+	if (IS_ERR_OR_NULL(phydev))
+		return ret;
+
+	ret = phy_ethtool_get_strings(phydev, data);
+	put_device(&phydev->mdio.dev);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(dsa_port_get_phy_strings);
+
+int dsa_port_get_ethtool_phy_stats(struct dsa_port *dp, uint64_t *data)
+{
+	struct phy_device *phydev;
+	int ret = -EOPNOTSUPP;
+
+	if (of_phy_is_fixed_link(dp->dn))
+		return ret;
+
+	phydev = dsa_port_get_phy_device(dp);
+	if (IS_ERR_OR_NULL(phydev))
+		return ret;
+
+	ret = phy_ethtool_get_stats(phydev, NULL, data);
+	put_device(&phydev->mdio.dev);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(dsa_port_get_ethtool_phy_stats);
+
+int dsa_port_get_phy_sset_count(struct dsa_port *dp)
+{
+	struct phy_device *phydev;
+	int ret = -EOPNOTSUPP;
+
+	if (of_phy_is_fixed_link(dp->dn))
+		return ret;
+
+	phydev = dsa_port_get_phy_device(dp);
+	if (IS_ERR_OR_NULL(phydev))
+		return ret;
+
+	ret = phy_ethtool_get_sset_count(phydev);
+	put_device(&phydev->mdio.dev);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(dsa_port_get_phy_sset_count);
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 6/6] net: dsa: b53: Add support for reading PHY statistics
From: Florian Fainelli @ 2018-04-21 18:55 UTC (permalink / raw)
  To: netdev
  Cc: Florian Fainelli, Andrew Lunn, Vivien Didelot, David S. Miller,
	cphealy
In-Reply-To: <1524336953-26108-1-git-send-email-f.fainelli@gmail.com>

Allow the b53 driver to return PHY statistics when the CPU port used is
different than 5, 7 or 8, because those are typically PHY-less on most
devices. This is useful for debugging link problems between the switch
and an external host when using a non standard CPU port number (e.g: 4).

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 drivers/net/dsa/b53/b53_common.c | 63 +++++++++++++++++++++++++++++++++++-----
 drivers/net/dsa/b53/b53_priv.h   |  1 +
 drivers/net/dsa/bcm_sf2.c        |  1 +
 3 files changed, 57 insertions(+), 8 deletions(-)

diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c
index 726b2d8c6fe9..a67c100be6eb 100644
--- a/drivers/net/dsa/b53/b53_common.c
+++ b/drivers/net/dsa/b53/b53_common.c
@@ -806,20 +806,39 @@ static unsigned int b53_get_mib_size(struct b53_device *dev)
 		return B53_MIBS_SIZE;
 }
 
+static struct phy_device *b53_get_phy_device(struct dsa_switch *ds, int port)
+{
+	/* These ports typically do not have built-in PHYs */
+	switch (port) {
+	case B53_CPU_PORT_25:
+	case 7:
+	case B53_CPU_PORT:
+		return NULL;
+	}
+
+	return mdiobus_get_phy(ds->slave_mii_bus, port);
+}
+
 void b53_get_strings(struct dsa_switch *ds, int port, u32 stringset,
 		     uint8_t *data)
 {
 	struct b53_device *dev = ds->priv;
 	const struct b53_mib_desc *mibs = b53_get_mib(dev);
 	unsigned int mib_size = b53_get_mib_size(dev);
+	struct phy_device *phydev;
 	unsigned int i;
 
-	if (stringset != ETH_SS_STATS)
-		return;
+	if (stringset == ETH_SS_STATS) {
+		for (i = 0; i < mib_size; i++)
+			strlcpy(data + i * ETH_GSTRING_LEN,
+				mibs[i].name, ETH_GSTRING_LEN);
+	} else if (stringset == ETH_SS_PHY_STATS) {
+		phydev = b53_get_phy_device(ds, port);
+		if (!phydev)
+			return;
 
-	for (i = 0; i < mib_size; i++)
-		strlcpy(data + i * ETH_GSTRING_LEN,
-			mibs[i].name, ETH_GSTRING_LEN);
+		phy_ethtool_get_strings(phydev, data);
+	}
 }
 EXPORT_SYMBOL(b53_get_strings);
 
@@ -856,14 +875,34 @@ void b53_get_ethtool_stats(struct dsa_switch *ds, int port, uint64_t *data)
 }
 EXPORT_SYMBOL(b53_get_ethtool_stats);
 
+void b53_get_ethtool_phy_stats(struct dsa_switch *ds, int port, uint64_t *data)
+{
+	struct phy_device *phydev;
+
+	phydev = b53_get_phy_device(ds, port);
+	if (!phydev)
+		return;
+
+	phy_ethtool_get_stats(phydev, NULL, data);
+}
+EXPORT_SYMBOL(b53_get_ethtool_phy_stats);
+
 int b53_get_sset_count(struct dsa_switch *ds, int port, int sset)
 {
 	struct b53_device *dev = ds->priv;
+	struct phy_device *phydev;
 
-	if (sset != ETH_SS_STATS)
-		return 0;
+	if (sset == ETH_SS_STATS) {
+		return b53_get_mib_size(dev);
+	} else if (sset == ETH_SS_PHY_STATS) {
+		phydev = b53_get_phy_device(ds, port);
+		if (!phydev)
+			return 0;
+
+		return phy_ethtool_get_sset_count(phydev);
+	}
 
-	return b53_get_mib_size(dev);
+	return 0;
 }
 EXPORT_SYMBOL(b53_get_sset_count);
 
@@ -1657,6 +1696,7 @@ static const struct dsa_switch_ops b53_switch_ops = {
 	.get_strings		= b53_get_strings,
 	.get_ethtool_stats	= b53_get_ethtool_stats,
 	.get_sset_count		= b53_get_sset_count,
+	.get_ethtool_phy_stats	= b53_get_ethtool_phy_stats,
 	.phy_read		= b53_phy_read16,
 	.phy_write		= b53_phy_write16,
 	.adjust_link		= b53_adjust_link,
@@ -1961,6 +2001,13 @@ static int b53_switch_init(struct b53_device *dev)
 	dev->num_ports = dev->cpu_port + 1;
 	dev->enabled_ports |= BIT(dev->cpu_port);
 
+	/* Include non standard CPU port built-in PHYs to be probed */
+	for (i = 0; i < dev->num_ports; i++) {
+		if (!(dev->ds->phys_mii_mask & BIT(i)) &&
+		    !b53_can_enable_brcm_tags(dev->ds, i))
+			dev->ds->phys_mii_mask |= BIT(i);
+	}
+
 	dev->ports = devm_kzalloc(dev->dev,
 				  sizeof(struct b53_port) * dev->num_ports,
 				  GFP_KERNEL);
diff --git a/drivers/net/dsa/b53/b53_priv.h b/drivers/net/dsa/b53/b53_priv.h
index b933d5cb5c2d..cc284a514de9 100644
--- a/drivers/net/dsa/b53/b53_priv.h
+++ b/drivers/net/dsa/b53/b53_priv.h
@@ -290,6 +290,7 @@ void b53_get_strings(struct dsa_switch *ds, int port, u32 stringset,
 		     uint8_t *data);
 void b53_get_ethtool_stats(struct dsa_switch *ds, int port, uint64_t *data);
 int b53_get_sset_count(struct dsa_switch *ds, int port, int sset);
+void b53_get_ethtool_phy_stats(struct dsa_switch *ds, int port, uint64_t *data);
 int b53_br_join(struct dsa_switch *ds, int port, struct net_device *bridge);
 void b53_br_leave(struct dsa_switch *ds, int port, struct net_device *bridge);
 void b53_br_set_stp_state(struct dsa_switch *ds, int port, u8 state);
diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c
index 0378eded31f2..97236cfcbae4 100644
--- a/drivers/net/dsa/bcm_sf2.c
+++ b/drivers/net/dsa/bcm_sf2.c
@@ -859,6 +859,7 @@ static const struct dsa_switch_ops bcm_sf2_ops = {
 	.get_strings		= b53_get_strings,
 	.get_ethtool_stats	= b53_get_ethtool_stats,
 	.get_sset_count		= b53_get_sset_count,
+	.get_ethtool_phy_stats	= b53_get_ethtool_phy_stats,
 	.get_phy_flags		= bcm_sf2_sw_get_phy_flags,
 	.adjust_link		= bcm_sf2_sw_adjust_link,
 	.fixed_link_update	= bcm_sf2_sw_fixed_link_update,
-- 
2.7.4

^ permalink raw reply related

* Re: WARNING in perf_trace_buf_alloc (2)
From: Eric Biggers @ 2018-04-21 19:37 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: linux-kernel, mingo, rostedt, syzkaller-bugs, netdev, syzbot
In-Reply-To: <001a11404e22634fb0055d4f2346@google.com>

[+bpf maintainers and netdev]

On Mon, Nov 06, 2017 at 03:56:01AM -0800, syzbot wrote:
> Hello,
> 
> syzkaller hit the following crash on
> 5cb0512c02ecd7e6214e912e4c150f4219ac78e0
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> C reproducer is attached
> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> for information about syzkaller reproducers
> 
> 
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 3008 at kernel/trace/trace_event_perf.c:274
> perf_trace_buf_alloc+0x12d/0x160 kernel/trace/trace_event_perf.c:273
> Kernel panic - not syncing: panic_on_warn set ...
> 
> CPU: 0 PID: 3008 Comm: syzkaller609027 Not tainted 4.14.0-rc7+ #159
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>  panic+0x1e4/0x417 kernel/panic.c:181
>  __warn+0x1c4/0x1d9 kernel/panic.c:542
>  report_bug+0x211/0x2d0 lib/bug.c:184
>  fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178
>  do_trap_no_signal arch/x86/kernel/traps.c:212 [inline]
>  do_trap+0x260/0x390 arch/x86/kernel/traps.c:261
>  do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298
>  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311
>  invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:906
> RIP: 0010:perf_trace_buf_alloc+0x12d/0x160
> kernel/trace/trace_event_perf.c:273
> RSP: 0018:ffff8801c0fdf760 EFLAGS: 00010286
> RAX: 000000000000001c RBX: 1ffff100381fbefe RCX: 0000000000000000
> RDX: 000000000000001c RSI: 1ffff100381fbeac RDI: ffffed00381fbee0
> RBP: ffff8801c0fdf780 R08: 0000000000000001 R09: 0000000000000000
> R10: ffff8801c0fdf7a0 R11: 0000000000000000 R12: 000000000000082c
> R13: ffff8801c0fdf810 R14: ffff8801c0fdf890 R15: ffff8801d8b34b40
>  perf_trace_bpf_map_keyval+0x260/0xbd0 include/trace/events/bpf.h:228
>  trace_bpf_map_update_elem include/trace/events/bpf.h:274 [inline]
>  map_update_elem kernel/bpf/syscall.c:597 [inline]
>  SYSC_bpf kernel/bpf/syscall.c:1478 [inline]
>  SyS_bpf+0x33eb/0x46a0 kernel/bpf/syscall.c:1453
>  entry_SYSCALL_64_fastpath+0x1f/0xbe
> RIP: 0033:0x445c29
> RSP: 002b:00000000007eff68 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
> RAX: ffffffffffffffda RBX: 00007ffe66adb340 RCX: 0000000000445c29
> RDX: 0000000000000020 RSI: 000000002053dfe0 RDI: 0000000000000002
> RBP: 0000000000000082 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000403280
> R13: 0000000000403310 R14: 0000000000000000 R15: 0000000000000000
> Dumping ftrace buffer:
>    (ftrace buffer empty)
> Kernel Offset: disabled
> Rebooting in 86400 seconds..
> 
> 
> ---
> This bug is generated by a dumb bot. It may contain errors.
> See https://goo.gl/tpsmEJ for details.
> Direct all questions to syzkaller@googlegroups.com.
> Please credit me with: Reported-by: syzbot <syzkaller@googlegroups.com>
> 
> syzbot will keep track of this bug report.
> Once a fix for this bug is committed, please reply to this email with:
> #syz fix: exact-commit-title
> To mark this as a duplicate of another syzbot report, please reply with:
> #syz dup: exact-subject-of-another-report
> If it's a one-off invalid bug report, please reply with:
> #syz invalid
> Note: if the crash happens again, it will cause creation of a new bug
> report.
> Note: all commands must start from beginning of the line.

This still happens on Linus' tree.  It seems one of the BPF tracepoints is
trying to pass a buffer that is too long.  Here's a simplified reproducer that
works on Linus' tree (commit 5e7c7806111ade5).  Note: it's not 100% reliable for
some reason; you may have to run it a couple times.  Daniel or Alexei, can one
of you please look into this more?  Thanks!

#include <linux/bpf.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <sys/syscall.h>
#include <unistd.h>

int main()
{
        int tracepoint_id;
        FILE *f;

        f = fopen("/sys/kernel/debug/tracing/events/bpf/bpf_map_update_elem/id",
                  "r");
        fscanf(f, "%d", &tracepoint_id);

        struct perf_event_attr perf_attr = {
                .type = PERF_TYPE_TRACEPOINT,
                .size = sizeof(perf_attr),
                .config = tracepoint_id,
        };
        syscall(__NR_perf_event_open, &perf_attr, 0, 0, -1, 0);

        for (;;) {
                union bpf_attr create_attr = {
                        .map_type = BPF_MAP_TYPE_HASH,
                        .key_size = 4,
                        .value_size = 2048,
                        .max_entries = 1,
                };
                int fd = syscall(__NR_bpf, BPF_MAP_CREATE,
                                 &create_attr, sizeof(create_attr));

                char key[4] = { 0 };
                char value[2048] = { 0 };
                union bpf_attr update_attr = {
                        .map_fd = fd,
                        .key = (unsigned long)key,
                        .value = (unsigned long)value,
                };
                syscall(__NR_bpf, BPF_MAP_UPDATE_ELEM,
                        &update_attr, sizeof(update_attr));
                close(fd);
        }
}

^ permalink raw reply

* Re: pull-request: bpf-next 2018-04-21
From: David Miller @ 2018-04-21 19:56 UTC (permalink / raw)
  To: daniel; +Cc: ast, netdev
In-Reply-To: <20180421010717.5050-1-daniel@iogearbox.net>

From: Daniel Borkmann <daniel@iogearbox.net>
Date: Sat, 21 Apr 2018 03:07:17 +0200

> The following pull-request contains BPF updates for your *net-next* tree.

BTF for the win :)

Pulled, thanks Daniel.

^ permalink raw reply

* Re: [PATCH net-next 0/7] net/ipv6: Another followup to the fib6_info change
From: David Miller @ 2018-04-21 20:06 UTC (permalink / raw)
  To: dsahern; +Cc: netdev, idosch, roopa, eric.dumazet, weiwan, kafai, yoshfuji
In-Reply-To: <20180420223803.15743-1-dsahern@gmail.com>

From: David Ahern <dsahern@gmail.com>
Date: Fri, 20 Apr 2018 15:37:56 -0700

> Last one - for this week.
> 
> Patches 1, 2 and 7 are more cleanup patches - removing dead code,
> moving code from a header to near its single caller, and updating
> function name.
> 
> Patches 3-5 do some refactoring leading up to patch 6 which fixes
> a NULL dereference. I have only managed to trigger a panic once, so
> I can not definitively confirm it addresses the problem but it seems
> pretty clear that it is a race on removing a 'from' reference on
> an rt6_info and another path using that 'from' value to do
> cookie checking.

Great work, series applied.

^ permalink raw reply

* Re: [PATCH/RFC net-next 2/5] ravb: correct ptp does failure after suspend and resume
From: Sergei Shtylyov @ 2018-04-21 20:33 UTC (permalink / raw)
  To: Simon Horman
  Cc: Magnus Damm, netdev, linux-renesas-soc, Wolfram Sang,
	Kazuya Mizuguchi
In-Reply-To: <20180417085030.32650-3-horms+renesas@verge.net.au>

Hello!

   s/failure/fail/ in the subject.

On 04/17/2018 11:50 AM, Simon Horman wrote:

> From: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com>
> 
> This patch fixes the problem that ptp4l command does not work after
> suspend and resume.
> Add the initial setting in ravb_suspend() and ravb_resume(),
> because ptp does not work.
> 
> Fixes: a0d2f20650e8 ("Renesas Ethernet AVB PTP clock driver")
> Signed-off-by: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com>
> Signed-off-by: Simon Horman <horms+renesas@verge.net.au>
[...]

   Other than the earlier comments, this seems good:

Reviewed-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>

MBR, Sergei

^ permalink raw reply

* Re: [pci PATCH v8 0/4] Add support for unmanaged SR-IOV
From: Bjorn Helgaas @ 2018-04-21 20:34 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: bhelgaas, linux-pci, virtio-dev, kvm, netdev, dan.daly,
	linux-kernel, linux-nvme, keith.busch, netanel, ddutile, mheyne,
	liang-min.wang, mark.d.rustad, dwmw2, hch, dwmw
In-Reply-To: <20180420162633.46077.49012.stgit@ahduyck-green-test.jf.intel.com>

On Fri, Apr 20, 2018 at 12:28:08PM -0400, Alexander Duyck wrote:
> This series is meant to add support for SR-IOV on devices when the VFs are
> not managed by the kernel. Examples of recent patches attempting to do this
> include:
> virto - https://patchwork.kernel.org/patch/10241225/
> pci-stub - https://patchwork.kernel.org/patch/10109935/
> vfio - https://patchwork.kernel.org/patch/10103353/
> uio - https://patchwork.kernel.org/patch/9974031/
> 
> Since this is quickly blowing up into a multi-driver problem it is probably
> best to implement this solution as generically as possible.
> 
> This series is an attempt to do that. What we do with this patch set is
> provide a generic framework to enable SR-IOV in the case that the PF driver
> doesn't support managing the VFs itself.
> 
> I based my patch set originally on the patch by Mark Rustad but there isn't
> much left after going through and cleaning out the bits that were no longer
> needed, and after incorporating the feedback from David Miller. At this point
> the only items to be fully reused was his patch description which is now
> present in patch 3 of the set.
> 
> This solution is limited in scope to just adding support for devices that
> provide no functionality for SR-IOV other than allocating the VFs by
> calling pci_enable_sriov. Previous sets had included patches for VFIO, but
> for now I am dropping that as the scope of that work is larger then I
> think I can take on at this time.
> 
> v2: Reduced scope back to just virtio_pci and vfio-pci
>     Broke into 3 patch set from single patch
>     Changed autoprobe behavior to always set when num_vfs is set non-zero
> v3: Updated Documentation to clarify when sriov_unmanaged_autoprobe is used
>     Wrapped vfio_pci_sriov_configure to fix build errors w/o SR-IOV in kernel
> v4: Dropped vfio-pci patch
>     Added ena and nvme to drivers now using pci_sriov_configure_unmanaged
>     Dropped pci_disable_sriov call in virtio_pci to be consistent with ena
> v5: Dropped sriov_unmanaged_autoprobe and pci_sriov_conifgure_unmanaged
>     Added new patch that enables pci_sriov_configure_simple
>     Updated drivers to use pci_sriov_configure_simple
> v6: Defined pci_sriov_configure_simple as NULL when SR-IOV is not enabled
>     Updated drivers to drop "#ifdef" checks for IOV
>     Added pci-pf-stub as place for PF-only drivers to add support
> v7: Dropped pci_id table explanation from pci-pf-stub driver
>     Updated pci_sriov_configure_simple to drop need for err value
>     Fixed comment explaining why pci_sriov_configure_simple is NULL
> v8: Dropped virtio from the set, support to be added later after TC approval
> 
> Cc: Mark Rustad <mark.d.rustad@intel.com>
> Cc: Maximilian Heyne <mheyne@amazon.de>
> Cc: Liang-Min Wang <liang-min.wang@intel.com>
> Cc: David Woodhouse <dwmw@amazon.co.uk>
> 
> ---
> 
> Alexander Duyck (4):
>       pci: Add pci_sriov_configure_simple for PFs that don't manage VF resources
>       ena: Migrate over to unmanaged SR-IOV support
>       nvme: Migrate over to unmanaged SR-IOV support
>       pci-pf-stub: Add PF driver stub for PFs that function only to enable VFs
> 
> 
>  drivers/net/ethernet/amazon/ena/ena_netdev.c |   28 -------------
>  drivers/nvme/host/pci.c                      |   20 ----------
>  drivers/pci/Kconfig                          |   12 ++++++
>  drivers/pci/Makefile                         |    2 +
>  drivers/pci/iov.c                            |   31 +++++++++++++++
>  drivers/pci/pci-pf-stub.c                    |   54 ++++++++++++++++++++++++++
>  include/linux/pci.h                          |    3 +
>  include/linux/pci_ids.h                      |    2 +
>  8 files changed, 106 insertions(+), 46 deletions(-)
>  create mode 100644 drivers/pci/pci-pf-stub.c

I tentatively applied these to pci/virtualization-review.

The code changes look fine, but I want to flesh out the changelogs a
little bit before merging them.

For example, I'm not sure what you mean by "devices where the PF is
not capable of managing VF resources."

It *sounds* like you're saying the hardware works differently on some
devices, but I don't think that's what you mean.  I think you're
saying something about which drivers are used for the PF and the VF.

I think a trivial example of how this will be used might help.  I
assume this involves a virtualization scenario where the host uses the
PF to enable several VFs, but the host doesn't use the PF for much
else.  Then you assign the VFs to guests, and drivers in the guest
OSes use the VFs.

Since .sriov_configure() is only used by sriov_numvfs_store(), I
assume the usage model involves writing to the sysfs sriov_numvfs
attribute to enable the VFs, then assigning them to guests?

Bjorn

^ permalink raw reply

* Re: [PATCH/RFC net-next 3/5] ravb: do not write 1 to reserved bits
From: Sergei Shtylyov @ 2018-04-21 20:44 UTC (permalink / raw)
  To: David Miller, horms+renesas
  Cc: magnus.damm, netdev, linux-renesas-soc, wsa+renesas,
	kazuya.mizuguchi.ks
In-Reply-To: <20180417.101553.1512039208518397234.davem@davemloft.net>

Hello!

On 04/17/2018 05:15 PM, David Miller wrote:

>> From: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com>
>>
>> This patch corrects writing 1 to reserved bits.
>> The write value should be 0.
>>
>> Signed-off-by: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com>
>> Signed-off-by: Simon Horman <horms+renesas@verge.net.au>
> 
> How are we ending up in situations where the driver is trying to write
> non-zero values to those fields in the first place?

   The brain damaged AVB core design is to blame here. You have to write 0 to
clear the set bits and, at the same time, you can't write 1 to the reserved
bits... :-/

> The places creating those values should be making sure that the
> reserved bits are never set.

   That's basically what this patch is doing.

> If you mask out the reserved bits in the register writing locations,
> this just hides bugs.

   There are no *other* locations in some cases... 
   And I don't think that forcing the reserved bits to 1 after a register is
read would look better. :-(

MBR, Sergei

^ permalink raw reply

* Re: [PATCH/RFC net-next 3/5] ravb: do not write 1 to reserved bits
From: Sergei Shtylyov @ 2018-04-21 20:53 UTC (permalink / raw)
  To: Simon Horman
  Cc: Magnus Damm, netdev, linux-renesas-soc, Wolfram Sang,
	Kazuya Mizuguchi
In-Reply-To: <20180417085030.32650-4-horms+renesas@verge.net.au>

On 04/17/2018 11:50 AM, Simon Horman wrote:

> From: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com>
> 
> This patch corrects writing 1 to reserved bits.
> The write value should be 0.
> 
> Signed-off-by: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com>
> Signed-off-by: Simon Horman <horms+renesas@verge.net.au>
> ---
>  drivers/net/ethernet/renesas/ravb.h      | 12 ++++++++++++
>  drivers/net/ethernet/renesas/ravb_main.c |  9 +++++----
>  drivers/net/ethernet/renesas/ravb_ptp.c  |  2 +-
>  3 files changed, 18 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/ethernet/renesas/ravb.h b/drivers/net/ethernet/renesas/ravb.h
> index b81f4faf7b10..57eea4a77826 100644
> --- a/drivers/net/ethernet/renesas/ravb.h
> +++ b/drivers/net/ethernet/renesas/ravb.h
> @@ -433,6 +433,8 @@ enum EIS_BIT {
>  	EIS_QFS		= 0x00010000,
>  };
>  
> +#define EIS_RESERVED_BITS	(u32)(GENMASK(31, 17) | GENMASK(15, 11))
> +
>  /* RIC0 */
>  enum RIC0_BIT {
>  	RIC0_FRE0	= 0x00000001,
> @@ -477,6 +479,8 @@ enum RIS0_BIT {
>  	RIS0_FRF17	= 0x00020000,
>  };
>  
> +#define RIS0_RESERVED_BITS	(u32)GENMASK(31, 18)
> +
>  /* RIC1 */
>  enum RIC1_BIT {
>  	RIC1_RFWE	= 0x80000000,
> @@ -533,6 +537,8 @@ enum RIS2_BIT {
>  	RIS2_RFFF	= 0x80000000,
>  };
>  
> +#define RIS2_RESERVED_BITS	(u32)GENMASK_ULL(30, 18)
> +
>  /* TIC */
>  enum TIC_BIT {
>  	TIC_FTE0	= 0x00000001,	/* Undocumented? */
> @@ -549,6 +555,10 @@ enum TIS_BIT {
>  	TIS_TFWF	= 0x00000200,
>  };
>  
> +#define TIS_RESERVED_BITS	(u32)(GENMASK_ULL(31, 20) | \
> +				      GENMASK_ULL(15, 12) | \
> +				      GENMASK_ULL(7, 4))
> +
>  /* ISS */
>  enum ISS_BIT {
>  	ISS_FRS		= 0x00000001,	/* Undocumented? */
> @@ -622,6 +632,8 @@ enum GIS_BIT {
>  	GIS_PTMF	= 0x00000004,
>  };
>  
> +#define GIS_RESERVED_BITS	(u32)GENMASK(15, 10)
> +
>  /* GIE (R-Car Gen3 only) */
>  enum GIE_BIT {
>  	GIE_PTCS	= 0x00000001,
[...]

   Perhaps we can do what the MUSB driver does: declare the writing-zero-clears
masks (inside *enum*!), and then set them before writing th register... 

MBR, Sergei

^ permalink raw reply

* Re: [PATCH/RFC net-next 4/5] ravb: remove undocumented processing
From: Sergei Shtylyov @ 2018-04-21 20:59 UTC (permalink / raw)
  To: Simon Horman
  Cc: Magnus Damm, netdev, linux-renesas-soc, Wolfram Sang,
	Kazuya Mizuguchi
In-Reply-To: <20180417085030.32650-5-horms+renesas@verge.net.au>

On 04/17/2018 11:50 AM, Simon Horman wrote:

> From: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com>

   How about the description (or 2)?

> Signed-off-by: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com>
> Signed-off-by: Simon Horman <horms+renesas@verge.net.au>
> ---
>  drivers/net/ethernet/renesas/ravb.h      |  5 -----
>  drivers/net/ethernet/renesas/ravb_main.c | 15 ---------------
>  2 files changed, 20 deletions(-)
> 
> diff --git a/drivers/net/ethernet/renesas/ravb.h b/drivers/net/ethernet/renesas/ravb.h
> index 57eea4a77826..fcd04dbc7dde 100644
> --- a/drivers/net/ethernet/renesas/ravb.h
> +++ b/drivers/net/ethernet/renesas/ravb.h
> @@ -197,15 +197,11 @@ enum ravb_reg {
>  	MAHR	= 0x05c0,
>  	MALR	= 0x05c8,
>  	TROCR	= 0x0700,	/* Undocumented? */
> -	CDCR	= 0x0708,	/* Undocumented? */
> -	LCCR	= 0x0710,	/* Undocumented? */
>  	CEFCR	= 0x0740,
>  	FRECR	= 0x0748,
>  	TSFRCR	= 0x0750,
>  	TLFRCR	= 0x0758,
>  	RFCR	= 0x0760,
> -	CERCR	= 0x0768,	/* Undocumented? */
> -	CEECR	= 0x0770,	/* Undocumented? */
>  	MAFCR	= 0x0778,
>  };
>  

   I think this is a material of patch #1... 

> @@ -223,7 +219,6 @@ enum CCC_BIT {
>  	CCC_CSEL_HPB	= 0x00010000,
>  	CCC_CSEL_ETH_TX	= 0x00020000,
>  	CCC_CSEL_GMII_REF = 0x00030000,
> -	CCC_BOC		= 0x00100000,	/* Undocumented? */
>  	CCC_LBME	= 0x01000000,

   ... and this -- of patch #2. Or vice versa. :-)

[...]

MBR, Sergei

^ permalink raw reply

* [PATCH net] net: ethtool: Add missing kernel doc for FEC parameters
From: Florian Fainelli @ 2018-04-21 23:18 UTC (permalink / raw)
  To: netdev; +Cc: davem, vidya.chowdary, dustin, roopa, Florian Fainelli

While adding support for ethtool::get_fecparam and set_param, kernel doc for
these functions was missed, add those.

Fixes: 1a5f3da20bd9 ("net: ethtool: add support for forward error correction modes")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 include/linux/ethtool.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index ebe41811ed34..f95d47b4d545 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -310,6 +310,10 @@ bool ethtool_convert_link_mode_to_legacy_u32(u32 *legacy_u32,
  *	fields should be ignored (use %__ETHTOOL_LINK_MODE_MASK_NBITS
  *	instead of the latter), any change to them will be overwritten
  *	by kernel. Returns a negative error code or zero.
+ * @get_fecparam: Get the network device Forward Error Correction
+ *	parameters.
+ * @sec_fecparam: Set the network device Forward Error Correction
+ *	parameters.
  *
  * All operations are optional (i.e. the function pointer may be set
  * to %NULL) and callers must take this into account.  Callers must
-- 
2.14.1

^ permalink raw reply related

* Re: [pci PATCH v8 0/4] Add support for unmanaged SR-IOV
From: Alexander Duyck @ 2018-04-22  0:22 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Alexander Duyck, Bjorn Helgaas, linux-pci, virtio-dev, kvm,
	Netdev, Daly, Dan, LKML, linux-nvme, Keith Busch, netanel,
	Don Dutile, Maximilian Heyne, Wang, Liang-min, Rustad, Mark D,
	David Woodhouse, Christoph Hellwig, dwmw
In-Reply-To: <20180421203437.GW28657@bhelgaas-glaptop.roam.corp.google.com>

On Sat, Apr 21, 2018 at 1:34 PM, Bjorn Helgaas <helgaas@kernel.org> wrote:
> On Fri, Apr 20, 2018 at 12:28:08PM -0400, Alexander Duyck wrote:
>> This series is meant to add support for SR-IOV on devices when the VFs are
>> not managed by the kernel. Examples of recent patches attempting to do this
>> include:
>> virto - https://patchwork.kernel.org/patch/10241225/
>> pci-stub - https://patchwork.kernel.org/patch/10109935/
>> vfio - https://patchwork.kernel.org/patch/10103353/
>> uio - https://patchwork.kernel.org/patch/9974031/
>>
>> Since this is quickly blowing up into a multi-driver problem it is probably
>> best to implement this solution as generically as possible.
>>
>> This series is an attempt to do that. What we do with this patch set is
>> provide a generic framework to enable SR-IOV in the case that the PF driver
>> doesn't support managing the VFs itself.
>>
>> I based my patch set originally on the patch by Mark Rustad but there isn't
>> much left after going through and cleaning out the bits that were no longer
>> needed, and after incorporating the feedback from David Miller. At this point
>> the only items to be fully reused was his patch description which is now
>> present in patch 3 of the set.
>>
>> This solution is limited in scope to just adding support for devices that
>> provide no functionality for SR-IOV other than allocating the VFs by
>> calling pci_enable_sriov. Previous sets had included patches for VFIO, but
>> for now I am dropping that as the scope of that work is larger then I
>> think I can take on at this time.
>>
>> v2: Reduced scope back to just virtio_pci and vfio-pci
>>     Broke into 3 patch set from single patch
>>     Changed autoprobe behavior to always set when num_vfs is set non-zero
>> v3: Updated Documentation to clarify when sriov_unmanaged_autoprobe is used
>>     Wrapped vfio_pci_sriov_configure to fix build errors w/o SR-IOV in kernel
>> v4: Dropped vfio-pci patch
>>     Added ena and nvme to drivers now using pci_sriov_configure_unmanaged
>>     Dropped pci_disable_sriov call in virtio_pci to be consistent with ena
>> v5: Dropped sriov_unmanaged_autoprobe and pci_sriov_conifgure_unmanaged
>>     Added new patch that enables pci_sriov_configure_simple
>>     Updated drivers to use pci_sriov_configure_simple
>> v6: Defined pci_sriov_configure_simple as NULL when SR-IOV is not enabled
>>     Updated drivers to drop "#ifdef" checks for IOV
>>     Added pci-pf-stub as place for PF-only drivers to add support
>> v7: Dropped pci_id table explanation from pci-pf-stub driver
>>     Updated pci_sriov_configure_simple to drop need for err value
>>     Fixed comment explaining why pci_sriov_configure_simple is NULL
>> v8: Dropped virtio from the set, support to be added later after TC approval
>>
>> Cc: Mark Rustad <mark.d.rustad@intel.com>
>> Cc: Maximilian Heyne <mheyne@amazon.de>
>> Cc: Liang-Min Wang <liang-min.wang@intel.com>
>> Cc: David Woodhouse <dwmw@amazon.co.uk>
>>
>> ---
>>
>> Alexander Duyck (4):
>>       pci: Add pci_sriov_configure_simple for PFs that don't manage VF resources
>>       ena: Migrate over to unmanaged SR-IOV support
>>       nvme: Migrate over to unmanaged SR-IOV support
>>       pci-pf-stub: Add PF driver stub for PFs that function only to enable VFs
>>
>>
>>  drivers/net/ethernet/amazon/ena/ena_netdev.c |   28 -------------
>>  drivers/nvme/host/pci.c                      |   20 ----------
>>  drivers/pci/Kconfig                          |   12 ++++++
>>  drivers/pci/Makefile                         |    2 +
>>  drivers/pci/iov.c                            |   31 +++++++++++++++
>>  drivers/pci/pci-pf-stub.c                    |   54 ++++++++++++++++++++++++++
>>  include/linux/pci.h                          |    3 +
>>  include/linux/pci_ids.h                      |    2 +
>>  8 files changed, 106 insertions(+), 46 deletions(-)
>>  create mode 100644 drivers/pci/pci-pf-stub.c
>
> I tentatively applied these to pci/virtualization-review.
>
> The code changes look fine, but I want to flesh out the changelogs a
> little bit before merging them.

Thanks.

> For example, I'm not sure what you mean by "devices where the PF is
> not capable of managing VF resources."
>
> It *sounds* like you're saying the hardware works differently on some
> devices, but I don't think that's what you mean.  I think you're
> saying something about which drivers are used for the PF and the VF.

That is sort of what I am saying.

So for example with ixgbe there is functionality which is controlled
in the MMIO space of the PF that affects the functionality of the VFs
that are generated on the device. The PF has to rearrange the
resources such as queues and interrupts on the device before it can
enable SR-IOV, and it could alter those later to limit what the VF is
capable of doing.

The model I am dealing with via this patch set has a PF that is not
much different than the VFs other than the fact that it has some
extended configuration space bits in place for SR-IOV, ARI, ACS, and
whatever other bits are needed in order to support spawning isolated
VFs.

> I think a trivial example of how this will be used might help.  I
> assume this involves a virtualization scenario where the host uses the
> PF to enable several VFs, but the host doesn't use the PF for much
> else.  Then you assign the VFs to guests, and drivers in the guest
> OSes use the VFs.

So this description would work for the pci-pf-stub driver. Basically
the idea here is that you have a device that is nothing more than a
function there to spawn VFs.

The other scenario that I am supporting are the ena and nvme models.
They match what I have described above where the PF is really just
another VF with some extra configuration space bits that add support
for spawning what amount to peer VFs to the already present PF.

> Since .sriov_configure() is only used by sriov_numvfs_store(), I
> assume the usage model involves writing to the sysfs sriov_numvfs
> attribute to enable the VFs, then assigning them to guests?
>
> Bjorn

Yes. The assumption for this function is that we are only supporting
the sysfs approach for spawning VFs.

- Alex

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox