Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 0/8] arm: renesas: Change platform dependency to ARCH_RENESAS
From: Arnd Bergmann @ 2018-04-20 13:40 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Simon Horman, Magnus Damm, Russell King, Catalin Marinas,
	Will Deacon, Dan Williams, Vinod Koul, Mauro Carvalho Chehab,
	Sergei Shtylyov, David S . Miller, Greg Kroah-Hartman,
	Liam Girdwood, Mark Brown, Jaroslav Kysela, Takashi Iwai,
	Kuninori Morimoto, Laurent Pinchart, Linux-Renesas, Linux
In-Reply-To: <1524230914-10175-1-git-send-email-geert+renesas@glider.be>

On Fri, Apr 20, 2018 at 3:28 PM, Geert Uytterhoeven
<geert+renesas@glider.be> wrote:
>         Hi all,
>
> Commit 9b5ba0df4ea4f940 ("ARM: shmobile: Introduce ARCH_RENESAS")
> started the conversion from ARCH_SHMOBILE to ARCH_RENESAS for Renesas
> ARM SoCs.  This patch series completes the conversion, by:
>   1. Updating dependencies for drivers that weren't converted yet,
>   2. Removing the ARCH_SHMOBILE Kconfig symbols on ARM and ARM64.
>
> The first 6 patches can be applied independently by subsystem
> maintainers.
> The last two patches depend on the first 6 patches, and are thus marked
> RFC.

This all looks fine to me.

Acked-by: Arnd Bergmann <arnd@arndb.de>

      Arnd

^ permalink raw reply

* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Matthew Wilcox @ 2018-04-20 13:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mikulas Patocka, David Miller, Andrew Morton, linux-mm,
	eric.dumazet, edumazet, bhutchings, netdev, linux-kernel, mst,
	jasowang, virtualization, dm-devel, Vlastimil Babka
In-Reply-To: <20180420130852.GC16083@dhcp22.suse.cz>

On Fri, Apr 20, 2018 at 03:08:52PM +0200, Michal Hocko wrote:
> > In order to detect these bugs reliably I submit this patch that changes
> > kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.
> 
> No way. This is just wrong! First of all, you will explode most likely
> on many allocations of small sizes. Second, CONFIG_DEBUG_VM tends to be
> enabled quite often.

I think it'll still suit Mikulas' debugging needs if we always use
vmalloc for sizes above PAGE_SIZE?

^ permalink raw reply

* [PATCH] iptables: Per-net ns lock
From: Kirill Tkhai @ 2018-04-20 13:42 UTC (permalink / raw)
  To: fw, netdev, pablo, rstoyanov1, ptikhomirov, avagin, ktkhai

Containers want to restore their own net ns,
while they may have no their own mnt ns.
This case they share host's /run/xtables.lock
file, but they may not have permission to open
it.

Patch makes /run/xtables.lock to be per-namespace,
i.e., to refer to the caller task's net ns.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 iptables/xshared.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/iptables/xshared.c b/iptables/xshared.c
index 06db72d4..b6dbe4e7 100644
--- a/iptables/xshared.c
+++ b/iptables/xshared.c
@@ -254,7 +254,12 @@ static int xtables_lock(int wait, struct timeval *wait_interval)
 	time_left.tv_sec = wait;
 	time_left.tv_usec = 0;
 
-	fd = open(XT_LOCK_NAME, O_CREAT, 0600);
+	if (symlink("/proc/self/ns/net", XT_LOCK_NAME) != 0 &&
+	    errno != EEXIST) {
+		fprintf(stderr, "Fatal: can't create lock file\n");
+		return XT_LOCK_FAILED;
+	}
+	fd = open(XT_LOCK_NAME, O_RDONLY);
 	if (fd < 0) {
 		fprintf(stderr, "Fatal: can't open lock file %s: %s\n",
 			XT_LOCK_NAME, strerror(errno));

^ permalink raw reply related

* Re: [PATCH net-next 2/2] udp: implement and use per cpu rx skbs cache
From: Jesper Dangaard Brouer @ 2018-04-20 13:48 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: brouer, Paolo Abeni, netdev, David S. Miller, Tariq Toukan
In-Reply-To: <0e3abeb5-8081-f9ea-4de6-cc1a7edfc5a5@gmail.com>

On Thu, 19 Apr 2018 06:47:10 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On 04/19/2018 12:40 AM, Paolo Abeni wrote:
> > On Wed, 2018-04-18 at 12:21 -0700, Eric Dumazet wrote:  
> >> On 04/18/2018 10:15 AM, Paolo Abeni wrote:
[...]
> > 
> > Any suggestions for better results are more than welcome!  
> 
> Yes, remote skb freeing. I mentioned this idea to Jesper and Tariq in
> Seoul (netdev conference). Not tied to UDP, but a generic solution.

Yes, I remember.  I think... was it the idea, where you basically
wanted to queue back SKBs to the CPU that allocated them, right?

Freeing an SKB on the same CPU that allocated it, have multiple
advantages. (1) the SLUB allocator can use a non-atomic
"cpu-local" (double)cmpxchg. (2) the 4 cache-lines memset cleared of
the SKB stay local.  (3) the atomic SKB refcnt/users stay local.

We just have to avoid that queue back SKB's mechanism, doesn't cost
more than the operations we expect to save.  Bulk transfer is an
obvious approach.  For storing SKBs until they are returned, we already
have a fast mechanism see napi_consume_skb calling _kfree_skb_defer,
which SLUB/SLAB-bulk free to amortize cost (1).

I guess, the missing information is that we don't know what CPU the SKB
were created on...

Where to store this CPU info?

(a) In struct sk_buff, in a cache-line that is already read on remote
CPU in UDP code?

(b) In struct page, as SLUB alloc hand-out objects/SKBs on a per page
basis, we could have SLUB store a hint about the CPU it was allocated
on, and bet on returning to that CPU ? (might be bad to read the
struct-page cache-line)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Michal Hocko @ 2018-04-20 13:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Mikulas Patocka, David Miller, Andrew Morton, linux-mm,
	eric.dumazet, edumazet, bhutchings, netdev, linux-kernel, mst,
	jasowang, virtualization, dm-devel, Vlastimil Babka
In-Reply-To: <20180420134136.GD10788@bombadil.infradead.org>

On Fri 20-04-18 06:41:36, Matthew Wilcox wrote:
> On Fri, Apr 20, 2018 at 03:08:52PM +0200, Michal Hocko wrote:
> > > In order to detect these bugs reliably I submit this patch that changes
> > > kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.
> > 
> > No way. This is just wrong! First of all, you will explode most likely
> > on many allocations of small sizes. Second, CONFIG_DEBUG_VM tends to be
> > enabled quite often.
> 
> I think it'll still suit Mikulas' debugging needs if we always use
> vmalloc for sizes above PAGE_SIZE?

Even if that was the case then this doesn't sounds like CONFIG_DEBUG_VM
material. We do not want a completely different behavior when the config
is enabled. If we really need some better fallback testing coverage
then the fault injection, as suggested by Vlastimil, sounds much more
reasonable to me

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH bpf-next 1/5] samples/bpf: Fix typo in comment
From: Jesper Dangaard Brouer @ 2018-04-20 13:52 UTC (permalink / raw)
  To: Daniel Thompson
  Cc: Leo Yan, Alexei Starovoitov, Daniel Borkmann, netdev,
	linux-kernel, brouer
In-Reply-To: <20180420132116.uswpqniteogfu4zz@holly.lan>

On Fri, 20 Apr 2018 14:21:16 +0100
Daniel Thompson <daniel.thompson@linaro.org> wrote:

> On Fri, Apr 20, 2018 at 02:10:04PM +0200, Jesper Dangaard Brouer wrote:
> > 
> > On Thu, 19 Apr 2018 09:34:02 +0800 Leo Yan <leo.yan@linaro.org> wrote:
> >   
> > > Fix typo by replacing 'iif' with 'if'.
> > > 
> > > Signed-off-by: Leo Yan <leo.yan@linaro.org>
> > > ---
> > >  samples/bpf/bpf_load.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
> > > index bebe418..28e4678 100644
> > > --- a/samples/bpf/bpf_load.c
> > > +++ b/samples/bpf/bpf_load.c
> > > @@ -393,7 +393,7 @@ static int load_elf_maps_section(struct bpf_map_data *maps, int maps_shndx,
> > >  			continue;
> > >  		if (sym[nr_maps].st_shndx != maps_shndx)
> > >  			continue;
> > > -		/* Only increment iif maps section */
> > > +		/* Only increment if maps section */
> > >  		nr_maps++;
> > >  	}  
> > 
> > This was actually not a typo from my side.
> > 
> > With 'iif' I mean 'if and only if' ... but it doesn't matter much.  
> 
> I think 'if and only if' is more commonly abbreviated 'iff' isn't it?

Ah, yes![1]  -- then it *is* actually a typo! - LOL

I'm fine with changing this to "if" :-)


[1] https://en.wikipedia.org/wiki/If_and_only_if

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* [PATCH iproute2 1/1] tc: return on invalid smac or dmac in ife action
From: Roman Mashak @ 2018-04-20 13:52 UTC (permalink / raw)
  To: stephen; +Cc: netdev, kernel, jhs, xiyou.wangcong, jiri, Roman Mashak

Return on invalid smac/dmac and use invarg consistently for invalid
arguments report.

Signed-off-by: Roman Mashak <mrv@mojatatu.com>
---
 tc/m_ife.c | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/tc/m_ife.c b/tc/m_ife.c
index d7e61703f666..ed0913a379aa 100644
--- a/tc/m_ife.c
+++ b/tc/m_ife.c
@@ -94,9 +94,7 @@ static int parse_ife(struct action_util *a, int *argc_p, char ***argv_p,
 			} else if (matches(*argv, "tcindex") == 0) {
 				ife_tcindex = IFE_META_TCINDEX;
 			} else {
-				fprintf(stderr, "Illegal meta define <%s>\n",
-					*argv);
-				return -1;
+				invarg("Illegal meta define", *argv);
 			}
 		} else if (matches(*argv, "use") == 0) {
 			NEXT_ARG();
@@ -116,9 +114,7 @@ static int parse_ife(struct action_util *a, int *argc_p, char ***argv_p,
 					invarg("ife tcindex val is invalid",
 					       *argv);
 			} else {
-				fprintf(stderr, "Illegal meta use type <%s>\n",
-					*argv);
-				return -1;
+				invarg("Illegal meta use type", *argv);
 			}
 		} else if (matches(*argv, "type") == 0) {
 			NEXT_ARG();
@@ -132,8 +128,7 @@ static int parse_ife(struct action_util *a, int *argc_p, char ***argv_p,
 			if (sscanf(daddr, "%hhx:%hhx:%hhx:%hhx:%hhx:%hhx",
 				   dbuf, dbuf + 1, dbuf + 2,
 				   dbuf + 3, dbuf + 4, dbuf + 5) != 6) {
-				fprintf(stderr, "Invalid mac address %s\n",
-					daddr);
+				invarg("Invalid mac address", *argv);
 			}
 			fprintf(stderr, "dst MAC address <%s>\n", daddr);
 
@@ -143,8 +138,7 @@ static int parse_ife(struct action_util *a, int *argc_p, char ***argv_p,
 			if (sscanf(saddr, "%hhx:%hhx:%hhx:%hhx:%hhx:%hhx",
 				   sbuf, sbuf + 1, sbuf + 2,
 				   sbuf + 3, sbuf + 4, sbuf + 5) != 6) {
-				fprintf(stderr, "Invalid mac address %s\n",
-					saddr);
+				invarg("Invalid mac address", *argv);
 			}
 			fprintf(stderr, "src MAC address <%s>\n", saddr);
 		} else if (matches(*argv, "help") == 0) {
-- 
2.7.4

^ permalink raw reply related

* Re: [RFC] vhost: introduce mdev based hardware vhost backend
From: Michael S. Tsirkin @ 2018-04-20 13:52 UTC (permalink / raw)
  To: Liang, Cunming
  Cc: Bie, Tiwei, Jason Wang, alex.williamson@redhat.com,
	ddutile@redhat.com, Duyck, Alexander H,
	virtio-dev@lists.oasis-open.org, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org, virtualization@lists.linux-foundation.org,
	netdev@vger.kernel.org, Daly, Dan, Wang, Zhihong, Tan, Jianfeng,
	Wang, Xiao W, Tian, Kevin
In-Reply-To: <D0158A423229094DA7ABF71CF2FA0DA34E9511D5@SHSMSX104.ccr.corp.intel.com>

On Fri, Apr 20, 2018 at 03:50:41AM +0000, Liang, Cunming wrote:
> 
> 
> > -----Original Message-----
> > From: Bie, Tiwei
> > Sent: Friday, April 20, 2018 11:28 AM
> > To: Michael S. Tsirkin <mst@redhat.com>
> > Cc: Jason Wang <jasowang@redhat.com>; alex.williamson@redhat.com;
> > ddutile@redhat.com; Duyck, Alexander H <alexander.h.duyck@intel.com>;
> > virtio-dev@lists.oasis-open.org; linux-kernel@vger.kernel.org;
> > kvm@vger.kernel.org; virtualization@lists.linux-foundation.org;
> > netdev@vger.kernel.org; Daly, Dan <dan.daly@intel.com>; Liang, Cunming
> > <cunming.liang@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>; Tan,
> > Jianfeng <jianfeng.tan@intel.com>; Wang, Xiao W <xiao.w.wang@intel.com>;
> > Tian, Kevin <kevin.tian@intel.com>
> > Subject: Re: [RFC] vhost: introduce mdev based hardware vhost backend
> > 
> > On Thu, Apr 19, 2018 at 09:40:23PM +0300, Michael S. Tsirkin wrote:
> > > On Tue, Apr 10, 2018 at 03:25:45PM +0800, Jason Wang wrote:
> > > > > > > One problem is that, different virtio ring compatible devices
> > > > > > > may have different device interfaces. That is to say, we will
> > > > > > > need different drivers in QEMU. It could be troublesome. And
> > > > > > > that's what this patch trying to fix. The idea behind this
> > > > > > > patch is very simple: mdev is a standard way to emulate device
> > > > > > > in kernel.
> > > > > > So you just move the abstraction layer from qemu to kernel, and
> > > > > > you still need different drivers in kernel for different device
> > > > > > interfaces of accelerators. This looks even more complex than
> > > > > > leaving it in qemu. As you said, another idea is to implement
> > > > > > userspace vhost backend for accelerators which seems easier and
> > > > > > could co-work with other parts of qemu without inventing new type of
> > messages.
> > > > > I'm not quite sure. Do you think it's acceptable to add various
> > > > > vendor specific hardware drivers in QEMU?
> > > > >
> > > >
> > > > I don't object but we need to figure out the advantages of doing it
> > > > in qemu too.
> > > >
> > > > Thanks
> > >
> > > To be frank kernel is exactly where device drivers belong.  DPDK did
> > > move them to userspace but that's merely a requirement for data path.
> > > *If* you can have them in kernel that is best:
> > > - update kernel and there's no need to rebuild userspace
> > > - apps can be written in any language no need to maintain multiple
> > >   libraries or add wrappers
> > > - security concerns are much smaller (ok people are trying to
> > >   raise the bar with IOMMUs and such, but it's already pretty
> > >   good even without)
> > >
> > > The biggest issue is that you let userspace poke at the device which
> > > is also allowed by the IOMMU to poke at kernel memory (needed for
> > > kernel driver to work).
> > 
> > I think the device won't and shouldn't be allowed to poke at kernel memory. Its
> > kernel driver needs some kernel memory to work. But the device doesn't have
> > the access to them. Instead, the device only has the access to:
> > 
> > (1) the entire memory of the VM (if vIOMMU isn't used) or
> > (2) the memory belongs to the guest virtio device (if
> >     vIOMMU is being used).
> > 
> > Below is the reason:
> > 
> > For the first case, we should program the IOMMU for the hardware device based
> > on the info in the memory table which is the entire memory of the VM.
> > 
> > For the second case, we should program the IOMMU for the hardware device
> > based on the info in the shadow page table of the vIOMMU.
> > 
> > So the memory can be accessed by the device is limited, it should be safe
> > especially for the second case.
> > 
> > My concern is that, in this RFC, we don't program the IOMMU for the mdev
> > device in the userspace via the VFIO API directly. Instead, we pass the memory
> > table to the kernel driver via the mdev device (BAR0) and ask the driver to do the
> > IOMMU programming. Someone may don't like it. The main reason why we don't
> > program IOMMU via VFIO API in userspace directly is that, currently IOMMU
> > drivers don't support mdev bus.
> > 
> > >
> > > Yes, maybe if device is not buggy it's all fine, but it's better if we
> > > do not have to trust the device otherwise the security picture becomes
> > > more murky.
> > >
> > > I suggested attaching a PASID to (some) queues - see my old post
> > > "using PASIDs to enable a safe variant of direct ring access".
> > 
> Ideally we can have a device binding with normal driver in host, meanwhile support to allocate a few queues attaching with PASID on-demand. By vhost mdev transport channel, the data path ability of queues(as a device) can expose to qemu vhost adaptor as a vDPA instance. Then we can avoid VF number limitation, providing vhost data path acceleration in a small granularity.

Exactly my point.

> > It's pretty cool. We also have some similar ideas.
> > Cunming will talk more about this.
> > 
> > Best regards,
> > Tiwei Bie
> > 
> > >
> > > Then using IOMMU with VFIO to limit access through queue to corrent
> > > ranges of memory.
> > >
> > >
> > > --
> > > MST

^ permalink raw reply

* Re: [PATCH net-next 2/2] netns: isolate seqnums to use per-netns locks
From: Christian Brauner @ 2018-04-20 13:56 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: davem, netdev, linux-kernel, avagin, ktkhai, serge, gregkh
In-Reply-To: <20180418215246.GA24000@gmail.com>

On Wed, Apr 18, 2018 at 11:52:47PM +0200, Christian Brauner wrote:
> On Wed, Apr 18, 2018 at 11:55:52AM -0500, Eric W. Biederman wrote:
> > Christian Brauner <christian.brauner@ubuntu.com> writes:
> > 
> > > Now that it's possible to have a different set of uevents in different
> > > network namespaces, per-network namespace uevent sequence numbers are
> > > introduced. This increases performance as locking is now restricted to the
> > > network namespace affected by the uevent rather than locking
> > > everything.
> > 
> > Numbers please.  I personally expect that the netlink mc_list issues
> > will swamp any benefit you get from this.
> 
> I wouldn't see how this would be the case. The gist of this is:
> Everytime you send a uevent into a network namespace *not* owned by
> init_user_ns you currently *have* to take mutex_lock(uevent_sock_list)
> effectively blocking the host from processing uevents even though
> - the uevent you're receiving might be totally different from the
>   uevent that you're sending
> - the uevent socket of the non-init_user_ns owned network namespace
>   isn't even recorded in the list.
> 
> The other argument is that we now have properly isolated network
> namespaces wrt to uevents such that each netns can have its own set of
> uevents. This can either happen by a sufficiently privileged userspace
> process sending it uevents that are only dedicated to that specific
> netns. Or - and this *has been true for a long time* - because network
> devices are *properly namespaced*. Meaning a uevent for that network
> device is *tied to a network namespace*. For both cases the uevent
> sequence numbering will be absolutely misleading. For example, whenever
> you create e.g. a new veth device in a new network namespace it
> shouldn't be accounted against the initial network namespace but *only*
> against the network namespace that has that device added to it.

Eric, I did the testing. Here's what I did:

I compiled two 4.17-rc1 Kernels:
- one with per netns uevent seqnums with decoupled locking
- one without per netns uevent seqnums with decoupled locking

# Testcase 1:
Only Injecting Uevents into network namespaces not owned by the initial user
namespace.
- created 1000 new user namespace + network namespace pairs
- opened a uevent listener in each of those namespace pairs
- injected uevents into each of those network namespaces 10,000 times meaning
  10,000,000 (10 million) uevents were injected. (The high number of
  uevent injections should get rid of a lot of jitter.)
- Calculated the mean transaction time.
- *without* uevent sequence number namespacing:
  67 μs
- *with* uevent sequence number namespacing:
  55 μs
- makes a difference of 12 μs

# Testcase 2:
Injecting Uevents into network namespaces not owned by the initial user
namespace and network namespaces owned by the initial user namespace.
- created 500 new user namespace + network namespace pairs
- created 500 new network namespace pairs
- opened a uevent listener in each of those namespace pairs
- injected uevents into each of those network namespaces 10,000 times meaning
  10,000,000 (10 million) uevents were injected. (The high number of
  uevent injections should get rid of a lot of jitter.)
- Calculated the mean transaction time.
- *without* uevent sequence number namespacing:
  572 μs
- *with* uevent sequence number namespacing:
  514 μs
- makes a difference of 58 μs

So there's performance gain. The third case would be to create a bunch
of hanging processes that send SIGSTOP to themselves but do not actually
open a uevent socket in their respective namespaces and then inject
uevents into them. I expect there to be an even more performance
benefits since the rtnl_table_lock() isn't hit in this case because
there are no listeners.

Christian

^ permalink raw reply

* [PATCH net] tcp: don't read out-of-bounds opsize
From: Jann Horn @ 2018-04-20 13:57 UTC (permalink / raw)
  To: davem, kuznet, yoshfuji, netdev, linux-kernel, jannh

The old code reads the "opsize" variable from out-of-bounds memory (first
byte behind the segment) if a broken TCP segment ends directly after an
opcode that is neither EOL nor NOP.

The result of the read isn't used for anything, so the worst thing that
could theoretically happen is a pagefault; and since the physmap is usually
mostly contiguous, even that seems pretty unlikely.

The following C reproducer triggers the uninitialized read - however, you
can't actually see anything happen unless you put something like a
pr_warn() in tcp_parse_md5sig_option() to print the opsize.

====================================
#define _GNU_SOURCE
#include <arpa/inet.h>
#include <stdlib.h>
#include <errno.h>
#include <stdarg.h>
#include <net/if.h>
#include <linux/if.h>
#include <linux/ip.h>
#include <linux/tcp.h>
#include <linux/in.h>
#include <linux/if_tun.h>
#include <err.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <string.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include <assert.h>

void systemf(const char *command, ...) {
  char *full_command;
  va_list ap;
  va_start(ap, command);
  if (vasprintf(&full_command, command, ap) == -1)
    err(1, "vasprintf");
  va_end(ap);
  printf("systemf: <<<%s>>>\n", full_command);
  system(full_command);
}

char *devname;

int tun_alloc(char *name) {
  int fd = open("/dev/net/tun", O_RDWR);
  if (fd == -1)
    err(1, "open tun dev");
  static struct ifreq req = { .ifr_flags = IFF_TUN|IFF_NO_PI };
  strcpy(req.ifr_name, name);
  if (ioctl(fd, TUNSETIFF, &req))
    err(1, "TUNSETIFF");
  devname = req.ifr_name;
  printf("device name: %s\n", devname);
  return fd;
}

#define IPADDR(a,b,c,d) (((a)<<0)+((b)<<8)+((c)<<16)+((d)<<24))

void sum_accumulate(unsigned int *sum, void *data, int len) {
  assert((len&2)==0);
  for (int i=0; i<len/2; i++) {
    *sum += ntohs(((unsigned short *)data)[i]);
  }
}

unsigned short sum_final(unsigned int sum) {
  sum = (sum >> 16) + (sum & 0xffff);
  sum = (sum >> 16) + (sum & 0xffff);
  return htons(~sum);
}

void fix_ip_sum(struct iphdr *ip) {
  unsigned int sum = 0;
  sum_accumulate(&sum, ip, sizeof(*ip));
  ip->check = sum_final(sum);
}

void fix_tcp_sum(struct iphdr *ip, struct tcphdr *tcp) {
  unsigned int sum = 0;
  struct {
    unsigned int saddr;
    unsigned int daddr;
    unsigned char pad;
    unsigned char proto_num;
    unsigned short tcp_len;
  } fakehdr = {
    .saddr = ip->saddr,
    .daddr = ip->daddr,
    .proto_num = ip->protocol,
    .tcp_len = htons(ntohs(ip->tot_len) - ip->ihl*4)
  };
  sum_accumulate(&sum, &fakehdr, sizeof(fakehdr));
  sum_accumulate(&sum, tcp, tcp->doff*4);
  tcp->check = sum_final(sum);
}

int main(void) {
  int tun_fd = tun_alloc("inject_dev%d");
  systemf("ip link set %s up", devname);
  systemf("ip addr add 192.168.42.1/24 dev %s", devname);

  struct {
    struct iphdr ip;
    struct tcphdr tcp;
    unsigned char tcp_opts[20];
  } __attribute__((packed)) syn_packet = {
    .ip = {
      .ihl = sizeof(struct iphdr)/4,
      .version = 4,
      .tot_len = htons(sizeof(syn_packet)),
      .ttl = 30,
      .protocol = IPPROTO_TCP,
      /* FIXUP check */
      .saddr = IPADDR(192,168,42,2),
      .daddr = IPADDR(192,168,42,1)
    },
    .tcp = {
      .source = htons(1),
      .dest = htons(1337),
      .seq = 0x12345678,
      .doff = (sizeof(syn_packet.tcp)+sizeof(syn_packet.tcp_opts))/4,
      .syn = 1,
      .window = htons(64),
      .check = 0 /*FIXUP*/
    },
    .tcp_opts = {
      /* INVALID: trailing MD5SIG opcode after NOPs */
      1, 1, 1, 1, 1,
      1, 1, 1, 1, 1,
      1, 1, 1, 1, 1,
      1, 1, 1, 1, 19
    }
  };
  fix_ip_sum(&syn_packet.ip);
  fix_tcp_sum(&syn_packet.ip, &syn_packet.tcp);
  while (1) {
    int write_res = write(tun_fd, &syn_packet, sizeof(syn_packet));
    if (write_res != sizeof(syn_packet))
      err(1, "packet write failed");
  }
}
====================================

Fixes: cfb6eeb4c860 ("[TCP]: MD5 Signature Option (RFC2385) support.")
Signed-off-by: Jann Horn <jannh@google.com>
---
 net/ipv4/tcp_input.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 367def6ddeda..e51c644484dc 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3868,11 +3868,8 @@ const u8 *tcp_parse_md5sig_option(const struct tcphdr *th)
 	int length = (th->doff << 2) - sizeof(*th);
 	const u8 *ptr = (const u8 *)(th + 1);
 
-	/* If the TCP option is too short, we can short cut */
-	if (length < TCPOLEN_MD5SIG)
-		return NULL;
-
-	while (length > 0) {
+	/* If not enough data remaining, we can short cut */
+	while (length >= TCPOLEN_MD5SIG) {
 		int opcode = *ptr++;
 		int opsize;
 
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related

* [PATCH] [net] ipv6: sr: fix NULL pointer dereference in seg6_do_srh_encap()- v4 pkts
From: Ahmed Abdelsalam @ 2018-04-20 13:58 UTC (permalink / raw)
  To: davem, dlebrun, kuznet, yoshfuji, netdev, linux-kernel; +Cc: amsalam20

In case of seg6 in encap mode, seg6_do_srh_encap() calls set_tun_src()
in order to set the src addr of outer IPv6 header.

The net_device is required for set_tun_src(). However calling ip6_dst_idev()
on dst_entry in case of IPv4 traffic results on the following bug.

Using just dst->dev should fix this BUG.

[  196.242461] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[  196.242975] PGD 800000010f076067 P4D 800000010f076067 PUD 10f060067 PMD 0
[  196.243329] Oops: 0000 [#1] SMP PTI
[  196.243468] Modules linked in: nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd input_leds glue_helper led_class pcspkr serio_raw mac_hid video autofs4 hid_generic usbhid hid e1000 i2c_piix4 ahci pata_acpi libahci
[  196.244362] CPU: 2 PID: 1089 Comm: ping Not tainted 4.16.0+ #1
[  196.244606] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  196.244968] RIP: 0010:seg6_do_srh_encap+0x1ac/0x300
[  196.245236] RSP: 0018:ffffb2ce00b23a60 EFLAGS: 00010202
[  196.245464] RAX: 0000000000000000 RBX: ffff8c7f53eea300 RCX: 0000000000000000
[  196.245742] RDX: 0000f10000000000 RSI: ffff8c7f52085a6c RDI: ffff8c7f41166850
[  196.246018] RBP: ffffb2ce00b23aa8 R08: 00000000000261e0 R09: ffff8c7f41166800
[  196.246294] R10: ffffdce5040ac780 R11: ffff8c7f41166828 R12: ffff8c7f41166808
[  196.246570] R13: ffff8c7f52085a44 R14: ffffffffb73211c0 R15: ffff8c7e69e44200
[  196.246846] FS:  00007fc448789700(0000) GS:ffff8c7f59d00000(0000) knlGS:0000000000000000
[  196.247286] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  196.247526] CR2: 0000000000000000 CR3: 000000010f05a000 CR4: 00000000000406e0
[  196.247804] Call Trace:
[  196.247972]  seg6_do_srh+0x15b/0x1c0
[  196.248156]  seg6_output+0x3c/0x220
[  196.248341]  ? prandom_u32+0x14/0x20
[  196.248526]  ? ip_idents_reserve+0x6c/0x80
[  196.248723]  ? __ip_select_ident+0x90/0x100
[  196.248923]  ? ip_append_data.part.50+0x6c/0xd0
[  196.249133]  lwtunnel_output+0x44/0x70
[  196.249328]  ip_send_skb+0x15/0x40
[  196.249515]  raw_sendmsg+0x8c3/0xac0
[  196.249701]  ? _copy_from_user+0x2e/0x60
[  196.249897]  ? rw_copy_check_uvector+0x53/0x110
[  196.250106]  ? _copy_from_user+0x2e/0x60
[  196.250299]  ? copy_msghdr_from_user+0xce/0x140
[  196.250508]  sock_sendmsg+0x36/0x40
[  196.250690]  ___sys_sendmsg+0x292/0x2a0
[  196.250881]  ? _cond_resched+0x15/0x30
[  196.251074]  ? copy_termios+0x1e/0x70
[  196.251261]  ? _copy_to_user+0x22/0x30
[  196.251575]  ? tty_mode_ioctl+0x1c3/0x4e0
[  196.251782]  ? _cond_resched+0x15/0x30
[  196.251972]  ? mutex_lock+0xe/0x30
[  196.252152]  ? vvar_fault+0xd2/0x110
[  196.252337]  ? __do_fault+0x1f/0xc0
[  196.252521]  ? __handle_mm_fault+0xc1f/0x12d0
[  196.252727]  ? __sys_sendmsg+0x63/0xa0
[  196.252919]  __sys_sendmsg+0x63/0xa0
[  196.253107]  do_syscall_64+0x72/0x200
[  196.253305]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  196.253530] RIP: 0033:0x7fc4480b0690
[  196.253715] RSP: 002b:00007ffde9f252f8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[  196.254053] RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 00007fc4480b0690
[  196.254331] RDX: 0000000000000000 RSI: 000000000060a360 RDI: 0000000000000003
[  196.254608] RBP: 00007ffde9f253f0 R08: 00000000002d1e81 R09: 0000000000000002
[  196.254884] R10: 00007ffde9f250c0 R11: 0000000000000246 R12: 0000000000b22070
[  196.255205] R13: 20c49ba5e353f7cf R14: 431bde82d7b634db R15: 00007ffde9f278fe
[  196.255484] Code: a5 0f b6 45 c0 41 88 41 28 41 0f b6 41 2c 48 c1 e0 04 49 8b 54 01 38 49 8b 44 01 30 49 89 51 20 49 89 41 18 48 8b 83 b0 00 00 00 <48> 8b 30 49 8b 86 08 0b 00 00 48 8b 40 20 48 8b 50 08 48 0b 10
[  196.256190] RIP: seg6_do_srh_encap+0x1ac/0x300 RSP: ffffb2ce00b23a60
[  196.256445] CR2: 0000000000000000
[  196.256676] ---[ end trace 71af7d093603885c ]---

Fixes: 8936ef7604c11 ipv6: sr: fix NULL pointer dereference when setting encap source address
Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com>
---
I tested the patch for IPv6 and IPv4 traffic 

 net/ipv6/seg6_iptunnel.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/seg6_iptunnel.c b/net/ipv6/seg6_iptunnel.c
index f343e6f..5fe1394 100644
--- a/net/ipv6/seg6_iptunnel.c
+++ b/net/ipv6/seg6_iptunnel.c
@@ -136,7 +136,7 @@ int seg6_do_srh_encap(struct sk_buff *skb, struct ipv6_sr_hdr *osrh, int proto)
 	isrh->nexthdr = proto;
 
 	hdr->daddr = isrh->segments[isrh->first_segment];
-	set_tun_src(net, ip6_dst_idev(dst)->dev, &hdr->daddr, &hdr->saddr);
+	set_tun_src(net, dst->dev, &hdr->daddr, &hdr->saddr);
 
 #ifdef CONFIG_IPV6_SEG6_HMAC
 	if (sr_has_hmac(isrh)) {
-- 
2.1.4

^ permalink raw reply related

* Re: Q: force netif ON even when there is no real link ?
From: Ran Shalit @ 2018-04-20 14:01 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: netdev
In-Reply-To: <CAJ2oMhKCq73QSPH7zWCBMbxOMxcij9vE+ovdhN=ueEBKXsZabA@mail.gmail.com>

On Fri, Apr 20, 2018 at 3:14 PM, Ran Shalit <ranshalit@gmail.com> wrote:
> On Fri, Apr 20, 2018 at 3:05 PM, Andrew Lunn <andrew@lunn.ch> wrote:
>> On Fri, Apr 20, 2018 at 03:01:09PM +0300, Ran Shalit wrote:
>>> On Fri, Apr 20, 2018 at 2:55 PM, Andrew Lunn <andrew@lunn.ch> wrote:
>>> > On Fri, Apr 20, 2018 at 11:44:14AM +0300, Ran Shalit wrote:
>>> >> Hello,
>>> >>
>>> >> We configure external switch in u-boot.
>>> >> The configuration is through mdio (cpu is mac and switch is phy).
>>> >>
>>> >> But in Linux we rather not implement any communication in mdio to
>>> >> switch, but it means that we then don't have the information of link
>>> >> state.
>>> >>
>>> >> Is it possible to force in Linux (by default in startup) Ethernet
>>> >> connectivity (netif_carrier_on, netif_wake_queue) even if there is no
>>> >> information of real link state ?
>>> >
>>> > Hi Ran
>>> >
>>> > Use a fixed-phy.
>>> >
>>>
>>> Hi Andrew,
>>>
>>> I'll check about fixed phy,
>>> but in general, is it a problem to have always netif_carrier_on, even
>>> when there is no link ?
>>
>> The link between the CPU and the switch should be up all the
>> time. That is the point of fixed-link.
>>
>
> I understand.
> But what about the mac driver,  does it just do netif_start_queue ?
>

By saying "mac driver", I mean Ethernet driver with fixed phy.

Regards,
Ranran

> Thanks
>
>
>>     Andrew

^ permalink raw reply

* Re: [RFC] vhost: introduce mdev based hardware vhost backend
From: Michael S. Tsirkin @ 2018-04-20 14:12 UTC (permalink / raw)
  To: Jason Wang
  Cc: Tiwei Bie, alex.williamson, ddutile, alexander.h.duyck,
	virtio-dev, linux-kernel, kvm, virtualization, netdev, dan.daly,
	cunming.liang, zhihong.wang, jianfeng.tan, xiao.w.wang
In-Reply-To: <060e2b5f-2e93-c53f-387b-5baaa33e87cd@redhat.com>

On Fri, Apr 20, 2018 at 11:52:47AM +0800, Jason Wang wrote:
> > The biggest issue is that you let userspace poke at the
> > device which is also allowed by the IOMMU to poke at
> > kernel memory (needed for kernel driver to work).
> 
> I don't quite get. The userspace driver could be built on top of VFIO for
> sure. So kernel memory were perfectly isolated in this case.

VFIO does what it can but it mostly just has the IOMMU to play with.
So don't overestimate what it can do - it assumes a high level
of spec compliance for protections to work. For example,
ATS is enabled by default if device has it, and that
treats translated requests are trusted. FLS is assumed to reset
the device for when VFIO is unbound from the device. etc.


> > 
> > Yes, maybe if device is not buggy it's all fine, but
> > it's better if we do not have to trust the device
> > otherwise the security picture becomes more murky.
> > 
> > I suggested attaching a PASID to (some) queues - see my old post "using
> > PASIDs to enable a safe variant of direct ring access".
> > 
> > Then using IOMMU with VFIO to limit access through queue to corrent
> > ranges of memory.
> 
> Well userspace driver could benefit from this too. And we can even go
> further by using nested IO page tables to share IOVA address space between
> devices and a VM.
> 
> Thanks

Yes I suggested this separately.

-- 
MST

^ permalink raw reply

* Re: [PATCH net] tcp: don't read out-of-bounds opsize
From: Eric Dumazet @ 2018-04-20 14:21 UTC (permalink / raw)
  To: Jann Horn, davem, kuznet, yoshfuji, netdev, linux-kernel
In-Reply-To: <20180420135730.44921-1-jannh@google.com>



On 04/20/2018 06:57 AM, Jann Horn wrote:
> The old code reads the "opsize" variable from out-of-bounds memory (first
> byte behind the segment) if a broken TCP segment ends directly after an
> opcode that is neither EOL nor NOP.
> 
> The result of the read isn't used for anything, so the worst thing that
> could theoretically happen is a pagefault; and since the physmap is usually
> mostly contiguous, even that seems pretty unlikely.
>

No page fault possible, because tcp headers are in skb->head

And we have 'struct skb_shared_info'  at the end of skb->head anyway.

But, yes, reading some extra bytes with random content is possible.

^ permalink raw reply

* Re: [PATCH net-next] net: phy: mdio-boardinfo: Allow recursive mdiobus_register()
From: David Miller @ 2018-04-20 14:34 UTC (permalink / raw)
  To: andrew; +Cc: netdev, f.fainelli, vivien.didelot
In-Reply-To: <1524096047-16823-1-git-send-email-andrew@lunn.ch>

From: Andrew Lunn <andrew@lunn.ch>
Date: Thu, 19 Apr 2018 02:00:47 +0200

> mdiobus_register will search for any mdiobus board info registered for
> the bus being registered. If found, it will probe devices on the bus.
> That device, if for example it is an ethernet switch, may then try to
> register an mdio bus. Thus we need to allow recursive calls to
> mdiobus_register.
> 
> Holding the mdio_board_lock will cause a deadlock during this
> recursion. Release the lock and use list_for_each_entry_safe.
> 
> Signed-off-by: Andrew Lunn <andrew@lunn.ch>

Applied.

While looking over this code I see that we currently never unregister
mdio boardinfo objects.

If we have drivers that can be unloaded, as it seems the one you plan
to add that needs this change should be, the situation could get more
tricky here.

^ permalink raw reply

* Re: [PATCH] net: net_cls: remove a NULL check for css_cls_state
From: David Miller @ 2018-04-20 14:37 UTC (permalink / raw)
  To: lirongqing; +Cc: netdev
In-Reply-To: <1524113961-30166-1-git-send-email-lirongqing@baidu.com>

From: Li RongQing <lirongqing@baidu.com>
Date: Thu, 19 Apr 2018 12:59:21 +0800

> The input of css_cls_state() is impossible to NULL except
> cgrp_css_online, so simplify it
> 
> Signed-off-by: Li RongQing <lirongqing@baidu.com>

I don't view this as an improvement.  Just let the helper always check
NULL and that way there are less situations to audit.

And it's not like this is a critical fast path either.

I'm not applying this, sorry.

^ permalink raw reply

* Re: [PATCH] [net] ipv6: sr: fix NULL pointer dereference in seg6_do_srh_encap()- v4 pkts
From: David Lebrun @ 2018-04-20 14:38 UTC (permalink / raw)
  To: Ahmed Abdelsalam, davem, dlebrun, kuznet, yoshfuji, netdev,
	linux-kernel
In-Reply-To: <1524232685-1203-1-git-send-email-amsalam20@gmail.com>

On 04/20/2018 02:58 PM, Ahmed Abdelsalam wrote:
> In case of seg6 in encap mode, seg6_do_srh_encap() calls set_tun_src()
> in order to set the src addr of outer IPv6 header.
> 
> The net_device is required for set_tun_src(). However calling ip6_dst_idev()
> on dst_entry in case of IPv4 traffic results on the following bug.
> 
> Using just dst->dev should fix this BUG.
> 

Good catch, thanks for spotting this. If you actually tested your fix 
with IPv4 and IPv6 traffic, you should mention it in the commit message. 
Your current formulation suggests that you just guessed a fix without 
testing.

> 
> Fixes: 8936ef7604c11 ipv6: sr: fix NULL pointer dereference when setting encap source address
> Signed-off-by: Ahmed Abdelsalam<amsalam20@gmail.com>

Acked-by: David Lebrun <dlebrun@google.com>

^ permalink raw reply

* Re: [PATCH] net: qrtr: Expose tunneling endpoint to user space
From: David Miller @ 2018-04-20 14:40 UTC (permalink / raw)
  To: bjorn.andersson; +Cc: linux-kernel, netdev, linux-arm-msm, clew
In-Reply-To: <20180419050346.17054-1-bjorn.andersson@linaro.org>

From: Bjorn Andersson <bjorn.andersson@linaro.org>
Date: Wed, 18 Apr 2018 22:03:46 -0700

> +struct qrtr_tun {
> +	struct qrtr_endpoint ep;
> +
> +	struct mutex queue_lock;
> +	struct sk_buff_head queue;
> +	wait_queue_head_t readq;
> +};

The queue lock is surperfluous.  sk_buff_head and all of the helpers you
are using does it's own locking.  So you are essentially using two sets
of locks to protect the same object.

^ permalink raw reply

* Re: [PATCH] [net] ipv6: sr: fix NULL pointer dereference in seg6_do_srh_encap()- v4 pkts
From: Ahmed Abdelsalam @ 2018-04-20 14:46 UTC (permalink / raw)
  To: David Lebrun; +Cc: davem, dlebrun, kuznet, yoshfuji, netdev, linux-kernel
In-Reply-To: <3627f25a-0c47-6428-aa76-5baf96993a4c@gmail.com>

On Fri, 20 Apr 2018 15:38:08 +0100
David Lebrun <dav.lebrun@gmail.com> wrote:

> On 04/20/2018 02:58 PM, Ahmed Abdelsalam wrote:
> > In case of seg6 in encap mode, seg6_do_srh_encap() calls set_tun_src()
> > in order to set the src addr of outer IPv6 header.
> > 
> > The net_device is required for set_tun_src(). However calling ip6_dst_idev()
> > on dst_entry in case of IPv4 traffic results on the following bug.
> > 
> > Using just dst->dev should fix this BUG.
> > 
> 
> Good catch, thanks for spotting this. If you actually tested your fix 
> with IPv4 and IPv6 traffic, you should mention it in the commit message. 
> Your current formulation suggests that you just guessed a fix without 
> testing.
> 

Yes, I did two tests for both IPv4 and IPv6.
Sorry for this Language Bug. 

> > 
> > Fixes: 8936ef7604c11 ipv6: sr: fix NULL pointer dereference when setting encap source address
> > Signed-off-by: Ahmed Abdelsalam<amsalam20@gmail.com>
> 
> Acked-by: David Lebrun <dlebrun@google.com>


-- 
Ahmed Abdelsalam <amsalam20@gmail.com>

^ permalink raw reply

* Re: [PATCH net-next v2 0/3] ave: fix the activation issues for some UniPhier SoCs
From: David Miller @ 2018-04-20 14:50 UTC (permalink / raw)
  To: hayashi.kunihiko
  Cc: netdev, andrew, f.fainelli, robh+dt, mark.rutland,
	linux-arm-kernel, linux-kernel, devicetree, yamada.masahiro,
	masami.hiramatsu, jaswinder.singh
In-Reply-To: <1524122695-19597-1-git-send-email-hayashi.kunihiko@socionext.com>

From: Kunihiko Hayashi <hayashi.kunihiko@socionext.com>
Date: Thu, 19 Apr 2018 16:24:52 +0900

> This add the following stuffs to fix the activation issues and satisfy
> requirements for AVE ethernet driver implemented on some UniPhier SoCs.
> 
> - Add support for additional necessary clocks and resets, because the kernel
>   is stalled on Pro4 due to lack of them.
> 
> - Check whether the SoC supports the specified phy-mode
> 
> - Add DT property support indicating system controller that has the feature
>   for configurating phy-mode including built-in phy on LD11.
> 
> v1: https://www.spinics.net/lists/netdev/msg494904.html
> 
> Changes since v1:
> - Add 'Reviewed-by' lines

Series applied to net-next, thank you.

^ permalink raw reply

* Re: [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel
From: Rahul Lakkireddy @ 2018-04-20 14:51 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Dave Young, netdev@vger.kernel.org, kexec@lists.infradead.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	Indranil Choudhury, Nirranjan Kirubaharan,
	stephen@networkplumber.org, Ganesh GR, akpm@linux-foundation.org,
	torvalds@linux-foundation.org, davem@davemloft.net,
	viro@zeniv.linux.org.uk
In-Reply-To: <87po2uhueu.fsf@xmission.com>

On Friday, April 04/20/18, 2018 at 19:06:09 +0530, Eric W. Biederman wrote:
> Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> writes:
> 
> > On Thursday, April 04/19/18, 2018 at 20:23:37 +0530, Eric W. Biederman wrote:
> >> Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> writes:
> >> 
> >> > On Thursday, April 04/19/18, 2018 at 07:10:30 +0530, Dave Young wrote:
> >> >> On 04/18/18 at 06:01pm, Rahul Lakkireddy wrote:
> >> >> > On Wednesday, April 04/18/18, 2018 at 11:45:46 +0530, Dave Young wrote:
> >> >> > > Hi Rahul,
> >> >> > > On 04/17/18 at 01:14pm, Rahul Lakkireddy wrote:
> >> >> > > > On production servers running variety of workloads over time, kernel
> >> >> > > > panic can happen sporadically after days or even months. It is
> >> >> > > > important to collect as much debug logs as possible to root cause
> >> >> > > > and fix the problem, that may not be easy to reproduce. Snapshot of
> >> >> > > > underlying hardware/firmware state (like register dump, firmware
> >> >> > > > logs, adapter memory, etc.), at the time of kernel panic will be very
> >> >> > > > helpful while debugging the culprit device driver.
> >> >> > > > 
> >> >> > > > This series of patches add new generic framework that enable device
> >> >> > > > drivers to collect device specific snapshot of the hardware/firmware
> >> >> > > > state of the underlying device in the crash recovery kernel. In crash
> >> >> > > > recovery kernel, the collected logs are added as elf notes to
> >> >> > > > /proc/vmcore, which is copied by user space scripts for post-analysis.
> >> >> > > > 
> >> >> > > > The sequence of actions done by device drivers to append their device
> >> >> > > > specific hardware/firmware logs to /proc/vmcore are as follows:
> >> >> > > > 
> >> >> > > > 1. During probe (before hardware is initialized), device drivers
> >> >> > > > register to the vmcore module (via vmcore_add_device_dump()), with
> >> >> > > > callback function, along with buffer size and log name needed for
> >> >> > > > firmware/hardware log collection.
> >> >> > > 
> >> >> > > I assumed the elf notes info should be prepared while kexec_[file_]load
> >> >> > > phase. But I did not read the old comment, not sure if it has been discussed
> >> >> > > or not.
> >> >> > > 
> >> >> > 
> >> >> > We must not collect dumps in crashing kernel. Adding more things in
> >> >> > crash dump path risks not collecting vmcore at all. Eric had
> >> >> > discussed this in more detail at:
> >> >> > 
> >> >> > https://lkml.org/lkml/2018/3/24/319
> >> >> > 
> >> >> > We are safe to collect dumps in the second kernel. Each device dump
> >> >> > will be exported as an elf note in /proc/vmcore.
> >> >> 
> >> >> I understand that we should avoid adding anything in crash path.  And I also
> >> >> agree to collect device dump in second kernel.  I just assumed device
> >> >> dump use some memory area to store the debug info and the memory
> >> >> is persistent so that this can be done in 2 steps, first register the
> >> >> address in elf header in kexec_load, then collect the dump in 2nd
> >> >> kernel.  But it seems the driver is doing some other logic to collect
> >> >> the info instead of just that simple like I thought. 
> >> >> 
> >> >
> >> > It seems simpler, but I'm concerned with waste of memory area, if
> >> > there are no device dumps being collected in second kernel. In
> >> > approach proposed in these series, we dynamically allocate memory
> >> > for the device dumps from second kernel's available memory.
> >> 
> >> Don't count that kernel having more than about 128MiB.
> >> 
> >
> > If large dump is expected, Administrator can increase the memory
> > allocated to the second kernel (using crashkernel boot param), to
> > ensure device dumps get collected.
> 
> Except 128MiB is already a already a huge amount to reserve.  I
> typically have run crash dumps with 16MiB of memory and thought it was
> overkill.  Looking below 32MiB seems a bit high but it is small enough
> that it is still doable.  I am baffled at how 2GiB can be guaranteed to fit
> in 32MiB (sparse register space?) but if it works reliably.
> 

Yes, we skip portions in on-chip memory dump that do not add to debug
value (such as the large regions reserved for holding Payload data
going through the device) and hence the overall dump size reduces
significantly.

> >> For that reason if for no other it would be nice if it was possible to
> >> have the driver to not initialize the device and just stand there
> >> handing out the data a piece at a time as it is read from /proc/vmcore.
> >> 
> >
> > Since cxgb4 is a network driver, it can be used to transfer the dumps
> > over the network. So we must ensure the dumps get collected and
> > stored, before device gets initialized to transfer dumps over
> > the network.
> 
> Good point.  For some reason I was thinking it was an infiniband and not
> an 10GiB ethernet device.
> 
> >> The 2GiB number I read earlier concerns me for working in a limited
> >> environment.
> >> 
> >
> > All dumps, including the 2GB on-chip memory dump, is compressed by
> > the cxgb4 driver as they are collected. The overall compressed dump
> > comes out at max 32 MB.
> >
> >> It might even make sense to separate this into a completely separate
> >> module (depended upon the main driver if it makes sense to share
> >> the functionality) so that people performing crash dumps would not
> >> hesitate to include the code in their initramfs images.
> >> 
> >> I can see splitting a device up into a portion only to be used in case
> >> of a crash dump and a normal portion like we do for main memory but I
> >> doubt that makes sense in practice.
> >> 
> >
> > This is not required, especially in case of network drivers, which
> > must collect underlying device dump and initialize the device to
> > transfer dumps over the network.
> 
> I have a practical concern.  What happens if the previous kernel left
> the device in such a bad stat the driver can not successfully initialize
> it.
> 
> Does failure to initialize cxgb4 after a crash now mean that you can not
> capture the crash dump to see the crazy state the device was in?
> 
> Typically the initramfs for a crash dump does not include unnecessary
> drivers so that hardware in states the drivers can't handle won't
> prevent taking a crash dump.
> 
> I understand the issue if you are taking a dump over your 10GiB ethernet
> it is a moot point.  But if you are writing your dump to disk, or
> writing it over a management gigabit ethernet then it is still an issue.
> 
> Is there a decoupling so that a totally b0rked device can't prevent
> taking it's own dump?
> 

As long as we can safely map and access the BAR registers, we
can collect the dumps regardless of whatever state the device and
firmware were left in and store it as part of /proc/vmcore. After
that, we attempt to re-initialize the device and restart the
firmware. So, even if driver initialization fails at this point,
we still have the dumps as part of vmcore.

Thanks,
Rahul

^ permalink raw reply

* Re: [PATCH 1/8] arm: shmobile: Change platform dependency to ARCH_RENESAS
From: Sergei Shtylyov @ 2018-04-20 14:53 UTC (permalink / raw)
  To: Geert Uytterhoeven, Simon Horman, Magnus Damm, Russell King,
	Catalin Marinas, Will Deacon, Dan Williams, Vinod Koul,
	Mauro Carvalho Chehab, David S . Miller, Greg Kroah-Hartman,
	Liam Girdwood, Mark Brown, Jaroslav Kysela, Takashi Iwai,
	Arnd Bergmann, Kuninori Morimoto, Laurent Pinchart
  Cc: linux-renesas-soc, linux-arm-kernel, dmaengine, linux-media,
	netdev, devel, alsa-devel, linux-kernel
In-Reply-To: <1524230914-10175-2-git-send-email-geert+renesas@glider.be>

On 04/20/2018 04:28 PM, Geert Uytterhoeven wrote:

> Since commit 9b5ba0df4ea4f940 ("ARM: shmobile: Introduce ARCH_RENESAS")
> is ARCH_RENESAS a more appropriate platform dependency than the legacy

   "ARCH_RENESAS is", no?

> ARCH_SHMOBILE, hence use the former.
> 
> This will allow to drop ARCH_SHMOBILE on ARM in the near future.
> 
> Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
[...]

MBR, Sergei

^ permalink raw reply

* Re: [PATCH net-next] liquidio: Added ndo_get_vf_stats support
From: David Miller @ 2018-04-20 14:54 UTC (permalink / raw)
  To: felix.manlunas
  Cc: netdev, raghu.vatsavayi, derek.chickles, satananda.burla,
	intiyaz.basha
In-Reply-To: <20180419061828.GA4292@felix-thinkpad.cavium.com>

From: Felix Manlunas <felix.manlunas@cavium.com>
Date: Wed, 18 Apr 2018 23:18:28 -0700

> From: Intiyaz Basha <intiyaz.basha@cavium.com>
> 
> Added the ndo to gather VF statistics through the PF.
> 
> Collect VF statistics via mailbox from VF.
> 
> Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
> Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>

Applied, thank you.

^ permalink raw reply

* Re: [virtio-dev] [pci PATCH v7 2/5] virtio_pci: Add support for unmanaged SR-IOV on virtio_pci devices
From: Alexander Duyck @ 2018-04-20 14:56 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Daly, Dan, Bjorn Helgaas, Duyck, Alexander H, linux-pci,
	virtio-dev, kvm, Netdev, LKML, linux-nvme, Keith Busch, netanel,
	Don Dutile, Maximilian Heyne, Wang, Liang-min, Rustad, Mark D,
	David Woodhouse, Christoph Hellwig, dwmw
In-Reply-To: <20180420030640-mutt-send-email-mst@kernel.org>

On Thu, Apr 19, 2018 at 5:40 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Apr 03, 2018 at 12:06:03PM -0700, Alexander Duyck wrote:
>> On Tue, Apr 3, 2018 at 11:27 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Tue, Apr 03, 2018 at 10:32:00AM -0700, Alexander Duyck wrote:
>> >> On Tue, Apr 3, 2018 at 6:12 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> >> > On Fri, Mar 16, 2018 at 09:40:34AM -0700, Alexander Duyck wrote:
>> >> >> On Fri, Mar 16, 2018 at 9:34 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> >> >> > On Thu, Mar 15, 2018 at 11:42:41AM -0700, Alexander Duyck wrote:
>> >> >> >> From: Alexander Duyck <alexander.h.duyck@intel.com>
>> >> >> >>
>> >> >> >> Hardware-realized virtio_pci devices can implement SR-IOV, so this
>> >> >> >> patch enables its use. The device in question is an upcoming Intel
>> >> >> >> NIC that implements both a virtio_net PF and virtio_net VFs. These
>> >> >> >> are hardware realizations of what has been up to now been a software
>> >> >> >> interface.
>> >> >> >>
>> >> >> >> The device in question has the following 4-part PCI IDs:
>> >> >> >>
>> >> >> >> PF: vendor: 1af4 device: 1041 subvendor: 8086 subdevice: 15fe
>> >> >> >> VF: vendor: 1af4 device: 1041 subvendor: 8086 subdevice: 05fe
>> >> >> >>
>> >> >> >> The patch currently needs no check for device ID, because the callback
>> >> >> >> will never be made for devices that do not assert the capability or
>> >> >> >> when run on a platform incapable of SR-IOV.
>> >> >> >>
>> >> >> >> One reason for this patch is because the hardware requires the
>> >> >> >> vendor ID of a VF to be the same as the vendor ID of the PF that
>> >> >> >> created it. So it seemed logical to simply have a fully-functioning
>> >> >> >> virtio_net PF create the VFs. This patch makes that possible.
>> >> >> >>
>> >> >> >> Reviewed-by: Christoph Hellwig <hch@lst.de>
>> >> >> >> Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
>> >> >> >> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>> >> >> >
>> >> >> > So if and when virtio PFs can manage the VFs, then we can
>> >> >> > add a feature bit for that?
>> >> >> > Seems reasonable.
>> >> >>
>> >> >> Yes. If nothing else you may not even need a feature bit depending on
>> >> >> how things go.
>> >> >
>> >> > OTOH if the interface is changed in an incompatible way,
>> >> > and old Linux will attempt to drive the new device
>> >> > since there is no check.
>> >> >
>> >> > I think we should add a feature bit right away.
>> >>
>> >> I'm not sure why you would need a feature bit. The capability is
>> >> controlled via PCI configuration space. If it is present the device
>> >> has the capability. If it is not then it does not.
>> >>
>> >> Basically if the PCI configuration space is not present then the sysfs
>> >> entries will not be spawned and nothing will attempt to use this
>> >> function.
>> >>
>> >> - ALex
>> >
>> > It's about compability with older guests which ignore the
>> > capability.
>> >
>> > The feature is thus helpful so host knows whether guest supports VFs.
>>
>> The thing is if the capability is ignored then the feature isn't used.
>> So for SR-IOV it isn't an uncommon thing for there to be drivers for
>> the PF floating around that do not support SR-IOV. In such cases
>> SR-IOV just isn't used while the hardware could support it.
>
> Right but how come there are VF drivers but PF driver does not
> know about these?

I'm not sure what you mean here. The VF and PF drivers are the same
driver. The only difference is that the PF has the extra SR-IOV
configuration space.

What this code is meant to enable is a form of SR-IOV where the VFs
are essentially pre-allocated resources. So for example in our case
the MMIO space is identical for a PF versus any of the VFs. It doesn't
have any special controls in place to allow the PF to manipulate any
of the resources belonging to the VFs.

> And are there PF drivers that intentially do not enable SRIOV
> because it's known to be broken in some way?

In the Virtio IO case right now are there any devices that support
SR-IOV? For now this is just an add-on bit to a function that is
already emulating the Virtio in hardware.

> Case in point I do think virtio want to limit this
> depending on a feature bit on general principles
> (the principle being that all extensions have feature bits).

This part has me kind of scratching my head. In our setup the "PF" is
really nothing more than a "VF" with the SR-IOV configuration space
attached to it. There are already examples of similar designs for NVMe
and the Amazon ENA devices. Giving the "PF" any functionality in MMIO
space that controls the SR-IOV kind of defeats the whole point of
allowing this function in the first place. Basically the PF isn't
really controlling things, it is the kernel that is doing it.

> There are security implications here - we previously relied on
> whitelisting after all.

Yes and no. The original patch set had issues as you could have a PF
assigned to user space and the VFs managed by the host. When I changed
things so that the function had to be in a kernel driver that issue
went away.

> Wouldn't it be safer to be a bit more careful and update the
> actual PF drivers? It's just one line per driver, but it
> can be done with an ack by driver maintainer.
> If/once we find out all drivers do have it, we can then
> change the default.

I have no clue what you are talking about here. This is the more
careful approach. Are you sure you are reviewing the v7 of the
patches?

My understanding is that no paravirtual interfaces currently expose
SR-IOV. What we are looking at is hardware will want to emulate
Virtio, specifically virtio_net in the future and as a part of that
the PF ends up emulating it as well. What we would need to watch for
going forward is that any device that enables SR-IOV support would
need to also provide a 4 tuple ID so that if something goes wrong with
it we could disable SR-IOV on the device via a PCI quirk later.

>> I would think in the case of virtio it would be the same kind of
>> thing. Basically if SR-IOV is supported by the host then the
>> capability would be present. If SR-IOV is supported by the guest then
>> it would make use of the capability to spawn VFs. If either the
>> capability isn't present, or the driver doesn't use it then you won't
>> be able to spawn VFs in the guest.
>
>> Maybe I am missing something. Do you support dynamically changing the
>> PCI configuration space for Virtio devices based on the presence of
>> feature bits provided by the guest?
>
> No. The point is that IMHO at least virtio - in absence of feature bit -
> to ignore VFs rather than assume they are safe to drive
> in an unmanaged way.
>
>> Also are you saying this patch set should wait on the feature bit to
>> be added, or are you talking about doing this as some sort of
>> follow-up?
>>
>> - Alex
>
> I think for virtio it should include the feature bit, yes.
> Adding feature bit is very easy - post a patch to the virtio TC mailing
> list, wait about a week to give people time to respond (two weeks if it
> is around holidays and such).

The problem is we are talking about hardware/FPGA, not software.
Adding a feature bit means going back and updating RTL. The software
side of things is easy, re-validating things after a hardware/FPGA
change not so much.

If this is a hard requirement I may just drop the virtio patch, push
what I have, and leave it to Mark/Dan to deal with the necessary RTL
and code changes needed to support Virtio as I don't expect the
turnaround to be as easy as just a patch.

Thanks.

- Alex

^ permalink raw reply

* Re: [net-next 0/3] tipc: Confgiuration of MTU for media UDP
From: David Miller @ 2018-04-20 15:04 UTC (permalink / raw)
  To: mohan.krishna.ghanta.krishnamurthy
  Cc: tipc-discussion, jon.maloy, maloy, ying.xue, netdev
In-Reply-To: <1524128780-2550-1-git-send-email-mohan.krishna.ghanta.krishnamurthy@ericsson.com>

From: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
Date: Thu, 19 Apr 2018 11:06:17 +0200

> Systematic measurements have shown that an emulated MTU of 14k for
> UDP bearers is the optimal value for maximal throughput. Accordingly,
> the default MTU of UDP bearers is changed to 14k.
> 
> We also provide users with a fallback option from this value,
> by providing support to configure MTU for UDP bearers. The following
> options are introduced which are symmetrical to the design of
> confguring link tolerance.
> 
> - Configure media with new MTU value, which will take effect on
> links going up after the moment it was configured. Alternatively,
> the bearer has to be disabled and re-enabled, for existing links to
> reflect the configured value.
> 
> - Configure bearer with new MTU value, which take effect on 
> running links dynamically.
> 
> Please note:
> - User has to change MTU at both endpoints, otherwise the link 
> will fall back to smallest MTU after a reset.
> - Failover from a link with higher MTU to a link with lower MTU

There are many negatives to using UDP in a way which causes
fragmentation, like this code now does.

But whatever, you guys can do whatever you want and get to keep the
pieces I guess :-)

Series applied.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox