Netdev List

Netdev List
 help / color / mirror / Atom feed

* Packet corrupts to guest system system from host kernel 4.16.x
From: Алексей Болдырев @ 2018-04-06  0:49 UTC (permalink / raw)
  To: qemu-discuss, netdev

Why, when using the 4.16 kernel on the host system, is there such a strange behavior of virtual machines? What is the problem? He rolled back to 4.9.c - the problem was gone.

tcpdump on router:
tcpdump: listening on vlan-00110013, link-type EN10MB (Ethernet), capture size 262144 bytes
23:59:08.331875 52:54:00:38:cf:94 > 52:54:00:4e:9c:5f, ethertype ARP (0x0806), length 38: [|ARP]
	0x0000:  0001 0800 0604 0001 5254 0038 cf94 c0a8  ........RT.8....
	0x0010:  b402 0000 0000 0000                      ........
23:59:08.454364 52:54:00:4e:9c:5f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.180.2 tell 192.168.180.1, length 28
23:59:08.454754 52:54:00:38:cf:94 > 52:54:00:4e:9c:5f, ethertype ARP (0x0806), length 38: [|ARP]
	0x0000:  0001 0800 0604 0002 5254 0038 cf94 c0a8  ........RT.8....
	0x0010:  b402 5254 004e 9c5f                      ..RT.N._
23:59:09.462383 52:54:00:4e:9c:5f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.180.2 tell 192.168.180.1, length 28
23:59:09.463607 52:54:00:38:cf:94 > 52:54:00:4e:9c:5f, ethertype ARP (0x0806), length 38: [|ARP]
	0x0000:  0001 0800 0604 0002 5254 0038 cf94 c0a8  ........RT.8....
	0x0010:  b402 5254 004e 9c5f                      ..RT.N._
23:59:10.486352 52:54:00:4e:9c:5f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.180.2 tell 192.168.180.1, length 28
23:59:10.487303 52:54:00:38:cf:94 > 52:54:00:4e:9c:5f, ethertype ARP (0x0806), length 38: [|ARP]
	0x0000:  0001 0800 0604 0002 5254 0038 cf94 c0a8  ........RT.8....
	0x0010:  b402 5254 004e 9c5f                      ..RT.N._
23:59:11.051570 52:54:00:38:cf:94 > 52:54:00:4e:9c:5f, ethertype IPv4 (0x0800), length 77: truncated-ip - 4 bytes missing! (tos 0x0, ttl 64, id 54539, offset 0, flags [DF], proto UDP (17), length 67)
    192.168.180.2.59917 > 198.18.120.1.53: 1196+[|domain]
23:59:11.051585 52:54:00:38:cf:94 > 52:54:00:4e:9c:5f, ethertype IPv4 (0x0800), length 77: truncated-ip - 4 bytes missing! (tos 0x0, ttl 64, id 54540, offset 0, flags [DF], proto UDP (17), length 67)
    192.168.180.2.59917 > 198.18.120.1.53: 5849+[|domain]
23:59:11.527109 52:54:00:4e:9c:5f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.180.2 tell 192.168.180.1, length 28
23:59:11.527707 52:54:00:38:cf:94 > 52:54:00:4e:9c:5f, ethertype ARP (0x0806), length 38: [|ARP]
	0x0000:  0001 0800 0604 0002 5254 0038 cf94 c0a8  ........RT.8....
	0x0010:  b402 5254 004e 9c5f                      ..RT.N._
^C
11 packets captured
11 packets received by filter
0 packets dropped by kernel

^ permalink raw reply

* Re: [PATCH 3/9] kbuild: Do not pass arguments to link-vmlinux.sh
From: Masahiro Yamada @ 2018-04-06  0:59 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann, lkml, netdev,
	Linux Kbuild mailing list, Quentin Monnet, Eugene Syromiatnikov,
	Jiri Benc, Stanislav Kozina, Jerome Marchand,
	Arnaldo Carvalho de Melo, Michal Marek, Jiri Kosina
In-Reply-To: <20180405185950.GA3668@krava>

2018-04-06 3:59 GMT+09:00 Jiri Olsa <jolsa@redhat.com>:
> On Fri, Apr 06, 2018 at 12:50:00AM +0900, Masahiro Yamada wrote:
>> 2018-04-06 0:16 GMT+09:00 Jiri Olsa <jolsa@kernel.org>:
>> > There's no need to pass LD* arguments to link-vmlinux.sh,
>> > because they are passed as variables. The only argument
>> > the link-vmlinux.sh supports is the 'clean' argument.
>> >
>> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
>> > ---
>>
>> Wrong.
>>
>> $(LD) $(LDFLAGS) $(LDFLAGS_vmlinux)
>> exist here so that any change in them
>> invokes scripts/linkk-vmlinux.sh
>
> sry, I can't see that.. but it's just a side fix,
> which is actually not needed for the rest
>
> I'll check on more and address this separately


The link command is recorded in .vmlinux.cmd
then, it is checked by if_changed in the next rebuild
because link-vmlinux is invoked by $(call if_changed,...)

     +$(call if_changed,link-vmlinux)



> thanks,
> jirka
>
>>
>>
>>
>>
>> >  Makefile | 2 +-
>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/Makefile b/Makefile
>> > index d3300e46f925..a65a3919c6ad 100644
>> > --- a/Makefile
>> > +++ b/Makefile
>> > @@ -1027,7 +1027,7 @@ ARCH_POSTLINK := $(wildcard $(srctree)/arch/$(SRCARCH)/Makefile.postlink)
>> >
>> >  # Final link of vmlinux with optional arch pass after final link
>> >  cmd_link-vmlinux =                                                 \
>> > -       $(CONFIG_SHELL) $< $(LD) $(LDFLAGS) $(LDFLAGS_vmlinux) ;    \
>> > +       $(CONFIG_SHELL) $< ;                                       \
>> >         $(if $(ARCH_POSTLINK), $(MAKE) -f $(ARCH_POSTLINK) $@, true)
>> >
>> >  vmlinux: scripts/link-vmlinux.sh vmlinux_prereq $(vmlinux-deps) FORCE
>>
>>
>>
>>
>>
>>
>> --
>> Best Regards
>> Masahiro Yamada
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kbuild" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards
Masahiro Yamada

^ permalink raw reply

* Re: [PATCH v2 bpf-next 0/3] bpf/verifier: subprog/func_call simplifications
From: Alexei Starovoitov @ 2018-04-06  1:20 UTC (permalink / raw)
  To: Jiong Wang; +Cc: Edward Cree, Daniel Borkmann, netdev
In-Reply-To: <2ff89131-c6ea-5ddf-156c-c6f6e455fbdd@netronome.com>

On Thu, Apr 05, 2018 at 04:50:03PM +0100, Jiong Wang wrote:
> On 03/04/2018 02:08, Alexei Starovoitov wrote:
> > Combining subprog pass with do_check is going into opposite direction
> > of this long term work. Divide and conquer. Combining more things into
> > do_check is the opposite of this programming principle.
> 
> Agree. And for the redundant insn traversal issue in check_subprogs that
> Edward trying to fix, I am thinking we could do it by utilizing the
> existing DFS traversal in check_cfg.
> 
> The current DFS probably could be improved into an generic instruction
> information collection pass.
> 
> This won't make the existing DFS complexer as it only does information
> collection as a side job during traversal. These collected information
> then could be used to build any other information to be consumed later,
> for example subprog, basic blocks etc.
> 
> For the redundant insn traversal issue during subprog detection, the
> Like how we mark STATE_LIST_MARK in DFS, we could just call add_subprog
> for BPF_PSEUDO_CALL insn during DFS.
> 
> i.e we change the code logic of check_cfg into:
> 
> check_cfg
> {
>   * DFS traversal:
>     - detect back-edge.
>     - collect STATE_LIST_MARK.
>     - collect subprog destination.
> 
>   * check all insns are reachable.
>   * check_subprogs (insn traversal removed).
> }

I don't think that will work.
Functions are independent from CFG.
With bpf libraries we will have disjoint functions sitting in the kernel
and check_cfg would need to be done separately from function boundaries.

^ permalink raw reply

* (unknown), 
From: venkatvenkatsubra @ 2018-04-06  1:18 UTC (permalink / raw)
  To: netdev

Hi Netdev    https://goo.gl/5bDZtk

^ permalink raw reply

* Re: [Cluster-devel] [PATCH v2 0/2] gfs2: Stop using rhashtable_walk_peek
From: NeilBrown @ 2018-04-06  1:33 UTC (permalink / raw)
  To: Andreas Grünbacher, Herbert Xu
  Cc: Andreas Gruenbacher, Bob Peterson, Thomas Graf, Tom Herbert,
	cluster-devel, linux-kernel, netdev
In-Reply-To: <CAHpGcMKxDNOkMMKB1_9H0ob-502cD89E-94a18prp91y_a52GA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2687 bytes --]

On Wed, Apr 04 2018, Andreas Grünbacher wrote:

> Herbert Xu <herbert@gondor.apana.org.au> schrieb am Mi. 4. Apr. 2018 um
> 17:51:
>
>> On Wed, Apr 04, 2018 at 11:46:28AM -0400, Bob Peterson wrote:
>> >
>> > The patches look good. The big question is whether to add them to this
>> > merge window while it's still open. Opinions?
>>
>> We're still hashing out the rhashtable interface so I don't think now is
>> the time to rush things.
>
>
> Fair enough. No matter how rhashtable_walk_peek changes, we‘ll still need
> these two patches to fix the glock dump though.

Those two patches look fine to me and don't depend on changes to
rhashtable, so it is up to you when they go upstream.

However, I think the code can be substantially simplified, particularly
once we make rhashtable a little cleverer.
So this is what I'll probably be doing for a similar situation in
lustre....

Having examined seqfile closely, it is apparent that if ->start never
changes *ppos, and if ->next always increments it (except maybe on error)
then

1/ ->next is only ever given a 'pos' that was returned by the previous
   call to ->start or ->next.  So it should *always* return the next
   object, after the one previously returned by ->start or ->next.  It
   never needs to 'seek'. The 'traverse()' function in seq_file.c does
   any seeking needed.  ->next needs to increase 'pos' and when walking
   a dynamic list, it is easiest if it just increments it.

2/ ->start is only called with a pos of:
    0 - in this case it should rewind to the start
    the last pos passed to ->start of ->next
         In this case it should return the same thing that was
         returned last time.  If it no longer exists, then
         the following one should be returned.
    one more than the last pos passed to ->start or ->next
         In this case it should return the object after the
         last one returned.

The proposed enhancement to rhashtable_walk* is to add a
rhashtable_walk_prev() which returns the previously returned object,
if it is still in the table, or NULL. It also enhances
rhashtable_walk_start() so that if the previously returned object is
still in the table, it is preserved as the current cursor.
This means that if you take some action to ensure that the
previously returned object remains in the table until the next ->start,
then you can reliably walk the table with no duplicates or omissions
(unless a concurrent rehash causes duplicates)
If you don't keep the object in the table and it gets removed, then
the 'skip' counter is used to find your place, and you might get
duplicates or omissions.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [RFC 0/9] bpf: Add buildid check support
From: Alexei Starovoitov @ 2018-04-06  1:37 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Alexei Starovoitov, Daniel Borkmann, lkml, netdev, linux-kbuild,
	Quentin Monnet, Eugene Syromiatnikov, Jiri Benc, Stanislav Kozina,
	Jerome Marchand, Arnaldo Carvalho de Melo, Masahiro Yamada,
	Michal Marek, Jiri Kosina
In-Reply-To: <20180405151645.19130-1-jolsa@kernel.org>

On Thu, Apr 05, 2018 at 05:16:36PM +0200, Jiri Olsa wrote:
> hi,
> eBPF programs loaded for kprobes are allowed to read kernel
> internal structures. We check the provided kernel version
> to ensure that the program is loaded for the proper kernel. 
> 
> The problem is that the version check is not enough, because
> it only follows the version setup from kernel's Makefile.
> However, the internal kernel structures change based on the
> .config data, so in practise we have different kernels with
> same version.
> 
> The eBPF kprobe program thus then get loaded for different
> kernel than it's been built for, get wrong data (silently)
> and provide misleading output.
> 
> This patchset implements additional check in eBPF loading code
> on provided build ID (from kernel's elf image, .notes section
> GNU build ID) to ensure we load the eBPF program on correct
> kernel.
> 
> Also available in here (based on bpf-next/master):
>   https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
>   bpf/checksum
> 
> This patchset consists of several changes:
> 
> - adding CONFIG_BUILDID_H option that instructs the build
>   to generate uapi header file with build ID data, that
>   will be included by eBPF program
> 
> - adding CONFIG_BPF_BUILDID_CHECK option and new bpf_attr
>   field to allow build ID checking when loading the eBPF
>   program
> 
> - changing libbpf to read and pass build ID to the kernel
> 
> - several small side fixes
> 
> - example perf eBPF code in bpf-samples/bpf-stdout-example.c
>   to show the build ID support/usage.
> 
>     # perf record -vv  -e ./bpf-samples/bpf-stdout-example.c kill 2>&1 | grep buildid
>     libbpf: section(7) buildid, size 21, link 0, flags 3, type=1
>     libbpf: kernel buildid of ./bpf-samples/bpf-stdout-example.c is: 6e25edeb408513184e2753bebad25d42314501a0
> 
>   The buildid is provided the same way we provide kernel
>   version, in a special "buildid" section:
> 
>     # cat ./bpf-samples/bpf-stdout-example.c
>     ...
>     #include <linux/buildid.h>
> 
>     char _buildid[] SEC("buildid") = LINUX_BUILDID_DATA;
>     ...
> 
>   where LINUX_BUILDID_DATA is defined in the generated buildid.h.
> 
> please note it's an RFC ;-) any comments and suggestions are welcome

I think this is overkill.

We're very heavy users of kprobe+bpf. It's used for lots
of different cases and usage is constantly growing,
but I haven't seen a single case of :

> The eBPF kprobe program thus then get loaded for different
> kernel than it's been built for, get wrong data (silently)
> and provide misleading output.

but I saw plenty of the opposite. People pre-compile the program
and hack kernel version when they load, since they know in advance
that kprobe+bpf doesn't use any kernel specific things.
The existing kernel version check for kprobe+bpf is already annoying
to them.
This buildid check they can easily bypass the same way.
imo the whole thing just adds complexity and doesn't solve anything.
Sorry, this is a Nack.

^ permalink raw reply

* [PATCH] make net_gso_ok return false when gso_type is zero(invalid)
From: Wenhua Shi @ 2018-04-06  1:43 UTC (permalink / raw)
  Cc: David S. Miller, netdev, linux-kernel

Signed-off-by: Wenhua Shi <march511@gmail.com>
---
 include/linux/netdevice.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cf44503e..1f26cbcf 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -4187,7 +4187,7 @@ static inline bool net_gso_ok(netdev_features_t features, int gso_type)
 	BUILD_BUG_ON(SKB_GSO_ESP != (NETIF_F_GSO_ESP >> NETIF_F_GSO_SHIFT));
 	BUILD_BUG_ON(SKB_GSO_UDP != (NETIF_F_GSO_UDP >> NETIF_F_GSO_SHIFT));
 
-	return (features & feature) == feature;
+	return feature && (features & feature) == feature;
 }
 
 static inline bool skb_gso_ok(struct sk_buff *skb, netdev_features_t features)
-- 
2.11.0

^ permalink raw reply related

* Re: [PATCH net] netns: filter uevents correctly
From: David Miller @ 2018-04-06  2:02 UTC (permalink / raw)
  To: christian.brauner
  Cc: ebiederm, gregkh, netdev, linux-kernel, avagin, ktkhai, serge
In-Reply-To: <CAPP7u0VQc33MiYP8Ene6K1Bqkx4LtBtQ0ugFYp_36UsxuusT1g@mail.gmail.com>

From: Christian Brauner <christian.brauner@canonical.com>
Date: Thu, 05 Apr 2018 01:27:16 +0000

> David, is it ok to queue this or would you prefer I resend when net-next
> reopens?

This definitely needs more discussion, and after the discussion some
further clarification in the commit log message based upon that
discussion if we move forward with this change at all.

Thank you.

^ permalink raw reply

* Re: [PATCH net] net: dsa: Discard frames from unused ports
From: David Miller @ 2018-04-06  2:04 UTC (permalink / raw)
  To: andrew; +Cc: netdev, f.fainelli, vivien.didelot
In-Reply-To: <1522886204-1545-1-git-send-email-andrew@lunn.ch>

From: Andrew Lunn <andrew@lunn.ch>
Date: Thu,  5 Apr 2018 01:56:44 +0200

> The Marvell switches under some conditions will pass a frame to the
> host with the port being the CPU port. Such frames are invalid, and
> should be dropped. Not dropping them can result in a crash when
> incrementing the receive statistics for an invalid port.
> 
> Reported-by: Chris Healy <cphealy@gmail.com>
> Fixes: 5f6b4e14cada ("net: dsa: User per-cpu 64-bit statistics")
> Signed-off-by: Andrew Lunn <andrew@lunn.ch>
 ...
> +	slave_port = &ds->ports[port];
> +
> +	if (slave_port->type != DSA_PORT_TYPE_USER)
> +		return NULL;

Look like we need a Fixes: update and an adjustment to use unlikely()
here based upon Florian's feedback.

^ permalink raw reply

* Re: [PATCH net] arp: fix arp_filter on l3slave devices
From: David Miller @ 2018-04-06  2:05 UTC (permalink / raw)
  To: dsahern; +Cc: mfadon, netdev
In-Reply-To: <8ea161a0-b164-8f05-5026-d416f17fbdfb@gmail.com>

From: David Ahern <dsahern@gmail.com>
Date: Thu, 5 Apr 2018 08:40:48 -0600

> On 4/5/18 2:25 AM, Miguel Fadon Perlines wrote:
>> arp_filter performs an ip_route_output search for arp source address and
>> checks if output device is the same where the arp request was received,
>> if it is not, the arp request is not answered.
>> 
>> This route lookup is always done on main route table so l3slave devices
>> never find the proper route and arp is not answered.
>> 
>> Passing l3mdev_master_ifindex_rcu(dev) return value as oif fixes the
>> lookup for l3slave devices while maintaining same behavior for non
>> l3slave devices as this function returns 0 in that case.
>> 
>> Signed-off-by: Miguel Fadon Perlines <mfadon@teldat.com>
>> ---
>>  net/ipv4/arp.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>> 
> 
> Acked-by: David Ahern <dsa@cumulusnetworks.com>
> 
> DaveM: this is a day 1 bug for VRF. Best guess at a Fixes tag would be:
> 
> Fixes: 613d09b30f8b ("net: Use VRF device index for lookups on TX")
> 
> It would be good to get this into stable releases 4.9 and up. Thanks,

Applied and queued up for -stable.

^ permalink raw reply

* Re: [PATCH net] net: mvpp2: Fix parser entry init boundary check
From: David Miller @ 2018-04-06  2:14 UTC (permalink / raw)
  To: maxime.chevallier
  Cc: netdev, linux-kernel, antoine.tenart, thomas.petazzoni,
	gregory.clement, miquel.raynal, nadavh, stefanc, ymarkman, mw
In-Reply-To: <20180405095548.14433-1-maxime.chevallier@bootlin.com>

From: Maxime Chevallier <maxime.chevallier@bootlin.com>
Date: Thu,  5 Apr 2018 11:55:48 +0200

> Boundary check in mvpp2_prs_init_from_hw must be done according to the
> passed "tid" parameter, not the mvpp2_prs_entry index, which is not yet
> initialized at the time of the check.
> 
> Fixes: 47e0e14eb1a6 ("net: mvpp2: Make mvpp2_prs_hw_read a parser entry init function")
> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH 08/32] aio: replace kiocb_set_cancel_fn with a cancel_kiocb file operation
From: Al Viro @ 2018-04-06  2:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Avi Kivity, linux-aio, linux-fsdevel, netdev, linux-api,
	linux-kernel
In-Reply-To: <20180330150809.28094-9-hch@lst.de>

On Fri, Mar 30, 2018 at 05:07:45PM +0200, Christoph Hellwig wrote:
> The current kiocb_set_cancel_fn implementation assumes the kiocb is
> embedded into an aio_kiocb, which is fundamentally unsafe as it might
> have been submitted by non-aio callers.  Instead add a cancel_kiocb
> file operation that replaced the ki_cancel function pointer set by
> kiocb_set_cancel_fn, and only adds iocbs to the active list when
> the read/write_iter methods return -EIOCBQUEUED and the file has
> a cancel_kiocb method.

> @@ -1440,6 +1423,16 @@ static inline ssize_t aio_rw_ret(struct kiocb *req, ssize_t ret)
>  {
>  	switch (ret) {
>  	case -EIOCBQUEUED:
> +		if (req->ki_filp->f_op->cancel_kiocb) {

... and by that point req might've been already freed by IO completion on
another CPU.
> +			struct aio_kiocb *iocb =
> +				container_of(req, struct aio_kiocb, rw);
> +			struct kioctx *ctx = iocb->ki_ctx;
> +			unsigned long flags;
> +
> +			spin_lock_irqsave(&ctx->ctx_lock, flags);
> +			list_add_tail(&iocb->ki_list, &ctx->active_reqs);

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply

* Re: [PATCH net-next] hv_netvsc: Add NetVSP v6 into version negotiation
From: David Miller @ 2018-04-06  2:17 UTC (permalink / raw)
  To: haiyangz, haiyangz
  Cc: netdev, kys, sthemmin, olaf, vkuznets, devel, linux-kernel
In-Reply-To: <20180405184222.3167-1-haiyangz@linuxonhyperv.com>

From: Haiyang Zhang <haiyangz@linuxonhyperv.com>
Date: Thu,  5 Apr 2018 11:42:22 -0700

> From: Haiyang Zhang <haiyangz@microsoft.com>
> 
> This patch adds the NetVSP v6 message structures, and includes this
> version into NetVSC/NetVSP version negotiation process.
> 
> Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>

The net-next tree is closed, please resubmit this when the net-next
tree reopens.

Thank you.

^ permalink raw reply

* Re: [PATCH 0/4] hv_netvsc: Fix shutdown issues on older Windows hosts
From: David Miller @ 2018-04-06  2:21 UTC (permalink / raw)
  To: mgamal
  Cc: netdev, sthemmin, devel, linux-kernel, kys, haiyangz, vkuznets,
	otubo
In-Reply-To: <1522955361-14704-1-git-send-email-mgamal@redhat.com>

From: Mohammed Gamal <mgamal@redhat.com>
Date: Thu,  5 Apr 2018 21:09:17 +0200

> Guests running on WS2012 hosts would not shutdown when changing network
> interface setting (e.g. Number of channels, MTU ... etc). 
> 
> This patch series addresses these shutdown issues we enecountered with WS2012
> hosts. It's essentialy a rework of the series sent in 
> https://lkml.org/lkml/2018/1/23/111 on top of latest upstream
> 
> Fixes: 0ef58b0a05c1 ("hv_netvsc: change GPAD teardown order on older versions")

Series applied, thank you.

^ permalink raw reply

* Re: [PATCH 2/2] net: phy: dp83640: Read strapped configuration settings
From: David Miller @ 2018-04-06  2:24 UTC (permalink / raw)
  To: andrew
  Cc: esben, f.fainelli, richardcochran, netdev, linux-kernel,
	rasmus.villemoes
In-Reply-To: <20180405204049.GD17495@lunn.ch>

From: Andrew Lunn <andrew@lunn.ch>
Date: Thu, 5 Apr 2018 22:40:49 +0200

> Or could it still contain whatever state the last boot of Linux, or
> maybe the bootloader, left the PHY in?

Right, this is my concern as well.

^ permalink raw reply

* Re: [PATCH net] net/ipv6: Increment OUTxxx counters after netfilter hook
From: David Miller @ 2018-04-06  2:24 UTC (permalink / raw)
  To: 0xeffeff; +Cc: netdev, dsahern
In-Reply-To: <20180405212947.17858-1-0xeffeff@gmail.com>

From: Jeff Barnhill <0xeffeff@gmail.com>
Date: Thu,  5 Apr 2018 21:29:47 +0000

> At the end of ip6_forward(), IPSTATS_MIB_OUTFORWDATAGRAMS and
> IPSTATS_MIB_OUTOCTETS are incremented immediately before the NF_HOOK call
> for NFPROTO_IPV6 / NF_INET_FORWARD.  As a result, these counters get
> incremented regardless of whether or not the netfilter hook allows the
> packet to continue being processed.  This change increments the counters
> in ip6_forward_finish() so that it will not happen if the netfilter hook
> chooses to terminate the packet, which is similar to how IPv4 works.
> 
> Signed-off-by: Jeff Barnhill <0xeffeff@gmail.com>

Yep, this is more consistent with ipv4.

Applied, thank you.

^ permalink raw reply

* Re: [PATCH v15 ] net/veth/XDP: Line-rate packet forwarding in kernel
From: David Ahern @ 2018-04-06  2:55 UTC (permalink / raw)
  To: Md. Islam
  Cc: netdev, David Miller, Stephen Hemminger, Anton Gary Ceph,
	Pavel Emelyanov, Eric Dumazet, alexei.starovoitov, brouer
In-Reply-To: <CAFgPn1D2SE_wvCvLJU9uRJbwa4hy4NVj2as_C=VzmDLqyBeGbQ@mail.gmail.com>

On 4/3/18 9:15 PM, Md. Islam wrote:
>> Have you looked at what I would consider a more interesting use case of
>> packets into a node and delivered to a namespace via veth?
>>
>>    +--------------------------+---------------
>>    | Host                     | container
>>    |                          |
>>    |        +-------{ veth1 }-|-{veth2}----
>>    |       |                  |
>>    +----{ eth1 }------------------
>>
>> Can xdp / bpf on eth1 be used to speed up delivery to the container?
> 
> I didn't consider that, but it sounds like an important use case. How
> do we determine which namespace gets the packet?
> 

FIB lookups of course. Starting with my patch set that handles
forwarding on eth1, what is needed for XDP with veth? ie., a program on
eth1 does the lookup and redirects the packet to veth1 for Tx.
ndo_xdp_xmit for veth knows the packet needs to be forwarded to veth2
internally and there is no skb allocated for the packet yet.

^ permalink raw reply

* Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.
From: NeilBrown @ 2018-04-06  3:11 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Thomas Graf, netdev, linux-kernel, David S. Miller, Eric Dumazet
In-Reply-To: <20180329052231.GA21028@gondor.apana.org.au>

[-- Attachment #1: Type: text/plain, Size: 5902 bytes --]

On Thu, Mar 29 2018, Herbert Xu wrote:

> On Thu, Mar 29, 2018 at 08:26:21AM +1100, NeilBrown wrote:
>>
>> I say "astronomically unlikely", you say "probability .. is extremely
>> low".  I think we are in agreement here.
>> 
>> The point remains that if an error *can* be returned then I have to
>> write code to handle it and test that code.  I'd rather not.
>
> You have be able to handle errors anyway because of memory allocation
> failures.  Ultimately if you keep inserting you will eventually
> fail with ENOMEM.  So I don't see the issue with an additional
> error value.

You don't need to handle memory allocation failures at the point where
you insert into the table - adding to a linked list requires no new
memory.

If you are running out of memory, you will fail to allocate new objects,
so you won't be able to add them to the rhashtable.  Obviously you have
to handle that failure, and that is likely to save rhashtable from
getting many mem-alloc failures.

But once you have allocated a new object, rhashtable should just accept
it.  It doesn't help anyone to have the insertion fail.
The insertion can currently fail even when allocating a new object would
succeed, as assertion can require a GFP_ATOMIC allocation to succeed,
while allocating a new object might only need a GFP_KERNEL allocation to
succeed. 

>
>> > Even if it does happen we won't fail because we will perform
>> > an immediate rehash.  We only fail if it happens right away
>> > after the rehash (that is, at least another 16 elements have
>> > been inserted and you're trying to insert a 17th element, all
>> > while the new hash table has not been completely populated),
>> > which means that somebody has figured out our hash secret and
>> > failing in that case makes sense.
>
> BTW, you didn't acknowledge this bit which I think is crucial to
> how likely such an error is.

The likelihood of the error isn't really the issue - it either can
happen or it cannot.  If it can, it requires code to handle it.

>
>> I never suggested retrying, but I would have to handle it somehow.  I'd
>> rather not.
>
> ...
>
>> While I have no doubt that there are hashtables where someone could try
>> to attack the hash, I am quite sure there are others where is such an
>> attack is meaningless - any code which could generate the required range of
>> keys, could do far worse things more easily.
>
> Our network hashtable has to be secure against adversaries.  I
> understand that this may not be important to your use-case.  However,
> given the fact that the failure would only occur if an adversary
> is present and actively attacking your hash table, I don't think
> it has that much of a negative effect on your use-case either.

I certainly accept that there are circumstances where it is a real
possibility that an adversary might succeed in attacking the hash
function, and changing the seed for each table is an excellent idea.
It isn't clear to me that failing insertions is needed - it is done
rather late and so doesn't throttle much, and could give the attacker
feedback that they had succeeded (?).

Not all hash tables are susceptible to attack.  It would be nice if
developers working in less exposed areas could use rhashtable without ever
having to check for errors.

Error codes are messages from the implementation to the caller.
They should be chosen to help the caller make useful choices, not to
allow the implementation to avoid awkward situations.

>
> Of course if you can reproduce the EBUSY error without your
> disable_count patch or even after you have fixed the issue I
> have pointed out in your disable_count patch you can still reproduce
> it then that would suggest a real bug and we would need to fix it,
> for everyone.

I suspect that I cannot.  However experience tells me that customers do
things that I cannot and would never expect - they can often trigger
errors that I cannot.  It is best if the error cannot be returned.

>
>> Yes, storing a sharded count in the spinlock table does seem like an
>> appropriate granularity.  However that leads me to ask: why do we have
>> the spinlock table?  Why not bit spinlocks in the hashchain head like
>> include/linux/list_bl uses?
>
> The spinlock table predates rhashtable.  Perhaps Thomas/Eric/Dave
> can elucidate this.

Maybe I'll submit an RFC patch to change it - if I can make it work.

>
>> I don't understand how it can ever be "too late", though I appreciate
>> that in some cases "sooner" is better than "later"
>> If we give up on the single atomic_t counter, then we must accept that
>> the number of elements could exceed any given value.  The only promise
>> we can provide is that it wont exceed N% of the table size for more than
>> T seconds.
>
> Sure.  However, assuming we use an estimate that is hash-based,
> that *should* be fairly accurate assuming that your hash function
> is working in the first place.  It's completely different vs.
> estimating based on a per-cpu count which could be wildly inaccurate.

Ahhh... I see what you are thinking now.  You aren't suggestion a
sharded count that is summed periodically.  Rather you are suggesting that
we divide the hash space into N regions each with their own independent
counter, and resize the table if any one region reaches 70% occupancy -
on the assumption that the other regions are likely to be close.  Is
that right?
It would probably be dangerous to allow automatic shrinking (when just
one drops below 30%) in that case, but it might be OK for growing.

I'm not sure I like the idea, but it is worth thinking about.

Thanks,
NeilBrown

>
> Cheers,
> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: KASAN: use-after-free Read in worker_thread (2)
From: Eric Biggers @ 2018-04-06  3:12 UTC (permalink / raw)
  To: syzbot
  Cc: davem, dvyukov, ebiggers, jiangshanlai, linux-kernel, mingo,
	netdev, syzkaller-bugs, tj, tklauser, tom, xiyou.wangcong
In-Reply-To: <001a1140ccfee824f8055db71210@google.com>

On Sat, Nov 11, 2017 at 07:56:01AM -0800, syzbot wrote:
> syzkaller has found reproducer for the following crash on
> d9e0e63d9a6f88440eb201e1491fcf730272c706
> git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> 
> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> for information about syzkaller reproducers
> 
> 
> BUG: KASAN: use-after-free in worker_thread+0x15bb/0x1990
> kernel/workqueue.c:2244
> Read of size 8 at addr ffff88002d0e3de0 by task kworker/u8:1/1209
> 
> CPU: 0 PID: 1209 Comm: kworker/u8:1 Not tainted 4.14.0-rc8-next-20171110+
> #12
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>  print_address_description+0x73/0x250 mm/kasan/report.c:252
>  kasan_report_error mm/kasan/report.c:351 [inline]
>  kasan_report+0x25b/0x340 mm/kasan/report.c:409
>  __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430
>  worker_thread+0x15bb/0x1990 kernel/workqueue.c:2244
>  kthread+0x37a/0x440 kernel/kthread.c:238
>  ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:437
> 
> Allocated by task 11866:
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>  set_track mm/kasan/kasan.c:459 [inline]
>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
>  kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:489
>  kmem_cache_alloc+0x12e/0x760 mm/slab.c:3548
>  kmem_cache_zalloc include/linux/slab.h:693 [inline]
>  kcm_attach net/kcm/kcmsock.c:1394 [inline]
>  kcm_attach_ioctl net/kcm/kcmsock.c:1460 [inline]
>  kcm_ioctl+0x2d1/0x1610 net/kcm/kcmsock.c:1695
>  sock_do_ioctl+0x65/0xb0 net/socket.c:960
>  sock_ioctl+0x2c2/0x440 net/socket.c:1057
>  vfs_ioctl fs/ioctl.c:46 [inline]
>  do_vfs_ioctl+0x1b1/0x1530 fs/ioctl.c:686
>  SYSC_ioctl fs/ioctl.c:701 [inline]
>  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
>  entry_SYSCALL_64_fastpath+0x1f/0x96
> 
> Freed by task 11867:
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>  set_track mm/kasan/kasan.c:459 [inline]
>  kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
>  __cache_free mm/slab.c:3492 [inline]
>  kmem_cache_free+0x77/0x280 mm/slab.c:3750
>  kcm_unattach+0xe50/0x1510 net/kcm/kcmsock.c:1563
>  kcm_unattach_ioctl net/kcm/kcmsock.c:1608 [inline]
>  kcm_ioctl+0xdf0/0x1610 net/kcm/kcmsock.c:1705
>  sock_do_ioctl+0x65/0xb0 net/socket.c:960
>  sock_ioctl+0x2c2/0x440 net/socket.c:1057
>  vfs_ioctl fs/ioctl.c:46 [inline]
>  do_vfs_ioctl+0x1b1/0x1530 fs/ioctl.c:686
>  SYSC_ioctl fs/ioctl.c:701 [inline]
>  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
>  entry_SYSCALL_64_fastpath+0x1f/0x96
> 
> The buggy address belongs to the object at ffff88002d0e3d00
>  which belongs to the cache kcm_psock_cache of size 576
> The buggy address is located 224 bytes inside of
>  576-byte region [ffff88002d0e3d00, ffff88002d0e3f40)
> The buggy address belongs to the page:
> page:ffffea0000b43880 count:1 mapcount:0 mapping:ffff88002d0e2180 index:0x0
> compound_mapcount: 0
> flags: 0x100000000008100(slab|head)
> raw: 0100000000008100 ffff88002d0e2180 0000000000000000 000000010000000b
> raw: ffffea0000b14920 ffffea0000b27e20 ffff88002b0089c0 0000000000000000
> page dumped because: kasan: bad access detected
> 
> Memory state around the buggy address:
>  ffff88002d0e3c80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>  ffff88002d0e3d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > ffff88002d0e3d80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>                                                        ^
>  ffff88002d0e3e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>  ffff88002d0e3e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ==================================================================

No longer occurring, the fix seems to have been commit 7e9964574ee97:

#syz fix: kcm: Only allow TCP sockets to be attached to a KCM mux

- Eric

^ permalink raw reply

* Re: [PATCH net-next] netns: filter uevents correctly
From: Eric W. Biederman @ 2018-04-06  3:59 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge
In-Reply-To: <20180405144130.GB26043@gmail.com>

Christian Brauner <christian.brauner@canonical.com> writes:

> On Thu, Apr 05, 2018 at 05:26:59PM +0300, Kirill Tkhai wrote:
>> On 05.04.2018 17:07, Christian Brauner wrote:
>> > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
>> >> On 04.04.2018 22:48, Christian Brauner wrote:
>> >>> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
>> >>>
>> >>> enabled sending hotplug events into all network namespaces back in 2010.
>> >>> Over time the set of uevents that get sent into all network namespaces has
>> >>> shrunk. We have now reached the point where hotplug events for all devices
>> >>> that carry a namespace tag are filtered according to that namespace.
>> >>>
>> >>> Specifically, they are filtered whenever the namespace tag of the kobject
>> >>> does not match the namespace tag of the netlink socket. One example are
>> >>> network devices. Uevents for network devices only show up in the network
>> >>> namespaces these devices are moved to or created in.
>> >>>
>> >>> However, any uevent for a kobject that does not have a namespace tag
>> >>> associated with it will not be filtered and we will *try* to broadcast it
>> >>> into all network namespaces.
>> >>>
>> >>> The original patchset was written in 2010 before user namespaces were a
>> >>> thing. With the introduction of user namespaces sending out uevents became
>> >>> partially isolated as they were filtered by user namespaces:
>> >>>
>> >>> net/netlink/af_netlink.c:do_one_broadcast()
>> >>>
>> >>> if (!net_eq(sock_net(sk), p->net)) {
>> >>>         if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
>> >>>                 return;
>> >>>
>> >>>         if (!peernet_has_id(sock_net(sk), p->net))
>> >>>                 return;
>> >>>
>> >>>         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>> >>>                              CAP_NET_BROADCAST))
>> >>>         j       return;
>> >>> }
>> >>>
>> >>> The file_ns_capable() check will check whether the caller had
>> >>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
>> >>> namespace of interest. This check is fine in general but seems insufficient
>> >>> to me when paired with uevents. The reason is that devices always belong to
>> >>> the initial user namespace so uevents for kobjects that do not carry a
>> >>> namespace tag should never be sent into another user namespace. This has
>> >>> been the intention all along. But there's one case where this breaks,
>> >>> namely if a new user namespace is created by root on the host and an
>> >>> identity mapping is established between root on the host and root in the
>> >>> new user namespace. Here's a reproducer:
>> >>>
>> >>>  sudo unshare -U --map-root
>> >>>  udevadm monitor -k
>> >>>  # Now change to initial user namespace and e.g. do
>> >>>  modprobe kvm
>> >>>  # or
>> >>>  rmmod kvm
>> >>>
>> >>> will allow the non-initial user namespace to retrieve all uevents from the
>> >>> host. This seems very anecdotal given that in the general case user
>> >>> namespaces do not see any uevents and also can't really do anything useful
>> >>> with them.
>> >>>
>> >>> Additionally, it is now possible to send uevents from userspace. As such we
>> >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
>> >>> namespace of the network namespace of the netlink socket) userspace process
>> >>> make a decision what uevents should be sent.
>> >>>
>> >>> This makes me think that we should simply ensure that uevents for kobjects
>> >>> that do not carry a namespace tag are *always* filtered by user namespace
>> >>> in kobj_bcast_filter(). Specifically:
>> >>> - If the owning user namespace of the uevent socket is not init_user_ns the
>> >>>   event will always be filtered.
>> >>> - If the network namespace the uevent socket belongs to was created in the
>> >>>   initial user namespace but was opened from a non-initial user namespace
>> >>>   the event will be filtered as well.
>> >>> Put another way, uevents for kobjects not carrying a namespace tag are now
>> >>> always only sent to the initial user namespace. The regression potential
>> >>> for this is near to non-existent since user namespaces can't really do
>> >>> anything with interesting devices.
>> >>>
>> >>> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
>> >>> ---
>> >>>  lib/kobject_uevent.c | 10 +++++++++-
>> >>>  1 file changed, 9 insertions(+), 1 deletion(-)
>> >>>
>> >>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
>> >>> index 15ea216a67ce..cb98cddb6e3b 100644
>> >>> --- a/lib/kobject_uevent.c
>> >>> +++ b/lib/kobject_uevent.c
>> >>> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
>> >>>  		return sock_ns != ns;
>> >>>  	}
>> >>>  
>> >>> -	return 0;
>> >>> +	/*
>> >>> +	 * The kobject does not carry a namespace tag so filter by user
>> >>> +	 * namespace below.
>> >>> +	 */
>> >>> +	if (sock_net(dsk)->user_ns != &init_user_ns)
>> >>> +		return 1;
>> >>> +
>> >>> +	/* Check if socket was opened from non-initial user namespace. */
>> >>> +	return sk_user_ns(dsk) != &init_user_ns;
>> >>>  }
>> >>>  #endif
>> >>
>> >> So, this prohibits to listen events of all devices except network-related
>> >> in containers? If it's so, I don't think it's a good solution. Uevents is not
>> > 
>> > No, this is not correct: As it is right now *without my patch* no
>> > non-initial user namespace is receiving *any uevents* but those
>> > specifically namespaced such as those for network devices. This patch
>> > doesn't change that at all. The commit message outlines this in detail
>> > how this comes about.
>> > There is only one case where this currently breaks and this is as I
>> > outlined explicitly in my commit message when you create a new user
>> > namespace and map container(0) -> host(0). This patch fixes this.
>> 
>> Could you please point the place, where non-initial user namespaces are filtered?
>> I only see the kobj_bcast_filter() logic, and it used to return 0, which means "accepted".
>> Now it will return 1 sometimes.
>
> Oh sure, it's in the commit message though. The callchain is
> lib/kobject_uevent.c:kobject_uevent_net_broadcast() ->
> nnet/netlink/af_netlink.c:netlink_broadcast_filtered() ->
> net/netlink/af_netlink.c:do_one_broadcast():
>
> This codepiece will check whether the openened socket holds
> CAP_NET_BROADCAST in the user namespace of the target network namespace
> which it won't because we don't have device namespaces and all devices
> belong to the initial set of namespaces.
>
>         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>                              CAP_NET_BROADCAST))
>         j       return;
>

The above that only applies if someone has set NETLINK_F_LISTEN_ALL_NSID
on their socket and has had someone with the appropriate privileges
assign a peerrnetid.

All of which is to say that file_ns_capable is not nearly as applicable
as it might be, and if you can pass the other two checks I think it is
pointless (because the peernet attributes are not generated for
kobj_uevents) but valid to receive events from outside your network
namespace.


I might be missing something but I don't see anything excluding network
namespaces owned by !init_user_ns excluded from the kobject_uevent
logic.

The uevent_sock_list has one entry per network namespace. And
kobject_uevent_net_broacast appears to walk each one.

I had a memory of filtering uevent messages and I had a memory
that the netlink_has_listeners had a per network namespace effect.
Neither seems true from my inspection of the code tonight.

If we are not filtering ordinary uevents at least at the user namespace
level that does seem to be at least a nuisance.


Christian can you dig a little deeper into this.  I have a feeling that
there are some real efficiency improvements that we could make to
kobject_uevent_net_broadcast if nothing else.

Perhaps you could see where uevents are broadcast by poking
the sysfs uevent of an existing device, and triggering a retransmit.

Eric

^ permalink raw reply

* Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.
From: Herbert Xu @ 2018-04-06  4:13 UTC (permalink / raw)
  To: NeilBrown
  Cc: Thomas Graf, netdev, linux-kernel, David S. Miller, Eric Dumazet
In-Reply-To: <87y3i1vxj7.fsf@notabene.neil.brown.name>

On Fri, Apr 06, 2018 at 01:11:56PM +1000, NeilBrown wrote:
>
> You don't need to handle memory allocation failures at the point where
> you insert into the table - adding to a linked list requires no new
> memory.

You do actually.  The r in rhashtable stands for resizable.  We
cannot completely rely on a background hash table resize because
the insertions can be triggered from softirq context and be without
rate-limits of any kind (e.g., incoming TCP connections).

Therefore at some point you either have to wait for the resize or
fail the insertion.  Since we cannot sleep then failing is the only
option.

> The likelihood of the error isn't really the issue - it either can
> happen or it cannot.  If it can, it requires code to handle it.

Sure, but you just have to handle it the same way you would handle
a memory allocation failure, because that's what it essentially is.

I'm sorry if that means you have to restructure your code to do that
but that's what you pay for having a resizable hash table because
at some point you just have to perform a resize.

> Ahhh... I see what you are thinking now.  You aren't suggestion a
> sharded count that is summed periodically.  Rather you are suggesting that
> we divide the hash space into N regions each with their own independent
> counter, and resize the table if any one region reaches 70% occupancy -
> on the assumption that the other regions are likely to be close.  Is
> that right?

Yes.

> It would probably be dangerous to allow automatic shrinking (when just
> one drops below 30%) in that case, but it might be OK for growing.

At the greatest granularity it would be a counter per bucket, so
clearly we need to maintain some kind of bound on the granularity
so our sample space is not too small.

Automatic shrinking shouldn't be an issue because that's the slow
kind of resizing that we punt to kthread context and it can afford
to do a careful count.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 1/2] af_key: Use DIV_ROUND_UP() instead of open-coded equivalent
From: Steffen Klassert @ 2018-04-06  4:31 UTC (permalink / raw)
  To: Kevin Easton; +Cc: Herbert Xu, David S. Miller, netdev, linux-kernel
In-Reply-To: <20180329013526.GA27752@la.guarana.org>

On Wed, Mar 28, 2018 at 09:35:26PM -0400, Kevin Easton wrote:
> On Wed, Mar 28, 2018 at 07:59:25AM +0200, Steffen Klassert wrote:
> > On Mon, Mar 26, 2018 at 07:39:16AM -0400, Kevin Easton wrote:
> > > Several places use (x + 7) / 8 to convert from a number of bits to a number
> > > of bytes.  Replace those with DIV_ROUND_UP(x, 8) instead, for consistency
> > > with other parts of the same file.
> > > 
> > > Signed-off-by: Kevin Easton <kevin@guarana.org>
> > 
> > Is this a fix or just a cleanup?
> > If it is just a cleanup, please resent it based on ipsec-next.
> 
> Hi Steffen,
> 
> This patch (1/2) is just a cleanup, but it's in preparation for patch 2/2
> which is a fix and modifies some of the same lines.

So please resend the fix without the cleanup, otherwise we can
get conflicts when backporting the fix to the stable trees.

^ permalink raw reply

* Re: [PATCH 05/32] fs: introduce new ->get_poll_head and ->poll_mask methods
From: Al Viro @ 2018-04-06  4:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Avi Kivity, linux-aio, linux-fsdevel, netdev, linux-api,
	linux-kernel
In-Reply-To: <20180330150809.28094-6-hch@lst.de>

On Fri, Mar 30, 2018 at 05:07:42PM +0200, Christoph Hellwig wrote:

> +  get_poll_head: Returns the struct wait_queue_head that poll, select,
> +  epoll or aio poll should wait on in case this instance only has single
> +  waitqueue.  Can return NULL to indicate polling is not supported,
> +  or a POLL* value using the POLL_TO_PTR helper in case a grave error
> +  occured and ->poll_mask shall not be called.

> +		if (IS_ERR(head))
> +			return PTR_TO_POLL(head);

> + * ->get_poll_head can return a __poll_t in the PTR_ERR, use these macros
> + * to return the value and recover it.  It takes care of the negation as
> + * well as off the annotations.
> + */
> +#define POLL_TO_PTR(mask)	(ERR_PTR(-(__force int)(mask)))

Uh-oh...
static inline bool __must_check IS_ERR(__force const void *ptr)
{
        return IS_ERR_VALUE((unsigned long)ptr);
}
#define IS_ERR_VALUE(x) unlikely((unsigned long)(void *)(x) >= (unsigned long)-MAX_ERRNO)
#define MAX_ERRNO       4095

IOW, your trick relies upon arguments of PTR_TO_POLL being no greater
than 4095.  Now, consider
#define EPOLLRDHUP      (__force __poll_t)0x00002000
which is to say, 8192...

So anything that tries e.g. POLL_TO_PTR(EPOLLRDHUP | EPOLLERR) will be in
for a quiet unpleasant surprise...

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply

* Re: [PATCH net-next 6/6] netdevsim: Add simple FIB resource controller via devlink
From: Jiri Pirko @ 2018-04-06  5:35 UTC (permalink / raw)
  To: David Ahern
  Cc: netdev, davem, roopa, shm, jiri, idosch, jakub.kicinski,
	andy.roulin
In-Reply-To: <a916f016-5d8b-3f56-0a84-95d1712bec4c@cumulusnetworks.com>

Thu, Apr 05, 2018 at 10:10:29PM CEST, dsa@cumulusnetworks.com wrote:
>On 4/5/18 11:27 AM, Jiri Pirko wrote:
>> Wed, Mar 28, 2018 at 03:22:00AM CEST, dsa@cumulusnetworks.com wrote:
>>> Add devlink support to netdevsim and use it to implement a simple,
>>> profile based resource controller. Only one controller is needed
>>> per namespace, so the first netdevsim netdevice in a namespace
>>> registers with devlink. If that device is deleted, the resource
>>> settings are deleted.
>> 
>> I don't understand why you add 1:1 fixed relationship between
>> netnamespace and devlink instance. That is highly misleading and reader
>> might think that those 2 are somehow related. They are not. You can have
>> multiple devlink instances for many ports in a single namespace.
>
>The netdevsim devlink instance is an example of limiting the number of
>FIB entries and FIB rules for a namespace. It is currently limited to
>the init_net based on past discussion.
>
>It does not make sense to have multiple resource controllers for the
>same network namespace, hence the limit of only registering with devlink
>on the first device create.

Devlink instance represents an ASIC. 1:1. There is no relation with
network namespaces and should not be. I have no clue why you think so.

The model looks as I described it down below in the picture.


>
>> 
>> Could you please clarify?
>> 
>> Also, to see the relationship between individual netdevsim netdevices
>> and the parent devlink instance, we should use devlink_port
>> instances, like this: 
>> 
>>       devlink1              devlink2
>>        |    |                |    |
>>  dl_port1_1 dlport1_2   dlport2_1 dlport2_2
>>        |    |                |    |
>>      eth0  eth1             eth2 eth3
>> 
>> Note that "devlink instance" reprensents one ASIC.
>> The address of the devlink instance is the bus address of the ASIC.
>> Here, you use address of some/first netdevsim netdev instance.
>
>The ASIC here is the kernel tables in a namespace. It does not make
>sense to have 2 devlink instances for a single namespace.

Again. No clue why you build relationship with namespace.


>
>> 
>> The way it is implemented in netdevsim by this patch is wrong on
>> so many levels :(
>> 
>> Could you please fix this? I'm more than happy to help you with this,
>> please say so. Thanks!
>
>What is there to fix?
>
>Not creating a netdevsim device per netdevsim netdevice? That is
>completely unrelated to the devlink change.

To fit the model. Multiple devlink instances, each representing one
"virtual" ASIC, devlink_port instances, 1 for each netdevsim port.
Netdevsim port should simulate real devices. No real device should have
1:1 relation with network namespace. That is just simply wrong.

^ permalink raw reply

* Re: [PATCH net-next 6/6] netdevsim: Add simple FIB resource controller via devlink
From: Jiri Pirko @ 2018-04-06  5:52 UTC (permalink / raw)
  To: David Ahern
  Cc: netdev, davem, roopa, shm, jiri, idosch, jakub.kicinski,
	andy.roulin
In-Reply-To: <e9c59b0c-328e-d343-6e8d-d19f643d2e9d@cumulusnetworks.com>

Thu, Apr 05, 2018 at 11:06:41PM CEST, dsa@cumulusnetworks.com wrote:
>On 4/5/18 2:10 PM, David Ahern wrote:
>> 
>> The ASIC here is the kernel tables in a namespace. It does not make
>> sense to have 2 devlink instances for a single namespace.
>
>I put this example controller in netdevsim per a suggestion from Ido.
>The netdevsim seemed like a good idea given that modules intention --
>testing network facilities. Perhaps I should have done this as a
>completely standalone module ...
>
>The intention is to treat the kernel's tables *per namespace* as a
>standalone entity that can be managed very similar to ASIC resources.

So you say you want to treat a namespace as an ASIC? That sounds very
odd to me :/


>Given that I can add a resource controller module
>(drivers/net/kern_res_mgr.c?) that creates a 'struct device' per network
>namespace with a devlink instance. In this case the device would very
>much be tied to the namespace 1:1.

That sounds more reasonable and accurate, yet still odd. You would not
have any netdevices there? Any ports?

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox