Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: switchdev offload & ecmp
From: Ido Schimmel @ 2017-05-15 16:40 UTC (permalink / raw)
  To: Nicolas Dichtel; +Cc: Jiri Pirko, Nikolay Aleksandrov, Roopa Prabhu, netdev
In-Reply-To: <84b9801f-0c2a-cd55-61a4-411261943024@6wind.com>

Hi,

On Mon, May 15, 2017 at 04:25:43PM +0200, Nicolas Dichtel wrote:
> Hi Jiri and Ido,
> 
> I'm trying to understand how ecmp offloading works. It seems that rocker doesn't
> support it:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/rocker/rocker_ofdpa.c#n2409.
> But I saw that the support was added in spectrum:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=684a95c064fc.
> 
> Is there a consistency between the ecmp algorithm of the kernel and the one from
> spectrum?

We currently use the hardware's defaults for ECMP hashing, which include
both L3 and L4 fields. I'm aware of Nik's patch, but we've yet to
reflect that. Note that the L4 fields aren't considered for fragmented
packets.

> I suspect that there can be scenarii where some packets of a flow are forwarded
> by the driver and some other are forwarded by the kernel.

Can you elaborate? The kernel only sees specific packets, which were
trapped to the CPU. See:
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/tree/drivers/net/ethernet/mellanox/mlxsw/spectrum.c#n2996

> For example, an ecmp route with two nexthops: a connected route and a gw? 

Not sure I'm following you. A packet will either hit a remote route or a
directly connected one. We distinguish between the two based on the
scope of the first nexthop in the group. See:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c#n2043

> In that case, the periodic nexthops update
> (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c#n987)
> won't help. How do you ensure that all packets of the flow are always forwarded
> through the same nexthop?

I don't think we can ensure that for a flow in which some packets are
forwarded by the kernel and some by the device, but I failed to
understand your example of such a flow.

Thanks

^ permalink raw reply

* Re: [GIT PULL 0/1] IPVS Fixes for v4.12
From: Pablo Neira Ayuso @ 2017-05-15 16:45 UTC (permalink / raw)
  To: Simon Horman
  Cc: lvs-devel, netdev, netfilter-devel, Wensong Zhang,
	Julian Anastasov
In-Reply-To: <1494236923-8015-1-git-send-email-horms@verge.net.au>

On Mon, May 08, 2017 at 11:48:42AM +0200, Simon Horman wrote:
> Hi Pablo,
> 
> please consider this fix to IPVS for v4.12.
> 
> * It is a fix from Julian Anastasov to only SNAT SNAT packet replies only for
>   NATed connections
> 
> 
> My understanding is that this fix is appropriate for 4.9.25, 4.10.13, 4.11
> as well as the nf tree. Julian has separately posted backports for other
> -stable kernels; please see:
> 
> * [PATCH 3.2.88,3.4.113 -stable 1/3] ipvs: SNAT packet replies only for
>         NATed connections
> * [PATCH 3.10.105,3.12.73,3.16.43,4.1.39 -stable 2/3] ipvs: SNAT packet
>         replies only for NATed connections 
> * [PATCH 4.4.65 -stable 3/3] ipvs: SNAT packet replies only for NATed
>         connections

Pulled, thanks.

Please, resubmit your stable backport patches once this patch hits
Linus' linux.git tree, Cc: stable@lists.kernel.org, I'll be glad to
ack them.

^ permalink raw reply

* Re: [PATCH nf] xtables: zero padding in data_to_user
From: Pablo Neira Ayuso @ 2017-05-15 16:52 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: netfilter-devel, netdev, rgb, fwestpha, pmoore, pvrabec, davem,
	Willem de Bruijn
In-Reply-To: <20170509201737.102987-1-willemdebruijn.kernel@gmail.com>

On Tue, May 09, 2017 at 04:17:37PM -0400, Willem de Bruijn wrote:
> From: Willem de Bruijn <willemb@google.com>
> 
> When looking up an iptables rule, the iptables binary compares the
> aligned match and target data (XT_ALIGN). In some cases this can
> exceed the actual data size to include padding bytes.
> 
> Before commit f77bc5b23fb1 ("iptables: use match, target and data
> copy_to_user helpers") the malloc()ed bytes were overwritten by the
> kernel with kzalloced contents, zeroing the padding and making the
> comparison succeed. After this patch, the kernel copies and clears
> only data, leaving the padding bytes undefined.
> 
> Extend the clear operation from data size to aligned data size to
> include the padding bytes, if any.
> 
> Padding bytes can be observed in both match and target, and the bug
> triggered, by issuing a rule with match icmp and target ACCEPT:
> 
>   iptables -t mangle -A INPUT -i lo -p icmp --icmp-type 1 -j ACCEPT
>   iptables -t mangle -D INPUT -i lo -p icmp --icmp-type 1 -j ACCEPT

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net v2] i40e/i40evf: proper update of the page_offset field
From: Duyck, Alexander H @ 2017-05-15 16:57 UTC (permalink / raw)
  To: netdev@vger.kernel.org, intel-wired-lan@lists.osuosl.org,
	bjorn.topel@gmail.com
  Cc: Topel, Bjorn, Kirsher, Jeffrey T, alexander.duyck@gmail.com
In-Reply-To: <20170515045200.27789-1-bjorn.topel@gmail.com>

On Mon, 2017-05-15 at 06:52 +0200, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> In f8b45b74cc62 ("i40e/i40evf: Use build_skb to build frames")
> i40e_build_skb updates the page_offset field with an incorrect offset,
> which can lead to data corruption. This patch updates page_offset
> correctly, by properly setting truesize.
> 
> Note that the bug only appears on architectures where PAGE_SIZE is
> 8192 or larger.
> 
> Fixes: f8b45b74cc62 ("i40e/i40evf: Use build_skb to build frames")
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>

Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>

> ---
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 3 ++-
>  drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 3 ++-
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> index 29321a6167a6..cd894f4023b1 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> @@ -1854,7 +1854,8 @@ static struct sk_buff *i40e_build_skb(struct i40e_ring *rx_ring,
>  #if (PAGE_SIZE < 8192)
>  	unsigned int truesize = i40e_rx_pg_size(rx_ring) / 2;
>  #else
> -	unsigned int truesize = SKB_DATA_ALIGN(size);
> +	unsigned int truesize = SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) +
> +				SKB_DATA_ALIGN(I40E_SKB_PAD + size);
>  #endif
>  	struct sk_buff *skb;
>  
> diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
> index dfe241a12ad0..12b02e530503 100644
> --- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
> +++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
> @@ -1190,7 +1190,8 @@ static struct sk_buff *i40e_build_skb(struct i40e_ring *rx_ring,
>  #if (PAGE_SIZE < 8192)
>  	unsigned int truesize = i40e_rx_pg_size(rx_ring) / 2;
>  #else
> -	unsigned int truesize = SKB_DATA_ALIGN(size);
> +	unsigned int truesize = SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) +
> +				SKB_DATA_ALIGN(I40E_SKB_PAD + size);
>  #endif
>  	struct sk_buff *skb;
>  

^ permalink raw reply

* Re: [PATCH] kmod: don't load module unless req process has CAP_SYS_MODULE
From: Kees Cook @ 2017-05-15 17:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Mahesh Bandewar (महेश बंडेवार),
	Greg Kroah-Hartman, Mahesh Bandewar, Ingo Molnar, LKML, netdev,
	Eric W . Biederman, David Miller
In-Reply-To: <CANn89i+iuT0xPwiOspxA-_7YpY8CXjWYh4o5MskFj6S=v=K8rA@mail.gmail.com>

On Mon, May 15, 2017 at 6:12 AM, Eric Dumazet <edumazet@google.com> wrote:
> On Sun, May 14, 2017 at 7:42 PM, Mahesh Bandewar (महेश बंडेवार)
> <maheshb@google.com> wrote:
>> On Sun, May 14, 2017 at 3:45 AM, Greg Kroah-Hartman
>> <gregkh@linuxfoundation.org> wrote:
>>> On Fri, May 12, 2017 at 04:22:59PM -0700, Mahesh Bandewar wrote:
>>>> From: Mahesh Bandewar <maheshb@google.com>
>>>>
>> [...]
>>>>   Now try to create a bridge inside this newly created net-ns which would
>>>>   mean bridge module need to be loaded.
>>>>   # ip link add br0 type bridge
>>>>   # echo $?
>>>>   0
>>>>   # lsmod | grep bridge
>>>>   bridge                110592  0
>>>>   stp                    16384  1 bridge
>>>>   llc                    16384  2 bridge,stp
>>>>   #
>>>>
>>>>   After this patch -
>>>>   # ip link add br0 type bridge
>>>>   RTNETLINK answers: Operation not supported
>>>>   # echo $?
>>>>   2
>>>>   # lsmod | grep bridge
>>>>   #
>>>
>>> Well, it only loads this because the kernel asked for it to be loaded,
>>> right?
>>>
>> Yes, kernel asked for it because of a user action.
>>
>>>>
>>>> Signed-off-by: Mahesh Bandewar <maheshb@google.com>
>>>> ---
>>>>  kernel/kmod.c | 3 +++
>>>>  1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/kernel/kmod.c b/kernel/kmod.c
>>>> index 563f97e2be36..ac30157169b7 100644
>>>> --- a/kernel/kmod.c
>>>> +++ b/kernel/kmod.c
>>>> @@ -133,6 +133,9 @@ int __request_module(bool wait, const char *fmt, ...)
>>>>  #define MAX_KMOD_CONCURRENT 50       /* Completely arbitrary value - KAO */
>>>>       static int kmod_loop_msg;
>>>>
>>>> +     if (!capable(CAP_SYS_MODULE))
>>>> +             return -EPERM;
>>>
>>> At first glance this looks right, but I'm worried what this will break
>>> that currently relies on this.  There might be lots of systems that are
>>> used to this being the method that the needed module is requested.  What
>>> about when userspace asks for a random char device and that module is
>>> then loaded?  Does this patch break that functionality?
>>>
>> Any module when loaded gets loaded system-wide as we can't allow
>> module loading per-ns. To validate the behavior I was comparing it
>> with insmod/modprobe, if that doesn't allow because of lack of this
>> capability in default-ns, then this *indirect* method of loading
>> module should not allow the same action and the behavior should be
>> consistent. So with that logic if userspace asks for a random
>> char-device if insmod/modprobe cannot load it, then this method should
>> not load it either for the consistency, right?
>
>
> This patch will break applications that expected modules being auto loaded.

I would prefer that we continue to look at the autoloading
restrictions series, since that will be more flexible and cover a
wider set of cases:

https://lkml.org/lkml/2017/4/19/1086

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply

* Re: [PATCH] neighbour: update neigh timestamps iff update is effective
From: David Miller @ 2017-05-15 17:10 UTC (permalink / raw)
  To: ihrachys; +Cc: ja, hchunhui, netdev
In-Reply-To: <20170510000605.6799-1-ihrachys@redhat.com>

From: Ihar Hrachyshka <ihrachys@redhat.com>
Date: Tue,  9 May 2017 17:06:05 -0700

> Sometimes neigh_update calls won't touch neither lladdr nor state, for
> example if an update arrives in locktime interval. Then we effectively
> ignore the update request, bailing out of touching the neigh entry,
> except that we still bump its timestamps.

So, in order to understand this, one has to know that the ->updated
value is tested by the protocol specific neigh code, which in turn
will thus influence whether NEIGH_UPDATE_F_OVERRIDE gets set in the
call to neigh_update() or not.

Please update your commit message to explain that this is how the
locktime mechanism influences neigh_update()'s behavior.

Thank you.

^ permalink raw reply

* Re: [PATCH 00/36] Convert DocBook documents to ReST
From: Jonathan Corbet @ 2017-05-15 17:11 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Andrew Lunn, alsa-devel, Linux Doc Mailing List, Takashi Iwai,
	Jan Kiszka, Herton R. Krzesinski, Alexei Starovoitov,
	Takashi Iwai, J. Bruce Fields, linux-ide, Eric Dumazet, netdev,
	Jeff Layton, Jan Kara, Soheil Hassas Yeganeh, linux-s390,
	Florian Fainelli, James E.J. Bottomley, Herbert Xu, linux-scsi,
	Ursula Braun, Rafael J. Wysocki, Peter Zijlstra, Julian Anastasov,
	Ingo
In-Reply-To: <cover.1494596071.git.mchehab@s-opensource.com>

On Fri, 12 May 2017 10:59:43 -0300
Mauro Carvalho Chehab <mchehab@s-opensource.com> wrote:

> This patch series convert the following books from
> DocBook to ReST:
> 
>    - filesystems
>    - kernel-hacking
>    - kernel-locking
>    - kgdb
>    - libata
>    - networking
>    - rapidio
>    - s390-drivers
>    - scsi
>    - w1
>    - z8530book
> 
> It also adjusts some Sphinx-pedantic errors/warnings on
> some kernel-doc markups.
> 
> I also added some patches here to add PDF output for all
> existing ReST books.

So I've been through the series (including digging out the parts that
weren't sent to me).  

> I did my best to check if what's there is not too outdated, but
> the best is if the subsystem maintainers could check it.

That has been my real concern with those remaining books; many of them
have not been touched in any significant way in at least ten years.  Just
shoveling a bunch of stuff into RST doesn't really solve the problem that
Documentation/ is an unorganized jumble of sometimes highly outdated
documentation.

But, then, I guess there's value in having a disorganized jumble that
depends on only one fragile toolchain rather than two :)  So maybe we
should just do this.

I only had one real comment with the series beyond the general stuff
here.  I see Markus had a few.  When the tweaks are done, can you send me
a series for the stuff I can apply, and I'll do it?

Thanks,

jon

^ permalink raw reply

* Your first payment of $5000  ,
From: Mrs. Linda Jones @ 2017-05-15 17:16 UTC (permalink / raw)


Attn Beneficiary,

We have deposited the check of your fund ($2.5m USD) through western union money transfer
department after our finally meeting today regarding your fund, Now all you
will do is to contact western union director Mis Rose Kelly  ,And She will give you
the direction on how you will be receiving your funds daily. Remember to send
her your Full information to avoid wrong transfer such as,

Your Receiver Name--------------
Your Country--------------------
Your City-----------------------
Your Phone No-----------------------
Your Address--------------------
Your Id card or pasport :.......

Therefore you are advised to contact western union accountant Manager 
 to her bellow information and tell She to give you the Mtcn, sender name
and question/answer to pick the money,

CONTACT Name: Mis Rose Kelly reply to (  wstun.office123@gmail.com  )
Phone: +229-99374614 


Get back to us once you receive your total fund of $2.5m.
Thanks and God bless you.

Best Regards
WESTERN UNION AGENT

^ permalink raw reply

* Re: [PATCH] mdio: mux: fix device_node_continue.cocci warnings
From: Jon Mason @ 2017-05-15 17:37 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: David Miller, Julia Lawall, Network Development, Andrew Lunn,
	kbuild-all, open list
In-Reply-To: <872f3980-9faa-718f-3260-9e4b22946140@gmail.com>

On Fri, May 12, 2017 at 6:52 PM, Florian Fainelli <f.fainelli@gmail.com> wrote:
> On 05/12/2017 09:22 AM, David Miller wrote:
>> From: Julia Lawall <julia.lawall@lip6.fr>
>> Date: Fri, 12 May 2017 22:54:23 +0800 (SGT)
>>
>>> Device node iterators put the previous value of the index variable, so an
>>> explicit put causes a double put.
>>  ...
>>> @@ -169,7 +169,6 @@ int mdio_mux_init(struct device *dev,
>>>              if (r) {
>>>                      mdiobus_free(cb->mii_bus);
>>>                      devm_kfree(dev, cb);
>>> -                    of_node_put(child_bus_node);
>>>              } else {
>>
>> I think we're instead simply missing a break; statement here.
>
> It's kind of questionable, if we have an error initializing one of our
> child MDIO bus controller (child from the perspective of the MDIO mux,
> boy this is getting complicated...), should we keep on going, or should
> we abort entirely and rollback what we have successfully registered?
>
> I don't think Julia's patch makes thing worse, in that if we had to
> rollback, we would not be doing this correctly now anyway.
>
> Jon, what do you think?

If every other case is fatal, then it is odd that this one is
permissive.  I think we should go 100% one way or the other.  So, the
options here are to:
1.  Encounter an error, unroll any mallocs, etc created by this entry,
but continue on to the next entry and return success if any are
created
2.  Encounter an error, unroll any mallocs, etc created by this entry
and any others that were created, and return an error
3.  Encounter an error, unroll any mallocs, etc created by this entry,
exit and return success if any are created

#1 would be the most accepting of any errors encountered
#2 would identify any poorly written DTs by breaking their currently
working functionality (though we should add some error messages to let
them know why)
#3 matches the suggestion by David Miller, and would be a hybrid of #1
and #2 in outcome

I would prefer #1, as I would not want to break something that was
currently working.  However, I think we should add much error logging
here to let people know their DT is hosed (instead of silently
working).  So, this would mean applying Julia's patch, and I'll do a
follow-on to change the breaks to continues and add the error logging
(assuming others agree with me).

Thanks,
Jon

^ permalink raw reply

* Re: [PATCH 00/36] Convert DocBook documents to ReST
From: Mauro Carvalho Chehab @ 2017-05-15 17:41 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Andrew Lunn, alsa-devel, Linux Doc Mailing List, Takashi Iwai,
	Jan Kiszka, Herton R. Krzesinski, Alexei Starovoitov,
	Takashi Iwai, J. Bruce Fields, linux-ide, Eric Dumazet, netdev,
	Jeff Layton, Jan Kara, Soheil Hassas Yeganeh, linux-s390,
	Florian Fainelli, James E.J. Bottomley, Herbert Xu, linux-scsi,
	Ursula Braun, Rafael J. Wysocki, Peter Zijlstra, Julian Anastasov,
	Ingo
In-Reply-To: <20170515111141.0fcd5ee6@lwn.net>

Em Mon, 15 May 2017 11:11:41 -0600
Jonathan Corbet <corbet@lwn.net> escreveu:

> On Fri, 12 May 2017 10:59:43 -0300
> Mauro Carvalho Chehab <mchehab@s-opensource.com> wrote:
> 
> > This patch series convert the following books from
> > DocBook to ReST:
> > 
> >    - filesystems
> >    - kernel-hacking
> >    - kernel-locking
> >    - kgdb
> >    - libata
> >    - networking
> >    - rapidio
> >    - s390-drivers
> >    - scsi
> >    - w1
> >    - z8530book
> > 
> > It also adjusts some Sphinx-pedantic errors/warnings on
> > some kernel-doc markups.
> > 
> > I also added some patches here to add PDF output for all
> > existing ReST books.  
> 
> So I've been through the series (including digging out the parts that
> weren't sent to me).  
> 
> > I did my best to check if what's there is not too outdated, but
> > the best is if the subsystem maintainers could check it.  
> 
> That has been my real concern with those remaining books; many of them
> have not been touched in any significant way in at least ten years. Just
> shoveling a bunch of stuff into RST doesn't really solve the problem that
> Documentation/ is an unorganized jumble of sometimes highly outdated
> documentation.

True. Yet, on the checks I did, on the books that have API descriptions,
the C domain references still exist. On the books that just have
kernel-doc tags, I wouldn't expect any changes there, as the API
changes should be, instead, at the C code.

So, I guess that it is not that bad, and, by having them in ReST will
make them easier to be updated, as ReST is basically ascii with benefits.

> But, then, I guess there's value in having a disorganized jumble that
> depends on only one fragile toolchain rather than two :)  So maybe we
> should just do this.
> 
> I only had one real comment with the series beyond the general stuff
> here.  I see Markus had a few.  When the tweaks are done, can you send me
> a series for the stuff I can apply, and I'll do it?

Sure, I'm addressing the comments and will send you a new series.

Thanks,
Mauro

^ permalink raw reply

* Re: arch: arm: bpf: Converting cBPF to eBPF for arm 32 bit
From: Kees Cook @ 2017-05-15 17:44 UTC (permalink / raw)
  To: Shubham Bansal
  Cc: David Miller, Mircea Gherzan, Network Development,
	kernel-hardening@lists.openwall.com,
	linux-arm-kernel@lists.infradead.org, ast, Daniel Borkmann
In-Reply-To: <CAHgaXdL7GcVzs+ANPke_NywhcgHbe_fzi1sTDEy+Ni1-o82GYQ@mail.gmail.com>

On Sat, May 13, 2017 at 2:38 PM, Shubham Bansal
<illusionist.neo@gmail.com> wrote:
> Finally finished testing.
>
> "test_bpf: Summary: 314 PASSED, 0 FAILED, [274/306 JIT'ed]"

Nice work! Glad you've been chipping away at this. Thanks!

-Kees

>
> Will send the patch after code refactoring. Thanks for all the help
> you guys. I really really appreciate it.
>
> Special thanks to Kees and Daniel. :)
>
> Best,
> Shubham Bansal
>
>
> On Thu, May 11, 2017 at 9:00 PM, Kees Cook <keescook@chromium.org> wrote:
>> On Thu, May 11, 2017 at 2:32 AM, Shubham Bansal
>> <illusionist.neo@gmail.com> wrote:
>>> What do you guys suggest i should implement it? I am almost done with
>>> my current implementation but if you think I should change it to the
>>> way David suggested, its better to suggest now before I send the
>>> patch.
>>
>> I'd say send what you have right now, as it's a good starting point
>> for future work. I'll be curious to see the benchmarks, etc. It can be
>> a base for further optimization.
>>
>> Thanks for chipping away at this!
>>
>> -Kees
>>
>> --
>> Kees Cook
>> Pixel Security



-- 
Kees Cook
Pixel Security

^ permalink raw reply

* [PATCH net-next 0/2] ldmvsw: port removal stability
From: Shannon Nelson @ 2017-05-15 17:51 UTC (permalink / raw)
  To: davem, netdev; +Cc: sparclinux, Shannon Nelson

Under heavy reboot stress testing we found a couple of timing issues
when removing the device that could cause the kernel great heartburn,
addressed by these two patches.

Shannon Nelson (1):
  ldmvsw: stop the clean timer at beginning of remove

Thomas Tai (1):
  ldmvsw: unregistering netdev before disable hardware

 drivers/net/ethernet/sun/ldmvsw.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

^ permalink raw reply

* [PATCH net-next 1/2] ldmvsw: unregistering netdev before disable hardware
From: Shannon Nelson @ 2017-05-15 17:51 UTC (permalink / raw)
  To: davem, netdev; +Cc: sparclinux, Thomas Tai, Shannon Nelson
In-Reply-To: <1494870668-65047-1-git-send-email-shannon.nelson@oracle.com>

From: Thomas Tai <thomas.tai@oracle.com>

When running LDom binding/unbinding test, kernel may panic
in ldmvsw_open(). It is more likely that because we're removing
the ldc connection before unregistering the netdev in vsw_port_remove(),
we set up a window of time where one process could be removing the
device while another trying to UP the device. This also sometimes causes
vio handshake error due to opening a device without closing it completely.
We should unregister the netdev before we disable the "hardware".

Orabug: 25980913, 25925306

Signed-off-by: Thomas Tai <thomas.tai@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
---
 drivers/net/ethernet/sun/ldmvsw.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/sun/ldmvsw.c b/drivers/net/ethernet/sun/ldmvsw.c
index 5a90fed..309747c 100644
--- a/drivers/net/ethernet/sun/ldmvsw.c
+++ b/drivers/net/ethernet/sun/ldmvsw.c
@@ -413,6 +413,7 @@ static int vsw_port_remove(struct vio_dev *vdev)
 		del_timer_sync(&port->vio.timer);
 
 		napi_disable(&port->napi);
+		unregister_netdev(port->dev);
 
 		list_del_rcu(&port->list);
 
@@ -427,7 +428,6 @@ static int vsw_port_remove(struct vio_dev *vdev)
 
 		dev_set_drvdata(&vdev->dev, NULL);
 
-		unregister_netdev(port->dev);
 		free_netdev(port->dev);
 	}
 
-- 
1.7.1

^ permalink raw reply related

* [PATCH net-next 2/2] ldmvsw: stop the clean timer at beginning of remove
From: Shannon Nelson @ 2017-05-15 17:51 UTC (permalink / raw)
  To: davem, netdev; +Cc: sparclinux, Shannon Nelson
In-Reply-To: <1494870668-65047-1-git-send-email-shannon.nelson@oracle.com>

Stop the clean timer earlier to be sure there's no asynchronous
interference while stopping the port.

Orabug: 25748241

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
---
 drivers/net/ethernet/sun/ldmvsw.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/sun/ldmvsw.c b/drivers/net/ethernet/sun/ldmvsw.c
index 309747c..5b56c24 100644
--- a/drivers/net/ethernet/sun/ldmvsw.c
+++ b/drivers/net/ethernet/sun/ldmvsw.c
@@ -411,6 +411,7 @@ static int vsw_port_remove(struct vio_dev *vdev)
 
 	if (port) {
 		del_timer_sync(&port->vio.timer);
+		del_timer_sync(&port->clean_timer);
 
 		napi_disable(&port->napi);
 		unregister_netdev(port->dev);
@@ -418,7 +419,6 @@ static int vsw_port_remove(struct vio_dev *vdev)
 		list_del_rcu(&port->list);
 
 		synchronize_rcu();
-		del_timer_sync(&port->clean_timer);
 		spin_lock_irqsave(&port->vp->lock, flags);
 		sunvnet_port_rm_txq_common(port);
 		spin_unlock_irqrestore(&port->vp->lock, flags);
-- 
1.7.1

^ permalink raw reply related

* [PATCH net-next] geneve: add rtnl changelink support
From: Girish Moodalbail @ 2017-05-15 17:47 UTC (permalink / raw)
  To: netdev; +Cc: davem, pshelar, joe, jbenc

This patch adds changelink rtnl operation support for geneve devices.
Code changes involve:
  - refactor geneve_newlink into geneve_nl2info to be used by both
    geneve_newlink and geneve_changelink
  - geneve_nl2info takes a changelink boolean argument to isolate
    changelink checks and updates.
  - Allow changing only a few attributes:
    - return -EOPNOTSUPP for attributes that cannot be changed for
      now. Incremental patches can make the non-supported one
      available in the future if needed.

Signed-off-by: Girish Moodalbail <girish.moodalbail@oracle.com>
---
 drivers/net/geneve.c | 149 ++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 117 insertions(+), 32 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index dec5d56..6528910 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -1112,6 +1112,18 @@ static bool is_tnl_info_zero(const struct ip_tunnel_info *info)
 		return true;
 }
 
+static inline bool geneve_dst_addr_equal(struct ip_tunnel_info *a,
+					 struct ip_tunnel_info *b)
+{
+	if (ip_tunnel_info_af(a) != ip_tunnel_info_af(b))
+		return false;
+
+	if (ip_tunnel_info_af(a) == AF_INET)
+		return a->key.u.ipv4.dst == b->key.u.ipv4.dst;
+	else
+		return ipv6_addr_equal(&a->key.u.ipv6.dst, &b->key.u.ipv6.dst);
+}
+
 static int geneve_configure(struct net *net, struct net_device *dev,
 			    const struct ip_tunnel_info *info,
 			    bool metadata, bool ipv6_rx_csum)
@@ -1169,45 +1181,58 @@ static void init_tnl_info(struct ip_tunnel_info *info, __u16 dst_port)
 	info->key.tp_dst = htons(dst_port);
 }
 
-static int geneve_newlink(struct net *net, struct net_device *dev,
-			  struct nlattr *tb[], struct nlattr *data[])
+static int geneve_nl2info(struct net_device *dev, struct nlattr *tb[],
+			  struct nlattr *data[], struct ip_tunnel_info *info,
+			  bool *metadata, bool *use_udp6_rx_checksums,
+			  bool changelink)
 {
-	bool use_udp6_rx_checksums = false;
-	struct ip_tunnel_info info;
-	bool metadata = false;
+	struct geneve_dev *geneve = netdev_priv(dev);
 
-	init_tnl_info(&info, GENEVE_UDP_PORT);
+	if (changelink) {
+		/* if changelink operation, start with old existing info */
+		memcpy(info, &geneve->info, sizeof(*info));
+		*metadata = geneve->collect_md;
+		*use_udp6_rx_checksums = geneve->use_udp6_rx_checksums;
+	} else {
+		init_tnl_info(info, GENEVE_UDP_PORT);
+	}
 
 	if (data[IFLA_GENEVE_REMOTE] && data[IFLA_GENEVE_REMOTE6])
 		return -EINVAL;
 
 	if (data[IFLA_GENEVE_REMOTE]) {
-		info.key.u.ipv4.dst =
+		info->key.u.ipv4.dst =
 			nla_get_in_addr(data[IFLA_GENEVE_REMOTE]);
 
-		if (IN_MULTICAST(ntohl(info.key.u.ipv4.dst))) {
+		if (IN_MULTICAST(ntohl(info->key.u.ipv4.dst))) {
 			netdev_dbg(dev, "multicast remote is unsupported\n");
 			return -EINVAL;
 		}
+		if (changelink &&
+		    ip_tunnel_info_af(&geneve->info) == AF_INET6) {
+			info->mode &= ~IP_TUNNEL_INFO_IPV6;
+			info->key.tun_flags &= ~TUNNEL_CSUM;
+			*use_udp6_rx_checksums = false;
+		}
 	}
 
 	if (data[IFLA_GENEVE_REMOTE6]) {
  #if IS_ENABLED(CONFIG_IPV6)
-		info.mode = IP_TUNNEL_INFO_IPV6;
-		info.key.u.ipv6.dst =
+		info->mode = IP_TUNNEL_INFO_IPV6;
+		info->key.u.ipv6.dst =
 			nla_get_in6_addr(data[IFLA_GENEVE_REMOTE6]);
 
-		if (ipv6_addr_type(&info.key.u.ipv6.dst) &
+		if (ipv6_addr_type(&info->key.u.ipv6.dst) &
 		    IPV6_ADDR_LINKLOCAL) {
 			netdev_dbg(dev, "link-local remote is unsupported\n");
 			return -EINVAL;
 		}
-		if (ipv6_addr_is_multicast(&info.key.u.ipv6.dst)) {
+		if (ipv6_addr_is_multicast(&info->key.u.ipv6.dst)) {
 			netdev_dbg(dev, "multicast remote is unsupported\n");
 			return -EINVAL;
 		}
-		info.key.tun_flags |= TUNNEL_CSUM;
-		use_udp6_rx_checksums = true;
+		info->key.tun_flags |= TUNNEL_CSUM;
+		*use_udp6_rx_checksums = true;
 #else
 		return -EPFNOSUPPORT;
 #endif
@@ -1216,48 +1241,107 @@ static int geneve_newlink(struct net *net, struct net_device *dev,
 	if (data[IFLA_GENEVE_ID]) {
 		__u32 vni;
 		__u8 tvni[3];
+		__be64 tunid;
 
 		vni = nla_get_u32(data[IFLA_GENEVE_ID]);
 		tvni[0] = (vni & 0x00ff0000) >> 16;
 		tvni[1] = (vni & 0x0000ff00) >> 8;
 		tvni[2] =  vni & 0x000000ff;
 
-		info.key.tun_id = vni_to_tunnel_id(tvni);
+		tunid = vni_to_tunnel_id(tvni);
+		if (changelink && (tunid != info->key.tun_id))
+			return -EOPNOTSUPP;
+		info->key.tun_id = tunid;
 	}
+
 	if (data[IFLA_GENEVE_TTL])
-		info.key.ttl = nla_get_u8(data[IFLA_GENEVE_TTL]);
+		info->key.ttl = nla_get_u8(data[IFLA_GENEVE_TTL]);
 
 	if (data[IFLA_GENEVE_TOS])
-		info.key.tos = nla_get_u8(data[IFLA_GENEVE_TOS]);
+		info->key.tos = nla_get_u8(data[IFLA_GENEVE_TOS]);
 
 	if (data[IFLA_GENEVE_LABEL]) {
-		info.key.label = nla_get_be32(data[IFLA_GENEVE_LABEL]) &
+		info->key.label = nla_get_be32(data[IFLA_GENEVE_LABEL]) &
 				  IPV6_FLOWLABEL_MASK;
-		if (info.key.label && (!(info.mode & IP_TUNNEL_INFO_IPV6)))
+		if (info->key.label && (!(info->mode & IP_TUNNEL_INFO_IPV6)))
 			return -EINVAL;
 	}
 
-	if (data[IFLA_GENEVE_PORT])
-		info.key.tp_dst = nla_get_be16(data[IFLA_GENEVE_PORT]);
+	if (data[IFLA_GENEVE_PORT]) {
+		if (changelink)
+			return -EOPNOTSUPP;
+		info->key.tp_dst = nla_get_be16(data[IFLA_GENEVE_PORT]);
+	}
+
+	if (data[IFLA_GENEVE_COLLECT_METADATA]) {
+		if (changelink)
+			return -EOPNOTSUPP;
+		*metadata = true;
+	}
+
+	if (data[IFLA_GENEVE_UDP_CSUM]) {
+		if (changelink)
+			return -EOPNOTSUPP;
+		if (nla_get_u8(data[IFLA_GENEVE_UDP_CSUM]))
+			info->key.tun_flags |= TUNNEL_CSUM;
+	}
+
+	if (data[IFLA_GENEVE_UDP_ZERO_CSUM6_TX]) {
+		if (changelink)
+			return -EOPNOTSUPP;
+		if (nla_get_u8(data[IFLA_GENEVE_UDP_ZERO_CSUM6_TX]))
+			info->key.tun_flags &= ~TUNNEL_CSUM;
+	}
 
-	if (data[IFLA_GENEVE_COLLECT_METADATA])
-		metadata = true;
+	if (data[IFLA_GENEVE_UDP_ZERO_CSUM6_RX]) {
+		if (changelink)
+			return -EOPNOTSUPP;
+		if (nla_get_u8(data[IFLA_GENEVE_UDP_ZERO_CSUM6_RX]))
+			*use_udp6_rx_checksums = false;
+	}
 
-	if (data[IFLA_GENEVE_UDP_CSUM] &&
-	    nla_get_u8(data[IFLA_GENEVE_UDP_CSUM]))
-		info.key.tun_flags |= TUNNEL_CSUM;
+	return 0;
+}
 
-	if (data[IFLA_GENEVE_UDP_ZERO_CSUM6_TX] &&
-	    nla_get_u8(data[IFLA_GENEVE_UDP_ZERO_CSUM6_TX]))
-		info.key.tun_flags &= ~TUNNEL_CSUM;
+static int geneve_newlink(struct net *net, struct net_device *dev,
+			  struct nlattr *tb[], struct nlattr *data[])
+{
+	bool use_udp6_rx_checksums = false;
+	struct ip_tunnel_info info;
+	bool metadata = false;
+	int err;
 
-	if (data[IFLA_GENEVE_UDP_ZERO_CSUM6_RX] &&
-	    nla_get_u8(data[IFLA_GENEVE_UDP_ZERO_CSUM6_RX]))
-		use_udp6_rx_checksums = false;
+	err = geneve_nl2info(dev, tb, data, &info, &metadata,
+			     &use_udp6_rx_checksums, false);
+	if (err)
+		return err;
 
 	return geneve_configure(net, dev, &info, metadata, use_udp6_rx_checksums);
 }
 
+static int geneve_changelink(struct net_device *dev, struct nlattr *tb[],
+			     struct nlattr *data[])
+{
+	struct geneve_dev *geneve = netdev_priv(dev);
+	struct ip_tunnel_info info;
+	bool metadata = false;
+	bool use_udp6_rx_checksums = false;
+	int err;
+
+	err = geneve_nl2info(dev, tb, data, &info, &metadata,
+			     &use_udp6_rx_checksums, true);
+	if (err)
+		return err;
+
+	if (!geneve_dst_addr_equal(&geneve->info, &info))
+		dst_cache_reset(&info.dst_cache);
+	geneve->info = info;
+	geneve->collect_md = metadata;
+	geneve->use_udp6_rx_checksums = use_udp6_rx_checksums;
+
+	return 0;
+}
+
 static void geneve_dellink(struct net_device *dev, struct list_head *head)
 {
 	struct geneve_dev *geneve = netdev_priv(dev);
@@ -1344,6 +1428,7 @@ static int geneve_fill_info(struct sk_buff *skb, const struct net_device *dev)
 	.setup		= geneve_setup,
 	.validate	= geneve_validate,
 	.newlink	= geneve_newlink,
+	.changelink	= geneve_changelink,
 	.dellink	= geneve_dellink,
 	.get_size	= geneve_get_size,
 	.fill_info	= geneve_fill_info,
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH] kmod: don't load module unless req process has CAP_SYS_MODULE
From: Mahesh Bandewar (महेश बंडेवार) @ 2017-05-15 17:59 UTC (permalink / raw)
  To: David Miller
  Cc: gregkh, ebiederm, mahesh, mingo, linux-kernel, linux-netdev,
	keescook, Eric Dumazet
In-Reply-To: <20170515.095228.1483686375235860235.davem@davemloft.net>

On Mon, May 15, 2017 at 6:52 AM, David Miller <davem@davemloft.net> wrote:
> From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Date: Mon, 15 May 2017 08:10:59 +0200
>
>> On Sun, May 14, 2017 at 08:57:34AM -0500, Eric W. Biederman wrote:
>>> Greg Kroah-Hartman <gregkh@linuxfoundation.org> writes:
>>>
>>> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
>>> index bcb0f610ee42..6b72528a4636 100644
>>> --- a/net/core/rtnetlink.c
>>> +++ b/net/core/rtnetlink.c
>>> @@ -2595,7 +2595,7 @@ static int rtnl_newlink(struct sk_buff *skb, struct nlmsghdr *nlh,
>>>
>>>                 if (!ops) {
>>>  #ifdef CONFIG_MODULES
>>> -                       if (kind[0]) {
>>> +                       if (kind[0] && capable(CAP_NET_ADMIN)) {
>>>                                 __rtnl_unlock();
>>>                                 request_module("rtnl-link-%s", kind);
>>>                                 rtnl_lock();
>>
>> I don't object to this if the networking developers don't mind the
>> change in functionality.  They can handle the fallout :)
>
> As I've said in another email, I am pretty sure this can break things.

The current behavior is already breaking things. e.g. unprivileged
process can be root inside it's own user-ns. This will allow it to
create IPtable rules causing contracking module to be loaded in
default-ns affecting every flow on the server (not just the namespace
that user or an unprivileged process is attached to). Cases that I
mentioned above are just the tip of an iceberg.

In a non-namespace world this wouldn't happen as capability checks are
performed correctly but the moment an unprivileged user can create
it's own user-ns and becomes root inside, it could make use of these
things and perform privileged operations in default-ns. So to protect
"global namespace" from making such things happen, we have to protect
using global capability check.

Alternatively we can preserve the existing behavior by adding this
check for non-default-user-ns only. e.g.

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 6e67315ec368..263f0d175091 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2595,7 +2595,9 @@ static int rtnl_newlink(struct sk_buff *skb,
struct nlmsghdr *nlh,

                if (!ops) {
 #ifdef CONFIG_MODULES
-                       if (kind[0]) {
+                       if (kind[0] &&
+                           ((net->user_ns == &init_user_ns) ||
+                            capable(CAP_SYS_MODULE))) {
                                __rtnl_unlock();
                                request_module("rtnl-link-%s", kind);
                                rtnl_lock();

if we have to do this in net-subsystem then it's not just this call
site and there are lot more. But if this is an acceptable alternative,
I can think of better implementation for all those sites.

^ permalink raw reply related

* Re: [PATCH] arp: honour gratuitous ARP _replies_
From: David Miller @ 2017-05-15 18:08 UTC (permalink / raw)
  To: ihrachys; +Cc: jmorris, yoshfuji, kaber, netdev
In-Reply-To: <20170510001607.9716-1-ihrachys@redhat.com>

From: Ihar Hrachyshka <ihrachys@redhat.com>
Date: Tue,  9 May 2017 17:16:07 -0700

> @@ -842,8 +844,20 @@ static int arp_process(struct net *net, struct sock *sk, struct sk_buff *skb)
>  		   It is possible, that this option should be enabled for some
>  		   devices (strip is candidate)
>  		 */
> -		is_garp = arp->ar_op == htons(ARPOP_REQUEST) && tip == sip &&
> -			  addr_type == RTN_UNICAST;
> +		is_garp = tip == sip && addr_type == RTN_UNICAST;
> +
> +		/* Unsolicited ARP _replies_ also require target hwaddr to be
> +		 * the same as source.
> +		 */
> +		if (is_garp && arp->ar_op == htons(ARPOP_REPLY))
> +			is_garp =
> +#if IS_ENABLED(CONFIG_FIREWIRE_NET)
> +				/* IPv4 over IEEE 1394 doesn't provide target
> +				 * hardware address field in its ARP payload.
> +				 */
> +				tha &&
> +#endif
> +				!memcmp(tha, sha, dev->addr_len);
>  

The ifdefs here make the test harder to understand.

I would suggest removing the ifdef and letting the compiler remove the 'tha'
check if it can.

Thank you.

^ permalink raw reply

* Re: [PATCH] kmod: don't load module unless req process has CAP_SYS_MODULE
From: David Miller @ 2017-05-15 18:14 UTC (permalink / raw)
  To: maheshb
  Cc: gregkh, ebiederm, mahesh, mingo, linux-kernel, netdev, keescook,
	edumazet
In-Reply-To: <CAF2d9jjdouSQYVk3kbWyOUEUe5b8S_Q6_Zx2WE_EBZB1cVeEvA@mail.gmail.com>

From: Mahesh Bandewar (महेश बंडेवार) <maheshb@google.com>
Date: Mon, 15 May 2017 10:59:55 -0700

> The current behavior is already breaking things. e.g. unprivileged
> process can be root inside it's own user-ns. This will allow it to
> create IPtable rules causing contracking module to be loaded in
> default-ns affecting every flow on the server (not just the namespace
> that user or an unprivileged process is attached to). Cases that I
> mentioned above are just the tip of an iceberg.

Yes, that is certainly undesirable.

But is it really a module loading problem?  Perhaps we need to look
more deeply into how conntract behaves by default wrt. namespaces.

If we've given the user the ability to be root in his or her own
namespace, then we should let them do root stuff in there.

The only problem is when "doing root stuff in there" has an
undesirable impact upon the rest of the system.

And that's needs to be looked into on a facility by facility basis,
rather then just sprinkling "no module loading" test here and there,
or even unconditionally.

^ permalink raw reply

* Re: [PATCH net] macvlan: Fix performance issues with vlan tagged packets
From: David Miller @ 2017-05-15 18:17 UTC (permalink / raw)
  To: vyasevich; +Cc: netdev, jasowang, mst, vyasevic
In-Reply-To: <1494515392-30826-1-git-send-email-vyasevic@redhat.com>

From: Vladislav Yasevich <vyasevich@gmail.com>
Date: Thu, 11 May 2017 11:09:52 -0400

> Macvlan always turns on offload features that have sofware
> fallback (NETIF_GSO_SOFTWARE).  This allows much higher guest-guest
> communications over macvtap.
> 
> However, macvtap does not turn on these features for vlan tagged traffic.
> As a result, depending on the HW that mactap is configured on, the
> performance of guest-guest communication over a vlan is very
> inconsistent.  If the HW supports TSO/UFO over vlans, then the
> performance will be fine.  If not, the the performance will suffer
> greatly since the VM may continue using TSO/UFO, and will force the host
> segment the traffic and possibly overlow the macvtap queue.
> 
> This patch adds the always on offloads to vlan_features.  This
> makes sure that any vlan tagged traffic between 2 guest will not
> be segmented needlessly.
> 
> Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>

Indeed, this makes us more consistent with how we handle non-vlan
features in macvtap.

Applied, thanks.

^ permalink raw reply

* Re: [PATCH] kmod: don't load module unless req process has CAP_SYS_MODULE
From: Eric W. Biederman @ 2017-05-15 18:20 UTC (permalink / raw)
  To: Mahesh Bandewar (महेश बंडेवार)
  Cc: David Miller, gregkh, mahesh, mingo, linux-kernel, linux-netdev,
	keescook, Eric Dumazet
In-Reply-To: <CAF2d9jjdouSQYVk3kbWyOUEUe5b8S_Q6_Zx2WE_EBZB1cVeEvA@mail.gmail.com>

"Mahesh Bandewar (महेश बंडेवार)" <maheshb@google.com> writes:

> On Mon, May 15, 2017 at 6:52 AM, David Miller <davem@davemloft.net> wrote:
>> From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>> Date: Mon, 15 May 2017 08:10:59 +0200
>>
>>> On Sun, May 14, 2017 at 08:57:34AM -0500, Eric W. Biederman wrote:
>>>> Greg Kroah-Hartman <gregkh@linuxfoundation.org> writes:
>>>>
>>>> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
>>>> index bcb0f610ee42..6b72528a4636 100644
>>>> --- a/net/core/rtnetlink.c
>>>> +++ b/net/core/rtnetlink.c
>>>> @@ -2595,7 +2595,7 @@ static int rtnl_newlink(struct sk_buff *skb, struct nlmsghdr *nlh,
>>>>
>>>>                 if (!ops) {
>>>>  #ifdef CONFIG_MODULES
>>>> -                       if (kind[0]) {
>>>> +                       if (kind[0] && capable(CAP_NET_ADMIN)) {
>>>>                                 __rtnl_unlock();
>>>>                                 request_module("rtnl-link-%s", kind);
>>>>                                 rtnl_lock();
>>>
>>> I don't object to this if the networking developers don't mind the
>>> change in functionality.  They can handle the fallout :)
>>
>> As I've said in another email, I am pretty sure this can break things.
>
> The current behavior is already breaking things. e.g. unprivileged
> process can be root inside it's own user-ns. This will allow it to
> create IPtable rules causing contracking module to be loaded in
> default-ns affecting every flow on the server (not just the namespace
> that user or an unprivileged process is attached to). Cases that I
> mentioned above are just the tip of an iceberg.

If loading the conntrack module changes the semantics of packet
processing when nothing is configured that is a bug in the conntrack
module.

> In a non-namespace world this wouldn't happen as capability checks are
> performed correctly but the moment an unprivileged user can create
> it's own user-ns and becomes root inside, it could make use of these
> things and perform privileged operations in default-ns. So to protect
> "global namespace" from making such things happen, we have to protect
> using global capability check.
>
> Alternatively we can preserve the existing behavior by adding this
> check for non-default-user-ns only. e.g.

I believe last time this was discussed the compromise was that a prefix
would be prepended to request_module calls so that what each call
allows to be loaded would be limited in scope to what is sensible
in that location.

I don't think anyone made any arguments about increasing the
attack surface at that time.  So there may be reason to go back
and reexamine the decision on security grounds, but it needs
to be a clearly made argument.  Explaining to people the pros and cons
of the reason to perform the work.

> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> index 6e67315ec368..263f0d175091 100644
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -2595,7 +2595,9 @@ static int rtnl_newlink(struct sk_buff *skb,
> struct nlmsghdr *nlh,
>
>                 if (!ops) {
>  #ifdef CONFIG_MODULES
> -                       if (kind[0]) {
> +                       if (kind[0] &&
> +                           ((net->user_ns == &init_user_ns) ||
> +                            capable(CAP_SYS_MODULE))) {
>                                 __rtnl_unlock();
>                                 request_module("rtnl-link-%s", kind);
>                                 rtnl_lock();

This patch is definitely wrong.  CAP_NET_ADMIN had always guarded this
request_module call.  CAP_SYS_MODULE means you can request any module
you like dropping does not mean you can't request modules.

Adding a capable(CAP_NET_ADMIN) at this call site would be the least
breaking solution available, as it would only break things for callers
in non-initial network namespaces.  Your change would definitely things
for ordinary network administration tools with capabilities.

> if we have to do this in net-subsystem then it's not just this call
> site and there are lot more. But if this is an acceptable alternative,
> I can think of better implementation for all those sites.

Eric

^ permalink raw reply

* Re: [PATCH v4] net/mlx4_core: Use min3 to select number of MSI-X vectors
From: David Miller @ 2017-05-15 18:20 UTC (permalink / raw)
  To: yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA
  Cc: yishaih-VPRAkNaXOzVWk0Htik3J/w, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1494569451-2567-1-git-send-email-yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>

From: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Date: Fri, 12 May 2017 09:10:51 +0300

> Signed-off-by: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> ---
> v0 -> v1:
> 	* s/"min_t("/"min_t(int"
> v1 -> v2:
> 	* Use min3 instead of min_t twice
> v2 -> v3:
> 	* Change commit log header message to reflect the changes made in
> 	  v2
> v3 -> v4:
> 	* Cast return value from num_online_cpus to int to avoid
> 	  compilation errors from "sparse"

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] net/packet: fix missing net_device reference release
From: David Miller @ 2017-05-15 18:25 UTC (permalink / raw)
  To: douglascs
  Cc: netdev, edumazet, daniel, willemb, jarno, andreyknvl, anoob.soman,
	sowmini.varadhan, philip.pettersson, rppt
In-Reply-To: <7daebe52-8542-87cb-551a-1a5b4912f140@taghos.com.br>

From: Douglas Caetano dos Santos <douglascs@taghos.com.br>
Date: Fri, 12 May 2017 15:19:15 -0300

> When using a TX ring buffer, if an error occurs processing a control
> message (e.g. invalid message), the net_device reference is not
> released.
> 
> Fixes c14ac9451c348 ("sock: enable timestamping using control messages")
> Signed-off-by: Douglas Caetano dos Santos <douglascs@taghos.com.br>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH net] ipv6: avoid dad-failures for addresses with NODAD
From: David Miller @ 2017-05-15 18:26 UTC (permalink / raw)
  To: mahesh; +Cc: kuznet, jmorris, yoshfuji, kaber, netdev, edumazet, maheshb
In-Reply-To: <20170513000339.15843-1-mahesh@bandewar.net>

From: Mahesh Bandewar <mahesh@bandewar.net>
Date: Fri, 12 May 2017 17:03:39 -0700

> From: Mahesh Bandewar <maheshb@google.com>
> 
> Every address gets added with TENTATIVE flag even for the addresses with
> IFA_F_NODAD flag and dad-work is scheduled for them. During this DAD process
> we realize it's an address with NODAD and complete the process without
> sending any probe. However the TENTATIVE flags stays on the
> address for sometime enough to cause misinterpretation when we receive a NS.
> While processing NS, if the address has TENTATIVE flag, we mark it DADFAILED
> and endup with an address that was originally configured as NODAD with
> DADFAILED.
> 
> We can't avoid scheduling dad_work for addresses with NODAD but we can
> avoid adding TENTATIVE flag to avoid this racy situation.
> 
> Signed-off-by: Mahesh Bandewar <maheshb@google.com>

This doesn't apply cleanly to the net tree, please respin.

Thank you.

^ permalink raw reply

* Re: [PATCH] mdio: mux: fix device_node_continue.cocci warnings
From: David Miller @ 2017-05-15 18:29 UTC (permalink / raw)
  To: jon.mason
  Cc: f.fainelli, julia.lawall, netdev, andrew, kbuild-all,
	linux-kernel
In-Reply-To: <CAC3K-4qtjNOfseuENeg2Q7-xKVCK-Ytz2wPQMcX_LiwPLuTAiw@mail.gmail.com>

From: Jon Mason <jon.mason@broadcom.com>
Date: Mon, 15 May 2017 13:37:09 -0400

> I would prefer #1, as I would not want to break something that was
> currently working.  However, I think we should add much error logging
> here to let people know their DT is hosed (instead of silently
> working).  So, this would mean applying Julia's patch, and I'll do a
> follow-on to change the breaks to continues and add the error logging
> (assuming others agree with me).

Ok, I've applied Julia's patch.

I agree that we shouldn't fail the whole list just because one does.
And yes, we should emit enough diagnostics so that people can figure
out what the problem is.

^ permalink raw reply

* Re: [PATCH net] ipv6: avoid dad-failures for addresses with NODAD
From: David Miller @ 2017-05-15 18:32 UTC (permalink / raw)
  To: mahesh; +Cc: kuznet, jmorris, yoshfuji, kaber, netdev, edumazet, maheshb
In-Reply-To: <20170515.142655.1279898304018190599.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Mon, 15 May 2017 14:26:55 -0400 (EDT)

> From: Mahesh Bandewar <mahesh@bandewar.net>
> Date: Fri, 12 May 2017 17:03:39 -0700
> 
>> From: Mahesh Bandewar <maheshb@google.com>
>> 
>> Every address gets added with TENTATIVE flag even for the addresses with
>> IFA_F_NODAD flag and dad-work is scheduled for them. During this DAD process
>> we realize it's an address with NODAD and complete the process without
>> sending any probe. However the TENTATIVE flags stays on the
>> address for sometime enough to cause misinterpretation when we receive a NS.
>> While processing NS, if the address has TENTATIVE flag, we mark it DADFAILED
>> and endup with an address that was originally configured as NODAD with
>> DADFAILED.
>> 
>> We can't avoid scheduling dad_work for addresses with NODAD but we can
>> avoid adding TENTATIVE flag to avoid this racy situation.
>> 
>> Signed-off-by: Mahesh Bandewar <maheshb@google.com>
> 
> This doesn't apply cleanly to the net tree, please respin.

Ignore this, I was trying to apply the wrong patch.

Applied, thank you.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox