Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] ixgbe: drop zero length frame segments during a packet split rx
From: Alexander Duyck @ 2011-09-02 16:17 UTC (permalink / raw)
  To: Neil Horman
  Cc: netdev, Thadeu Lima de Souza Cascardo, Jesse Brandeburg,
	John Fastabend, Jeff Kirsher, David S. Miller
In-Reply-To: <1314972197-31557-1-git-send-email-nhorman@tuxdriver.com>

This kind of fix just opens up a whole can of security related worms.  
If you are going to discard a packet you should do it after we have 
reached the EOP in the series.  My advice would be to determine what 
traits identify this packet and add those to the check for the 
IXGBE_RXDADV_ERR_FRAME_ERR_MASK check further down in the code.  Likely 
what you are seeing is skb_headlen(skb) will be equal to 0.

I'm suspecting this is some sort of read corruption.  It looks like in 
order to trigger it you have to either be reading rx_buffer_info->dma as 
0, or the header length is being read as 0.  Do you know if you actually 
have header split enabled when this is occuring?  Are you running with 
jumbo frames enabled to see the issue?  If not then packet split 
wouldn't be enabled.

Is this occurring on net-next or on an older kernel?  I just want to be 
sure since we added a read memory barrier in 2.6.34 to address the fact 
that the length and descriptor DD bits were being read in the wrong 
order resulting in the length being corrupted on PowerPC systems.  The 
fact that we are now seeing another length error on PowerPC seems very odd.

Thanks,

Alex

On 09/02/2011 07:03 AM, Neil Horman wrote:
> This oops was reported recently no ppc64 hardware:
> Unable to handle kernel paging request for data at address 0x00000000
> Faulting instruction address: 0xc0000000004dda0c
> Oops: Kernel access of bad area, sig: 11 [#1]
> SMP NR_CPUS=1024 NUMA pSeries
> Modules linked in: sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4
> iptable_fi
> lter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state
> nf_conntrack ip6table_filter ip6_tables ipv6 jsm ses enclosure sg ixgbe
> mdio e1000 ehea ext4 jbd2 mbcache sd_mod crc_t10dif ipr dm_mod
> NIP: c0000000004dda0c LR: c0000000004e3e50 CTR: c0000000004e3e20
> REGS: c0000001bffeb8d0 TRAP: 0300   Not tainted  (3.1.0-rc2-10121-gab7e2db)
> MSR: 8000000000009032<EE,ME,IR,DR>   CR: 28002042  XER: 20000000
> CFAR: c000000000004d70
> DAR: 0000000000000000, DSISR: 40000000
> TASK = c000000000d548e0[0] 'swapper' THREAD: c000000000dfc000 CPU: 0
> GPR04: c0000000010f4d80 c0000001bffebd80 0000000000000000 c0000001b18a8200
> GPR08: 0000000000000280 c0000001bcc517a8 c0000001b18a7f80 0000000000000000
> GPR12: d0000000047e5bb0 c000000001f10000 c0000001b19c8700 0000000000000000
> GPR16: c0000001bffebd80 0000000000000083 c00000018f2447a0 0000000000000002
> GPR20: 0000000000000000 c0000001ba860010 c0000001ba860000 d000000003d40000
> GPR24: 0000000000000000 0000000000000083 d000000003d40000 0000000000000001
> GPR28: c00000018f244780 c0000001b2b94310 c000000000da95f0 c0000001bcc51780
> NIP [c0000000004dda0c] .skb_gro_reset_offset+0x5c/0xe0
> LR [c0000000004e3e50] .napi_gro_receive+0x30/0x120
> Call Trace:
> [c0000001bffebb50] [c000000000da95f0] perf_callchain_user+0x0/0x10 (unreliable)
> [c0000001bffebbf0] [d0000000047bd118] .ixgbe_clean_rx_irq+0x7a8/0x8a0 [ixgbe]
> [c0000001bffebd10] [d0000000047bd414] .ixgbe_poll+0x64/0x160 [ixgbe]
> [c0000001bffebdd0] [c0000000004e3358] .net_rx_action+0x108/0x2a0
> [c0000001bffebea0] [c00000000009b220] .__do_softirq+0x110/0x2a0
> [c0000001bffebf90] [c000000000023798] .call_do_softirq+0x14/0x24
> [c000000000dff830] [c000000000011148] .do_softirq+0xf8/0x130
> [c000000000dff8d0] [c00000000009aeb4] .irq_exit+0xb4/0xc0
> [c000000000dff950] [c000000000011254] .do_IRQ+0xd4/0x300
> [c000000000dffa10] [c000000000005024] hardware_interrupt_entry+0x18/0x74
> --- Exception: 501 at .pseries_dedicated_idle_sleep+0xe4/0x210
> LR = .pseries_dedicated_idle_sleep+0x8c/0x210
> [c000000000dffd00] [c00000000005b194] .pseries_dedicated_idle_sleep+0x194/0x210
> (unreliable)
> [c000000000dffdc0] [c000000000018c84] .cpu_idle+0x164/0x210
> [c000000000dffe70] [c00000000000b0d0] .rest_init+0x90/0xb0
> [c000000000dffef0] [c000000000830bc0] .start_kernel+0x54c/0x56c
> [c000000000dfff90] [c00000000000953c] .start_here_common+0x1c/0x60
>
> Its caused when skb_gro_reset_offset attempts to call PageHighMem on
> skb_shinfo(skb)->frags[0].page, when the frags array was left uninitalized.
> This can happen in the ixgbe driver if the hardware reports a zero length rx
> descriptor ni the middle of a packet split receive transaction.  I've consulted
> with Jesse Brandeburg on this, who is attempting to root cause the issue at
> Intel, but it seems prudent to add this check to the driver to discard frames of
> that encounter this error to avoid the opps
>
> Signed-off-by: Neil Horman<nhorman@tuxdriver.com>
> Signed-off-by: Thadeu Lima de Souza Cascardo<cascardo@linux.vnet.ibm.com>
> CC: Jesse Brandeburg<jesse.brandeburg@intel.com>
> CC: Alexander Duyck<alexander.h.duyck@intel.com>
> CC: John Fastabend<john.r.fastabend@intel.com>
> CC: Jeff Kirsher<jeffrey.t.kirsher@intel.com>
> CC: David S. Miller<davem@davemloft.net>
> ---
>   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   17 +++++++++++------
>   1 files changed, 11 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index d20e804..6d59185 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -1326,6 +1326,13 @@ static void ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
>
>   		rx_buffer_info =&rx_ring->rx_buffer_info[i];
>
> +		i++;
> +		if (i == rx_ring->count)
> +			i = 0;
> +
> +		next_rxd = IXGBE_RX_DESC_ADV(rx_ring, i);
> +		prefetch(next_rxd);
> +
>   		skb = rx_buffer_info->skb;
>   		rx_buffer_info->skb = NULL;
>   		prefetch(skb->data);
> @@ -1367,6 +1374,10 @@ static void ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
>   		} else {
>   			/* assume packet split since header is unmapped */
>   			upper_len = le16_to_cpu(rx_desc->wb.upper.length);
> +			if (!upper_len) {
> +				rx_buffer_info->skb = skb;
> +				goto next_desc;
> +			}
>   		}
>
>   		if (upper_len) {
> @@ -1391,12 +1402,6 @@ static void ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
>   			skb->truesize += upper_len;
>   		}
>
> -		i++;
> -		if (i == rx_ring->count)
> -			i = 0;
> -
> -		next_rxd = IXGBE_RX_DESC_ADV(rx_ring, i);
> -		prefetch(next_rxd);
>   		cleaned_count++;
>
>   		if (pkt_is_rsc) {

^ permalink raw reply

* Re: [PATCH] ixgbe: drop zero length frame segments during a packet split rx
From: Neil Horman @ 2011-09-02 16:43 UTC (permalink / raw)
  To: Jeff Kirsher
  Cc: netdev@vger.kernel.org, Thadeu Lima de Souza Cascardo,
	Brandeburg, Jesse, Duyck, Alexander H, Fastabend, John R,
	David S. Miller
In-Reply-To: <1314979493.3532.4.camel@jtkirshe-linux>

On Fri, Sep 02, 2011 at 09:04:53AM -0700, Jeff Kirsher wrote:
> On Fri, 2011-09-02 at 07:03 -0700, Neil Horman wrote:
> > This oops was reported recently no ppc64 hardware:
> > Unable to handle kernel paging request for data at address 0x00000000
> > Faulting instruction address: 0xc0000000004dda0c
> > Oops: Kernel access of bad area, sig: 11 [#1]
> > SMP NR_CPUS=1024 NUMA pSeries
> > Modules linked in: sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4
> > iptable_fi
> > lter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state
> > nf_conntrack ip6table_filter ip6_tables ipv6 jsm ses enclosure sg
> > ixgbe
> > mdio e1000 ehea ext4 jbd2 mbcache sd_mod crc_t10dif ipr dm_mod
> > NIP: c0000000004dda0c LR: c0000000004e3e50 CTR: c0000000004e3e20
> > REGS: c0000001bffeb8d0 TRAP: 0300   Not tainted
> > (3.1.0-rc2-10121-gab7e2db)
> > MSR: 8000000000009032 <EE,ME,IR,DR>  CR: 28002042  XER: 20000000
> > CFAR: c000000000004d70
> > DAR: 0000000000000000, DSISR: 40000000
> > TASK = c000000000d548e0[0] 'swapper' THREAD: c000000000dfc000 CPU: 0
> > GPR04: c0000000010f4d80 c0000001bffebd80 0000000000000000
> > c0000001b18a8200
> > GPR08: 0000000000000280 c0000001bcc517a8 c0000001b18a7f80
> > 0000000000000000
> > GPR12: d0000000047e5bb0 c000000001f10000 c0000001b19c8700
> > 0000000000000000
> > GPR16: c0000001bffebd80 0000000000000083 c00000018f2447a0
> > 0000000000000002
> > GPR20: 0000000000000000 c0000001ba860010 c0000001ba860000
> > d000000003d40000
> > GPR24: 0000000000000000 0000000000000083 d000000003d40000
> > 0000000000000001
> > GPR28: c00000018f244780 c0000001b2b94310 c000000000da95f0
> > c0000001bcc51780
> > NIP [c0000000004dda0c] .skb_gro_reset_offset+0x5c/0xe0
> > LR [c0000000004e3e50] .napi_gro_receive+0x30/0x120
> > Call Trace:
> > [c0000001bffebb50] [c000000000da95f0] perf_callchain_user+0x0/0x10
> > (unreliable)
> > [c0000001bffebbf0] [d0000000047bd118] .ixgbe_clean_rx_irq+0x7a8/0x8a0
> > [ixgbe]
> > [c0000001bffebd10] [d0000000047bd414] .ixgbe_poll+0x64/0x160 [ixgbe]
> > [c0000001bffebdd0] [c0000000004e3358] .net_rx_action+0x108/0x2a0
> > [c0000001bffebea0] [c00000000009b220] .__do_softirq+0x110/0x2a0
> > [c0000001bffebf90] [c000000000023798] .call_do_softirq+0x14/0x24
> > [c000000000dff830] [c000000000011148] .do_softirq+0xf8/0x130
> > [c000000000dff8d0] [c00000000009aeb4] .irq_exit+0xb4/0xc0
> > [c000000000dff950] [c000000000011254] .do_IRQ+0xd4/0x300
> > [c000000000dffa10] [c000000000005024] hardware_interrupt_entry
> > +0x18/0x74
> > --- Exception: 501 at .pseries_dedicated_idle_sleep+0xe4/0x210
> > LR = .pseries_dedicated_idle_sleep+0x8c/0x210
> > [c000000000dffd00] [c00000000005b194] .pseries_dedicated_idle_sleep
> > +0x194/0x210
> > (unreliable)
> > [c000000000dffdc0] [c000000000018c84] .cpu_idle+0x164/0x210
> > [c000000000dffe70] [c00000000000b0d0] .rest_init+0x90/0xb0
> > [c000000000dffef0] [c000000000830bc0] .start_kernel+0x54c/0x56c
> > [c000000000dfff90] [c00000000000953c] .start_here_common+0x1c/0x60
> > 
> > Its caused when skb_gro_reset_offset attempts to call PageHighMem on
> > skb_shinfo(skb)->frags[0].page, when the frags array was left
> > uninitalized.
> > This can happen in the ixgbe driver if the hardware reports a zero
> > length rx
> > descriptor ni the middle of a packet split receive transaction.  I've
> > consulted
> > with Jesse Brandeburg on this, who is attempting to root cause the
> > issue at
> > Intel, but it seems prudent to add this check to the driver to discard
> > frames of
> > that encounter this error to avoid the opps
> > 
> > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > Signed-off-by: Thadeu Lima de Souza Cascardo
> > <cascardo@linux.vnet.ibm.com>
> > CC: Jesse Brandeburg <jesse.brandeburg@intel.com>
> > CC: Alexander Duyck <alexander.h.duyck@intel.com>
> > CC: John Fastabend <john.r.fastabend@intel.com>
> > CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> > CC: David S. Miller <davem@davemloft.net>
> > ---
> >  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   17
> > +++++++++++------
> >  1 files changed, 11 insertions(+), 6 deletions(-) 
> 
> Thanks Neil, I have added the patch to my queue of ixgbe patches.
> 
> This patch was made against net-next, was this issue only seen on the
> net-next kernel?
You can see all the details here:
https://bugzilla.redhat.com/buglist.cgi?quicksearch=683611

My understanding was that it was seen in RHEL, net-next, and with the
sourceforge driver.

Regards
Neil

^ permalink raw reply

* Re: [PATCH net-next v4 4/4] r8169: support new chips of RTL8111F
From: Francois Romieu @ 2011-09-02 16:28 UTC (permalink / raw)
  To: Hayes Wang; +Cc: netdev, linux-kernel
In-Reply-To: <1314956953-1568-4-git-send-email-hayeswang@realtek.com>

Hayes Wang <hayeswang@realtek.com> :
> Support new chips of RTL8111F.

Amongst other things :o)

> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
> index 175c769..8e6a200 100644
> --- a/drivers/net/ethernet/realtek/r8169.c
> +++ b/drivers/net/ethernet/realtek/r8169.c
[...]
> @@ -711,7 +719,10 @@ MODULE_FIRMWARE(FIRMWARE_8168D_1);
>  MODULE_FIRMWARE(FIRMWARE_8168D_2);
>  MODULE_FIRMWARE(FIRMWARE_8168E_1);
>  MODULE_FIRMWARE(FIRMWARE_8168E_2);
> +MODULE_FIRMWARE(FIRMWARE_8168E_3);

This one is relevant for Linus's tree.

Don't worry about submitting again, I'll send it separately.

No opinion regarding the jumbo fixes patches I sent on 2011/07/17 ?

-- 
Ueimor

^ permalink raw reply

* Re: FW: [PATCH] af_packet: flush complete kernel cache in packet_sendmsg
From: chetan loke @ 2011-09-02 16:49 UTC (permalink / raw)
  To: Phil Sutter; +Cc: linux-arm-kernel, netdev, linux, davem
In-Reply-To: <20110902153147.GB29025@philter>

On Fri, Sep 2, 2011 at 11:31 AM, Phil Sutter <phil.sutter@viprinet.com> wrote:

> So far we haven't noticed problems in that direction. I just tried some
> explicit test: having tcpdump print local timestamps (not the pcap-ones)
> on every received packet, activating icmp_echo_ignore_all and pinging
> the host on a dedicated line. I expected to sometimes see a second
> difference between the two timestamps, as like with sending from time to
> time a packet should get "lost" in the cache, and then occur to
> userspace after the next one arrived. Maybe my test is broken, or RX is
> indeed unaffected.
>

You will need high traffic rate. If interested, you could try
pktgen(with varying packet-load). Keep the packet-payload under 1500
bytes (don't send jumbo frames) unless you have the following fix:
commit cc9f01b246ca8e4fa245991840b8076394f86707

Your Tx path is working because flush_cache_call gets triggered before
flush_dcache_page. On the Rx path, since you don't have that
workaround, you will eventually(it's just a matter of time) see this
problem.

Or, delete your patch and try this workaround (in
__packet_get/set_status) and you may be able to cover both Tx and Rx
paths.


diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 2ea3d63..35d71dc 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -412,11 +412,19 @@ static void __packet_set_status(struct
packet_sock *po, void *frame, int status)
        switch (po->tp_version) {
        case TPACKET_V1:
                h.h1->tp_status = status;
-               flush_dcache_page(pgv_to_page(&h.h1->tp_status));
+               #ifndef ENABLE_CACHEPROB_WORKAROUND
+                       flush_dcache_page(pgv_to_page(&h.h1->tp_status));
+               #else
+                       kw_extra_cache_flush();
+               endif
                break;
        case TPACKET_V2:
                h.h2->tp_status = status;
-               flush_dcache_page(pgv_to_page(&h.h2->tp_status));
+               #ifndef ENABLE_CACHEPROB_WORKAROUND
+                       flush_dcache_page(pgv_to_page(&h.h2->tp_status));
+               #else
+                       kw_extra_cache_flush();
+               #endif
                break;
        case TPACKET_V3:
        default:
@@ -437,13 +445,19 @@ static int __packet_get_status(struct
packet_sock *po, void *frame)

        smp_rmb();

+       kw_extra_cache_flush();
+
        h.raw = frame;
        switch (po->tp_version) {
        case TPACKET_V1:
-               flush_dcache_page(pgv_to_page(&h.h1->tp_status));
+               #ifndef ENABLE_CACHEPROB_WORKAROUND
+                       flush_dcache_page(pgv_to_page(&h.h1->tp_status));
+               #endif
                return h.h1->tp_status;
        case TPACKET_V2:
-               flush_dcache_page(pgv_to_page(&h.h2->tp_status));
+               #ifndef ENABLE_CACHEPROB_WORKAROUND
+                       flush_dcache_page(pgv_to_page(&h.h2->tp_status));
+               #endif
                return h.h2->tp_status;
        case TPACKET_V3:
        default:


> Greetings and thanks for the hints, Phil

Chetan Loke

^ permalink raw reply related

* Re: [PATCH] ixgbe: drop zero length frame segments during a packet split rx
From: Neil Horman @ 2011-09-02 16:55 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: netdev, Thadeu Lima de Souza Cascardo, Jesse Brandeburg,
	John Fastabend, Jeff Kirsher, David S. Miller
In-Reply-To: <4E6101A4.4060802@intel.com>

On Fri, Sep 02, 2011 at 09:17:40AM -0700, Alexander Duyck wrote:
> This kind of fix just opens up a whole can of security related
> worms.  If you are going to discard a packet you should do it after
> we have reached the EOP in the series.  My advice would be to
> determine what traits identify this packet and add those to the
> check for the IXGBE_RXDADV_ERR_FRAME_ERR_MASK check further down in
> the code.  Likely what you are seeing is skb_headlen(skb) will be
> equal to 0.
> 
Well, the traits of the bogus descriptor are almost exactly as you describe
them, i.e. rx_buffer_info->dma is zero, which the driver takes to mean packet
split is enabled, and this is a buffer in the middle of that operation
(according to the comments in ixgbe_clean_rx_irq), and the upper_len value we
read from the rx_descriptior rx_dex->wb.upper.length is zero.  This implies we
have a frame which is in the middle of a packet split receive, and one of the
page long buffers has a length value of zero, which is non-sensical.  I suppose
we could wait until the next frame with EOP set to discard the whole thing, but
I'm not sure how that amounts to anything different than just skipping to the
next descriptor.

> I'm suspecting this is some sort of read corruption.  It looks like
> in order to trigger it you have to either be reading
> rx_buffer_info->dma as 0, or the header length is being read as 0.
Correct, which drops us into the else clause of the if(rx_buffer_info->dma)
conditional in ixgbe_clean_rx_irq.

> Do you know if you actually have header split enabled when this is
> occuring?  Are you running with jumbo frames enabled to see the
Yes, packet split is enabled. and no, Jumbo frames are not in use.

> issue?  If not then packet split wouldn't be enabled.
> 
> Is this occurring on net-next or on an older kernel?  I just want to
> be sure since we added a read memory barrier in 2.6.34 to address
> the fact that the length and descriptor DD bits were being read in
> the wrong order resulting in the length being corrupted on PowerPC
> systems.  The fact that we are now seeing another length error on
> PowerPC seems very odd.
> 
According to the bz:
https://bugzilla.redhat.com/show_bug.cgi?id=683611
This appears to be happening on RHEL, and on upstream kernels, as well as the
sourceforge driver.  Don't quote me on the SF driver though, because I never got
a clear answer on that.  Although, fwiw, the RHEL version of the driver in which
we were definately seeing this problem has a read memory barrrier at the top of
the loop in ixgbe_clean_rx_irq, pulled in from commit
3c945e5b3719bcc18c6ddd31bbcae8ef94f3d19a, so I think thats handled.


Regards
Neil

^ permalink raw reply

* Re: [PATCH] af_packet: flush complete kernel cache in packet_sendmsg
From: Russell King - ARM Linux @ 2011-09-02 17:28 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev, Phil Sutter, David S. Miller, linux-arm-kernel
In-Reply-To: <1314971179.3092.159.camel@deadeye>

On Fri, Sep 02, 2011 at 02:46:17PM +0100, Ben Hutchings wrote:
> On Fri, 2011-09-02 at 13:08 +0200, Phil Sutter wrote:
> > This flushes the cache before and after accessing the mmapped packet
> > buffer. It seems like the call to flush_dcache_page from inside
> > __packet_get_status is not enough on Kirkwood (or ARM in general).
> > ---
> > I know this is far from an optimal solution, but it's in fact the only working
> > one I found.
> [...]
> 
> This is ridiculous.  If flush_dcache_page() isn't doing everything it
> should, you need to fix that.

It does do everything it should - which is to perform maintanence on
page cache pages.  It flushes the kernel mapping of the page.  It
also flushes the userspace mappings of the page which it finds by
walking the mmap list via the associated struct page.  It does not
touch vmalloc mappings because it has no way to know whether they
exist or not.

It doesn't do so much for anonymous pages - to do so would only
duplicate what flush_anon_page() does at the very same callsites.
Plus the mmap list isn't available for such pages so there's no
way to find out what userspace addresses to flush.

If the AF_PACKET buffers are created from anonymous pages and it's
using flush_dcache_page(), it's using the wrong interface.

^ permalink raw reply

* [PATCH 1/2] bridge: leave carrier on for empty bridge
From: Stephen Hemminger @ 2011-09-02 17:22 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev
In-Reply-To: <20110902172220.830228928@vyatta.com>

[-- Attachment #1: br-carrier-default.patch --]
[-- Type: text/plain, Size: 955 bytes --]

This resolves a regression seen by some users of bridging.
Some users use the bridge like a dummy device. 
They expect to be able to put an IPv6 address on the device
with no ports attached during boot.

Note: the bridge still will reflect the state of ports in the
bridge if there are any added.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

---
This fix needs to go to stable for 3.0 and 2.6.39


--- a/net/bridge/br_device.c	2011-09-01 08:52:27.596631192 -0700
+++ b/net/bridge/br_device.c	2011-09-01 09:01:03.256611801 -0700
@@ -91,7 +91,6 @@ static int br_dev_open(struct net_device
 {
 	struct net_bridge *br = netdev_priv(dev);
 
-	netif_carrier_off(dev);
 	netdev_update_features(dev);
 	netif_start_queue(dev);
 	br_stp_enable_bridge(br);
@@ -108,8 +107,6 @@ static int br_dev_stop(struct net_device
 {
 	struct net_bridge *br = netdev_priv(dev);
 
-	netif_carrier_off(dev);
-
 	br_stp_disable_bridge(br);
 	br_multicast_stop(br);
 

^ permalink raw reply

* [PATCH 2/2] bridge: set flags in RTM_NEWNEIGH message correctly
From: Stephen Hemminger @ 2011-09-02 17:22 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev
In-Reply-To: <20110902172220.830228928@vyatta.com>

[-- Attachment #1: bridge-newneigh-state.patch --]
[-- Type: text/plain, Size: 1409 bytes --]

The functionality for notification was added with 3.0. kernel
but bridge would always send new neighbour message with state == 0.
The problem is that the notify needs to be done after
the flags on ther forwarding engry are set.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
---

--- a/net/bridge/br_fdb.c	2011-08-26 09:41:25.966304883 -0700
+++ b/net/bridge/br_fdb.c	2011-09-01 17:21:43.755481630 -0700
@@ -347,7 +347,6 @@ static struct net_bridge_fdb_entry *fdb_
 		fdb->is_static = 0;
 		fdb->updated = fdb->used = jiffies;
 		hlist_add_head_rcu(&fdb->hlist, head);
-		fdb_notify(fdb, RTM_NEWNEIGH);
 	}
 	return fdb;
 }
@@ -379,6 +378,7 @@ static int fdb_insert(struct net_bridge
 		return -ENOMEM;
 
 	fdb->is_local = fdb->is_static = 1;
+	fdb_notify(fdb, RTM_NEWNEIGH);
 	return 0;
 }
 
@@ -424,8 +424,11 @@ void br_fdb_update(struct net_bridge *br
 		}
 	} else {
 		spin_lock(&br->hash_lock);
-		if (likely(!fdb_find(head, addr)))
-			fdb_create(head, source, addr);
+		if (!fdb_find(head, addr)) {
+			fdb = fdb_create(head, source, addr);
+			if (fdb)
+				fdb_notify(fdb, RTM_NEWNEIGH);
+		}
 
 		/* else  we lose race and someone else inserts
 		 * it first, don't bother updating
@@ -576,6 +579,8 @@ static int fdb_add_entry(struct net_brid
 		fdb->is_local = fdb->is_static = 1;
 	else if (state & NUD_NOARP)
 		fdb->is_static = 1;
+
+	fdb_notify(fdb, RTM_NEWNEIGH);
 	return 0;
 }
 

^ permalink raw reply

* Re: [Bugme-new] [Bug 42132] New: Support BCM5750M in tg3
From: Matt Carlson @ 2011-09-02 17:43 UTC (permalink / raw)
  To: Francesco Piccinno
  Cc: Matthew Carlson, Andrew Morton, netdev@vger.kernel.org,
	bugme-daemon@bugzilla.kernel.org, Benjamin Li, Michael Chan
In-Reply-To: <CAA7bCn5DJjrdZtyYN2g75Ty1=5s1Zcs5A4x0WksOM4oRLLaGOQ@mail.gmail.com>

The output shows that the device's firmware isn't running.  Since the
firmware version also doesn't show up in the 'ethtool -i' output, it
might mean that firmware is completely missing.

You probably don't want to leave the device the way it is now.  I'd
contact your vendor to see if you can get your firmware reprogrammed.

On Fri, Sep 02, 2011 at 02:20:10AM -0700, Francesco Piccinno wrote:
> The patch did not apply cleanly. BTW I have figured out an alternative
> method. I modified by hand pci_ids.h and tg3.c files. The device seems
> to work now.
> 
> The output of ethtool -i eth0 gives me:
> driver: tg3
> version: 3.119
> firmware-version:
> bus-info: 0000:08:00.0
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: yes
> supports-register-dump: yes
> 
> Messages produced by the driver:
> 
> [  728.741487] tg3 0000:08:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
> [  728.741498] tg3 0000:08:00.0: setting latency timer to 64
> [  728.819963] tg3 0000:08:00.0: vpd r/w failed.  This is likely a
> firmware bug on this device.  Contact the card vendor for a firmware
> update.
> [  728.879960] tg3 0000:08:00.0: vpd r/w failed.  This is likely a
> firmware bug on this device.  Contact the card vendor for a firmware
> update.
> [  728.939957] tg3 0000:08:00.0: vpd r/w failed.  This is likely a
> firmware bug on this device.  Contact the card vendor for a firmware
> update.
> [  728.942680] tg3 0000:08:00.0: eth0: Tigon3 [partno(none) rev 4201]
> (PCI Express) MAC address 00:1b:38:38:c6:60
> [  728.942685] tg3 0000:08:00.0: eth0: attached PHY is 5750
> (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[0])
> [  728.942689] tg3 0000:08:00.0: eth0: RXcsums[1] LinkChgREG[0]
> MIirq[0] ASF[0] TSOcap[1]
> [  728.942692] tg3 0000:08:00.0: eth0: dma_rwctrl[76180000] dma_mask[64-bit]
> [  728.949503] tg3 0000:08:00.0: irq 45 for MSI/MSI-X
> [  730.633610] tg3 0000:08:00.0: eth0: No firmware running
> [  730.650658] ADDRCONF(NETDEV_UP): eth0: link is not ready
> [  811.811298] tg3 0000:08:00.0: eth0: Link is up at 100 Mbps, full duplex
> [  811.811306] tg3 0000:08:00.0: eth0: Flow control is on for TX and on for RX
> 
> --
> Best regards,
> Francesco Piccinno
> 
> 
> 
> On Fri, Sep 2, 2011 at 3:25 AM, Matt Carlson <mcarlson@broadcom.com> wrote:
> > Yes. ??Sorry. ??Please revert that patch. ??If you really had a bcm5750,
> > you'd need to revert another patch too, but let's see where we stand
> > before going down that road.
> >
> > On Thu, Sep 01, 2011 at 06:14:57PM -0700, Francesco Piccinno wrote:
> >> The only message I get regarding the firmware is the following:
> >>
> >> [51503.038205] pci 0000:08:00.0: vpd r/w failed. ??This is likely a
> >> firmware bug on this device. ??Contact the card vendor for a firmware
> >> update.
> >>
> >> Unfortunately I can not post the output of ethtool since the interface
> >> is not available. Shall I recompile the tg3 module with the proper
> >> patch and post the output?
> >>
> >> --
> >> Best regards,
> >> Francesco Piccinno
> >>
> >> On Fri, Sep 2, 2011 at 3:04 AM, Matt Carlson <mcarlson@broadcom.com> wrote:
> >> > It's showing up on lspci as a PCIe device, so it can't be the 5750M.
> >> > The bcm5750M is a pci device.
> >> >
> >> > I'm wondering if bootcode is failing. ??Do you see any messages in your
> >> > syslogs that say "No firmware running"?
> >> >
> >> > Can you post the output of 'ethtool -i ethX'?
> >> >
> >> > On Thu, Sep 01, 2011 at 05:48:50PM -0700, Francesco Piccinno wrote:
> >> >> Yes sure.
> >> >>
> >> >> # lspci -vvv -s 08:00.0
> >> >> 08:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5750M
> >> >> Gigabit Ethernet
> >> >> ?? ?? ?? Subsystem: Broadcom Corporation NetXtreme BCM5750M Gigabit Ethernet
> >> >> ?? ?? ?? Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> >> >> Stepping- SERR- FastB2B- DisINTx-
> >> >> ?? ?? ?? Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> >> >> <TAbort- <MAbort- >SERR- <PERR- INTx-
> >> >> ?? ?? ?? Latency: 0, Cache Line Size: 64 bytes
> >> >> ?? ?? ?? Interrupt: pin A routed to IRQ 10
> >> >> ?? ?? ?? Region 0: Memory at f4100000 (64-bit, non-prefetchable) [size=64K]
> >> >> ?? ?? ?? Capabilities: [48] Power Management version 2
> >> >> ?? ?? ?? ?? ?? ?? ?? Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+)
> >> >> ?? ?? ?? ?? ?? ?? ?? Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
> >> >> ?? ?? ?? Capabilities: [50] Vital Product Data
> >> >> pcilib: sysfs_read_vpd: read failed: Connection timed out
> >> >> ?? ?? ?? ?? ?? ?? ?? Not readable
> >> >> ?? ?? ?? Capabilities: [58] MSI: Enable- Count=1/8 Maskable- 64bit+
> >> >> ?? ?? ?? ?? ?? ?? ?? Address: 5149526521410124 ??Data: 8b60
> >> >> ?? ?? ?? Capabilities: [d0] Express (v1) Endpoint, MSI 00
> >> >> ?? ?? ?? ?? ?? ?? ?? DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
> >> >> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ExtTag+ AttnBtn- AttnInd- PwrInd- RBE- FLReset-
> >> >> ?? ?? ?? ?? ?? ?? ?? DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
> >> >> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
> >> >> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? MaxPayload 128 bytes, MaxReadReq 512 bytes
> >> >> ?? ?? ?? ?? ?? ?? ?? DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
> >> >> ?? ?? ?? ?? ?? ?? ?? LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Latency L0 <4us, L1 <64us
> >> >> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ClockPM- Surprise- LLActRep- BwNot-
> >> >> ?? ?? ?? ?? ?? ?? ?? LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
> >> >> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> >> >> ?? ?? ?? ?? ?? ?? ?? LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive-
> >> >> BWMgmt- ABWMgmt-
> >> >> ?? ?? ?? Capabilities: [100 v1] Advanced Error Reporting
> >> >> ?? ?? ?? ?? ?? ?? ?? UESta: ??DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF-
> >> >> MalfTLP- ECRC- UnsupReq- ACSViol-
> >> >> ?? ?? ?? ?? ?? ?? ?? UEMsk: ??DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF-
> >> >> MalfTLP- ECRC- UnsupReq- ACSViol-
> >> >> ?? ?? ?? ?? ?? ?? ?? UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+
> >> >> MalfTLP+ ECRC- UnsupReq- ACSViol-
> >> >> ?? ?? ?? ?? ?? ?? ?? CESta: ??RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
> >> >> ?? ?? ?? ?? ?? ?? ?? CEMsk: ??RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
> >> >> ?? ?? ?? ?? ?? ?? ?? AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
> >> >> ?? ?? ?? Capabilities: [13c v1] Virtual Channel
> >> >> ?? ?? ?? ?? ?? ?? ?? Caps: ?? LPEVC=0 RefClk=100ns PATEntryBits=1
> >> >> ?? ?? ?? ?? ?? ?? ?? Arb: ?? ??Fixed- WRR32- WRR64- WRR128-
> >> >> ?? ?? ?? ?? ?? ?? ?? Ctrl: ?? ArbSelect=Fixed
> >> >> ?? ?? ?? ?? ?? ?? ?? Status: InProgress-
> >> >> ?? ?? ?? ?? ?? ?? ?? VC0: ?? ??Caps: ?? PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
> >> >> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? Arb: ?? ??Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
> >> >> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? Ctrl: ?? Enable+ ID=0 ArbSelect=Fixed TC/VC=01
> >> >> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? Status: NegoPending- InProgress-
> >> >> ?? ?? ?? Capabilities: [160 v1] Device Serial Number 00-00-00-ff-fe-00-00-00
> >> >>
> >> >> Serial number is CND71700K6.
> >> >> --
> >> >> Best regards,
> >> >> Francesco Piccinno
> >> >>
> >> >>
> >> >>
> >> >> On Fri, Sep 2, 2011 at 2:06 AM, Matt Carlson <mcarlson@broadcom.com> wrote:
> >> >> > On Thu, Sep 01, 2011 at 04:40:11PM -0700, Andrew Morton wrote:
> >> >> >>
> >> >> >> (switched to email. ??Please respond via emailed reply-to-all, not via the
> >> >> >> bugzilla web interface).
> >> >> >>
> >> >> >> On Wed, 31 Aug 2011 18:18:40 GMT
> >> >> >> bugzilla-daemon@bugzilla.kernel.org wrote:
> >> >> >>
> >> >> >> > https://bugzilla.kernel.org/show_bug.cgi?id=42132
> >> >> >> >
> >> >> >> > ?? ?? ?? ?? ?? ??Summary: Support BCM5750M in tg3
> >> >> >> > ?? ?? ?? ?? ?? ??Product: Drivers
> >> >> >> > ?? ?? ?? ?? ?? ??Version: 2.5
> >> >> >> > ?? ?? Kernel Version: 3.0.3
> >> >> >> > ?? ?? ?? ?? ?? Platform: All
> >> >> >> > ?? ?? ?? ?? OS/Version: Linux
> >> >> >> > ?? ?? ?? ?? ?? ?? ?? Tree: Mainline
> >> >> >> > ?? ?? ?? ?? ?? ?? Status: NEW
> >> >> >> > ?? ?? ?? ?? ?? Severity: normal
> >> >> >> > ?? ?? ?? ?? ?? Priority: P1
> >> >> >> > ?? ?? ?? ?? ??Component: Network
> >> >> >> > ?? ?? ?? ?? AssignedTo: drivers_network@kernel-bugs.osdl.org
> >> >> >> > ?? ?? ?? ?? ReportedBy: stack.box@gmail.com
> >> >> >> > ?? ?? ?? ?? Regression: Yes
> >> >> >> >
> >> >> >> >
> >> >> >> > I have a notebook (HP TC4400) which has a BCM5750 ethernet card inside. The
> >> >> >> > ouput of lspci is:
> >> >> >> >
> >> >> >> > 08:00.0 Ethernet controller [0200]: Broadcom Corporation NetXtreme BCM5750M
> >> >> >> > Gigabit Ethernet [14e4:167c]
> >> >> >> >
> >> >> >> > Commit 67b284d476bcb3d100e946da23d6cf9acfd0465c removed the support for this
> >> >> >> > device.
> >> >> >> >
> >> >> >>
> >> >> >> 67b284d476bcb3d100 says "These devices were never released to the public".
> >> >> >>
> >> >> >> > I wish to have the support for this network card back again. Thanks!
> >> >> >>
> >> >> >> oops ;)
> >> >> >
> >> >> > Really? ??All the TC4400 documentation I find says it uses a bcm5753M on a
> >> >> > PCIe bus. ??Can you post the full output of 'lspci -vvv -s 08:00.0' ?
> >> >> >
> >> >> >
> >> >>
> >> >
> >> >
> >>
> >
> >
> 

^ permalink raw reply

* Re: [PATCH] ixgbe: drop zero length frame segments during a packet split rx
From: Alexander Duyck @ 2011-09-02 17:54 UTC (permalink / raw)
  To: Neil Horman
  Cc: netdev, Thadeu Lima de Souza Cascardo, Jesse Brandeburg,
	John Fastabend, Jeff Kirsher, David S. Miller
In-Reply-To: <20110902165523.GB27571@hmsreliant.think-freely.org>

On 09/02/2011 09:55 AM, Neil Horman wrote:
> On Fri, Sep 02, 2011 at 09:17:40AM -0700, Alexander Duyck wrote:
>> This kind of fix just opens up a whole can of security related
>> worms.  If you are going to discard a packet you should do it after
>> we have reached the EOP in the series.  My advice would be to
>> determine what traits identify this packet and add those to the
>> check for the IXGBE_RXDADV_ERR_FRAME_ERR_MASK check further down in
>> the code.  Likely what you are seeing is skb_headlen(skb) will be
>> equal to 0.
>>
> Well, the traits of the bogus descriptor are almost exactly as you describe
> them, i.e. rx_buffer_info->dma is zero, which the driver takes to mean packet
> split is enabled, and this is a buffer in the middle of that operation
> (according to the comments in ixgbe_clean_rx_irq), and the upper_len value we
> read from the rx_descriptior rx_dex->wb.upper.length is zero.  This implies we
> have a frame which is in the middle of a packet split receive, and one of the
> page long buffers has a length value of zero, which is non-sensical.  I suppose
> we could wait until the next frame with EOP set to discard the whole thing, but
> I'm not sure how that amounts to anything different than just skipping to the
> next descriptor.
>
>> I'm suspecting this is some sort of read corruption.  It looks like
>> in order to trigger it you have to either be reading
>> rx_buffer_info->dma as 0, or the header length is being read as 0.
> Correct, which drops us into the else clause of the if(rx_buffer_info->dma)
> conditional in ixgbe_clean_rx_irq.
>
>> Do you know if you actually have header split enabled when this is
>> occuring?  Are you running with jumbo frames enabled to see the
> Yes, packet split is enabled. and no, Jumbo frames are not in use.
>
>> issue?  If not then packet split wouldn't be enabled.
>>
>> Is this occurring on net-next or on an older kernel?  I just want to
>> be sure since we added a read memory barrier in 2.6.34 to address
>> the fact that the length and descriptor DD bits were being read in
>> the wrong order resulting in the length being corrupted on PowerPC
>> systems.  The fact that we are now seeing another length error on
>> PowerPC seems very odd.
>>
> According to the bz:
> https://bugzilla.redhat.com/show_bug.cgi?id=683611
> This appears to be happening on RHEL, and on upstream kernels, as well as the
> sourceforge driver.  Don't quote me on the SF driver though, because I never got
> a clear answer on that.  Although, fwiw, the RHEL version of the driver in which
> we were definately seeing this problem has a read memory barrrier at the top of
> the loop in ixgbe_clean_rx_irq, pulled in from commit
> 3c945e5b3719bcc18c6ddd31bbcae8ef94f3d19a, so I think thats handled.
>
>
> Regards
> Neil
I'll review the bugzilla and submit my comments there.

Thanks,

Alex

^ permalink raw reply

* WEBMASTER SERVICE
From: WEBMASTER SERVICE @ 2011-09-02 17:47 UTC (permalink / raw)
  To: ofat1

CONFIRM YOUR EMAIL ACCOUNT

We are upgrading our data base and center of e-mail View page Initial
accounts as well. We will delete e-mail accounts that are no longer of
working age to create more space for other users. We have also innovated
new accounts a security audit of all system to improve and strengthen our
current security.

To continue using our services, you need to update and re-confirm your
account information clicking on this link to complete your email account
http://buzurl.com/bf16 confirmation of new account, you must fill out the
account information and then submit immediately your account will be
confirmed.

We have confirm others without security, yours has to be done for your
email account to remain valid.

WEBMASTER SERVICE.

^ permalink raw reply

* Re: [patch net-next-2.6 v2] net: consolidate and fix ethtool_ops->get_settings calling
From: Ben Hutchings @ 2011-09-02 18:46 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, ralf, fubar, andy, kaber, bprakash, JBottomley,
	robert.w.love, davem, shemminger, decot, mirq-linux,
	alexander.h.duyck, amit.salecha, eric.dumazet, therbert, paulmck,
	laijs, xiaosuo, greearb, loke.chetan, linux-mips, linux-scsi,
	devel, bridge
In-Reply-To: <20110902122630.GC1991@minipsycho>

On Fri, 2011-09-02 at 14:26 +0200, Jiri Pirko wrote:
> This patch does several things:
> - introduces __ethtool_get_settings which is called from ethtool code and
>   from dev_ethtool_get_settings() as well.
> - dev_ethtool_get_settings() becomes rtnl wrapper for
>   __ethtool_get_settings()
[...]

I don't like this locking change.  Most other dev_*() functions require
the caller to hold RTNL, and it will break any OOT module calling
dev_ethtool_get_settings() without producing any warning at compile
time.  Why not put an ASSERT_RTNL() in it instead?

The rest of this looks fine.

Ben. 

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* [PATCH] net: change capability used by socket options IP{,V6}_TRANSPARENT
From: Maciej Żenczykowski @ 2011-09-02 19:10 UTC (permalink / raw)
  To: Maciej Żenczykowski
  Cc: netdev, Maciej Żenczykowski, Balazs Scheidler
In-Reply-To: <1314953022.26692.182.camel@bzorp>

From: Maciej Żenczykowski <maze@google.com>

Up till now the IP{,V6}_TRANSPARENT socket options (which actually set
the same bit in the socket struct) have required CAP_NET_ADMIN
privileges to set or clear the option.

- we make clearing the bit not require any privileges.
- we deprecate using CAP_NET_ADMIN for this purpose.
- we introduce a new capability CAP_NET_TRANSPARENT,
  which is tailored to allow setting just this bit.
- we allow either one of CAP_NET_TRANSPARENT or CAP_NET_RAW
  to set this bit, because raw sockets already effectively
  allow you to emulate socket transparency, and make the
  transition easier for apps not desiring to use a brand
  new capability (because of header file or glibc support)
- we print a warning (but allow it) if you try to set
  the socket option with CAP_NET_ADMIN privs, but without
  either one of CAP_NET_TRANSPARENT or CAP_NET_RAW.

The reason for introducing a new capability is that while
transparent sockets are potentially dangerous (and can let you
spoof your source IP on traffic), they don't normally give you
the full 'freedom' of eavesdropping and/or spoofing that raw sockets
give you.

Signed-off-by: Maciej Żenczykowski <maze@google.com>
CC: Balazs Scheidler <bazsi@balabit.hu>
---
 include/linux/capability.h |   13 +++++++++----
 net/ipv4/ip_sockglue.c     |   26 ++++++++++++++++++++++----
 net/ipv6/ipv6_sockglue.c   |   29 ++++++++++++++++++++++++-----
 3 files changed, 55 insertions(+), 13 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index c421123..a115ed4 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -198,7 +198,7 @@ struct cpu_vfs_cap_data {
 /* Allow modification of routing tables */
 /* Allow setting arbitrary process / process group ownership on
    sockets */
-/* Allow binding to any address for transparent proxying */
+/* Allow binding to any address for transparent proxying (deprecated) */
 /* Allow setting TOS (type of service) */
 /* Allow setting promiscuous mode */
 /* Allow clearing driver statistics */
@@ -210,6 +210,7 @@ struct cpu_vfs_cap_data {
 
 /* Allow use of RAW sockets */
 /* Allow use of PACKET sockets */
+/* Allow binding to any address for transparent proxying */
 
 #define CAP_NET_RAW          13
 
@@ -332,7 +333,7 @@ struct cpu_vfs_cap_data {
 
 #define CAP_AUDIT_CONTROL    30
 
-#define CAP_SETFCAP	     31
+#define CAP_SETFCAP          31
 
 /* Override MAC access.
    The base kernel enforces no MAC policy.
@@ -357,10 +358,14 @@ struct cpu_vfs_cap_data {
 
 /* Allow triggering something that will wake the system */
 
-#define CAP_WAKE_ALARM            35
+#define CAP_WAKE_ALARM       35
+
+/* Allow binding to any address for transparent proxying */
+
+#define CAP_NET_TRANSPARENT  36
 
 
-#define CAP_LAST_CAP         CAP_WAKE_ALARM
+#define CAP_LAST_CAP         CAP_NET_TRANSPARENT
 
 #define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)
 
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 8905e92..44efa39 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -961,12 +961,30 @@ mc_msf_out:
 		break;
 
 	case IP_TRANSPARENT:
-		if (!capable(CAP_NET_ADMIN)) {
-			err = -EPERM;
-			break;
-		}
 		if (optlen < 1)
 			goto e_inval;
+		/* Always allow clearing the transparent proxy socket option.
+		 * The pre-3.2 permission for setting this was CAP_NET_ADMIN,
+		 * and this is still supported - but deprecated.  As of Linux
+		 * 3.2 the proper permission is one of CAP_NET_TRANSPARENT
+		 * (preferred, a new capability) or CAP_NET_RAW.  The latter
+		 * is supported to make the transition easier (and because
+		 * raw sockets already effectively allow one to emulate
+		 * socket transparency).
+		 */
+		if (!!val && !capable(CAP_NET_TRANSPARENT)
+		          && !capable(CAP_NET_RAW)) {
+			if (!capable(CAP_NET_ADMIN)) {
+				err = -EPERM;
+				break;
+			}
+			printk_once(KERN_WARNING "%s (%d): "
+				 "deprecated: attempt to set socket option "
+				 "IP_TRANSPARENT with CAP_NET_ADMIN but "
+				 "without either one of CAP_NET_TRANSPARENT "
+				 "or CAP_NET_RAW.\n",
+				 current->comm, task_pid_nr(current));
+		}
 		inet->transparent = !!val;
 		break;
 
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index 147ede38..c840098 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -343,13 +343,32 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
 		break;
 
 	case IPV6_TRANSPARENT:
-		if (!capable(CAP_NET_ADMIN)) {
-			retv = -EPERM;
-			break;
-		}
 		if (optlen < sizeof(int))
 			goto e_inval;
-		/* we don't have a separate transparent bit for IPV6 we use the one in the IPv4 socket */
+		/* Always allow clearing the transparent proxy socket option.
+		 * The pre-3.2 permission for setting this was CAP_NET_ADMIN,
+		 * and this is still supported - but deprecated.  As of Linux
+		 * 3.2 the proper permission is one of CAP_NET_TRANSPARENT
+		 * (preferred, a new capability) or CAP_NET_RAW.  The latter
+		 * is supported to make the transition easier (and because
+		 * raw sockets already effectively allow one to emulate
+		 * socket transparency).
+		 */
+		if (valbool && !capable(CAP_NET_TRANSPARENT)
+		            && !capable(CAP_NET_RAW)) {
+			if (!capable(CAP_NET_ADMIN)) {
+				retv = -EPERM;
+				break;
+			}
+			printk_once(KERN_WARNING "%s (%d): "
+				 "deprecated: attempt to set socket option "
+				 "IPV6_TRANSPARENT with CAP_NET_ADMIN but "
+				 "without either one of CAP_NET_TRANSPARENT "
+				 "or CAP_NET_RAW.\n",
+				 current->comm, task_pid_nr(current));
+		}
+		/* we don't have a separate transparent bit for IPV6 we use the
+		 * one in the IPv4 socket */
 		inet_sk(sk)->transparent = valbool;
 		retv = 0;
 		break;
-- 
1.7.3.1

^ permalink raw reply related

* [PATCH 08/15] af_netlink.c: make netlink_capable userns-aware
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm, segooon, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap
  Cc: Serge E. Hallyn, Eric Dumazet
In-Reply-To: <1314993400-6910-1-git-send-email-serge@hallyn.com>

From: "Serge E. Hallyn" <serge.hallyn@canonical.com>

netlink_capable should check for permissions against the user
namespace owning the socket in question.

Changelog:
  Per Eric Dumazet advice, use sock_net(sk) instead of #ifdef.

Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/netlink/af_netlink.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 0a4db02..3cc0bbe 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -580,8 +580,9 @@ retry:
 
 static inline int netlink_capable(struct socket *sock, unsigned int flag)
 {
-	return (nl_table[sock->sk->sk_protocol].nl_nonroot & flag) ||
-	       capable(CAP_NET_ADMIN);
+	if (nl_table[sock->sk->sk_protocol].nl_nonroot & flag)
+		return 1;
+	return ns_capable(sock_net(sock->sk)->user_ns, CAP_NET_ADMIN);
 }
 
 static void
-- 
1.7.5.4

^ permalink raw reply related

* [PATCH 13/15] userns: net: make many network capable calls targeted
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm, segooon, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap
  Cc: Serge E. Hallyn
In-Reply-To: <1314993400-6910-1-git-send-email-serge@hallyn.com>

From: "Serge E. Hallyn" <serge.hallyn@canonical.com>

When privilege is protected a namespaced network resource, then having
the required privilege targed toward the user namespace which owns the
resource suffices.

As with other patches, a big concern here is that we be cleanly separating
the cases where privilege protects a network resource from cases where
privilege can lead to laxer constraints on input and, subsequently,
the ability to corrupt, crash, or own the host kernel.

Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
---
 net/8021q/vlan.c                  |   12 ++++++------
 net/bridge/br_ioctl.c             |   22 +++++++++++-----------
 net/bridge/br_sysfs_br.c          |    8 ++++----
 net/bridge/br_sysfs_if.c          |    2 +-
 net/bridge/netfilter/ebtables.c   |    8 ++++----
 net/core/ethtool.c                |    2 +-
 net/ipv4/arp.c                    |    2 +-
 net/ipv4/devinet.c                |    4 ++--
 net/ipv4/fib_frontend.c           |    2 +-
 net/ipv4/ip_options.c             |    6 +++---
 net/ipv4/ip_sockglue.c            |    4 ++--
 net/ipv4/ipip.c                   |    4 ++--
 net/ipv4/ipmr.c                   |    2 +-
 net/ipv4/netfilter/arp_tables.c   |    8 ++++----
 net/ipv4/netfilter/ip_tables.c    |    8 ++++----
 net/netfilter/ipset/ip_set_core.c |    2 +-
 net/netfilter/ipvs/ip_vs_ctl.c    |    4 ++--
 net/packet/af_packet.c            |    2 +-
 18 files changed, 51 insertions(+), 51 deletions(-)

diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 8970ba1..7d12f63 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -558,7 +558,7 @@ static int vlan_ioctl_handler(struct net *net, void __user *arg)
 	switch (args.cmd) {
 	case SET_VLAN_INGRESS_PRIORITY_CMD:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			break;
 		vlan_dev_set_ingress_priority(dev,
 					      args.u.skb_priority,
@@ -568,7 +568,7 @@ static int vlan_ioctl_handler(struct net *net, void __user *arg)
 
 	case SET_VLAN_EGRESS_PRIORITY_CMD:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			break;
 		err = vlan_dev_set_egress_priority(dev,
 						   args.u.skb_priority,
@@ -577,7 +577,7 @@ static int vlan_ioctl_handler(struct net *net, void __user *arg)
 
 	case SET_VLAN_FLAG_CMD:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			break;
 		err = vlan_dev_change_flags(dev,
 					    args.vlan_qos ? args.u.flag : 0,
@@ -586,7 +586,7 @@ static int vlan_ioctl_handler(struct net *net, void __user *arg)
 
 	case SET_VLAN_NAME_TYPE_CMD:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			break;
 		if ((args.u.name_type >= 0) &&
 		    (args.u.name_type < VLAN_NAME_TYPE_HIGHEST)) {
@@ -602,14 +602,14 @@ static int vlan_ioctl_handler(struct net *net, void __user *arg)
 
 	case ADD_VLAN_CMD:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			break;
 		err = register_vlan_device(dev, args.u.VID);
 		break;
 
 	case DEL_VLAN_CMD:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			break;
 		unregister_vlan_dev(dev, NULL);
 		err = 0;
diff --git a/net/bridge/br_ioctl.c b/net/bridge/br_ioctl.c
index 7222fe1..c82f9cb 100644
--- a/net/bridge/br_ioctl.c
+++ b/net/bridge/br_ioctl.c
@@ -88,7 +88,7 @@ static int add_del_if(struct net_bridge *br, int ifindex, int isadd)
 	struct net_device *dev;
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(dev_net(br->dev)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	dev = __dev_get_by_index(dev_net(br->dev), ifindex);
@@ -178,25 +178,25 @@ static int old_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
 	}
 
 	case BRCTL_SET_BRIDGE_FORWARD_DELAY:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		return br_set_forward_delay(br, args[1]);
 
 	case BRCTL_SET_BRIDGE_HELLO_TIME:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		return br_set_hello_time(br, args[1]);
 
 	case BRCTL_SET_BRIDGE_MAX_AGE:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		return br_set_max_age(br, args[1]);
 
 	case BRCTL_SET_AGEING_TIME:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		br->ageing_time = clock_t_to_jiffies(args[1]);
@@ -236,14 +236,14 @@ static int old_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
 	}
 
 	case BRCTL_SET_BRIDGE_STP_STATE:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		br_stp_set_enabled(br, args[1]);
 		return 0;
 
 	case BRCTL_SET_BRIDGE_PRIORITY:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		spin_lock_bh(&br->lock);
@@ -256,7 +256,7 @@ static int old_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
 		struct net_bridge_port *p;
 		int ret;
 
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		spin_lock_bh(&br->lock);
@@ -273,7 +273,7 @@ static int old_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
 		struct net_bridge_port *p;
 		int ret;
 
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		spin_lock_bh(&br->lock);
@@ -330,7 +330,7 @@ static int old_deviceless(struct net *net, void __user *uarg)
 	{
 		char buf[IFNAMSIZ];
 
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		if (copy_from_user(buf, (void __user *)args[1], IFNAMSIZ))
@@ -360,7 +360,7 @@ int br_ioctl_deviceless_stub(struct net *net, unsigned int cmd, void __user *uar
 	{
 		char buf[IFNAMSIZ];
 
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		if (copy_from_user(buf, uarg, IFNAMSIZ))
diff --git a/net/bridge/br_sysfs_br.c b/net/bridge/br_sysfs_br.c
index 68b893e..7f4fa3a 100644
--- a/net/bridge/br_sysfs_br.c
+++ b/net/bridge/br_sysfs_br.c
@@ -36,7 +36,7 @@ static ssize_t store_bridge_parm(struct device *d,
 	unsigned long val;
 	int err;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(dev_net(br->dev)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	val = simple_strtoul(buf, &endp, 0);
@@ -132,7 +132,7 @@ static ssize_t store_stp_state(struct device *d,
 	char *endp;
 	unsigned long val;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(dev_net(br->dev)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	val = simple_strtoul(buf, &endp, 0);
@@ -267,7 +267,7 @@ static ssize_t store_group_addr(struct device *d,
 	unsigned new_addr[6];
 	int i;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(dev_net(br->dev)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	if (sscanf(buf, "%x:%x:%x:%x:%x:%x",
@@ -304,7 +304,7 @@ static ssize_t store_flush(struct device *d,
 {
 	struct net_bridge *br = to_bridge(d);
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(dev_net(br->dev)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	br_fdb_flush(br);
diff --git a/net/bridge/br_sysfs_if.c b/net/bridge/br_sysfs_if.c
index 6229b62..9cb4d2e 100644
--- a/net/bridge/br_sysfs_if.c
+++ b/net/bridge/br_sysfs_if.c
@@ -209,7 +209,7 @@ static ssize_t brport_store(struct kobject * kobj,
 	char *endp;
 	unsigned long val;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(dev_net(p->br->dev)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	val = simple_strtoul(buf, &endp, 0);
diff --git a/net/bridge/netfilter/ebtables.c b/net/bridge/netfilter/ebtables.c
index 5864cc4..cc1198b 100644
--- a/net/bridge/netfilter/ebtables.c
+++ b/net/bridge/netfilter/ebtables.c
@@ -1463,7 +1463,7 @@ static int do_ebt_set_ctl(struct sock *sk,
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch(cmd) {
@@ -1485,7 +1485,7 @@ static int do_ebt_get_ctl(struct sock *sk, int cmd, void __user *user, int *len)
 	struct ebt_replace tmp;
 	struct ebt_table *t;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	if (copy_from_user(&tmp, user, sizeof(tmp)))
@@ -2276,7 +2276,7 @@ static int compat_do_ebt_set_ctl(struct sock *sk,
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
@@ -2299,7 +2299,7 @@ static int compat_do_ebt_get_ctl(struct sock *sk, int cmd,
 	struct compat_ebt_replace tmp;
 	struct ebt_table *t;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	/* try real handler in case userland supplied needed padding */
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 6cdba5f..56878bf 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -1676,7 +1676,7 @@ int dev_ethtool(struct net *net, struct ifreq *ifr)
 	case ETHTOOL_GFEATURES:
 		break;
 	default:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 	}
 
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 96a164a..023ad24 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -1175,7 +1175,7 @@ int arp_ioctl(struct net *net, unsigned int cmd, void __user *arg)
 	switch (cmd) {
 	case SIOCDARP:
 	case SIOCSARP:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 	case SIOCGARP:
 		err = copy_from_user(&r, arg, sizeof(struct arpreq));
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index bc19bd0..93b5b0b 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -728,7 +728,7 @@ int devinet_ioctl(struct net *net, unsigned int cmd, void __user *arg)
 
 	case SIOCSIFFLAGS:
 		ret = -EACCES;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			goto out;
 		break;
 	case SIOCSIFADDR:	/* Set interface address (and family) */
@@ -736,7 +736,7 @@ int devinet_ioctl(struct net *net, unsigned int cmd, void __user *arg)
 	case SIOCSIFDSTADDR:	/* Set the destination address */
 	case SIOCSIFNETMASK: 	/* Set the netmask for the interface */
 		ret = -EACCES;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			goto out;
 		ret = -EINVAL;
 		if (sin->sin_family != AF_INET)
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 92fc5f6..8f34a07 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -437,7 +437,7 @@ int ip_rt_ioctl(struct net *net, unsigned int cmd, void __user *arg)
 	switch (cmd) {
 	case SIOCADDRT:		/* Add a route */
 	case SIOCDELRT:		/* Delete a route */
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 
 		if (copy_from_user(&rt, arg, sizeof(rt)))
diff --git a/net/ipv4/ip_options.c b/net/ipv4/ip_options.c
index ec93335..21df700 100644
--- a/net/ipv4/ip_options.c
+++ b/net/ipv4/ip_options.c
@@ -396,7 +396,7 @@ int ip_options_compile(struct net *net,
 					optptr[2] += 8;
 					break;
 				      default:
-					if (!skb && !capable(CAP_NET_RAW)) {
+					if (!skb && !ns_capable(net->user_ns, CAP_NET_RAW)) {
 						pp_ptr = optptr + 3;
 						goto error;
 					}
@@ -432,7 +432,7 @@ int ip_options_compile(struct net *net,
 				opt->router_alert = optptr - iph;
 			break;
 		      case IPOPT_CIPSO:
-			if ((!skb && !capable(CAP_NET_RAW)) || opt->cipso) {
+			if ((!skb && !ns_capable(net->user_ns, CAP_NET_RAW)) || opt->cipso) {
 				pp_ptr = optptr;
 				goto error;
 			}
@@ -445,7 +445,7 @@ int ip_options_compile(struct net *net,
 		      case IPOPT_SEC:
 		      case IPOPT_SID:
 		      default:
-			if (!skb && !capable(CAP_NET_RAW)) {
+			if (!skb && !ns_capable(net->user_ns, CAP_NET_RAW)) {
 				pp_ptr = optptr;
 				goto error;
 			}
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 8905e92..6408507 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -955,13 +955,13 @@ mc_msf_out:
 	case IP_IPSEC_POLICY:
 	case IP_XFRM_POLICY:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 			break;
 		err = xfrm_user_policy(sk, optname, optval, optlen);
 		break;
 
 	case IP_TRANSPARENT:
-		if (!capable(CAP_NET_ADMIN)) {
+		if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) {
 			err = -EPERM;
 			break;
 		}
diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c
index 378b20b..6725832 100644
--- a/net/ipv4/ipip.c
+++ b/net/ipv4/ipip.c
@@ -629,7 +629,7 @@ ipip_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 	case SIOCADDTUNNEL:
 	case SIOCCHGTUNNEL:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			goto done;
 
 		err = -EFAULT;
@@ -689,7 +689,7 @@ ipip_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 
 	case SIOCDELTUNNEL:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			goto done;
 
 		if (dev == ipn->fb_tunnel_dev) {
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 58e8791..309aa0c 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1204,7 +1204,7 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, unsi
 
 	if (optname != MRT_INIT) {
 		if (sk != rcu_dereference_raw(mrt->mroute_sk) &&
-		    !capable(CAP_NET_ADMIN))
+		    !ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EACCES;
 	}
 
diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index fd7a3f6..acc908f 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -1534,7 +1534,7 @@ static int compat_do_arpt_set_ctl(struct sock *sk, int cmd, void __user *user,
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
@@ -1678,7 +1678,7 @@ static int compat_do_arpt_get_ctl(struct sock *sk, int cmd, void __user *user,
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
@@ -1699,7 +1699,7 @@ static int do_arpt_set_ctl(struct sock *sk, int cmd, void __user *user, unsigned
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
@@ -1723,7 +1723,7 @@ static int do_arpt_get_ctl(struct sock *sk, int cmd, void __user *user, int *len
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index 24e556e..72f2cde 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -1847,7 +1847,7 @@ compat_do_ipt_set_ctl(struct sock *sk,	int cmd, void __user *user,
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
@@ -1962,7 +1962,7 @@ compat_do_ipt_get_ctl(struct sock *sk, int cmd, void __user *user, int *len)
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
@@ -1984,7 +1984,7 @@ do_ipt_set_ctl(struct sock *sk, int cmd, void __user *user, unsigned int len)
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
@@ -2009,7 +2009,7 @@ do_ipt_get_ctl(struct sock *sk, int cmd, void __user *user, int *len)
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c
index d7e86ef..38d69a5 100644
--- a/net/netfilter/ipset/ip_set_core.c
+++ b/net/netfilter/ipset/ip_set_core.c
@@ -1596,7 +1596,7 @@ ip_set_sockfn_get(struct sock *sk, int optval, void __user *user, int *len)
 	void *data;
 	int copylen = *len, ret = 0;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 	if (optval != SO_IP_SET)
 		return -EBADF;
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 2b771dc..db224ef 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -2284,7 +2284,7 @@ do_ip_vs_set_ctl(struct sock *sk, int cmd, void __user *user, unsigned int len)
 	struct ip_vs_dest_user *udest_compat;
 	struct ip_vs_dest_user_kern udest;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	if (cmd < IP_VS_BASE_CTL || cmd > IP_VS_SO_SET_MAX)
@@ -2566,7 +2566,7 @@ do_ip_vs_get_ctl(struct sock *sk, int cmd, void __user *user, int *len)
 	struct netns_ipvs *ipvs = net_ipvs(net);
 
 	BUG_ON(!net);
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	if (cmd < IP_VS_BASE_CTL || cmd > IP_VS_SO_GET_MAX)
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index c698cec..c2e6bb6 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1793,7 +1793,7 @@ static int packet_create(struct net *net, struct socket *sock, int protocol,
 	__be16 proto = (__force __be16)protocol; /* weird, but documented */
 	int err;
 
-	if (!capable(CAP_NET_RAW))
+	if (!ns_capable(net->user_ns, CAP_NET_RAW))
 		return -EPERM;
 	if (sock->type != SOCK_DGRAM && sock->type != SOCK_RAW &&
 	    sock->type != SOCK_PACKET)
-- 
1.7.5.4

^ permalink raw reply related

* user namespaces v3: continue targetting capabilities
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm, segooon, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap

This was last sent Jul 26, and incorporates feedback from that thread.
The last patch, 0015-make-kernel-signal.c-user-ns-safe-v2.patch, is new,
so could stand extra scrutiny.

This patchset is a basis for Eric's set which allows assigning a
filesystem to a user namespace
(http://git.kernel.org/?p=linux/kernel/git/ebiederm/linux-userns-devel.git),
which is the last hurdle to starting to employ user namespaces to help
constrain root in a container.  So if there is no more major feedback,
I'd love to see this get a spin in -mm so we can proceed with that.

[ v2 intro message: ]

here is a set of patches to continue targetting capabilities
where appropriate.  This set goes about as far as is possible
without making the VFS user namespace aware, meaning that the
VFS can provide a namespaced view of userids, i.e init_user_ns
sees file owner 500, while child user ns sees file owner 0 or
1000.  (There are a few other things, like siginfos, which can
be addressed before we address the VFS).

With this set applied, you can create and configure veth netdevs
if your user namespace owns your network namespace (and you are
privileged), but not otherwise.

Some simple testcases can be found at
https://code.launchpad.net/~serge-hallyn/+junk/usernstests with
packages at
https://launchpad.net/~serge-hallyn/+archive/userns-natty

Feedback very much appreciated.

^ permalink raw reply

* (unknown), 
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm, segooon, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap
  Cc: Serge Hallyn
In-Reply-To: <1314993400-6910-1-git-send-email-serge@hallyn.com>

GIT: [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3)
GIT: [PATCH 02/15] user ns: setns: move capable checks into per-ns attach
GIT: [PATCH 03/15] keyctl: check capabilities against key's user_ns
GIT: [PATCH 04/15] user_ns: convert fs/attr.c to targeted capabilities
GIT: [PATCH 05/15] userns: clamp down users of cap_raised
GIT: [PATCH 06/15] user namespace: make each net (net_ns) belong to a
GIT: [PATCH 07/15] user namespace: use net->user_ns for some capable
GIT: [PATCH 08/15] af_netlink.c: make netlink_capable userns-aware
GIT: [PATCH 09/15] user ns: convert ipv6 to targeted capabilities
GIT: [PATCH 10/15] net/core/scm.c: target capable() calls to user_ns
GIT: [PATCH 11/15] userns: make some net-sysfs capable calls targeted
GIT: [PATCH 12/15] user_ns: target af_key capability check
GIT: [PATCH 13/15] userns: net: make many network capable calls targeted
GIT: [PATCH 14/15] net: pass user_ns to cap_netlink_recv()
GIT: [PATCH 15/15] make kernel/signal.c user ns safe (v2)

^ permalink raw reply

* [PATCH 01/15] add Documentation/namespaces/user_namespace.txt (v3)
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm, segooon, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap
  Cc: Serge E. Hallyn, Serge E. Hallyn
In-Reply-To: <1314993400-6910-1-git-send-email-serge@hallyn.com>

From: "Serge E. Hallyn" <serge@hallyn.com>

Quoting David Howells (dhowells@redhat.com):
> Randy Dunlap <rdunlap@xenotime.net> wrote:
>
> > > +Any task in or resource belonging to the initial user namespace will, to this
> > > +new task, appear to belong to UID and GID -1 - which is usually known as
> >
> > that extra hyphen is confusing.  how about:
> >
> >                               to UID and GID -1, which is
>
> 'which are'.
>
> David

This will hold some info about the design.  Currently it contains
future todos, issues and questions.

Changelog:
   jul 26: incorporate feedback from David Howells.
   jul 29: incorporate feedback from Randy Dunlap.

Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
---
 Documentation/namespaces/user_namespace.txt |  107 +++++++++++++++++++++++++++
 1 files changed, 107 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/namespaces/user_namespace.txt

diff --git a/Documentation/namespaces/user_namespace.txt b/Documentation/namespaces/user_namespace.txt
new file mode 100644
index 0000000..b0bc480
--- /dev/null
+++ b/Documentation/namespaces/user_namespace.txt
@@ -0,0 +1,107 @@
+Description
+===========
+
+Traditionally, each task is owned by a user ID (UID) and belongs to one or more
+groups (GID).  Both are simple numeric IDs, though userspace usually translates
+them to names.  The user namespace allows tasks to have different views of the
+UIDs and GIDs associated with tasks and other resources.  (See 'UID mapping'
+below for more.)
+
+The user namespace is a simple hierarchical one.  The system starts with all
+tasks belonging to the initial user namespace.  A task creates a new user
+namespace by passing the CLONE_NEWUSER flag to clone(2).  This requires the
+creating task to have the CAP_SETUID, CAP_SETGID, and CAP_CHOWN capabilities,
+but it does not need to be running as root.  The clone(2) call will result in a
+new task which to itself appears to be running as UID and GID 0, but to its
+creator seems to have the creator's credentials.
+
+To this new task, any resource belonging to the initial user namespace will
+appear to belong to user and group 'nobody', which are UID and GID -1.
+Permission to open such files will be granted according to world access
+permissions.  UID comparisons and group membership checks will return false,
+and privilege will be denied.
+
+When a task belonging to (for example) userid 500 in the initial user namespace
+creates a new user namespace, even though the new task will see itself as
+belonging to UID 0, any task in the initial user namespace will see it as
+belonging to UID 500.  Therefore, UID 500 in the initial user namespace will be
+able to kill the new task.  Files created by the new user will (eventually) be
+seen by tasks in its own user namespace as belonging to UID 0, but to tasks in
+the initial user namespace as belonging to UID 500.
+
+Note that this userid mapping for the VFS is not yet implemented, though the
+lkml and containers mailing list archives will show several previous
+prototypes.  In the end, those got hung up waiting on the concept of targeted
+capabilities to be developed, which, thanks to the insight of Eric Biederman,
+they finally did.
+
+Relationship between the User namespace and other namespaces
+============================================================
+
+Other namespaces, such as UTS and network, are owned by a user namespace.  When
+such a namespace is created, it is assigned to the user namespace of the task
+by which it was created.  Therefore, attempts to exercise privilege to
+resources in, for instance, a particular network namespace, can be properly
+validated by checking whether the caller has the needed privilege (i.e.
+CAP_NET_ADMIN) targeted to the user namespace which owns the network namespace.
+This is done using the ns_capable() function.
+
+As an example, if a new task is cloned with a private user namespace but
+no private network namespace, then the task's network namespace is owned
+by the parent user namespace.  The new task has no privilege to the
+parent user namespace, so it will not be able to create or configure
+network devices.  If, instead, the task were cloned with both private
+user and network namespaces, then the private network namespace is owned
+by the private user namespace, and so root in the new user namespace
+will have privilege targeted to the network namespace.  It will be able
+to create and configure network devices.
+
+UID Mapping
+===========
+The current plan (see 'flexible UID mapping' at
+https://wiki.ubuntu.com/UserNamespace) is:
+
+The UID/GID stored on disk will be that in the init_user_ns.  Most likely
+UID/GID in other namespaces will be stored in xattrs.  But Eric was advocating
+(a few years ago) leaving the details up to filesystems while providing a lib/
+stock implementation.  See the thread around here:
+http://www.mail-archive.com/devel@openvz.org/msg09331.html
+
+
+Working notes
+=============
+Capability checks for actions related to syslog must be against the
+init_user_ns until syslog is containerized.
+
+Same is true for reboot and power, control groups, devices, and time.
+
+Perf actions (kernel/event/core.c for instance) will always be constrained to
+init_user_ns.
+
+Q:
+Is accounting considered properly containerized with respect to pidns?  (it
+appears to be).  If so, then we can change the capable() check in
+kernel/acct.c to 'ns_capable(current_pid_ns()->user_ns, CAP_PACCT)'
+
+Q:
+For things like nice and schedaffinity, we could allow root in a container to
+control those, and leave only cgroups to constrain the container.  I'm not sure
+whether that is right, or whether it violates admin expectations.
+
+I deferred some of commoncap.c.  I'm punting on xattr stuff as they take
+dentries, not inodes.
+
+For drivers/tty/tty_io.c and drivers/tty/vt/vt.c, we'll want to (for some of
+them) target the capability checks at the user_ns owning the tty.  That will
+have to wait until we get userns owning files straightened out.
+
+We need to figure out how to label devices.  Should we just toss a user_ns
+right into struct device?
+
+capable(CAP_MAC_ADMIN) checks are always to be against init_user_ns, unless
+some day LSMs were to be containerized, near zero chance.
+
+inode_owner_or_capable() should probably take an optional ns and cap parameter.
+If cap is 0, then CAP_FOWNER is checked.  If ns is NULL, we derive the ns from
+inode.  But if ns is provided, then callers who need to derive
+inode_userns(inode) anyway can save a few cycles.
-- 
1.7.5.4

^ permalink raw reply related

* [PATCH 02/15] user ns: setns: move capable checks into per-ns attach helper
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w
In-Reply-To: <1314993400-6910-1-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>

From: "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>

Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 ipc/namespace.c          |    7 +++++++
 kernel/fork.c            |    5 +++++
 kernel/nsproxy.c         |   11 ++++++++---
 kernel/utsname.c         |    7 +++++++
 net/core/net_namespace.c |    7 +++++++
 5 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/ipc/namespace.c b/ipc/namespace.c
index ce0a647..a0a7609 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -163,6 +163,13 @@ static void ipcns_put(void *ns)
 
 static int ipcns_install(struct nsproxy *nsproxy, void *ns)
 {
+#if 0
+	struct ipc_namespace *newns = ns;
+	if (!ns_capable(newns->user_ns, CAP_SYS_ADMIN))
+#else
+	if (!capable(CAP_SYS_ADMIN))
+#endif
+		return -1;
 	/* Ditch state from the old ipc namespace */
 	exit_sem(current);
 	put_ipc_ns(nsproxy->ipc_ns);
diff --git a/kernel/fork.c b/kernel/fork.c
index 8e6b6f4..ca712f5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1489,8 +1489,13 @@ long do_fork(unsigned long clone_flags,
 		/* hopefully this check will go away when userns support is
 		 * complete
 		 */
+#if 0
+		if (!nsown_capable(CAP_SYS_ADMIN) || !nsown_capable(CAP_SETUID) ||
+				!nsown_capable(CAP_SETGID))
+#else
 		if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SETUID) ||
 				!capable(CAP_SETGID))
+#endif
 			return -EPERM;
 	}
 
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 9aeab4b..e274577 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -134,7 +134,11 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
 				CLONE_NEWPID | CLONE_NEWNET)))
 		return 0;
 
+#if 0
+	if (!nsown_capable(CAP_SYS_ADMIN)) {
+#else
 	if (!capable(CAP_SYS_ADMIN)) {
+#endif
 		err = -EPERM;
 		goto out;
 	}
@@ -191,7 +195,11 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
 			       CLONE_NEWNET)))
 		return 0;
 
+#if 0
+	if (!nsown_capable(CAP_SYS_ADMIN))
+#else
 	if (!capable(CAP_SYS_ADMIN))
+#endif
 		return -EPERM;
 
 	*new_nsp = create_new_namespaces(unshare_flags, current,
@@ -241,9 +249,6 @@ SYSCALL_DEFINE2(setns, int, fd, int, nstype)
 	struct file *file;
 	int err;
 
-	if (!capable(CAP_SYS_ADMIN))
-		return -EPERM;
-
 	file = proc_ns_fget(fd);
 	if (IS_ERR(file))
 		return PTR_ERR(file);
diff --git a/kernel/utsname.c b/kernel/utsname.c
index bff131b..4638a54 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -104,6 +104,13 @@ static void utsns_put(void *ns)
 
 static int utsns_install(struct nsproxy *nsproxy, void *ns)
 {
+#if 0
+	struct uts_namespace *newns = ns;
+	if (!ns_capable(newns->user_ns, CAP_SYS_ADMIN))
+#else
+	if (!capable(CAP_SYS_ADMIN))
+#endif
+		return -1;
 	get_uts_ns(ns);
 	put_uts_ns(nsproxy->uts_ns);
 	nsproxy->uts_ns = ns;
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 5bbdbf0..6f6698d 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -620,6 +620,13 @@ static void netns_put(void *ns)
 
 static int netns_install(struct nsproxy *nsproxy, void *ns)
 {
+#if 0
+	struct net *net = ns;
+	if (!ns_capable(net->user_ns, CAP_SYS_ADMIN))
+#else
+	if (capable(CAP_SYS_ADMIN))
+#endif
+		return -1;
 	put_net(nsproxy->net_ns);
 	nsproxy->net_ns = get_net(ns);
 	return 0;
-- 
1.7.5.4

^ permalink raw reply related

* [PATCH 03/15] keyctl: check capabilities against key's user_ns
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w
In-Reply-To: <1314993400-6910-1-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>

From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

ATM, task should only be able to get his own user_ns's keys
anyway, so nsown_capable should also work, but there is no
advantage to doing that, while using key's user_ns is clearer.

changelog: jun 6:
	compile fix: keyctl.c (key_user, not key has user_ns)

Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Acked-by: David Howells <dhowells-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 security/keys/keyctl.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index eca5191..fa7d420 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -745,7 +745,7 @@ long keyctl_chown_key(key_serial_t id, uid_t uid, gid_t gid)
 	ret = -EACCES;
 	down_write(&key->sem);
 
-	if (!capable(CAP_SYS_ADMIN)) {
+	if (!ns_capable(key->user->user_ns, CAP_SYS_ADMIN)) {
 		/* only the sysadmin can chown a key to some other UID */
 		if (uid != (uid_t) -1 && key->uid != uid)
 			goto error_put;
@@ -852,7 +852,8 @@ long keyctl_setperm_key(key_serial_t id, key_perm_t perm)
 	down_write(&key->sem);
 
 	/* if we're not the sysadmin, we can only change a key that we own */
-	if (capable(CAP_SYS_ADMIN) || key->uid == current_fsuid()) {
+	if (ns_capable(key->user->user_ns, CAP_SYS_ADMIN) ||
+	    key->uid == current_fsuid()) {
 		key->perm = perm;
 		ret = 0;
 	}
-- 
1.7.5.4

^ permalink raw reply related

* [PATCH 04/15] user_ns: convert fs/attr.c to targeted capabilities
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w
In-Reply-To: <1314993400-6910-1-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>

From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/attr.c |   20 +++++++++++++-------
 1 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index 538e279..e0cf46a 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -29,6 +29,7 @@
 int inode_change_ok(const struct inode *inode, struct iattr *attr)
 {
 	unsigned int ia_valid = attr->ia_valid;
+	struct user_namespace *ns;
 
 	/*
 	 * First check size constraints.  These can't be overriden using
@@ -44,26 +45,28 @@ int inode_change_ok(const struct inode *inode, struct iattr *attr)
 	if (ia_valid & ATTR_FORCE)
 		return 0;
 
+	ns = inode_userns(inode);
 	/* Make sure a caller can chown. */
 	if ((ia_valid & ATTR_UID) &&
-	    (current_fsuid() != inode->i_uid ||
-	     attr->ia_uid != inode->i_uid) && !capable(CAP_CHOWN))
+	    (ns != current_user_ns() || current_fsuid() != inode->i_uid ||
+	     attr->ia_uid != inode->i_uid) && !ns_capable(ns, CAP_CHOWN))
 		return -EPERM;
 
 	/* Make sure caller can chgrp. */
 	if ((ia_valid & ATTR_GID) &&
-	    (current_fsuid() != inode->i_uid ||
+	    (ns != current_user_ns() || current_fsuid() != inode->i_uid ||
 	    (!in_group_p(attr->ia_gid) && attr->ia_gid != inode->i_gid)) &&
-	    !capable(CAP_CHOWN))
+	    !ns_capable(ns, CAP_CHOWN))
 		return -EPERM;
 
 	/* Make sure a caller can chmod. */
 	if (ia_valid & ATTR_MODE) {
+		gid_t gid = (ia_valid & ATTR_GID) ? attr->ia_gid : inode->i_gid;
 		if (!inode_owner_or_capable(inode))
 			return -EPERM;
 		/* Also check the setgid bit! */
-		if (!in_group_p((ia_valid & ATTR_GID) ? attr->ia_gid :
-				inode->i_gid) && !capable(CAP_FSETID))
+		if ((ns != current_user_ns() || !in_group_p(gid)) &&
+		    !ns_capable(ns, CAP_FSETID))
 			attr->ia_mode &= ~S_ISGID;
 	}
 
@@ -154,9 +157,12 @@ void setattr_copy(struct inode *inode, const struct iattr *attr)
 						inode->i_sb->s_time_gran);
 	if (ia_valid & ATTR_MODE) {
 		umode_t mode = attr->ia_mode;
+		struct user_namespace *ns = inode_userns(inode);
 
-		if (!in_group_p(inode->i_gid) && !capable(CAP_FSETID))
+		if ((ns != current_user_ns() || !in_group_p(inode->i_gid)) &&
+		    !ns_capable(ns, CAP_FSETID))
 			mode &= ~S_ISGID;
+
 		inode->i_mode = mode;
 	}
 }
-- 
1.7.5.4

^ permalink raw reply related

* [PATCH 05/15] userns: clamp down users of cap_raised
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w
In-Reply-To: <1314993400-6910-1-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>

From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

A few modules are using cap_raised(current_cap(), cap) to authorize
actions, but the privilege should be applicable against the initial
user namespace.  Refuse privilege if the caller is not in init_user_ns.

Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 drivers/block/drbd/drbd_nl.c           |    5 +++++
 drivers/md/dm-log-userspace-transfer.c |    3 +++
 drivers/staging/pohmelfs/config.c      |    3 +++
 drivers/video/uvesafb.c                |    3 +++
 4 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 0feab26..9a87a14 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -2297,6 +2297,11 @@ static void drbd_connector_callback(struct cn_msg *req, struct netlink_skb_parms
 		return;
 	}
 
+	if (current_user_ns() != &init_user_ns) {
+		retcode = ERR_PERM;
+		goto fail;
+	}
+
 	if (!cap_raised(current_cap(), CAP_SYS_ADMIN)) {
 		retcode = ERR_PERM;
 		goto fail;
diff --git a/drivers/md/dm-log-userspace-transfer.c b/drivers/md/dm-log-userspace-transfer.c
index 1f23e04..140ca81 100644
--- a/drivers/md/dm-log-userspace-transfer.c
+++ b/drivers/md/dm-log-userspace-transfer.c
@@ -134,6 +134,9 @@ static void cn_ulog_callback(struct cn_msg *msg, struct netlink_skb_parms *nsp)
 {
 	struct dm_ulog_request *tfr = (struct dm_ulog_request *)(msg + 1);
 
+	if (current_user_ns() != &init_user_ns)
+		return;
+
 	if (!cap_raised(current_cap(), CAP_SYS_ADMIN))
 		return;
 
diff --git a/drivers/staging/pohmelfs/config.c b/drivers/staging/pohmelfs/config.c
index b6c42cb..cd259d0 100644
--- a/drivers/staging/pohmelfs/config.c
+++ b/drivers/staging/pohmelfs/config.c
@@ -525,6 +525,9 @@ static void pohmelfs_cn_callback(struct cn_msg *msg, struct netlink_skb_parms *n
 {
 	int err;
 
+	if (current_user_ns() != &init_user_ns)
+		return;
+
 	if (!cap_raised(current_cap(), CAP_SYS_ADMIN))
 		return;
 
diff --git a/drivers/video/uvesafb.c b/drivers/video/uvesafb.c
index 7f8472c..71dab8e 100644
--- a/drivers/video/uvesafb.c
+++ b/drivers/video/uvesafb.c
@@ -73,6 +73,9 @@ static void uvesafb_cn_callback(struct cn_msg *msg, struct netlink_skb_parms *ns
 	struct uvesafb_task *utask;
 	struct uvesafb_ktask *task;
 
+	if (current_user_ns() != &init_user_ns)
+		return;
+
 	if (!cap_raised(current_cap(), CAP_SYS_ADMIN))
 		return;
 
-- 
1.7.5.4

^ permalink raw reply related

* [PATCH 06/15] user namespace: make each net (net_ns) belong to a user_ns
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm-3NddpPZAyC0, segooon-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w
In-Reply-To: <1314993400-6910-1-git-send-email-serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>

From: "Serge E. Hallyn" <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

This way we can target capabilites at the user_ns which created the
net ns.

Changelog:
   jul 8: nsproxy: don't assign netns->userns if not cloning.

Signed-off-by: Serge E. Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Cc: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 include/net/net_namespace.h |    2 ++
 kernel/nsproxy.c            |    2 ++
 net/core/net_namespace.c    |    3 +++
 3 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 3bb6fa0..d91fe5f 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -29,6 +29,7 @@ struct ctl_table_header;
 struct net_generic;
 struct sock;
 struct netns_ipvs;
+struct user_namespace;
 
 
 #define NETDEV_HASHBITS    8
@@ -101,6 +102,7 @@ struct net {
 	struct netns_xfrm	xfrm;
 #endif
 	struct netns_ipvs	*ipvs;
+	struct user_namespace	*user_ns;
 };
 
 
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index e274577..752b477 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -95,6 +95,8 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
 		err = PTR_ERR(new_nsp->net_ns);
 		goto out_net;
 	}
+	if (flags & CLONE_NEWNET)
+		new_nsp->net_ns->user_ns = get_user_ns(task_cred_xxx(tsk, user_ns));
 
 	return new_nsp;
 
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 6f6698d..5ca95cc 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -10,6 +10,7 @@
 #include <linux/nsproxy.h>
 #include <linux/proc_fs.h>
 #include <linux/file.h>
+#include <linux/user_namespace.h>
 #include <net/net_namespace.h>
 #include <net/netns/generic.h>
 
@@ -209,6 +210,7 @@ static void net_free(struct net *net)
 	}
 #endif
 	kfree(net->gen);
+	put_user_ns(net->user_ns);
 	kmem_cache_free(net_cachep, net);
 }
 
@@ -389,6 +391,7 @@ static int __init net_ns_init(void)
 	rcu_assign_pointer(init_net.gen, ng);
 
 	mutex_lock(&net_mutex);
+	init_net.user_ns = &init_user_ns;
 	if (setup_net(&init_net))
 		panic("Could not setup the initial network namespace");
 
-- 
1.7.5.4

^ permalink raw reply related

* [PATCH 07/15] user namespace: use net->user_ns for some capable calls under net/
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm, segooon, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap
  Cc: Serge Hallyn
In-Reply-To: <1314993400-6910-1-git-send-email-serge@hallyn.com>

From: Serge Hallyn <serge.hallyn@ubuntu.com>

Just a partial conversion to show how the previous patch is expected to
be used.

Changelog:
  6/28/11: fix typo in net/core/sock.c
  7/08/11: don't target capability which authorizes module loading

Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
---
 net/core/dev.c  |    4 ++--
 net/core/sock.c |   14 ++++++++------
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 17d67b5..6ae955f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5014,7 +5014,7 @@ int dev_ioctl(struct net *net, unsigned int cmd, void __user *arg)
 	case SIOCGMIIPHY:
 	case SIOCGMIIREG:
 	case SIOCSIFNAME:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 		dev_load(net, ifr.ifr_name);
 		rtnl_lock();
@@ -5053,7 +5053,7 @@ int dev_ioctl(struct net *net, unsigned int cmd, void __user *arg)
 	case SIOCBRADDIF:
 	case SIOCBRDELIF:
 	case SIOCSHWTSTAMP:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 		/* fall through */
 	case SIOCBONDSLAVEINFOQUERY:
diff --git a/net/core/sock.c b/net/core/sock.c
index bc745d0..0f31675 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -420,7 +420,7 @@ static int sock_bindtodevice(struct sock *sk, char __user *optval, int optlen)
 
 	/* Sorry... */
 	ret = -EPERM;
-	if (!capable(CAP_NET_RAW))
+	if (!ns_capable(net->user_ns, CAP_NET_RAW))
 		goto out;
 
 	ret = -EINVAL;
@@ -488,6 +488,7 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 	int valbool;
 	struct linger ling;
 	int ret = 0;
+	struct net *net = sock_net(sk);
 
 	/*
 	 *	Options without arguments
@@ -508,7 +509,7 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 
 	switch (optname) {
 	case SO_DEBUG:
-		if (val && !capable(CAP_NET_ADMIN))
+		if (val && !ns_capable(net->user_ns, CAP_NET_ADMIN))
 			ret = -EACCES;
 		else
 			sock_valbool_flag(sk, SOCK_DBG, valbool);
@@ -551,7 +552,7 @@ set_sndbuf:
 		break;
 
 	case SO_SNDBUFFORCE:
-		if (!capable(CAP_NET_ADMIN)) {
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) {
 			ret = -EPERM;
 			break;
 		}
@@ -589,7 +590,7 @@ set_rcvbuf:
 		break;
 
 	case SO_RCVBUFFORCE:
-		if (!capable(CAP_NET_ADMIN)) {
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) {
 			ret = -EPERM;
 			break;
 		}
@@ -612,7 +613,8 @@ set_rcvbuf:
 		break;
 
 	case SO_PRIORITY:
-		if ((val >= 0 && val <= 6) || capable(CAP_NET_ADMIN))
+		if ((val >= 0 && val <= 6) ||
+		     ns_capable(net->user_ns, CAP_NET_ADMIN))
 			sk->sk_priority = val;
 		else
 			ret = -EPERM;
@@ -729,7 +731,7 @@ set_rcvbuf:
 			clear_bit(SOCK_PASSSEC, &sock->flags);
 		break;
 	case SO_MARK:
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			ret = -EPERM;
 		else
 			sk->sk_mark = val;
-- 
1.7.5.4

^ permalink raw reply related

* [PATCH 09/15] user ns: convert ipv6 to targeted capabilities
From: Serge Hallyn @ 2011-09-02 19:56 UTC (permalink / raw)
  To: akpm, segooon, linux-kernel, netdev, containers, dhowells,
	ebiederm, rdunlap
  Cc: Serge E. Hallyn
In-Reply-To: <1314993400-6910-1-git-send-email-serge@hallyn.com>

From: "Serge E. Hallyn" <serge.hallyn@canonical.com>

Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
---
 net/ipv6/addrconf.c             |    4 ++--
 net/ipv6/af_inet6.c             |    6 ++++--
 net/ipv6/datagram.c             |    6 +++---
 net/ipv6/ip6_flowlabel.c        |   24 ++++++++++++++----------
 net/ipv6/ip6_tunnel.c           |    4 ++--
 net/ipv6/ip6mr.c                |    2 +-
 net/ipv6/ipv6_sockglue.c        |    7 ++++---
 net/ipv6/netfilter/ip6_tables.c |    8 ++++----
 net/ipv6/route.c                |    2 +-
 net/ipv6/sit.c                  |   10 +++++-----
 10 files changed, 40 insertions(+), 33 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index f012ebd..871e5cf 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -2230,7 +2230,7 @@ int addrconf_add_ifaddr(struct net *net, void __user *arg)
 	struct in6_ifreq ireq;
 	int err;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	if (copy_from_user(&ireq, arg, sizeof(struct in6_ifreq)))
@@ -2249,7 +2249,7 @@ int addrconf_del_ifaddr(struct net *net, void __user *arg)
 	struct in6_ifreq ireq;
 	int err;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	if (copy_from_user(&ireq, arg, sizeof(struct in6_ifreq)))
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 3b5669a..1854ffe 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -160,7 +160,8 @@ lookup_protocol:
 	}
 
 	err = -EPERM;
-	if (sock->type == SOCK_RAW && !kern && !capable(CAP_NET_RAW))
+	if (sock->type == SOCK_RAW && !kern &&
+	    !ns_capable(net->user_ns, CAP_NET_RAW))
 		goto out_rcu_unlock;
 
 	sock->ops = answer->ops;
@@ -281,7 +282,8 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 		return -EINVAL;
 
 	snum = ntohs(addr->sin6_port);
-	if (snum && snum < PROT_SOCK && !capable(CAP_NET_BIND_SERVICE))
+	if (snum && snum < PROT_SOCK &&
+	    !ns_capable(sock_net(sk)->user_ns, CAP_NET_BIND_SERVICE))
 		return -EACCES;
 
 	lock_sock(sk);
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 9ef1831..33b1b0f 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -701,7 +701,7 @@ int datagram_send_ctl(struct net *net,
 				err = -EINVAL;
 				goto exit_f;
 			}
-			if (!capable(CAP_NET_RAW)) {
+			if (!ns_capable(net->user_ns, CAP_NET_RAW)) {
 				err = -EPERM;
 				goto exit_f;
 			}
@@ -721,7 +721,7 @@ int datagram_send_ctl(struct net *net,
 				err = -EINVAL;
 				goto exit_f;
 			}
-			if (!capable(CAP_NET_RAW)) {
+			if (!ns_capable(net->user_ns, CAP_NET_RAW)) {
 				err = -EPERM;
 				goto exit_f;
 			}
@@ -746,7 +746,7 @@ int datagram_send_ctl(struct net *net,
 				err = -EINVAL;
 				goto exit_f;
 			}
-			if (!capable(CAP_NET_RAW)) {
+			if (!ns_capable(net->user_ns, CAP_NET_RAW)) {
 				err = -EPERM;
 				goto exit_f;
 			}
diff --git a/net/ipv6/ip6_flowlabel.c b/net/ipv6/ip6_flowlabel.c
index f3caf1b..4726c02 100644
--- a/net/ipv6/ip6_flowlabel.c
+++ b/net/ipv6/ip6_flowlabel.c
@@ -294,21 +294,22 @@ struct ipv6_txoptions *fl6_merge_options(struct ipv6_txoptions * opt_space,
 	return opt_space;
 }
 
-static unsigned long check_linger(unsigned long ttl)
+static unsigned long check_linger(unsigned long ttl, struct user_namespace *ns)
 {
 	if (ttl < FL_MIN_LINGER)
 		return FL_MIN_LINGER*HZ;
-	if (ttl > FL_MAX_LINGER && !capable(CAP_NET_ADMIN))
+	if (ttl > FL_MAX_LINGER && !ns_capable(ns, CAP_NET_ADMIN))
 		return 0;
 	return ttl*HZ;
 }
 
-static int fl6_renew(struct ip6_flowlabel *fl, unsigned long linger, unsigned long expires)
+static int fl6_renew(struct ip6_flowlabel *fl, unsigned long linger,
+		     unsigned long expires, struct user_namespace *ns)
 {
-	linger = check_linger(linger);
+	linger = check_linger(linger, ns);
 	if (!linger)
 		return -EPERM;
-	expires = check_linger(expires);
+	expires = check_linger(expires, ns);
 	if (!expires)
 		return -EPERM;
 	fl->lastuse = jiffies;
@@ -375,7 +376,7 @@ fl_create(struct net *net, struct in6_flowlabel_req *freq, char __user *optval,
 
 	fl->fl_net = hold_net(net);
 	fl->expires = jiffies;
-	err = fl6_renew(fl, freq->flr_linger, freq->flr_expires);
+	err = fl6_renew(fl, freq->flr_linger, freq->flr_expires, net->user_ns);
 	if (err)
 		goto done;
 	fl->share = freq->flr_share;
@@ -425,7 +426,7 @@ static int mem_check(struct sock *sk)
 	if (room <= 0 ||
 	    ((count >= FL_MAX_PER_SOCK ||
 	      (count > 0 && room < FL_MAX_SIZE/2) || room < FL_MAX_SIZE/4) &&
-	     !capable(CAP_NET_ADMIN)))
+	     !ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)))
 		return -ENOBUFS;
 
 	return 0;
@@ -507,17 +508,20 @@ int ipv6_flowlabel_opt(struct sock *sk, char __user *optval, int optlen)
 		read_lock_bh(&ip6_sk_fl_lock);
 		for (sfl = np->ipv6_fl_list; sfl; sfl = sfl->next) {
 			if (sfl->fl->label == freq.flr_label) {
-				err = fl6_renew(sfl->fl, freq.flr_linger, freq.flr_expires);
+				err = fl6_renew(sfl->fl, freq.flr_linger, freq.flr_expires,
+						net->user_ns);
 				read_unlock_bh(&ip6_sk_fl_lock);
 				return err;
 			}
 		}
 		read_unlock_bh(&ip6_sk_fl_lock);
 
-		if (freq.flr_share == IPV6_FL_S_NONE && capable(CAP_NET_ADMIN)) {
+		if (freq.flr_share == IPV6_FL_S_NONE &&
+		    ns_capable(net->user_ns, CAP_NET_ADMIN)) {
 			fl = fl_lookup(net, freq.flr_label);
 			if (fl) {
-				err = fl6_renew(fl, freq.flr_linger, freq.flr_expires);
+				err = fl6_renew(fl, freq.flr_linger, freq.flr_expires,
+						net->user_ns);
 				fl_release(fl);
 				return err;
 			}
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 0bc9888..c430d69 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1269,7 +1269,7 @@ ip6_tnl_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
 	case SIOCADDTUNNEL:
 	case SIOCCHGTUNNEL:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			break;
 		err = -EFAULT;
 		if (copy_from_user(&p, ifr->ifr_ifru.ifru_data, sizeof (p)))
@@ -1304,7 +1304,7 @@ ip6_tnl_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
 		break;
 	case SIOCDELTUNNEL:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			break;
 
 		if (dev == ip6n->fb_tnl_dev) {
diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index 705c828..1649ccd 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -1582,7 +1582,7 @@ int ip6_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, uns
 		return -ENOENT;
 
 	if (optname != MRT6_INIT) {
-		if (sk != mrt->mroute6_sk && !capable(CAP_NET_ADMIN))
+		if (sk != mrt->mroute6_sk && !ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EACCES;
 	}
 
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index 147ede38..485e181 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -343,7 +343,7 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
 		break;
 
 	case IPV6_TRANSPARENT:
-		if (!capable(CAP_NET_ADMIN)) {
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) {
 			retv = -EPERM;
 			break;
 		}
@@ -381,7 +381,8 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
 
 		/* hop-by-hop / destination options are privileged option */
 		retv = -EPERM;
-		if (optname != IPV6_RTHDR && !capable(CAP_NET_RAW))
+		if (optname != IPV6_RTHDR &&
+		    !ns_capable(net->user_ns, CAP_NET_RAW))
 			break;
 
 		opt = ipv6_renew_options(sk, np->opt, optname,
@@ -725,7 +726,7 @@ done:
 	case IPV6_IPSEC_POLICY:
 	case IPV6_XFRM_POLICY:
 		retv = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			break;
 		retv = xfrm_user_policy(sk, optname, optval, optlen);
 		break;
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index 94874b0..7fce7d8 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -1869,7 +1869,7 @@ compat_do_ip6t_set_ctl(struct sock *sk, int cmd, void __user *user,
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
@@ -1984,7 +1984,7 @@ compat_do_ip6t_get_ctl(struct sock *sk, int cmd, void __user *user, int *len)
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
@@ -2006,7 +2006,7 @@ do_ip6t_set_ctl(struct sock *sk, int cmd, void __user *user, unsigned int len)
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
@@ -2031,7 +2031,7 @@ do_ip6t_get_ctl(struct sock *sk, int cmd, void __user *user, int *len)
 {
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
+	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
 
 	switch (cmd) {
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 9e69eb0..f00c18d 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1938,7 +1938,7 @@ int ipv6_route_ioctl(struct net *net, unsigned int cmd, void __user *arg)
 	switch(cmd) {
 	case SIOCADDRT:		/* Add a route */
 	case SIOCDELRT:		/* Delete a route */
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
 			return -EPERM;
 		err = copy_from_user(&rtmsg, arg,
 				     sizeof(struct in6_rtmsg));
diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 00b15ac..7438711 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -308,7 +308,7 @@ static int ipip6_tunnel_get_prl(struct ip_tunnel *t,
 	/* For simple GET or for root users,
 	 * we try harder to allocate.
 	 */
-	kp = (cmax <= 1 || capable(CAP_NET_ADMIN)) ?
+	kp = (cmax <= 1 || ns_capable(dev_net(t->dev)->user_ns, CAP_NET_ADMIN)) ?
 		kcalloc(cmax, sizeof(*kp), GFP_KERNEL) :
 		NULL;
 
@@ -929,7 +929,7 @@ ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 	case SIOCADDTUNNEL:
 	case SIOCCHGTUNNEL:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			goto done;
 
 		err = -EFAULT;
@@ -988,7 +988,7 @@ ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 
 	case SIOCDELTUNNEL:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			goto done;
 
 		if (dev == sitn->fb_tunnel_dev) {
@@ -1021,7 +1021,7 @@ ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 	case SIOCDELPRL:
 	case SIOCCHGPRL:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			goto done;
 		err = -EINVAL;
 		if (dev == sitn->fb_tunnel_dev)
@@ -1050,7 +1050,7 @@ ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 	case SIOCCHG6RD:
 	case SIOCDEL6RD:
 		err = -EPERM;
-		if (!capable(CAP_NET_ADMIN))
+		if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
 			goto done;
 
 		err = -EFAULT;
-- 
1.7.5.4

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox