Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: rfkill-input to be removed
From: Johannes Berg @ 2011-04-21 16:22 UTC (permalink / raw)
  To: Marco Chiappero; +Cc: netdev
In-Reply-To: <4DAFEAA7.5090003@absence.it>

On Thu, 2011-04-21 at 10:28 +0200, Marco Chiappero wrote:
> While working on the the sony-laptop driver, adding support for 
> persistent rfkill state storing and adding the SW_RFKILL_ALL switch 
> event forwarding to the input core to notify userspace, I realized that 
> rfkill-input interferes with correct behavior of the driver, vanishing 
> the hardware device state storing. 

Yeah we noticed this before with some other drivers. The persistent
stuff seems to only be suitable for a small number of semantics.

> Then, looking at 
> Documentation/feature-removal-schedule.txt I realized that rfkill-input 
> was scheduled to be removed in 2.6.33, but it's still there in 2.6.39. 
> Please remove that code as soon as possible, rfkill input events should 
> be handled by user space tools.

Frankly, I don't think we're ready for this yet, most distros don't yet
ship the rfkill daemon.

johannes


^ permalink raw reply

* Re: RPS will assign different smp_processor_id for the same packet?
From: Eric Dumazet @ 2011-04-21 16:25 UTC (permalink / raw)
  To: zhou rui; +Cc: netdev
In-Reply-To: <1303402094.3685.54.camel@edumazet-laptop>

Le jeudi 21 avril 2011 à 18:08 +0200, Eric Dumazet a écrit :
> Le jeudi 21 avril 2011 à 23:50 +0800, zhou rui a écrit :
> > kernel 2.6.36.4
> > CONFIG_RPS=y but not set the cpu mask
> > 
> > /sys/class/net/eth1/queues/rx-0 # cat rps_cpus
> > 00
> > 
> > register a hook func:
> >   prot_hook.func = packet_rcv;
> >   prot_hook.type = htons(ETH_P_ALL);
> >   dev_add_pack(&prot_hook);
> > 
> > 
> > replay the same traffic in very slow speed, printk the
> > smp_processor_id in packet_rcv():
> > first time:
> > cpu=4
> > cpu=3
> > cpu=6
> > cpu=7
> > 
> > second time:
> > cpu=7
> > cpu=1
> > cpu=5
> > cpu=2
> > 
> > is it normal?
> 
> Yes it is.
> 
> What would you expect ?
> 

If rps_cpus contains only '0' bits, it basically means RPS is not active
for this input queue.

CPU is therefore not changed : The cpu handling NAPI on your network
device directly calls upper linux stack.

Seeing your traces, it also means your device spreads its interrupts on
many different cpus, this might be not optimal.

Check /proc/irq/{irq_number}/smp_affinity, it probably contains "ff"

^ permalink raw reply

* Re: [PATCHv4] usbnet: Resubmit interrupt URB once if halted
From: Alan Stern @ 2011-04-21 16:27 UTC (permalink / raw)
  To: Paul Stewart; +Cc: netdev, linux-usb, davem, greg
In-Reply-To: <BANLkTik7Da6K4Fn7=dpo8RwwctCp8kV_iw@mail.gmail.com>

On Thu, 21 Apr 2011, Paul Stewart wrote:

> I'm trying to handle two separate issues, one of which I can't say I
> fully understand.  The first, which is the most straightforward, is
> for systems in which USB devices remained powered across
> suspend-resume.  For this case for sure, we don't need a flag.  The
> interrupt URBs are halted (either done so by the HCD as I've observed,
> or the drive can choose to kill them in usbnet_suspend()).  On system
> resume, we're guaranteed URBs have stopped, and we can just submit
> one.

Okay, good.

> In a second scenario, for other systems USB devices go unpowered
> during suspend.

As happens during hibernation, for example.

>  At resume time, there's a quick succession where the
> device appears to detach and reattach and enumerate.

Right.  It's called reset-resume, and drivers have a special method for 
it, distinct from regular resume.  In theory it shouldn't make any 
difference.

>  This is where
> things get strange.  It appears that since the enumeration happens in
> the course of system resume, when usbnet_open() gets called, and
> usb_autopm_get_interface(), there's a call path that leads to
> usbnet_resume().

Only if the interface was suspended when usbnet_open() was called.  It
might have gotten suspended automatically following the system resume,
if it wasn't in use.  But this should work out the same whether or not
there was a reset-rseume.

>  If there's no flag, then we submit the interrupt urb
> from usbnet_resume(), so the submit_urb() in usbnet_open() fails in an
> error.  This makes actions like "ifconfig eth0 up" fail on the
> interface after resume from suspend.

The driver needs better coordination between open/stop and
resume/suspend.  The interrupt and receive URBs are supposed to be
active whenever the interface is up and not suspended, right?  Which
means that usbnet_resume() shouldn't submit anything if the interface 
isn't up.

Alan Stern

^ permalink raw reply

* Re: RPS will assign different smp_processor_id for the same packet?
From: zhou rui @ 2011-04-21 16:27 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev@vger.kernel.org
In-Reply-To: <1303402094.3685.54.camel@edumazet-laptop>

On Friday, April 22, 2011, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le jeudi 21 avril 2011 à 23:50 +0800, zhou rui a écrit :
>> kernel 2.6.36.4
>> CONFIG_RPS=y but not set the cpu mask
>>
>> /sys/class/net/eth1/queues/rx-0 # cat rps_cpus
>> 00
>>
>> register a hook func:
>>   prot_hook.func = packet_rcv;
>>   prot_hook.type = htons(ETH_P_ALL);
>>   dev_add_pack(&prot_hook);
>>
>>
>> replay the same traffic in very slow speed, printk the
>> smp_processor_id in packet_rcv():
>> first time:
>> cpu=4
>> cpu=3
>> cpu=6
>> cpu=7
>>
>> second time:
>> cpu=7
>> cpu=1
>> cpu=5
>> cpu=2
>>
>> is it normal?
>
> Yes it is.
>
> What would you expect ?
>
>
>
I want a same CPU for same packet,
If I echo ff >rps_cpu,will I get it?
And the design idea for different CPU in rps?
I understand nic will assign same rxq for packet has same hash

^ permalink raw reply

* Re: RPS will assign different smp_processor_id for the same packet?
From: zhou rui @ 2011-04-21 16:29 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev@vger.kernel.org
In-Reply-To: <1303403112.3685.61.camel@edumazet-laptop>

On Friday, April 22, 2011, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le jeudi 21 avril 2011 à 18:08 +0200, Eric Dumazet a écrit :
>> Le jeudi 21 avril 2011 à 23:50 +0800, zhou rui a écrit :
>> > kernel 2.6.36.4
>> > CONFIG_RPS=y but not set the cpu mask
>> >
>> > /sys/class/net/eth1/queues/rx-0 # cat rps_cpus
>> > 00
>> >
>> > register a hook func:
>> >   prot_hook.func = packet_rcv;
>> >   prot_hook.type = htons(ETH_P_ALL);
>> >   dev_add_pack(&prot_hook);
>> >
>> >
>> > replay the same traffic in very slow speed, printk the
>> > smp_processor_id in packet_rcv():
>> > first time:
>> > cpu=4
>> > cpu=3
>> > cpu=6
>> > cpu=7
>> >
>> > second time:
>> > cpu=7
>> > cpu=1
>> > cpu=5
>> > cpu=2
>> >
>> > is it normal?
>>
>> Yes it is.
>>
>> What would you expect ?
>>
>
> If rps_cpus contains only '0' bits, it basically means RPS is not active
> for this input queue.
>
> CPU is therefore not changed : The cpu handling NAPI on your network
> device directly calls upper linux stack.
>
> Seeing your traces, it also means your device spreads its interrupts on
> many different cpus, this might be not optimal.
>
> Check /proc/irq/{irq_number}/smp_affinity, it probably contains "ff"
>
>
>
>
Thanks,just saw this email

^ permalink raw reply

* Re: rfkill-input to be removed
From: Marco Chiappero @ 2011-04-21 16:45 UTC (permalink / raw)
  To: Johannes Berg; +Cc: netdev
In-Reply-To: <1303402951.3597.27.camel@jlt3.sipsolutions.net>

Il 21/04/2011 18:22, Johannes Berg ha scritto:
> Yeah we noticed this before with some other drivers. The persistent
> stuff seems to only be suitable for a small number of semantics.

Sorry, what do you mean exactly? The persistent stuff seems to work well 
with those notebooks.

> Frankly, I don't think we're ready for this yet, most distros don't yet
> ship the rfkill daemon.

Ok, in the meantime I'm going to avoid the SW_RFKILL_ALL switch event 
forwarding, unless the master_switch_mode=1 is going to be changed to 
honor the stored power state on drivers loading as well, not only when 
moving the "kill all" switch (it seems that on boot every device is 
always turned on, which is very annoying).

^ permalink raw reply

* Re: [PATCH] tg3: Convert u32 flag,flg2,flg3 uses to bitmap
From: Joe Perches @ 2011-04-21 16:49 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Matt Carlson, Michael Chan, netdev, linux-kernel
In-Reply-To: <1303374696.3685.14.camel@edumazet-laptop>

On Thu, 2011-04-21 at 10:31 +0200, Eric Dumazet wrote:
> Le mercredi 20 avril 2011 à 23:39 -0700, Joe Perches a écrit :
> > Using a bitmap instead of separate u32 flags allows a consistent, simpler
[]
> Use an enum ?

No strong preference.
If it's an enum .c file will change.
	
> Why first value is 1 and not 0 ?

Should be 0.

> > +#define TG3_FLAGS			74	/* Set to number of flags */
> Also you need to make TG3_FLAGS be (last_flag_value + 1) or you could
> miss one long in bitmap.

Right.  Thanks for comments Eric.
I'll wait for Matt to comment before resubmitting.

^ permalink raw reply

* RE: rtnetlink and many VFs
From: Rose, Gregory V @ 2011-04-21 17:02 UTC (permalink / raw)
  To: Ben Hutchings, David Miller; +Cc: netdev, sf-linux-drivers
In-Reply-To: <1303396576.3165.13.camel@bwh-desktop>

> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]
> On Behalf Of Ben Hutchings
> Sent: Thursday, April 21, 2011 7:36 AM
> To: David Miller
> Cc: netdev; sf-linux-drivers
> Subject: rtnetlink and many VFs
> 
> My colleagues have been working on SR-IOV support for sfc.  The hardware
> supports up to 127 VFs per port.
> 
> If we configure all 127 VFs through the net device, an RTM_GETLINK dump
> will need to include messages describing them, with a total size of:
> 
> 127 * (sizeof(struct ifla_vf_mac) + sizeof(struct ifla_vf_vlan) +
>        sizeof(struct ifla_vf_tx_rate) + protocol overhead)
> > 7112
> 
> These messages are nested within the message describing the device as a
> whole, so they cannot be split.  The maximum size of an outgoing netlink
> message, based on NLMSG_GOODSIZE, seems to be min(PAGE_SIZE, 8192).  So
> when PAGE_SIZE = 4096 it is simply impossible to dump information about
> such a device!
> 
> I think it needs to be made possible to grow a netlink skb during
> generation of the first message.  Userspace may still be unable to
> receive the large message but at least it has a chance.

I've been looking at this one too.  The limit seems to be about 40 or so in the most common case.  My netlink fu is weak but I've been looking at the code in iproute2/ip and netlink to see what we can do about it.

As more VFs become possible it really needs a fix.  I was thinking about something along the lines of this:

# ip link show eth(x) vf (n)

Where eth(x) is the physical function that owns the VFs and (n) is the specific VF you want information for.  That way one could easily script something that loops through the VFs and gets the information for each.  This really becomes necessary when we start adding additional MAC and VLAN filters for each VF that need to be displayed.  In that case you can only show a few VFs before you run out of space.

In any case I've been working on an RFC patch for this and hope to have it soon.  I consider this a pretty serious limitation and one could even view it as a bug.

- Greg
Greg Rose
LAD Division
Intel Corp.

^ permalink raw reply

* Re: [PATCH] tg3: Convert u32 flag,flg2,flg3 uses to bitmap
From: Michael Chan @ 2011-04-21 17:18 UTC (permalink / raw)
  To: Joe Perches
  Cc: Matthew Carlson, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <02bf2aa5c08514641ecbe7c39ef976918fad036c.1303367730.git.joe@perches.com>


On Wed, 2011-04-20 at 23:39 -0700, Joe Perches wrote:
> @@ -4622,7 +4611,7 @@ static void tg3_tx_recover(struct tg3 *tp)
>                     "and include system chipset information.\n");
> 
>         spin_lock(&tp->lock);
> -       tp->tg3_flags |= TG3_FLAG_TX_RECOVERY_PENDING;
> +       tg3_flag_set(tp, TX_RECOVERY_PENDING);
>         spin_unlock(&tp->lock);
>  }
> 

By using set_bit() now, we can eliminate the spin_lock() here.  This
flag is checked much later when tg3_reset_task() is scheduled to run in
workqueue, so no memory barrier is needed either.

Thanks.

^ permalink raw reply

* [PATCH] [RFC] davinci_emac: don't WARN_ON cpdma_chan_submit -ENOMEM
From: Ben Gardiner @ 2011-04-21 17:30 UTC (permalink / raw)
  To: netdev; +Cc: davinci-linux-open-source, Sriramakrishnan A G

The current implementation of emac_rx_handler, when the host is
flooded, will result in a great deal of WARNs on the console; due to
the return value of cpdma_chan_submit. This function can error with
EINVAL and ENOMEM; the former when the channel is in an invalid
state, in which case the caller is in error. The latter when a cpdma
descriptor cannot be allocated.

When flooded, cpdma_chan_submit will return -ENOMEM;
treat the inability to allocate a cpdma descriptor as an rx error
similar in behaviour to when emac_rx_alloc returns NULL.

No Signed-off-by yet; not complete fix (see below).
CC: Sriramakrishnan A G <srk@ti.com>

---
I'm new to network drivers -- and kernel development, really. I'd be
happy to receive feedback on this approach of resolving the -ENOMEM
when flooded. Is there a more conventional approach? Shoud these frames
be recorded as 'dropped'?

Testing was performed on da850evm both with and without  "net:
davinci_emac: fix spinlock bug with dma channel cleanup" from
Sriramakrishnan A G applied. The behaviour was the same: the emac is
not able to receive any frames after being flooded -- but it can still
send. I would appreciate any insight into the potential causes of the
lockup.

Best Regards,
Ben Gardiner

Nanometrics Inc.
http://www.nanometrics.ca

---
 drivers/net/davinci_emac.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/drivers/net/davinci_emac.c b/drivers/net/davinci_emac.c
index 7018bfe..17c48d6 100644
--- a/drivers/net/davinci_emac.c
+++ b/drivers/net/davinci_emac.c
@@ -1037,8 +1037,12 @@ static void emac_rx_handler(void *token, int len, int status)
 recycle:
 	ret = cpdma_chan_submit(priv->rxchan, skb, skb->data,
 			skb_tailroom(skb), GFP_KERNEL);
-	if (WARN_ON(ret < 0))
+	WARN_ON(ret == -EINVAL);
+	if (ret < 0) {
+		if (netif_msg_rx_err(priv) && net_ratelimit())
+			dev_err(emac_dev, "failed cpdma submit\n");
 		dev_kfree_skb_any(skb);
+	}
 }

 static void emac_tx_handler(void *token, int len, int status)
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH net-next-2.6 0/3] new SCTP sockets APIs
From: David Miller @ 2011-04-21 17:36 UTC (permalink / raw)
  To: yjwei; +Cc: netdev, linux-sctp
In-Reply-To: <4DABAEF1.1010506@cn.fujitsu.com>

From: Wei Yongjun <yjwei@cn.fujitsu.com>
Date: Mon, 18 Apr 2011 11:24:33 +0800

> This patchset implement some new sockets APIs for net-next-2.6.
> 
> Wei Yongjun (3):
>   sctp: implement socket option SCTP_GET_ASSOC_ID_LIST
>   sctp: change auth event type name to SCTP_AUTHENTICATION_EVENT
>   sctp: implement event notification SCTP_SENDER_DRY_EVENT

All 3 patches applied, thanks.

^ permalink raw reply

* Re: [PATCH BUG-FIX] ipv6: udp: fix the wrong headroom check
From: David Miller @ 2011-04-21 17:39 UTC (permalink / raw)
  To: herbert; +Cc: shanwei, kuznet, pekkas, jmorris, yoshfuji, kaber, netdev
In-Reply-To: <20110420105007.GA16248@gondor.apana.org.au>

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 20 Apr 2011 18:50:07 +0800

> On Wed, Apr 20, 2011 at 04:52:49PM +0800, Shan Wei wrote:
>> At this point, skb->data points to skb_transport_header.
>> So, headroom check is wrong. 
>> 
>> For some case:bridge(UFO is on) + eth device(UFO is off),
>> there is no enough headroom for IPv6 frag head.
>> But headroom check is always false.
>> 
>> This will bring about data be moved to there prior to skb->head,
>> when adding IPv6 frag header to skb.
>> 
>> Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
> 
> Ouch.
> 
> Acked-by: Herbert Xu <herbert@gondor.apana.org.au>

Applied, thanks.

^ permalink raw reply

* RE: rtnetlink and many VFs
From: Ben Hutchings @ 2011-04-21 17:40 UTC (permalink / raw)
  To: Rose, Gregory V; +Cc: David Miller, netdev, sf-linux-drivers
In-Reply-To: <43F901BD926A4E43B106BF17856F07550145FC7526@orsmsx508.amr.corp.intel.com>

On Thu, 2011-04-21 at 10:02 -0700, Rose, Gregory V wrote:
> > -----Original Message-----
> > From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]
> > On Behalf Of Ben Hutchings
> > Sent: Thursday, April 21, 2011 7:36 AM
> > To: David Miller
> > Cc: netdev; sf-linux-drivers
> > Subject: rtnetlink and many VFs
> > 
> > My colleagues have been working on SR-IOV support for sfc.  The hardware
> > supports up to 127 VFs per port.
> > 
> > If we configure all 127 VFs through the net device, an RTM_GETLINK dump
> > will need to include messages describing them, with a total size of:
> > 
> > 127 * (sizeof(struct ifla_vf_mac) + sizeof(struct ifla_vf_vlan) +
> >        sizeof(struct ifla_vf_tx_rate) + protocol overhead)
> > > 7112
> > 
> > These messages are nested within the message describing the device as a
> > whole, so they cannot be split.  The maximum size of an outgoing netlink
> > message, based on NLMSG_GOODSIZE, seems to be min(PAGE_SIZE, 8192).  So
> > when PAGE_SIZE = 4096 it is simply impossible to dump information about
> > such a device!
> > 
> > I think it needs to be made possible to grow a netlink skb during
> > generation of the first message.  Userspace may still be unable to
> > receive the large message but at least it has a chance.
> 
> I've been looking at this one too.  The limit seems to be about 40 or
> so in the most common case.

Right.  When Steve Hodgson investigated this here, he found that 46 VFs
would fit.

> My netlink fu is weak but I've been looking at the code in iproute2/ip
> and netlink to see what we can do about it.
> 
> As more VFs become possible it really needs a fix.  I was thinking
> about something along the lines of this:
> 
> # ip link show eth(x) vf (n)
> 
> Where eth(x) is the physical function that owns the VFs and (n) is the
> specific VF you want information for.  That way one could easily
> script something that loops through the VFs and gets the information
> for each.  This really becomes necessary when we start adding
> additional MAC and VLAN filters for each VF that need to be displayed.
> In that case you can only show a few VFs before you run out of space.

I think that what 'ip link show' is doing now seems to be perfectly
valid.  It allocates a 16K buffer which would be enough if netlink
didn't apply this PAGE_SIZE limit to single messages.

> In any case I've been working on an RFC patch for this and hope to
> have it soon.  I consider this a pretty serious limitation and one
> could even view it as a bug.

It is certainly a bug.  rtnetlink is the currently favoured API for
querying and configuring network device settings, but there are now
valid device settings that cannot be queried.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* RE: rtnetlink and many VFs
From: Rose, Gregory V @ 2011-04-21 17:50 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David Miller, netdev, sf-linux-drivers
In-Reply-To: <1303407628.3165.30.camel@bwh-desktop>

> -----Original Message-----
> From: Ben Hutchings [mailto:bhutchings@solarflare.com]
> Sent: Thursday, April 21, 2011 10:40 AM
> To: Rose, Gregory V
> Cc: David Miller; netdev; sf-linux-drivers
> Subject: RE: rtnetlink and many VFs
> 
> On Thu, 2011-04-21 at 10:02 -0700, Rose, Gregory V wrote:
> > > -----Original Message-----
> > > From: netdev-owner@vger.kernel.org [mailto:netdev-
> owner@vger.kernel.org]
> > > On Behalf Of Ben Hutchings
> > > Sent: Thursday, April 21, 2011 7:36 AM
> > > To: David Miller
> > > Cc: netdev; sf-linux-drivers
> > > Subject: rtnetlink and many VFs
> > >
> >
> > As more VFs become possible it really needs a fix.  I was thinking
> > about something along the lines of this:
> >
> > # ip link show eth(x) vf (n)
> >
> > Where eth(x) is the physical function that owns the VFs and (n) is the
> > specific VF you want information for.  That way one could easily
> > script something that loops through the VFs and gets the information
> > for each.  This really becomes necessary when we start adding
> > additional MAC and VLAN filters for each VF that need to be displayed.
> > In that case you can only show a few VFs before you run out of space.
> 
> I think that what 'ip link show' is doing now seems to be perfectly
> valid.  It allocates a 16K buffer which would be enough if netlink
> didn't apply this PAGE_SIZE limit to single messages.

Ah, I hadn't seen that it was allocating 16K, that would then be enough for 128 VFs but in the future would not be enough for 40Gig (or higher speed) devices that might support two or 4 times that many VFs.  I still feel like eventually the number of VFs will outgrow the capability of a single message to handle, especially when VFs will have the capability of having multiple MAC address and VLAN filters assigned to them.  And it seems orthogonal to me to mirror the 'ip link set eth(x) vf (n)' syntax with a 'ip link show eth(x) vf (n)' syntax.

That's just me though.

> 
> It is certainly a bug.  rtnetlink is the currently favoured API for
> querying and configuring network device settings, but there are now
> valid device settings that cannot be queried.

I'll look into just fixing the bug for now and reserve (at least what I consider to be) future improvements for later.

- Greg


^ permalink raw reply

* Re: [PATCH] tg3: Convert u32 flag,flg2,flg3 uses to bitmap
From: Joe Perches @ 2011-04-21 17:51 UTC (permalink / raw)
  To: Michael Chan
  Cc: Matthew Carlson, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <1303406280.8857.13.camel@HP1>

On Thu, 2011-04-21 at 10:18 -0700, Michael Chan wrote:
> On Wed, 2011-04-20 at 23:39 -0700, Joe Perches wrote:
> > @@ -4622,7 +4611,7 @@ static void tg3_tx_recover(struct tg3 *tp)
> >                     "and include system chipset information.\n");
> > 
> >         spin_lock(&tp->lock);
> > -       tp->tg3_flags |= TG3_FLAG_TX_RECOVERY_PENDING;
> > +       tg3_flag_set(tp, TX_RECOVERY_PENDING);
> >         spin_unlock(&tp->lock);
> >  }
> By using set_bit() now, we can eliminate the spin_lock() here.  This
> flag is checked much later when tg3_reset_task() is scheduled to run in
> workqueue, so no memory barrier is needed either.

Sure, but as a separate patch.

^ permalink raw reply

* RE: rtnetlink and many VFs
From: Ben Hutchings @ 2011-04-21 18:11 UTC (permalink / raw)
  To: Rose, Gregory V; +Cc: David Miller, netdev, sf-linux-drivers
In-Reply-To: <43F901BD926A4E43B106BF17856F07550145FC75D6@orsmsx508.amr.corp.intel.com>

On Thu, 2011-04-21 at 10:50 -0700, Rose, Gregory V wrote:
> > -----Original Message-----
> > From: Ben Hutchings [mailto:bhutchings@solarflare.com]
> > Sent: Thursday, April 21, 2011 10:40 AM
> > To: Rose, Gregory V
> > Cc: David Miller; netdev; sf-linux-drivers
> > Subject: RE: rtnetlink and many VFs
> > 
> > On Thu, 2011-04-21 at 10:02 -0700, Rose, Gregory V wrote:
> > > > -----Original Message-----
> > > > From: netdev-owner@vger.kernel.org [mailto:netdev-
> > owner@vger.kernel.org]
> > > > On Behalf Of Ben Hutchings
> > > > Sent: Thursday, April 21, 2011 7:36 AM
> > > > To: David Miller
> > > > Cc: netdev; sf-linux-drivers
> > > > Subject: rtnetlink and many VFs
> > > >
> > >
> > > As more VFs become possible it really needs a fix.  I was thinking
> > > about something along the lines of this:
> > >
> > > # ip link show eth(x) vf (n)
> > >
> > > Where eth(x) is the physical function that owns the VFs and (n) is the
> > > specific VF you want information for.  That way one could easily
> > > script something that loops through the VFs and gets the information
> > > for each.  This really becomes necessary when we start adding
> > > additional MAC and VLAN filters for each VF that need to be displayed.
> > > In that case you can only show a few VFs before you run out of space.
> > 
> > I think that what 'ip link show' is doing now seems to be perfectly
> > valid.  It allocates a 16K buffer which would be enough if netlink
> > didn't apply this PAGE_SIZE limit to single messages.
> 
> Ah, I hadn't seen that it was allocating 16K, that would then be
> enough for 128 VFs but in the future would not be enough for 40Gig (or
> higher speed) devices that might support two or 4 times that many VFs.

There are only 256 available function addresses per PCIe link, so there
can be at most 255 VFs associated with a single PF.

(The function number could be extended further by effectively assigning
multiple bus numbers to a link, but that would be a significantly more
disruptive change than ARI.)

> I still feel like eventually the number of VFs will outgrow the
> capability of a single message to handle, especially when VFs will
> have the capability of having multiple MAC address and VLAN filters
> assigned to them.  And it seems orthogonal to me to mirror the 'ip
> link set eth(x) vf (n)' syntax with a 'ip link show eth(x) vf (n)'
> syntax.
> 
> That's just me though.
[...]

I think it would be a useful extension, but we have to keep the current
API working as far as possible.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* RE: rtnetlink and many VFs
From: Rose, Gregory V @ 2011-04-21 18:28 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David Miller, netdev, sf-linux-drivers
In-Reply-To: <1303409517.3165.39.camel@bwh-desktop>

> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]
> On Behalf Of Ben Hutchings
> Sent: Thursday, April 21, 2011 11:12 AM
> To: Rose, Gregory V
> Cc: David Miller; netdev; sf-linux-drivers
> Subject: RE: rtnetlink and many VFs
> 
> On Thu, 2011-04-21 at 10:50 -0700, Rose, Gregory V wrote:
> > > -----Original Message-----
> > > From: Ben Hutchings [mailto:bhutchings@solarflare.com]
> > > Sent: Thursday, April 21, 2011 10:40 AM
> > > To: Rose, Gregory V
> > > Cc: David Miller; netdev; sf-linux-drivers
> > > Subject: RE: rtnetlink and many VFs
> > >
> 
> > I still feel like eventually the number of VFs will outgrow the
> > capability of a single message to handle, especially when VFs will
> > have the capability of having multiple MAC address and VLAN filters
> > assigned to them.  And it seems orthogonal to me to mirror the 'ip
> > link set eth(x) vf (n)' syntax with a 'ip link show eth(x) vf (n)'
> > syntax.
> >
> > That's just me though.
> [...]
> 
> I think it would be a useful extension, but we have to keep the current
> API working as far as possible.

Sure, sounds good.  I'm  working on something that will patch things up for the present use case right now.

- Greg


^ permalink raw reply

* [PATCH] bonding: fix bridged bonds in 802.3ad mode
From: Jiri Bohac @ 2011-04-21 18:47 UTC (permalink / raw)
  To: netdev; +Cc: Stephen Hemminger, Jay Vosburgh, Andy Gospodarek,
	Benjamin Poirier

802.3ad bonding inside a bridge is broken again. Originally fixed by
43aa1920117801fe9ae3d1fad886b62511e09bee, the bug was re-introduced by
1e253c3b8a1aeed51eef6fc366812f219b97de65.

LACP frames must not have their skb->dev changed by the bridging hook.

Signed-off-by: Jiri Bohac <jbohac@suse.cz>

--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1514,6 +1514,11 @@ static rx_handler_result_t bond_handle_frame(struct sk_buff **pskb)
 		memcpy(eth_hdr(skb)->h_dest, bond->dev->dev_addr, ETH_ALEN);
 	}
 
+	/* prevent bridging code from mangling and forwarding LACP frames */
+	if (bond->params.mode == BOND_MODE_8023AD &&
+	    skb->protocol == htons(ETH_P_SLOW))
+		return RX_HANDLER_PASS;
+
 	return RX_HANDLER_ANOTHER;
 }
 
-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ


^ permalink raw reply

* Re: [PATCHv4] usbnet: Resubmit interrupt URB once if halted
From: Alan Stern @ 2011-04-21 18:48 UTC (permalink / raw)
  To: Paul Stewart; +Cc: netdev, USB list
In-Reply-To: <BANLkTinhmAfe1V2SvoY+J6Tu_DdnZMoYgw@mail.gmail.com>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: TEXT/PLAIN; charset=UTF-8, Size: 1715 bytes --]

On Thu, 21 Apr 2011, Paul Stewart wrote:

> > The driver needs better coordination between open/stop and
> > resume/suspend.  The interrupt and receive URBs are supposed to be
> > active whenever the interface is up and not suspended, right?  Which
> > means that usbnet_resume() shouldn't submit anything if the interface
> > isn't up.
> 
> How do we define "up" here (from a network perspective there are many
> ways to interpret that)?  How does this concept compare to the user's
> "ifconfig up/down" state?

I don't know the details of how network drivers are supposed to work.  
But it doesn't matter -- for your purposes you should define "up" to 
mean "whenever the URBs are supposed to be active (unless the interface 
is suspended)".

>  What call do I use in usbnet_resume() to
> tell that the interface isn't up?  Currently I'm using netif_running()
> which responds true in this condition, which is why I'm resorting to
> the flag.

Again, I don't know.  However, the URBs get submitted from within
usbnet_open() and killed within usbnet_stop(), right?  Therefore you
can use any condition which gets set to True in usbnet_open() and set
to False in usbnet_stop().  (If nothing else is suitable, use a flag of
your own.)  And be careful of the edge case: Since usbnet_open() itself
performs a resume operation, you need to make sure the resume takes
place before the condition becomes True -- otherwise the URBs will get
submitted twice.

One more thing to keep in mind: If the kernel is built without PM
support, the resume and suspend routines will never get called.  
Therefore they must not be the only places where URBs are submitted and
killed.

Alan Stern

^ permalink raw reply

* Re: [PATCH v5] net: bnx2x: convert to hw_features
From: Vladislav Zolotarov @ 2011-04-21 18:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Michał Mirosław, netdev@vger.kernel.org,
	Eilon Greenstein
In-Reply-To: <1303397531.3685.16.camel@edumazet-laptop>

On Thu, 2011-04-21 at 07:52 -0700, Eric Dumazet wrote:
> Le mardi 12 avril 2011  21:38 +0200, Micha Mirosaw a crit :
> > Since ndo_fix_features callback is postponing features change when
> > bp->recovery_state != BNX2X_RECOVERY_DONE, netdev_update_features()
> > has to be called again when this condition changes. Previously,
> > ethtool_ops->set_flags callback returned -EBUSY in that case
> > (it's not possible in the new model).
> > 
> > Signed-off-by: Micha Mirosaw <mirq-linux@rere.qmqm.pl>
> > 
> > v5: - don't delay set_features, as it's rtnl_locked - same as recovery process
> > v4: - complete bp->rx_csum -> NETIF_F_RXCSUM conversion
> >     - add check for failed ndo_set_features in ndo_open callback
> > v3: - include NETIF_F_LRO in hw_features
> >     - don't call netdev_update_features() if bnx2x_nic_load() failed
> > v2: - comment in ndo_fix_features callback
> > ---
> 
> Hi guys
> 
> I am not sure its related to these changes, but I now have in
> net-next-2.6 :
> 
> [   23.674263] ------------[ cut here ]------------
> [   23.674266] WARNING: at net/core/dev.c:1318 dev_disable_lro+0x83/0x90()
> [   23.674270] Hardware name: ProLiant BL460c G6
> [   23.674273] Modules linked in: tg3 libphy sg
> [   23.674280] Pid: 3070, comm: sysctl Tainted: G        W   2.6.39-rc2-01242-g3ef22b9-dirty #669
> [   23.674282] Call Trace:
> [   23.674285]  [<ffffffff813b94f3>] ? dev_disable_lro+0x83/0x90
> [   23.674291]  [<ffffffff81042c9b>] warn_slowpath_common+0x8b/0xc0
> [   23.674298]  [<ffffffff81042ce5>] warn_slowpath_null+0x15/0x20
> [   23.674304]  [<ffffffff813b94f3>] dev_disable_lro+0x83/0x90
> [   23.674309]  [<ffffffff81429789>] devinet_sysctl_forward+0x199/0x210
> [   23.674313]  [<ffffffff814296e4>] ? devinet_sysctl_forward+0xf4/0x210
> [   23.674318]  [<ffffffff8104e712>] ? capable+0x12/0x20
> [   23.674324]  [<ffffffff81168f45>] proc_sys_call_handler+0xb5/0xd0
> [   23.674328]  [<ffffffff81168f6f>] proc_sys_write+0xf/0x20
> [   23.674334]  [<ffffffff81105f39>] vfs_write+0xc9/0x170
> [   23.674339]  [<ffffffff81106550>] sys_write+0x50/0x90
> [   23.674345]  [<ffffffff814b95a0>] sysenter_dispatch+0x7/0x33
> [   23.674350] ---[ end trace 051ec497c66b228e ]---
> 
> Thanks
> 

More than that, in addition it is impossible to disable the LRO with 
the current bnx2x upstream driver (ethtool -K ethX lro off) and this is
because dev_disable_lro passes to __ethtool_set_flags() flags based on
the current value of dev->features while __ethtool_set_flags() expects
only the flags set in dev->hw_features. bnx2x has NETIF_F_HW_VLAN_RX
that is set in dev->features and not set in dev->hw_features and it's
passed down to the __ethtool_set_flags(). 

Regarding the 'ethtool -K ethX lro off' I noticed that there is the same
problem: the flags that are passed to the __ethtool_set_flags() include
NETIF_F_HW_VLAN_RX. The userspace app gets the flags set in
dev->features to be able to display them, then clears the needed bits
and passes the resulting value down to kernel. So, I don't know how to
fix this without changing __ethtool_set_flags() not to check the bits or
by the taking care of the bit mask before it's passed to the
__ethtool_set_flags(). Below is the implementation of the second
option: 

diff --git a/net/core/dev.c b/net/core/dev.c
index 3871bf6..fd9175b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1314,7 +1314,7 @@ void dev_disable_lro(struct net_device *dev)
 	if (!(flags & ETH_FLAG_LRO))
 		return;
 
-	__ethtool_set_flags(dev, flags & ~ETH_FLAG_LRO);
+	__ethtool_set_flags(dev, flags & ~ETH_FLAG_LRO & dev->hw_features);
 	WARN_ON(dev->features & NETIF_F_LRO);
 }
 EXPORT_SYMBOL(dev_disable_lro);
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 13d79f5..bbf75fc 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -1810,7 +1810,7 @@ static int ethtool_set_value(struct net_device *dev, char __user *useraddr,
 	if (copy_from_user(&edata, useraddr, sizeof(edata)))
 		return -EFAULT;
 
-	return actor(dev, edata.data);
+	return actor(dev, edata.data & dev->hw_features);
 }
 
 static noinline_for_stack int ethtool_flash_device(struct net_device *dev,

Pls., comment.

Thanks,
vlad



> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 




^ permalink raw reply related

* Re: [PATCH] bonding: fix bridged bonds in 802.3ad mode
From: Ben Hutchings @ 2011-04-21 19:08 UTC (permalink / raw)
  To: Jiri Bohac
  Cc: netdev, Stephen Hemminger, Jay Vosburgh, Andy Gospodarek,
	Benjamin Poirier
In-Reply-To: <20110421184748.GD23756@midget.suse.cz>

On Thu, 2011-04-21 at 20:47 +0200, Jiri Bohac wrote:
> 802.3ad bonding inside a bridge is broken again. Originally fixed by
> 43aa1920117801fe9ae3d1fad886b62511e09bee, the bug was re-introduced by
> 1e253c3b8a1aeed51eef6fc366812f219b97de65.
> 
> LACP frames must not have their skb->dev changed by the bridging hook.
> 
> Signed-off-by: Jiri Bohac <jbohac@suse.cz>
> 
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -1514,6 +1514,11 @@ static rx_handler_result_t bond_handle_frame(struct sk_buff **pskb)
>  		memcpy(eth_hdr(skb)->h_dest, bond->dev->dev_addr, ETH_ALEN);
>  	}
>  
> +	/* prevent bridging code from mangling and forwarding LACP frames */
> +	if (bond->params.mode == BOND_MODE_8023AD &&
> +	    skb->protocol == htons(ETH_P_SLOW))
> +		return RX_HANDLER_PASS;
> +
>  	return RX_HANDLER_ANOTHER;
>  }
>  

It seems to me that 1e253c3b8a1aeed51eef6fc366812f219b97de65 is bogus
and should be reverted, rather than worked around by other drivers.  We
shouldn't enable non-conformant forwarding behaviour by default just
because some people find it useful.  The administrator should have to
explicitly enable it.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH v5] net: bnx2x: convert to hw_features
From: Ben Hutchings @ 2011-04-21 19:19 UTC (permalink / raw)
  To: Vladislav Zolotarov
  Cc: Eric Dumazet, Michał Mirosław, netdev@vger.kernel.org,
	Eilon Greenstein
In-Reply-To: <1303411750.19212.123.camel@lb-tlvb-vladz>

On Thu, 2011-04-21 at 21:49 +0300, Vladislav Zolotarov wrote:
[...]
> More than that, in addition it is impossible to disable the LRO with 
> the current bnx2x upstream driver (ethtool -K ethX lro off) and this is
> because dev_disable_lro passes to __ethtool_set_flags() flags based on
> the current value of dev->features while __ethtool_set_flags() expects
> only the flags set in dev->hw_features. bnx2x has NETIF_F_HW_VLAN_RX
> that is set in dev->features and not set in dev->hw_features and it's
> passed down to the __ethtool_set_flags(). 
[...]

The problem is here in register_netdevice():

	/* Transfer changeable features to wanted_features and enable
	 * software offloads (GSO and GRO).
	 */
	dev->hw_features |= NETIF_F_SOFT_FEATURES;
	dev->features |= NETIF_F_SOFT_FEATURES;
	dev->wanted_features = dev->features & dev->hw_features;

This doesn't work correctly for features that are always enabled, like
NETIF_F_HW_VLAN_RX in bnx2x, which are set in dev->features but not in
dev->hw_features.

The name 'hw_features' really wasn't a good choice - the obvious meaning
and the meaning assumed by this code is 'hardware-supported features'
and not 'hardware-supported features that can be toggled'.  And since we
add NETIF_F_SOFT_FEATURES, it really only means 'features that can be
toggled'.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH] bonding: fix bridged bonds in 802.3ad mode
From: Jiri Bohac @ 2011-04-21 19:27 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Jiri Bohac, netdev, Stephen Hemminger, Jay Vosburgh,
	Andy Gospodarek, Benjamin Poirier
In-Reply-To: <1303412899.3165.47.camel@bwh-desktop>

On Thu, Apr 21, 2011 at 08:08:19PM +0100, Ben Hutchings wrote:
> It seems to me that 1e253c3b8a1aeed51eef6fc366812f219b97de65 is bogus
> and should be reverted, rather than worked around by other drivers.  We
> shouldn't enable non-conformant forwarding behaviour by default just
> because some people find it useful.  The administrator should have to
> explicitly enable it.

This is what I thought as well. I find it even more awkward to 
make this behaviour dependend on the STP setting of the bridge
(turning on STP works around this bonding problem, btw).

But even if forwarding of link-local frames is made optional in
some way, it will still break bonding when turned on. So unless
this it is completely reverted, we still need some kind of fix
for bonding.

Btw, this could also be fixed by checking for 
skb->protocol == htons(ETH_P_SLOW) directly in the bridging code.
However, I think the wonderful rx_handler infrastructure makes it
much cleaner this way...

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ

^ permalink raw reply

* [PATCH] inet: add RCU protection to inet->opt
From: Eric Dumazet @ 2011-04-21 19:45 UTC (permalink / raw)
  To: David Miller; +Cc: Herbert Xu, netdev
In-Reply-To: <1302881994.3613.34.camel@edumazet-laptop>

Le vendredi 15 avril 2011 à 17:39 +0200, Eric Dumazet a écrit :
> In commit 903ab86d19 (udp: Add lockless transmit path), we added a
> fastpath to avoid taking socket lock if we dont use corking.
> 
> Prior work were commit 1c32c5ad6fac8c (inet: Add ip_make_skb and
> ip_finish_skb) and commit 1470ddf7f8cecf776921e5 (inet: Remove explicit
> write references to sk/inet in ip_append_data)
> 
> Problem is ip_make_skb() calls ip_setup_cork() and
> ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options),
> without any protection against another thread manipulating inet->opt.
> 
> Another thread can change inet->opt pointer and free old one... kaboom.
> 
> This was discovered by code analysis (I am trying to remove the zeroing
> of cork variable in ip_make_skb(), since its a bit expensive and
> probably useless)
> 
> Note : race was there before Herbert patches.
> 
> My plan is to add RCU protection on inet->opt, unless someone has better
> idea ?
> 
> 

OK, here is the patch I cooked to address this problem, on top of
net-next-2.6

[PATCH] inet: add RCU protection to inet->opt

We lack proper synchronization to manipulate inet->opt ip_options

Problem is ip_make_skb() calls ip_setup_cork() and
ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options),
without any protection against another thread manipulating inet->opt.

Another thread can change inet->opt pointer and free old one under us.

Use RCU to protect inet->opt (changed to inet->inet_opt).

Instead of handling atomic refcounts, just copy ip_options when
necessary, to avoid cache line dirtying.

We cant insert an rcu_head in struct ip_options since its included in
skb->cb[], so this patch is large because I had to introduce a new
ip_options_rcu structure.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
---
 include/net/inet_sock.h         |   14 ++-
 include/net/ip.h                |   11 +-
 net/dccp/ipv4.c                 |   16 ++--
 net/dccp/ipv6.c                 |    2 
 net/ipv4/af_inet.c              |   17 +++-
 net/ipv4/cipso_ipv4.c           |  113 ++++++++++++++++--------------
 net/ipv4/icmp.c                 |   23 ++----
 net/ipv4/inet_connection_sock.c |    6 -
 net/ipv4/ip_options.c           |   38 ++++------
 net/ipv4/ip_output.c            |   44 +++++------
 net/ipv4/ip_sockglue.c          |   35 ++++++---
 net/ipv4/raw.c                  |   20 ++++-
 net/ipv4/syncookies.c           |    4 -
 net/ipv4/tcp_ipv4.c             |   34 +++++----
 net/ipv4/udp.c                  |   22 ++++-
 net/ipv6/tcp_ipv6.c             |    2 
 16 files changed, 236 insertions(+), 165 deletions(-)

diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 7a37369..ed2ba6e 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -57,7 +57,15 @@ struct ip_options {
 	unsigned char	__data[0];
 };
 
-#define optlength(opt) (sizeof(struct ip_options) + opt->optlen)
+struct ip_options_rcu {
+	struct rcu_head rcu;
+	struct ip_options opt;
+};
+
+struct ip_options_data {
+	struct ip_options_rcu	opt;
+	char			data[40];
+};
 
 struct inet_request_sock {
 	struct request_sock	req;
@@ -78,7 +86,7 @@ struct inet_request_sock {
 				acked	   : 1,
 				no_srccheck: 1;
 	kmemcheck_bitfield_end(flags);
-	struct ip_options	*opt;
+	struct ip_options_rcu	*opt;
 };
 
 static inline struct inet_request_sock *inet_rsk(const struct request_sock *sk)
@@ -140,7 +148,7 @@ struct inet_sock {
 	__be16			inet_sport;
 	__u16			inet_id;
 
-	struct ip_options	*opt;
+	struct ip_options_rcu __rcu	*inet_opt;
 	__u8			tos;
 	__u8			min_ttl;
 	__u8			mc_ttl;
diff --git a/include/net/ip.h b/include/net/ip.h
index 7c41658..3a59bf9 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -52,7 +52,7 @@ static inline unsigned int ip_hdrlen(const struct sk_buff *skb)
 struct ipcm_cookie {
 	__be32			addr;
 	int			oif;
-	struct ip_options	*opt;
+	struct ip_options_rcu	*opt;
 	__u8			tx_flags;
 };
 
@@ -92,7 +92,7 @@ extern int		igmp_mc_proc_init(void);
 
 extern int		ip_build_and_send_pkt(struct sk_buff *skb, struct sock *sk,
 					      __be32 saddr, __be32 daddr,
-					      struct ip_options *opt);
+					      struct ip_options_rcu *opt);
 extern int		ip_rcv(struct sk_buff *skb, struct net_device *dev,
 			       struct packet_type *pt, struct net_device *orig_dev);
 extern int		ip_local_deliver(struct sk_buff *skb);
@@ -416,14 +416,15 @@ extern int ip_forward(struct sk_buff *skb);
  *	Functions provided by ip_options.c
  */
  
-extern void ip_options_build(struct sk_buff *skb, struct ip_options *opt, __be32 daddr, struct rtable *rt, int is_frag);
+extern void ip_options_build(struct sk_buff *skb, struct ip_options *opt,
+			     __be32 daddr, struct rtable *rt, int is_frag);
 extern int ip_options_echo(struct ip_options *dopt, struct sk_buff *skb);
 extern void ip_options_fragment(struct sk_buff *skb);
 extern int ip_options_compile(struct net *net,
 			      struct ip_options *opt, struct sk_buff *skb);
-extern int ip_options_get(struct net *net, struct ip_options **optp,
+extern int ip_options_get(struct net *net, struct ip_options_rcu **optp,
 			  unsigned char *data, int optlen);
-extern int ip_options_get_from_user(struct net *net, struct ip_options **optp,
+extern int ip_options_get_from_user(struct net *net, struct ip_options_rcu **optp,
 				    unsigned char __user *data, int optlen);
 extern void ip_options_undo(struct ip_options * opt);
 extern void ip_forward_options(struct sk_buff *skb);
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index ae451c6..1f54ee2 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -47,6 +47,7 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	struct rtable *rt;
 	__be32 daddr, nexthop;
 	int err;
+	struct ip_options_rcu *inet_opt;
 
 	dp->dccps_role = DCCP_ROLE_CLIENT;
 
@@ -57,10 +58,13 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 		return -EAFNOSUPPORT;
 
 	nexthop = daddr = usin->sin_addr.s_addr;
-	if (inet->opt != NULL && inet->opt->srr) {
+
+	inet_opt = rcu_dereference_protected(inet->inet_opt,
+					     sock_owned_by_user(sk));
+	if (inet_opt != NULL && inet_opt->opt.srr) {
 		if (daddr == 0)
 			return -EINVAL;
-		nexthop = inet->opt->faddr;
+		nexthop = inet_opt->opt.faddr;
 	}
 
 	orig_sport = inet->inet_sport;
@@ -77,7 +81,7 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 		return -ENETUNREACH;
 	}
 
-	if (inet->opt == NULL || !inet->opt->srr)
+	if (inet_opt == NULL || !inet_opt->opt.srr)
 		daddr = rt->rt_dst;
 
 	if (inet->inet_saddr == 0)
@@ -88,8 +92,8 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	inet->inet_daddr = daddr;
 
 	inet_csk(sk)->icsk_ext_hdr_len = 0;
-	if (inet->opt != NULL)
-		inet_csk(sk)->icsk_ext_hdr_len = inet->opt->optlen;
+	if (inet_opt)
+		inet_csk(sk)->icsk_ext_hdr_len = inet_opt->opt.optlen;
 	/*
 	 * Socket identity is still unknown (sport may be zero).
 	 * However we set state to DCCP_REQUESTING and not releasing socket
@@ -405,7 +409,7 @@ struct sock *dccp_v4_request_recv_sock(struct sock *sk, struct sk_buff *skb,
 	newinet->inet_daddr	= ireq->rmt_addr;
 	newinet->inet_rcv_saddr = ireq->loc_addr;
 	newinet->inet_saddr	= ireq->loc_addr;
-	newinet->opt	   = ireq->opt;
+	newinet->inet_opt	= ireq->opt;
 	ireq->opt	   = NULL;
 	newinet->mc_index  = inet_iif(skb);
 	newinet->mc_ttl	   = ip_hdr(skb)->ttl;
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index de1b7e3..43eba99 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -573,7 +573,7 @@ static struct sock *dccp_v6_request_recv_sock(struct sock *sk,
 
 	   First: no IPv4 options.
 	 */
-	newinet->opt = NULL;
+	newinet->inet_opt = NULL;
 
 	/* Clone RX bits */
 	newnp->rxopt.all = np->rxopt.all;
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 807d83c..01ae9b5 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -153,7 +153,7 @@ void inet_sock_destruct(struct sock *sk)
 	WARN_ON(sk->sk_wmem_queued);
 	WARN_ON(sk->sk_forward_alloc);
 
-	kfree(inet->opt);
+	kfree(rcu_dereference_protected(inet->inet_opt, 1));
 	dst_release(rcu_dereference_check(sk->sk_dst_cache, 1));
 	sk_refcnt_debug_dec(sk);
 }
@@ -1105,9 +1105,12 @@ static int inet_sk_reselect_saddr(struct sock *sk)
 	__be32 daddr = inet->inet_daddr;
 	struct rtable *rt;
 	__be32 new_saddr;
+	struct ip_options_rcu *inet_opt;
 
-	if (inet->opt && inet->opt->srr)
-		daddr = inet->opt->faddr;
+	inet_opt = rcu_dereference_protected(inet->inet_opt,
+					     sock_owned_by_user(sk));
+	if (inet_opt && inet_opt->opt.srr)
+		daddr = inet_opt->opt.faddr;
 
 	/* Query new route. */
 	rt = ip_route_connect(daddr, 0, RT_CONN_FLAGS(sk),
@@ -1147,6 +1150,7 @@ int inet_sk_rebuild_header(struct sock *sk)
 	struct inet_sock *inet = inet_sk(sk);
 	struct rtable *rt = (struct rtable *)__sk_dst_check(sk, 0);
 	__be32 daddr;
+	struct ip_options_rcu *inet_opt;
 	int err;
 
 	/* Route is OK, nothing to do. */
@@ -1154,9 +1158,12 @@ int inet_sk_rebuild_header(struct sock *sk)
 		return 0;
 
 	/* Reroute. */
+	rcu_read_lock();
+	inet_opt = rcu_dereference(inet->inet_opt);
 	daddr = inet->inet_daddr;
-	if (inet->opt && inet->opt->srr)
-		daddr = inet->opt->faddr;
+	if (inet_opt && inet_opt->opt.srr)
+		daddr = inet_opt->opt.faddr;
+	rcu_read_unlock();
 	rt = ip_route_output_ports(sock_net(sk), sk, daddr, inet->inet_saddr,
 				   inet->inet_dport, inet->inet_sport,
 				   sk->sk_protocol, RT_CONN_FLAGS(sk),
diff --git a/net/ipv4/cipso_ipv4.c b/net/ipv4/cipso_ipv4.c
index a0af7ea..2b3c23c 100644
--- a/net/ipv4/cipso_ipv4.c
+++ b/net/ipv4/cipso_ipv4.c
@@ -1857,6 +1857,11 @@ static int cipso_v4_genopt(unsigned char *buf, u32 buf_len,
 	return CIPSO_V4_HDR_LEN + ret_val;
 }
 
+static void opt_kfree_rcu(struct rcu_head *head)
+{
+	kfree(container_of(head, struct ip_options_rcu, rcu));
+}
+
 /**
  * cipso_v4_sock_setattr - Add a CIPSO option to a socket
  * @sk: the socket
@@ -1879,7 +1884,7 @@ int cipso_v4_sock_setattr(struct sock *sk,
 	unsigned char *buf = NULL;
 	u32 buf_len;
 	u32 opt_len;
-	struct ip_options *opt = NULL;
+	struct ip_options_rcu *old, *opt = NULL;
 	struct inet_sock *sk_inet;
 	struct inet_connection_sock *sk_conn;
 
@@ -1915,22 +1920,25 @@ int cipso_v4_sock_setattr(struct sock *sk,
 		ret_val = -ENOMEM;
 		goto socket_setattr_failure;
 	}
-	memcpy(opt->__data, buf, buf_len);
-	opt->optlen = opt_len;
-	opt->cipso = sizeof(struct iphdr);
+	memcpy(opt->opt.__data, buf, buf_len);
+	opt->opt.optlen = opt_len;
+	opt->opt.cipso = sizeof(struct iphdr);
 	kfree(buf);
 	buf = NULL;
 
 	sk_inet = inet_sk(sk);
+
+	old = rcu_dereference_protected(sk_inet->inet_opt, sock_owned_by_user(sk));
 	if (sk_inet->is_icsk) {
 		sk_conn = inet_csk(sk);
-		if (sk_inet->opt)
-			sk_conn->icsk_ext_hdr_len -= sk_inet->opt->optlen;
-		sk_conn->icsk_ext_hdr_len += opt->optlen;
+		if (old)
+			sk_conn->icsk_ext_hdr_len -= old->opt.optlen;
+		sk_conn->icsk_ext_hdr_len += opt->opt.optlen;
 		sk_conn->icsk_sync_mss(sk, sk_conn->icsk_pmtu_cookie);
 	}
-	opt = xchg(&sk_inet->opt, opt);
-	kfree(opt);
+	rcu_assign_pointer(sk_inet->inet_opt, opt);
+	if (old)
+		call_rcu(&old->rcu, opt_kfree_rcu);
 
 	return 0;
 
@@ -1960,7 +1968,7 @@ int cipso_v4_req_setattr(struct request_sock *req,
 	unsigned char *buf = NULL;
 	u32 buf_len;
 	u32 opt_len;
-	struct ip_options *opt = NULL;
+	struct ip_options_rcu *opt = NULL;
 	struct inet_request_sock *req_inet;
 
 	/* We allocate the maximum CIPSO option size here so we are probably
@@ -1988,15 +1996,16 @@ int cipso_v4_req_setattr(struct request_sock *req,
 		ret_val = -ENOMEM;
 		goto req_setattr_failure;
 	}
-	memcpy(opt->__data, buf, buf_len);
-	opt->optlen = opt_len;
-	opt->cipso = sizeof(struct iphdr);
+	memcpy(opt->opt.__data, buf, buf_len);
+	opt->opt.optlen = opt_len;
+	opt->opt.cipso = sizeof(struct iphdr);
 	kfree(buf);
 	buf = NULL;
 
 	req_inet = inet_rsk(req);
 	opt = xchg(&req_inet->opt, opt);
-	kfree(opt);
+	if (opt)
+		call_rcu(&opt->rcu, opt_kfree_rcu);
 
 	return 0;
 
@@ -2016,34 +2025,34 @@ req_setattr_failure:
  * values on failure.
  *
  */
-static int cipso_v4_delopt(struct ip_options **opt_ptr)
+static int cipso_v4_delopt(struct ip_options_rcu **opt_ptr)
 {
 	int hdr_delta = 0;
-	struct ip_options *opt = *opt_ptr;
+	struct ip_options_rcu *opt = *opt_ptr;
 
-	if (opt->srr || opt->rr || opt->ts || opt->router_alert) {
+	if (opt->opt.srr || opt->opt.rr || opt->opt.ts || opt->opt.router_alert) {
 		u8 cipso_len;
 		u8 cipso_off;
 		unsigned char *cipso_ptr;
 		int iter;
 		int optlen_new;
 
-		cipso_off = opt->cipso - sizeof(struct iphdr);
-		cipso_ptr = &opt->__data[cipso_off];
+		cipso_off = opt->opt.cipso - sizeof(struct iphdr);
+		cipso_ptr = &opt->opt.__data[cipso_off];
 		cipso_len = cipso_ptr[1];
 
-		if (opt->srr > opt->cipso)
-			opt->srr -= cipso_len;
-		if (opt->rr > opt->cipso)
-			opt->rr -= cipso_len;
-		if (opt->ts > opt->cipso)
-			opt->ts -= cipso_len;
-		if (opt->router_alert > opt->cipso)
-			opt->router_alert -= cipso_len;
-		opt->cipso = 0;
+		if (opt->opt.srr > opt->opt.cipso)
+			opt->opt.srr -= cipso_len;
+		if (opt->opt.rr > opt->opt.cipso)
+			opt->opt.rr -= cipso_len;
+		if (opt->opt.ts > opt->opt.cipso)
+			opt->opt.ts -= cipso_len;
+		if (opt->opt.router_alert > opt->opt.cipso)
+			opt->opt.router_alert -= cipso_len;
+		opt->opt.cipso = 0;
 
 		memmove(cipso_ptr, cipso_ptr + cipso_len,
-			opt->optlen - cipso_off - cipso_len);
+			opt->opt.optlen - cipso_off - cipso_len);
 
 		/* determining the new total option length is tricky because of
 		 * the padding necessary, the only thing i can think to do at
@@ -2052,21 +2061,21 @@ static int cipso_v4_delopt(struct ip_options **opt_ptr)
 		 * from there we can determine the new total option length */
 		iter = 0;
 		optlen_new = 0;
-		while (iter < opt->optlen)
-			if (opt->__data[iter] != IPOPT_NOP) {
-				iter += opt->__data[iter + 1];
+		while (iter < opt->opt.optlen)
+			if (opt->opt.__data[iter] != IPOPT_NOP) {
+				iter += opt->opt.__data[iter + 1];
 				optlen_new = iter;
 			} else
 				iter++;
-		hdr_delta = opt->optlen;
-		opt->optlen = (optlen_new + 3) & ~3;
-		hdr_delta -= opt->optlen;
+		hdr_delta = opt->opt.optlen;
+		opt->opt.optlen = (optlen_new + 3) & ~3;
+		hdr_delta -= opt->opt.optlen;
 	} else {
 		/* only the cipso option was present on the socket so we can
 		 * remove the entire option struct */
 		*opt_ptr = NULL;
-		hdr_delta = opt->optlen;
-		kfree(opt);
+		hdr_delta = opt->opt.optlen;
+		call_rcu(&opt->rcu, opt_kfree_rcu);
 	}
 
 	return hdr_delta;
@@ -2083,15 +2092,15 @@ static int cipso_v4_delopt(struct ip_options **opt_ptr)
 void cipso_v4_sock_delattr(struct sock *sk)
 {
 	int hdr_delta;
-	struct ip_options *opt;
+	struct ip_options_rcu *opt;
 	struct inet_sock *sk_inet;
 
 	sk_inet = inet_sk(sk);
-	opt = sk_inet->opt;
-	if (opt == NULL || opt->cipso == 0)
+	opt = rcu_dereference_protected(sk_inet->inet_opt, 1);
+	if (opt == NULL || opt->opt.cipso == 0)
 		return;
 
-	hdr_delta = cipso_v4_delopt(&sk_inet->opt);
+	hdr_delta = cipso_v4_delopt(&sk_inet->inet_opt);
 	if (sk_inet->is_icsk && hdr_delta > 0) {
 		struct inet_connection_sock *sk_conn = inet_csk(sk);
 		sk_conn->icsk_ext_hdr_len -= hdr_delta;
@@ -2109,12 +2118,12 @@ void cipso_v4_sock_delattr(struct sock *sk)
  */
 void cipso_v4_req_delattr(struct request_sock *req)
 {
-	struct ip_options *opt;
+	struct ip_options_rcu *opt;
 	struct inet_request_sock *req_inet;
 
 	req_inet = inet_rsk(req);
 	opt = req_inet->opt;
-	if (opt == NULL || opt->cipso == 0)
+	if (opt == NULL || opt->opt.cipso == 0)
 		return;
 
 	cipso_v4_delopt(&req_inet->opt);
@@ -2184,14 +2193,18 @@ getattr_return:
  */
 int cipso_v4_sock_getattr(struct sock *sk, struct netlbl_lsm_secattr *secattr)
 {
-	struct ip_options *opt;
+	struct ip_options_rcu *opt;
+	int res = -ENOMSG;
 
-	opt = inet_sk(sk)->opt;
-	if (opt == NULL || opt->cipso == 0)
-		return -ENOMSG;
-
-	return cipso_v4_getattr(opt->__data + opt->cipso - sizeof(struct iphdr),
-				secattr);
+	rcu_read_lock();
+	opt = rcu_dereference(inet_sk(sk)->inet_opt);
+	if (opt && opt->opt.cipso)
+		res = cipso_v4_getattr(opt->opt.__data +
+						opt->opt.cipso -
+						sizeof(struct iphdr),
+				       secattr);
+	rcu_read_unlock();
+	return res;
 }
 
 /**
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index e5f8a71..831e7dc 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -108,8 +108,7 @@ struct icmp_bxm {
 		__be32	       times[3];
 	} data;
 	int head_len;
-	struct ip_options replyopts;
-	unsigned char  optbuf[40];
+	struct ip_options_data replyopts;
 };
 
 /* An array of errno for error messages from dest unreach. */
@@ -333,7 +332,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
 	struct inet_sock *inet;
 	__be32 daddr;
 
-	if (ip_options_echo(&icmp_param->replyopts, skb))
+	if (ip_options_echo(&icmp_param->replyopts.opt.opt, skb))
 		return;
 
 	sk = icmp_xmit_lock(net);
@@ -347,10 +346,10 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
 	daddr = ipc.addr = rt->rt_src;
 	ipc.opt = NULL;
 	ipc.tx_flags = 0;
-	if (icmp_param->replyopts.optlen) {
-		ipc.opt = &icmp_param->replyopts;
-		if (ipc.opt->srr)
-			daddr = icmp_param->replyopts.faddr;
+	if (icmp_param->replyopts.opt.opt.optlen) {
+		ipc.opt = &icmp_param->replyopts.opt;
+		if (ipc.opt->opt.srr)
+			daddr = icmp_param->replyopts.opt.opt.faddr;
 	}
 	{
 		struct flowi4 fl4 = {
@@ -379,8 +378,8 @@ static struct rtable *icmp_route_lookup(struct net *net, struct sk_buff *skb_in,
 					struct icmp_bxm *param)
 {
 	struct flowi4 fl4 = {
-		.daddr = (param->replyopts.srr ?
-			  param->replyopts.faddr : iph->saddr),
+		.daddr = (param->replyopts.opt.opt.srr ?
+			  param->replyopts.opt.opt.faddr : iph->saddr),
 		.saddr = saddr,
 		.flowi4_tos = RT_TOS(tos),
 		.flowi4_proto = IPPROTO_ICMP,
@@ -581,7 +580,7 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
 					   IPTOS_PREC_INTERNETCONTROL) :
 					  iph->tos;
 
-	if (ip_options_echo(&icmp_param.replyopts, skb_in))
+	if (ip_options_echo(&icmp_param.replyopts.opt.opt, skb_in))
 		goto out_unlock;
 
 
@@ -597,7 +596,7 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
 	icmp_param.offset = skb_network_offset(skb_in);
 	inet_sk(sk)->tos = tos;
 	ipc.addr = iph->saddr;
-	ipc.opt = &icmp_param.replyopts;
+	ipc.opt = &icmp_param.replyopts.opt;
 	ipc.tx_flags = 0;
 
 	rt = icmp_route_lookup(net, skb_in, iph, saddr, tos,
@@ -613,7 +612,7 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
 	room = dst_mtu(&rt->dst);
 	if (room > 576)
 		room = 576;
-	room -= sizeof(struct iphdr) + icmp_param.replyopts.optlen;
+	room -= sizeof(struct iphdr) + icmp_param.replyopts.opt.opt.optlen;
 	room -= sizeof(struct icmphdr);
 
 	icmp_param.data_len = skb_in->len - icmp_param.offset;
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 8514db5..3282cb2 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -354,20 +354,20 @@ struct dst_entry *inet_csk_route_req(struct sock *sk,
 {
 	struct rtable *rt;
 	const struct inet_request_sock *ireq = inet_rsk(req);
-	struct ip_options *opt = inet_rsk(req)->opt;
+	struct ip_options_rcu *opt = inet_rsk(req)->opt;
 	struct net *net = sock_net(sk);
 	struct flowi4 fl4;
 
 	flowi4_init_output(&fl4, sk->sk_bound_dev_if, sk->sk_mark,
 			   RT_CONN_FLAGS(sk), RT_SCOPE_UNIVERSE,
 			   sk->sk_protocol, inet_sk_flowi_flags(sk),
-			   (opt && opt->srr) ? opt->faddr : ireq->rmt_addr,
+			   (opt && opt->opt.srr) ? opt->opt.faddr : ireq->rmt_addr,
 			   ireq->loc_addr, ireq->rmt_port, inet_sk(sk)->inet_sport);
 	security_req_classify_flow(req, flowi4_to_flowi(&fl4));
 	rt = ip_route_output_flow(net, &fl4, sk);
 	if (IS_ERR(rt))
 		goto no_route;
-	if (opt && opt->is_strictroute && rt->rt_dst != rt->rt_gateway)
+	if (opt && opt->opt.is_strictroute && rt->rt_dst != rt->rt_gateway)
 		goto route_err;
 	return &rt->dst;
 
diff --git a/net/ipv4/ip_options.c b/net/ipv4/ip_options.c
index 2391b24..01fc409 100644
--- a/net/ipv4/ip_options.c
+++ b/net/ipv4/ip_options.c
@@ -36,7 +36,7 @@
  * saddr is address of outgoing interface.
  */
 
-void ip_options_build(struct sk_buff * skb, struct ip_options * opt,
+void ip_options_build(struct sk_buff *skb, struct ip_options *opt,
 			    __be32 daddr, struct rtable *rt, int is_frag)
 {
 	unsigned char *iph = skb_network_header(skb);
@@ -83,9 +83,9 @@ void ip_options_build(struct sk_buff * skb, struct ip_options * opt,
  * NOTE: dopt cannot point to skb.
  */
 
-int ip_options_echo(struct ip_options * dopt, struct sk_buff * skb)
+int ip_options_echo(struct ip_options *dopt, struct sk_buff *skb)
 {
-	struct ip_options *sopt;
+	const struct ip_options *sopt;
 	unsigned char *sptr, *dptr;
 	int soffset, doffset;
 	int	optlen;
@@ -95,10 +95,8 @@ int ip_options_echo(struct ip_options * dopt, struct sk_buff * skb)
 
 	sopt = &(IPCB(skb)->opt);
 
-	if (sopt->optlen == 0) {
-		dopt->optlen = 0;
+	if (sopt->optlen == 0)
 		return 0;
-	}
 
 	sptr = skb_network_header(skb);
 	dptr = dopt->__data;
@@ -157,7 +155,7 @@ int ip_options_echo(struct ip_options * dopt, struct sk_buff * skb)
 		dopt->optlen += optlen;
 	}
 	if (sopt->srr) {
-		unsigned char * start = sptr+sopt->srr;
+		unsigned char *start = sptr+sopt->srr;
 		__be32 faddr;
 
 		optlen  = start[1];
@@ -499,19 +497,19 @@ void ip_options_undo(struct ip_options * opt)
 	}
 }
 
-static struct ip_options *ip_options_get_alloc(const int optlen)
+static struct ip_options_rcu *ip_options_get_alloc(const int optlen)
 {
-	return kzalloc(sizeof(struct ip_options) + ((optlen + 3) & ~3),
+	return kzalloc(sizeof(struct ip_options_rcu) + ((optlen + 3) & ~3),
 		       GFP_KERNEL);
 }
 
-static int ip_options_get_finish(struct net *net, struct ip_options **optp,
-				 struct ip_options *opt, int optlen)
+static int ip_options_get_finish(struct net *net, struct ip_options_rcu **optp,
+				 struct ip_options_rcu *opt, int optlen)
 {
 	while (optlen & 3)
-		opt->__data[optlen++] = IPOPT_END;
-	opt->optlen = optlen;
-	if (optlen && ip_options_compile(net, opt, NULL)) {
+		opt->opt.__data[optlen++] = IPOPT_END;
+	opt->opt.optlen = optlen;
+	if (optlen && ip_options_compile(net, &opt->opt, NULL)) {
 		kfree(opt);
 		return -EINVAL;
 	}
@@ -520,29 +518,29 @@ static int ip_options_get_finish(struct net *net, struct ip_options **optp,
 	return 0;
 }
 
-int ip_options_get_from_user(struct net *net, struct ip_options **optp,
+int ip_options_get_from_user(struct net *net, struct ip_options_rcu **optp,
 			     unsigned char __user *data, int optlen)
 {
-	struct ip_options *opt = ip_options_get_alloc(optlen);
+	struct ip_options_rcu *opt = ip_options_get_alloc(optlen);
 
 	if (!opt)
 		return -ENOMEM;
-	if (optlen && copy_from_user(opt->__data, data, optlen)) {
+	if (optlen && copy_from_user(opt->opt.__data, data, optlen)) {
 		kfree(opt);
 		return -EFAULT;
 	}
 	return ip_options_get_finish(net, optp, opt, optlen);
 }
 
-int ip_options_get(struct net *net, struct ip_options **optp,
+int ip_options_get(struct net *net, struct ip_options_rcu **optp,
 		   unsigned char *data, int optlen)
 {
-	struct ip_options *opt = ip_options_get_alloc(optlen);
+	struct ip_options_rcu *opt = ip_options_get_alloc(optlen);
 
 	if (!opt)
 		return -ENOMEM;
 	if (optlen)
-		memcpy(opt->__data, data, optlen);
+		memcpy(opt->opt.__data, data, optlen);
 	return ip_options_get_finish(net, optp, opt, optlen);
 }
 
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index bdad3d6..362e66f 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -140,14 +140,14 @@ static inline int ip_select_ttl(struct inet_sock *inet, struct dst_entry *dst)
  *
  */
 int ip_build_and_send_pkt(struct sk_buff *skb, struct sock *sk,
-			  __be32 saddr, __be32 daddr, struct ip_options *opt)
+			  __be32 saddr, __be32 daddr, struct ip_options_rcu *opt)
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct rtable *rt = skb_rtable(skb);
 	struct iphdr *iph;
 
 	/* Build the IP header. */
-	skb_push(skb, sizeof(struct iphdr) + (opt ? opt->optlen : 0));
+	skb_push(skb, sizeof(struct iphdr) + (opt ? opt->opt.optlen : 0));
 	skb_reset_network_header(skb);
 	iph = ip_hdr(skb);
 	iph->version  = 4;
@@ -163,9 +163,9 @@ int ip_build_and_send_pkt(struct sk_buff *skb, struct sock *sk,
 	iph->protocol = sk->sk_protocol;
 	ip_select_ident(iph, &rt->dst, sk);
 
-	if (opt && opt->optlen) {
-		iph->ihl += opt->optlen>>2;
-		ip_options_build(skb, opt, daddr, rt, 0);
+	if (opt && opt->opt.optlen) {
+		iph->ihl += opt->opt.optlen>>2;
+		ip_options_build(skb, &opt->opt, daddr, rt, 0);
 	}
 
 	skb->priority = sk->sk_priority;
@@ -316,7 +316,7 @@ int ip_queue_xmit(struct sk_buff *skb)
 {
 	struct sock *sk = skb->sk;
 	struct inet_sock *inet = inet_sk(sk);
-	struct ip_options *opt = inet->opt;
+	struct ip_options_rcu *inet_opt;
 	struct rtable *rt;
 	struct iphdr *iph;
 	int res;
@@ -325,6 +325,7 @@ int ip_queue_xmit(struct sk_buff *skb)
 	 * f.e. by something like SCTP.
 	 */
 	rcu_read_lock();
+	inet_opt = rcu_dereference(inet->inet_opt);
 	rt = skb_rtable(skb);
 	if (rt != NULL)
 		goto packet_routed;
@@ -336,8 +337,8 @@ int ip_queue_xmit(struct sk_buff *skb)
 
 		/* Use correct destination address if we have options. */
 		daddr = inet->inet_daddr;
-		if(opt && opt->srr)
-			daddr = opt->faddr;
+		if (inet_opt && inet_opt->opt.srr)
+			daddr = inet_opt->opt.faddr;
 
 		/* If this fails, retransmit mechanism of transport layer will
 		 * keep trying until route appears or the connection times
@@ -357,11 +358,11 @@ int ip_queue_xmit(struct sk_buff *skb)
 	skb_dst_set_noref(skb, &rt->dst);
 
 packet_routed:
-	if (opt && opt->is_strictroute && rt->rt_dst != rt->rt_gateway)
+	if (inet_opt && inet_opt->opt.is_strictroute && rt->rt_dst != rt->rt_gateway)
 		goto no_route;
 
 	/* OK, we know where to send it, allocate and build IP header. */
-	skb_push(skb, sizeof(struct iphdr) + (opt ? opt->optlen : 0));
+	skb_push(skb, sizeof(struct iphdr) + (inet_opt ? inet_opt->opt.optlen : 0));
 	skb_reset_network_header(skb);
 	iph = ip_hdr(skb);
 	*((__be16 *)iph) = htons((4 << 12) | (5 << 8) | (inet->tos & 0xff));
@@ -375,9 +376,9 @@ packet_routed:
 	iph->daddr    = rt->rt_dst;
 	/* Transport layer set skb->h.foo itself. */
 
-	if (opt && opt->optlen) {
-		iph->ihl += opt->optlen >> 2;
-		ip_options_build(skb, opt, inet->inet_daddr, rt, 0);
+	if (inet_opt && inet_opt->opt.optlen) {
+		iph->ihl += inet_opt->opt.optlen >> 2;
+		ip_options_build(skb, &inet_opt->opt, inet->inet_daddr, rt, 0);
 	}
 
 	ip_select_ident_more(iph, &rt->dst, sk,
@@ -1033,7 +1034,7 @@ static int ip_setup_cork(struct sock *sk, struct inet_cork *cork,
 			 struct ipcm_cookie *ipc, struct rtable **rtp)
 {
 	struct inet_sock *inet = inet_sk(sk);
-	struct ip_options *opt;
+	struct ip_options_rcu *opt;
 	struct rtable *rt;
 
 	/*
@@ -1047,7 +1048,7 @@ static int ip_setup_cork(struct sock *sk, struct inet_cork *cork,
 			if (unlikely(cork->opt == NULL))
 				return -ENOBUFS;
 		}
-		memcpy(cork->opt, opt, sizeof(struct ip_options) + opt->optlen);
+		memcpy(cork->opt, &opt->opt, sizeof(struct ip_options) + opt->opt.optlen);
 		cork->flags |= IPCORK_OPT;
 		cork->addr = ipc->addr;
 	}
@@ -1451,26 +1452,23 @@ void ip_send_reply(struct sock *sk, struct sk_buff *skb, struct ip_reply_arg *ar
 		   unsigned int len)
 {
 	struct inet_sock *inet = inet_sk(sk);
-	struct {
-		struct ip_options	opt;
-		char			data[40];
-	} replyopts;
+	struct ip_options_data replyopts;
 	struct ipcm_cookie ipc;
 	__be32 daddr;
 	struct rtable *rt = skb_rtable(skb);
 
-	if (ip_options_echo(&replyopts.opt, skb))
+	if (ip_options_echo(&replyopts.opt.opt, skb))
 		return;
 
 	daddr = ipc.addr = rt->rt_src;
 	ipc.opt = NULL;
 	ipc.tx_flags = 0;
 
-	if (replyopts.opt.optlen) {
+	if (replyopts.opt.opt.optlen) {
 		ipc.opt = &replyopts.opt;
 
-		if (ipc.opt->srr)
-			daddr = replyopts.opt.faddr;
+		if (replyopts.opt.opt.srr)
+			daddr = replyopts.opt.opt.faddr;
 	}
 
 	{
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 3948c86..68cdf2b 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -451,6 +451,11 @@ out:
 }
 
 
+static void opt_kfree_rcu(struct rcu_head *head)
+{
+	kfree(container_of(head, struct ip_options_rcu, rcu));
+}
+
 /*
  *	Socket option code for IP. This is the end of the line after any
  *	TCP,UDP etc options on an IP socket.
@@ -497,13 +502,16 @@ static int do_ip_setsockopt(struct sock *sk, int level,
 	switch (optname) {
 	case IP_OPTIONS:
 	{
-		struct ip_options *opt = NULL;
+		struct ip_options_rcu *old, *opt = NULL;
+
 		if (optlen > 40)
 			goto e_inval;
 		err = ip_options_get_from_user(sock_net(sk), &opt,
 					       optval, optlen);
 		if (err)
 			break;
+		old = rcu_dereference_protected(inet->inet_opt,
+						sock_owned_by_user(sk));
 		if (inet->is_icsk) {
 			struct inet_connection_sock *icsk = inet_csk(sk);
 #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
@@ -512,17 +520,18 @@ static int do_ip_setsockopt(struct sock *sk, int level,
 			       (TCPF_LISTEN | TCPF_CLOSE)) &&
 			     inet->inet_daddr != LOOPBACK4_IPV6)) {
 #endif
-				if (inet->opt)
-					icsk->icsk_ext_hdr_len -= inet->opt->optlen;
+				if (old)
+					icsk->icsk_ext_hdr_len -= old->opt.optlen;
 				if (opt)
-					icsk->icsk_ext_hdr_len += opt->optlen;
+					icsk->icsk_ext_hdr_len += opt->opt.optlen;
 				icsk->icsk_sync_mss(sk, icsk->icsk_pmtu_cookie);
 #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
 			}
 #endif
 		}
-		opt = xchg(&inet->opt, opt);
-		kfree(opt);
+		rcu_assign_pointer(inet->inet_opt, opt);
+		if (old)
+			call_rcu(&old->rcu, opt_kfree_rcu);
 		break;
 	}
 	case IP_PKTINFO:
@@ -1081,12 +1090,16 @@ static int do_ip_getsockopt(struct sock *sk, int level, int optname,
 	case IP_OPTIONS:
 	{
 		unsigned char optbuf[sizeof(struct ip_options)+40];
-		struct ip_options * opt = (struct ip_options *)optbuf;
+		struct ip_options *opt = (struct ip_options *)optbuf;
+		struct ip_options_rcu *inet_opt;
+
+		inet_opt = rcu_dereference_protected(inet->inet_opt,
+						     sock_owned_by_user(sk));
 		opt->optlen = 0;
-		if (inet->opt)
-			memcpy(optbuf, inet->opt,
-			       sizeof(struct ip_options)+
-			       inet->opt->optlen);
+		if (inet_opt)
+			memcpy(optbuf, &inet_opt->opt,
+			       sizeof(struct ip_options) +
+			       inet_opt->opt.optlen);
 		release_sock(sk);
 
 		if (opt->optlen == 0)
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 2b50cc2..c577d25 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -460,6 +460,7 @@ static int raw_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	__be32 saddr;
 	u8  tos;
 	int err;
+	struct ip_options_data opt_copy;
 
 	err = -EMSGSIZE;
 	if (len > 0xFFFF)
@@ -520,8 +521,18 @@ static int raw_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	saddr = ipc.addr;
 	ipc.addr = daddr;
 
-	if (!ipc.opt)
-		ipc.opt = inet->opt;
+	if (!ipc.opt) {
+		struct ip_options_rcu *inet_opt;
+
+		rcu_read_lock();
+		inet_opt = rcu_dereference(inet->inet_opt);
+		if (inet_opt) {
+			memcpy(&opt_copy, inet_opt,
+			       sizeof(*inet_opt) + inet_opt->opt.optlen);
+			ipc.opt = &opt_copy.opt;
+		}
+		rcu_read_unlock();
+	}
 
 	if (ipc.opt) {
 		err = -EINVAL;
@@ -530,10 +541,10 @@ static int raw_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		 */
 		if (inet->hdrincl)
 			goto done;
-		if (ipc.opt->srr) {
+		if (ipc.opt->opt.srr) {
 			if (!daddr)
 				goto done;
-			daddr = ipc.opt->faddr;
+			daddr = ipc.opt->opt.faddr;
 		}
 	}
 	tos = RT_CONN_FLAGS(sk);
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 71e0296..2646149 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -321,10 +321,10 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
 	 * the ACK carries the same options again (see RFC1122 4.2.3.8)
 	 */
 	if (opt && opt->optlen) {
-		int opt_size = sizeof(struct ip_options) + opt->optlen;
+		int opt_size = sizeof(struct ip_options_rcu) + opt->optlen;
 
 		ireq->opt = kmalloc(opt_size, GFP_ATOMIC);
-		if (ireq->opt != NULL && ip_options_echo(ireq->opt, skb)) {
+		if (ireq->opt != NULL && ip_options_echo(&ireq->opt->opt, skb)) {
 			kfree(ireq->opt);
 			ireq->opt = NULL;
 		}
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f7e6c2c..d120f6f 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -153,6 +153,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	struct rtable *rt;
 	__be32 daddr, nexthop;
 	int err;
+	struct ip_options_rcu *inet_opt;
 
 	if (addr_len < sizeof(struct sockaddr_in))
 		return -EINVAL;
@@ -161,10 +162,12 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 		return -EAFNOSUPPORT;
 
 	nexthop = daddr = usin->sin_addr.s_addr;
-	if (inet->opt && inet->opt->srr) {
+	inet_opt = rcu_dereference_protected(inet->inet_opt,
+					     sock_owned_by_user(sk));
+	if (inet_opt && inet_opt->opt.srr) {
 		if (!daddr)
 			return -EINVAL;
-		nexthop = inet->opt->faddr;
+		nexthop = inet_opt->opt.faddr;
 	}
 
 	orig_sport = inet->inet_sport;
@@ -185,7 +188,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 		return -ENETUNREACH;
 	}
 
-	if (!inet->opt || !inet->opt->srr)
+	if (!inet_opt || !inet_opt->opt.srr)
 		daddr = rt->rt_dst;
 
 	if (!inet->inet_saddr)
@@ -221,8 +224,8 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	inet->inet_daddr = daddr;
 
 	inet_csk(sk)->icsk_ext_hdr_len = 0;
-	if (inet->opt)
-		inet_csk(sk)->icsk_ext_hdr_len = inet->opt->optlen;
+	if (inet_opt)
+		inet_csk(sk)->icsk_ext_hdr_len = inet_opt->opt.optlen;
 
 	tp->rx_opt.mss_clamp = TCP_MSS_DEFAULT;
 
@@ -820,17 +823,18 @@ static void syn_flood_warning(const struct sk_buff *skb)
 /*
  * Save and compile IPv4 options into the request_sock if needed.
  */
-static struct ip_options *tcp_v4_save_options(struct sock *sk,
-					      struct sk_buff *skb)
+static struct ip_options_rcu *tcp_v4_save_options(struct sock *sk,
+						  struct sk_buff *skb)
 {
-	struct ip_options *opt = &(IPCB(skb)->opt);
-	struct ip_options *dopt = NULL;
+	const struct ip_options *opt = &(IPCB(skb)->opt);
+	struct ip_options_rcu *dopt = NULL;
 
 	if (opt && opt->optlen) {
-		int opt_size = optlength(opt);
+		int opt_size = sizeof(*dopt) + opt->optlen;
+
 		dopt = kmalloc(opt_size, GFP_ATOMIC);
 		if (dopt) {
-			if (ip_options_echo(dopt, skb)) {
+			if (ip_options_echo(&dopt->opt, skb)) {
 				kfree(dopt);
 				dopt = NULL;
 			}
@@ -1411,6 +1415,7 @@ struct sock *tcp_v4_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 #ifdef CONFIG_TCP_MD5SIG
 	struct tcp_md5sig_key *key;
 #endif
+	struct ip_options_rcu *inet_opt;
 
 	if (sk_acceptq_is_full(sk))
 		goto exit_overflow;
@@ -1431,13 +1436,14 @@ struct sock *tcp_v4_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 	newinet->inet_daddr   = ireq->rmt_addr;
 	newinet->inet_rcv_saddr = ireq->loc_addr;
 	newinet->inet_saddr	      = ireq->loc_addr;
-	newinet->opt	      = ireq->opt;
+	inet_opt	      = ireq->opt;
+	rcu_assign_pointer(newinet->inet_opt, inet_opt);
 	ireq->opt	      = NULL;
 	newinet->mc_index     = inet_iif(skb);
 	newinet->mc_ttl	      = ip_hdr(skb)->ttl;
 	inet_csk(newsk)->icsk_ext_hdr_len = 0;
-	if (newinet->opt)
-		inet_csk(newsk)->icsk_ext_hdr_len = newinet->opt->optlen;
+	if (inet_opt)
+		inet_csk(newsk)->icsk_ext_hdr_len = inet_opt->opt.optlen;
 	newinet->inet_id = newtp->write_seq ^ jiffies;
 
 	tcp_mtup_init(newsk);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index a15c8fb..75d24ce 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -804,6 +804,7 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	int corkreq = up->corkflag || msg->msg_flags&MSG_MORE;
 	int (*getfrag)(void *, char *, int, int, int, struct sk_buff *);
 	struct sk_buff *skb;
+	struct ip_options_data opt_copy;
 
 	if (len > 0xFFFF)
 		return -EMSGSIZE;
@@ -877,22 +878,32 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 			free = 1;
 		connected = 0;
 	}
-	if (!ipc.opt)
-		ipc.opt = inet->opt;
+	if (!ipc.opt) {
+		struct ip_options_rcu *inet_opt;
+
+		rcu_read_lock();
+		inet_opt = rcu_dereference(inet->inet_opt);
+		if (inet_opt) {
+			memcpy(&opt_copy, inet_opt,
+			       sizeof(*inet_opt) + inet_opt->opt.optlen);
+			ipc.opt = &opt_copy.opt;
+		}
+		rcu_read_unlock();
+	}
 
 	saddr = ipc.addr;
 	ipc.addr = faddr = daddr;
 
-	if (ipc.opt && ipc.opt->srr) {
+	if (ipc.opt && ipc.opt->opt.srr) {
 		if (!daddr)
 			return -EINVAL;
-		faddr = ipc.opt->faddr;
+		faddr = ipc.opt->opt.faddr;
 		connected = 0;
 	}
 	tos = RT_TOS(inet->tos);
 	if (sock_flag(sk, SOCK_LOCALROUTE) ||
 	    (msg->msg_flags & MSG_DONTROUTE) ||
-	    (ipc.opt && ipc.opt->is_strictroute)) {
+	    (ipc.opt && ipc.opt->opt.is_strictroute)) {
 		tos |= RTO_ONLINK;
 		connected = 0;
 	}
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 4f49e5d..2c6e606 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1469,7 +1469,7 @@ static struct sock * tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 
 	   First: no IPv4 options.
 	 */
-	newinet->opt = NULL;
+	newinet->inet_opt = NULL;
 	newnp->ipv6_fl_list = NULL;
 
 	/* Clone RX bits */



^ permalink raw reply related

* Re: [PATCHv4] usbnet: Resubmit interrupt URB once if halted
From: Oliver Neukum @ 2011-04-21 20:00 UTC (permalink / raw)
  To: Alan Stern; +Cc: Paul Stewart, netdev, linux-usb, davem, greg
In-Reply-To: <Pine.LNX.4.44L0.1104210958560.1939-100000@iolanthe.rowland.org>

Am Donnerstag, 21. April 2011, 16:03:34 schrieb Alan Stern:
> On Tue, 19 Apr 2011, Paul Stewart wrote:

> > This version of the patch moves the urb submit directly into
> > usbnet_resume.  Is it okay to submit a GFP_KERNEL urb from
> > usbnet_resume()?

Suppose a device of two interfaces one of them storage is autosuspended.
GFP_KERNEL in the first device to be resumed triggers a pageout to the
suspended storage device.

	Regards
		Oliver

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox