Netdev List
 help / color / mirror / Atom feed
* NAPI documentation needed
From: Rafał Miłecki @ 2012-12-20 19:39 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller

I wanted to report some problem I've encouraged during bgmac driver development.

At the very beginning I've implemented IRQ using threaded IRQ
(request_threaded_irq). I didn't know about NAPI until someone pointed
me that mistake. So I decided to rewrite IRQs handling to use NAPI.
I've found following documents:
http://www.linuxfoundation.org/collaborate/workgroups/networking/napi
ftp://robur.slu.se/pub/Linux/net-development/NAPI/README
ftp://robur.slu.se/pub/Linux/net-development/NAPI/NAPI_HOWTO.txt
ftp://robur.slu.se/pub/Linux/net-development/NAPI/converting-to-NAPI.txt~
but nothing really official sitting in kernel's Documentation dir.

So I started to using found documents, but then noticed they are quite outdated.

1) We don't have netif_rx_schedule and netif_rx_complete anymore.
2) We don't set poll and weight manually anymore but use netif_napi_add
3) Return type and arguments has changed in poll. None of the
following is up-to-date:
static void my_poll (struct net_device *dev, int *budget)
int (*poll)(struct net_device *dev, int *budget);

It would be great if someone with NAPI knowledge could document it in
a kernel. Would be really helpful for new network drivers developers.

-- 
Rafał

^ permalink raw reply

* Re: NAPI documentation needed
From: Eric Dumazet @ 2012-12-20 20:04 UTC (permalink / raw)
  To: Rafał Miłecki; +Cc: netdev, David S. Miller
In-Reply-To: <CACna6rzG3KcMYUpMZv-wv6e62J9j1NLuZpTRSQHLpyfTTbKyUg@mail.gmail.com>

On Thu, 2012-12-20 at 20:39 +0100, Rafał Miłecki wrote:
> I wanted to report some problem I've encouraged during bgmac driver development.
> 
> At the very beginning I've implemented IRQ using threaded IRQ
> (request_threaded_irq). I didn't know about NAPI until someone pointed
> me that mistake. So I decided to rewrite IRQs handling to use NAPI.
> I've found following documents:
> http://www.linuxfoundation.org/collaborate/workgroups/networking/napi
> ftp://robur.slu.se/pub/Linux/net-development/NAPI/README
> ftp://robur.slu.se/pub/Linux/net-development/NAPI/NAPI_HOWTO.txt
> ftp://robur.slu.se/pub/Linux/net-development/NAPI/converting-to-NAPI.txt~
> but nothing really official sitting in kernel's Documentation dir.
> 
> So I started to using found documents, but then noticed they are quite outdated.
> 
> 1) We don't have netif_rx_schedule and netif_rx_complete anymore.
> 2) We don't set poll and weight manually anymore but use netif_napi_add
> 3) Return type and arguments has changed in poll. None of the
> following is up-to-date:
> static void my_poll (struct net_device *dev, int *budget)
> int (*poll)(struct net_device *dev, int *budget);
> 
> It would be great if someone with NAPI knowledge could document it in
> a kernel. Would be really helpful for new network drivers developers.

I think you might be the one to update/create the official NAPI
documentation, now ideas are clear for you.

That would be really great indeed.

^ permalink raw reply

* Re: NAPI documentation needed
From: Ben Hutchings @ 2012-12-20 20:09 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Rafał Miłecki, netdev, David S. Miller
In-Reply-To: <1356033878.21834.3163.camel@edumazet-glaptop>

On Thu, 2012-12-20 at 12:04 -0800, Eric Dumazet wrote:
> On Thu, 2012-12-20 at 20:39 +0100, Rafał Miłecki wrote:
> > I wanted to report some problem I've encouraged during bgmac driver development.
> > 
> > At the very beginning I've implemented IRQ using threaded IRQ
> > (request_threaded_irq). I didn't know about NAPI until someone pointed
> > me that mistake. So I decided to rewrite IRQs handling to use NAPI.
> > I've found following documents:
> > http://www.linuxfoundation.org/collaborate/workgroups/networking/napi
> > ftp://robur.slu.se/pub/Linux/net-development/NAPI/README
> > ftp://robur.slu.se/pub/Linux/net-development/NAPI/NAPI_HOWTO.txt
> > ftp://robur.slu.se/pub/Linux/net-development/NAPI/converting-to-NAPI.txt~
> > but nothing really official sitting in kernel's Documentation dir.
> > 
> > So I started to using found documents, but then noticed they are quite outdated.
> > 
> > 1) We don't have netif_rx_schedule and netif_rx_complete anymore.
> > 2) We don't set poll and weight manually anymore but use netif_napi_add
> > 3) Return type and arguments has changed in poll. None of the
> > following is up-to-date:
> > static void my_poll (struct net_device *dev, int *budget)
> > int (*poll)(struct net_device *dev, int *budget);
> > 
> > It would be great if someone with NAPI knowledge could document it in
> > a kernel. Would be really helpful for new network drivers developers.
> 
> I think you might be the one to update/create the official NAPI
> documentation, now ideas are clear for you.
> 
> That would be really great indeed.

It would.  Last time someone asked, my answer was:

> The initial change to napi_struct is explained in
> <http://lwn.net/Articles/244640/>.
> 
> Since then there have been further changes:
> 
> - netif_napi_del() has been added.  You must call it to clean up NAPI
> contexts before freeing the associated net device(s).
> 
> - Instead of netif_rx_schedule(), netif_rx_complete(), etc. you must use
> napi_schedule(), napi_complete() etc. which just take a napi_struct
> pointer.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* RE: [PATCH] wireless: mwifiex: remove unreachable paths
From: Bing Zhao @ 2012-12-20 20:13 UTC (permalink / raw)
  To: Sasha Levin
  Cc: John W. Linville, linux-wireless@vger.kernel.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <1356030701-16284-17-git-send-email-sasha.levin@oracle.com>

Hi Sasha,

Thanks for the patch.

> We know 'firmware' is non-NULL from the beginning of mwifiex_prog_fw_w_helper,
> remove all !firmware paths from the rest of the function.

After removing all !firmware paths the function 'mwifiex_get_fw_data' becomes an orphan.

Could you please remove that function as well and resend a v2 patch?

Thanks,
Bing

> 
> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
> ---
>  drivers/net/wireless/mwifiex/usb.c | 15 +++------------
>  1 file changed, 3 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/net/wireless/mwifiex/usb.c b/drivers/net/wireless/mwifiex/usb.c
> index 63ac9f2..8bd7098 100644
> --- a/drivers/net/wireless/mwifiex/usb.c
> +++ b/drivers/net/wireless/mwifiex/usb.c
> @@ -836,23 +836,14 @@ static int mwifiex_prog_fw_w_helper(struct mwifiex_adapter *adapter,
>  			dlen = 0;
>  		} else {
>  			/* copy the header of the fw_data to get the length */
> -			if (firmware)
> -				memcpy(&fwdata->fw_hdr, &firmware[tlen],
> -				       sizeof(struct fw_header));
> -			else
> -				mwifiex_get_fw_data(adapter, tlen,
> -						    sizeof(struct fw_header),
> -						    (u8 *)&fwdata->fw_hdr);
> +			memcpy(&fwdata->fw_hdr, &firmware[tlen],
> +			       sizeof(struct fw_header));
> 
>  			dlen = le32_to_cpu(fwdata->fw_hdr.data_len);
>  			dnld_cmd = le32_to_cpu(fwdata->fw_hdr.dnld_cmd);
>  			tlen += sizeof(struct fw_header);
> 
> -			if (firmware)
> -				memcpy(fwdata->data, &firmware[tlen], dlen);
> -			else
> -				mwifiex_get_fw_data(adapter, tlen, dlen,
> -						    (u8 *)fwdata->data);
> +			memcpy(fwdata->data, &firmware[tlen], dlen);
> 
>  			fwdata->seq_num = cpu_to_le32(fw_seqnum);
>  			tlen += dlen;
> --
> 1.8.0

^ permalink raw reply

* Re: [PATCH net-next V4 03/13] bridge: Validate that vlan is permitted on ingress
From: Shmulik Ladkani @ 2012-12-20 20:13 UTC (permalink / raw)
  To: vyasevic; +Cc: netdev, shemminger, davem, or.gerlitz, jhs, mst, erdnetdev, jiri
In-Reply-To: <50D342B7.9010901@redhat.com>

Hi Vlad,

On Thu, 20 Dec 2012 11:54:15 -0500 Vlad Yasevich <vyasevic@redhat.com> wrote:
> > (2) frame ingress on a "non-vlan" port may egress on a "vlan enabled"
> >    port, depending on the ingress VID and the port-membership map of the
> >    egress port
> >    (and thus, PVID should be defined even to "non-vlan" ports, for the
> >    case where untagged frame is received on the non-vlan port)
> 
> Sort of.  The way I did it (testing now), is like this:
>     if there is egress policy
> 	apply policy and forward.
>     else if there was ingress policy (pvid)
> 	apply it and forward
>     else
> 	forward as is (old bridge behavior).
> 
> This way if there was a pvid on an ingress port and nothing on egress,
> then pvid applies.  If there was nothing configured on ingress port,
> but we have a egress policy, we'll apply any vlan information from
> the frame to egress policy.  In this case, ingress untagged traffic 
> would be assigned vlan 0.

Sorry, got too cryptic too me ;)
We're probably aligned, but if you don't mind I'd like to make sure I
got it right.

I'd expect the following logic if the bridge is a vlan bridge:

1. Frame ingress on a port
  Frame's VID is collected: if frame was tagged, its the VID found in
  the tag; if frame was untagged (or prio-tagged), the VID would be
  port's PVID.
2. Ingress membership verification
  Verify the ingress port is a member of the frame's VID vlan (collected
  on step 1).
  (Usually policy is 'drop' in case port is not a member).
  Easily calculated by testing if the port bit is set in vlan's port
  membership map.
3. Switching logic
  Consult FDB for a forward/flood/drop decision, resulting in a map of
  candidate ports the frame might egress upon (e.g. in the common case,
  a valid existing unicast entry, the result is just one candidate
  port).
4. Egress membership verification
  For each candidate port found on step 3, verify it is a member of the
  frame's VID vlan.
  (Usually, candidate ports that aren't members of the vlan will not be
  selected for actual egress).
  This can be easily calculated by masking the candidates port map
  (found on step 3) with the vlan's port membership map. The resulting
  masked map is final egress portmap.
5. Frame tag construction and egress 
  For each final egress port (found on step 4), verify its
  tagged/untagged policy in the vlan's egress-policy map.
  Properly add/remove the vlan tag (if needed) according to port's
  egress policy, and transmit.

To my best understanding, if all the ports are "vlan-enabled" (having a
non-empty vid list, i.e. belonging to at least one vlan), the behavior
of the implemented bridge is aligned with the above scheme.

For "non-vlan" ports (having en empty vid list), we treat them as if
they belong to ANY POSSIBLE vlan (as if their bit is always set in every
vlan port membeship map). Meaning, in step (2) verification always
suceeds for such ports, and in step (4) such ports will never be masked
out of the egress candidates portmap.

Please let me know if the implementation fits this.

> I'll try to document things sufficiently.  This hybrid approach may
> produce some unintended results.  We could always remove it or introduce
> the tunable to change default policy to drop once vlan configuration is
> in effect.

Ok.

Regards,
Shmulik

^ permalink raw reply

* Re: [PATCH 4/4] net/smsc911x: Provide common clock functionality
From: Lee Jones @ 2012-12-20 20:35 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Linus Walleij, Steve Glendinning, Robert Marklund, linus.walleij,
	arnd, netdev, linux-kernel, linux-arm-kernel
In-Reply-To: <20121220192441.GC14363@n2100.arm.linux.org.uk>

On Thu, 20 Dec 2012, Russell King - ARM Linux wrote:

> On Thu, Dec 20, 2012 at 08:12:08PM +0100, Linus Walleij wrote:
> > On Wed, Dec 19, 2012 at 6:19 PM, Lee Jones <lee.jones@linaro.org> wrote:
> > 
> > > Some platforms provide clocks which require enabling before the
> > > SMSC911x chip will power on. This patch uses the new common clk
> > > framework to do just that. If no clock is provided, it will just
> > > be ignored and the driver will continue to assume that no clock
> > > is required for the chip to run successfully.
> > >
> > > Cc: Steve Glendinning <steve.glendinning@shawell.net>
> > > Cc: netdev@vger.kernel.org
> > > Signed-off-by: Lee Jones <lee.jones@linaro.org>
> > 
> > Seems to me like it'll do the trick.
> > Acked-by: Linus Walleij <linus.walleij@linaro.org>
> 
> This looks fairly dangerous.  What about those platforms which use this
> driver, but don't provide a clock for it?
> 
> It looks like this will result in those platforms losing their ethernet
> support.  There's at least a bunch of the ARM evaluation boards which
> make use of this driver...

Right, but nothing should regress. If no clock is provided the driver
moves on during the request and will refuse to prepare, enable and
disable there after. 

Unless I've made a mistake somewhere? If so, I'd be happy to fixup.

-- 
Lee Jones
Linaro ST-Ericsson Landing Team Lead
Linaro.org │ Open source software for ARM SoCs
Follow Linaro: Facebook | Twitter | Blog

^ permalink raw reply

* bonding driver - how to recognize the active slave
From: Erez Shitrit @ 2012-12-20 20:38 UTC (permalink / raw)
  To: netdev@vger.kernel.org

Hi,
Is there any way to know who is the active slave in active-passive mode?

Is there any indication for that in netdevice flags? Or some other way?
(I would like to get in my network interface driver which can be a slave of the bonding interface)

I saw that there is the flag IFF_SLAVE_INACTIVE but it is not used as far as I can see.

Thanks, Erez

^ permalink raw reply

* Re: [PATCH 4/4] net/smsc911x: Provide common clock functionality
From: Russell King - ARM Linux @ 2012-12-20 20:51 UTC (permalink / raw)
  To: Lee Jones
  Cc: Linus Walleij, Steve Glendinning, Robert Marklund, linus.walleij,
	arnd, netdev, linux-kernel, linux-arm-kernel
In-Reply-To: <20121220203514.GN2691@gmail.com>

On Thu, Dec 20, 2012 at 08:35:14PM +0000, Lee Jones wrote:
> On Thu, 20 Dec 2012, Russell King - ARM Linux wrote:
> 
> > On Thu, Dec 20, 2012 at 08:12:08PM +0100, Linus Walleij wrote:
> > > On Wed, Dec 19, 2012 at 6:19 PM, Lee Jones <lee.jones@linaro.org> wrote:
> > > 
> > > > Some platforms provide clocks which require enabling before the
> > > > SMSC911x chip will power on. This patch uses the new common clk
> > > > framework to do just that. If no clock is provided, it will just
> > > > be ignored and the driver will continue to assume that no clock
> > > > is required for the chip to run successfully.
> > > >
> > > > Cc: Steve Glendinning <steve.glendinning@shawell.net>
> > > > Cc: netdev@vger.kernel.org
> > > > Signed-off-by: Lee Jones <lee.jones@linaro.org>
> > > 
> > > Seems to me like it'll do the trick.
> > > Acked-by: Linus Walleij <linus.walleij@linaro.org>
> > 
> > This looks fairly dangerous.  What about those platforms which use this
> > driver, but don't provide a clock for it?
> > 
> > It looks like this will result in those platforms losing their ethernet
> > support.  There's at least a bunch of the ARM evaluation boards which
> > make use of this driver...
> 
> Right, but nothing should regress. If no clock is provided the driver
> moves on during the request and will refuse to prepare, enable and
> disable there after. 
> 
> Unless I've made a mistake somewhere? If so, I'd be happy to fixup.

No, but... don't use NULL for that.  Use IS_ERR(pdata->clk) instead.

^ permalink raw reply

* Re: bonding driver - how to recognize the active slave
From: Eric Dumazet @ 2012-12-20 21:19 UTC (permalink / raw)
  To: Erez Shitrit; +Cc: netdev@vger.kernel.org
In-Reply-To: <BA0B6857ABCA474FB07F0DA6F50A3077918B2450@MTRDAG01.mtl.com>

On Thu, 2012-12-20 at 20:38 +0000, Erez Shitrit wrote:
> Hi,
> Is there any way to know who is the active slave in active-passive mode?
> 
> Is there any indication for that in netdevice flags? Or some other way?
> (I would like to get in my network interface driver which can be a slave of the bonding interface)
> 
> I saw that there is the flag IFF_SLAVE_INACTIVE but it is not used as far as I can see.


cat /sys/class/net/bond0/bonding/active_slave

^ permalink raw reply

* RE: [PATCH net-next] be2net: fix INTx ISR for interrupt behaviour on BE2
From: Ben Hutchings @ 2012-12-20 21:20 UTC (permalink / raw)
  To: Perla, Sathya; +Cc: netdev@vger.kernel.org
In-Reply-To: <CF9D1877D81D214CB0CA0669EFAE020C0E0C422D@CMEXMB1.ad.emulex.com>

On Tue, 2012-12-18 at 12:57 +0000, Perla, Sathya wrote:
> >-----Original Message-----
> >From: Ben Hutchings [mailto:bhutchings@solarflare.com]
> 
> >On Wed, 2012-11-28 at 20:20 +0000, Ben Hutchings wrote:
> >> On Wed, 2012-11-28 at 11:20 +0530, Sathya Perla wrote:
> >> > On BE2 chip, an interrupt may be raised even when EQ is in un-armed state.
> >> > As a result be_intx()::events_get() and be_poll:events_get() can race and
> >> > notify an EQ wrongly.
> >> >
> >> > Fix this by counting events only in be_poll(). Commit 0b545a629 fixes
> >> > the same issue in the MSI-x path.
> >> >
> >> > But, on Lancer, INTx can be de-asserted only by notifying num evts. This
> >> > is not an issue as the above BE2 behavior doesn't exist/has never been
> >> > seen on Lancer.
> >> [...]
> >> > @@ -2014,15 +1996,23 @@ static int be_rx_cqs_create(struct be_adapter
> >*adapter)
> >> >
> >> >  static irqreturn_t be_intx(int irq, void *dev)
> >> >  {
> >> > -	struct be_adapter *adapter = dev;
> >> > -	int num_evts;
> >> > +	struct be_eq_obj *eqo = dev;
> >> > +	struct be_adapter *adapter = eqo->adapter;
> >> > +	int num_evts = 0;
> >> >
> >> > -	/* With INTx only one EQ is used */
> >> > -	num_evts = event_handle(&adapter->eq_obj[0]);
> >> > -	if (num_evts)
> >> > -		return IRQ_HANDLED;
> >> > -	else
> >> > -		return IRQ_NONE;
> >> > +	/* On Lancer, clear-intr bit of the EQ DB does not work.
> >> > +	 * INTx is de-asserted only on notifying num evts.
> >> > +	 */
> >> > +	if (lancer_chip(adapter))
> >> > +		num_evts = events_get(eqo);
> >> > +
> >> > +	/* The EQ-notify may not de-assert INTx rightaway, causing
> >> > +	 * the ISR to be invoked again. So, return HANDLED even when
> >> > +	 * num_evts is zero.
> >> > +	 */
> >> > +	be_eq_notify(adapter, eqo->q.id, false, true, num_evts);
> >> > +	napi_schedule(&eqo->napi);
> >> > +	return IRQ_HANDLED;
> >> >  }
> >> [...]
> >>
> >> You shouldn't unconditionally return IRQ_HANDLED.  This prevents
> >> interrupt storm detection from working, not just for your device but for
> >> anything else sharing its IRQ.
> >>
> >> I understand there is a real problem to be fixed (PCIe write completions
> >> overtaking INTx deassertion, and maybe a specific hardware bug).
> >[...]
> >
> >I was thinking of read completions; there are no write completions to
> >wait for so you're pretty much guaranteed to get called a second time.
> >Maybe you should add an MMIO read after calling be_eq_notify().
> 
> Ben, I'm very sorry for not replying to this thread earlier. I needed time to play with
> your suggested solution and better understand the HW behavior in this case.
> 
> Adding an extra PCI memory read after the EQ-notify helps in syncing the PCI de-assert
> message *only* if it was already issued.
> But, there are cases when the HW block takes some time (longer) to initiate the INTx de-assert.
> The PCI memory read would complete but the de-assert wouldn't have happened yet.
> The PCI read sync will work only if the de-assert was issued *before* the PCI-read request was seen by the HW.

Yes, I've seen the exact same problem with our controllers.  At the PCIe
transaction level, legacy interrupt messages and read completions are
queued and flow-controlled separately.  If the chip's PCIe core doesn't
provide a way to restrict reordering of the two TLPs, this seems to be
inevitable.

What we do in sfc is to count the number of times in a row we've seen no
reason for the interrupt (irq_zero_count), and return IRQ_HANDLED only
if this is the first time.  This might work for you too.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [RFC PATCH v3 2/2] tun: fix LSM/SELinux labeling of tun/tap devices
From: Eric Paris @ 2012-12-20 21:58 UTC (permalink / raw)
  To: Paul Moore
  Cc: Jason Wang, Michael S. Tsirkin, Linux Netdev List, LSM List,
	SE-Linux
In-Reply-To: <2374585.ppB8RWyPuB@sifl>

You may add my Ack to the series.

-Eric

On Wed, Dec 19, 2012 at 11:58 AM, Paul Moore <pmoore@redhat.com> wrote:
> On Wednesday, December 19, 2012 01:46:25 PM Jason Wang wrote:
>> On 12/19/2012 07:08 AM, Michael S. Tsirkin wrote:
>> > On Tue, Dec 18, 2012 at 05:53:52PM -0500, Paul Moore wrote:
>> >> This patch corrects some problems with LSM/SELinux that were introduced
>> >> with the multiqueue patchset.  The problem stems from the fact that the
>> >> multiqueue work changed the relationship between the tun device and its
>> >> associated socket; before the socket persisted for the life of the
>> >> device, however after the multiqueue changes the socket only persisted
>> >> for the life of the userspace connection (fd open).  For non-persistent
>> >> devices this is not an issue, but for persistent devices this can cause
>> >> the tun device to lose its SELinux label.
>> >>
>> >> We correct this problem by adding an opaque LSM security blob to the
>> >> tun device struct which allows us to have the LSM security state, e.g.
>> >> SELinux labeling information, persist for the lifetime of the tun
>> >> device.  In the process we tweak the LSM hooks to work with this new
>> >> approach to TUN device/socket labeling and introduce a new LSM hook,
>> >> security_tun_dev_attach_queue(), to approve requests to attach to a
>> >> TUN queue via TUNSETQUEUE.
>> >>
>> >> The SELinux code has been adjusted to match the new LSM hooks, the
>> >> other LSMs do not make use of the LSM TUN controls.  This patch makes
>> >> use of the recently added "tun_socket:attach_queue" permission to
>> >> restrict access to the TUNSETQUEUE operation.  On older SELinux
>> >> policies which do not define the "tun_socket:attach_queue" permission
>> >> the access control decision for TUNSETQUEUE will be handled according
>> >> to the SELinux policy's unknown permission setting.
>> >>
>> >> Signed-off-by: Paul Moore <pmoore@redhat.com>
>> >
>> > Looks good to me. A comment not directly related to this patch, below.
>>
>> Good to me too, will do some test on this.
>
> Great.  I'll do some more testing and make sure the LSM and SELinux crowd are
> okay with the changes.
>
> --
> paul moore
> security and virtualization @ redhat
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: bonding driver - how to recognize the active slave
From: Or Gerlitz @ 2012-12-20 22:21 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Erez Shitrit, netdev@vger.kernel.org
In-Reply-To: <1356038370.21834.3315.camel@edumazet-glaptop>

On Thu, Dec 20, 2012 at 11:19 PM, Eric Dumazet <erdnetdev@gmail.com> wrote:
> cat /sys/class/net/bond0/bonding/active_slave

sure, I think Erez would like to know that from within a network
device kernel code, e.g maybe register to netdev kernel events and on
the event of bonding fail-over identify the active slave for
active-backup mode, etc.

Or.

^ permalink raw reply

* Re: bonding driver - how to recognize the active slave
From: Eric Dumazet @ 2012-12-20 22:52 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: Erez Shitrit, netdev@vger.kernel.org
In-Reply-To: <CAJZOPZLOj2QCWN75hAsQnmB=YiSGRhdhzf4ODhAD-QoU+Y0ZGg@mail.gmail.com>

On Fri, 2012-12-21 at 00:21 +0200, Or Gerlitz wrote:
> On Thu, Dec 20, 2012 at 11:19 PM, Eric Dumazet <erdnetdev@gmail.com> wrote:
> > cat /sys/class/net/bond0/bonding/active_slave
> 
> sure, I think Erez would like to know that from within a network
> device kernel code, e.g maybe register to netdev kernel events and on
> the event of bonding fail-over identify the active slave for
> active-backup mode, etc.

OK, but then why a network driver should be aware of this bonding
detail ?

It seems part of this commit could be reverted,
ie setting IFF_SLAVE_INACTIVE again.

commit 2d7011ca79f1a8792e04d131b8ea21db179ab917
Author: Jiri Pirko <jpirko@redhat.com>
Date:   Wed Mar 16 08:46:43 2011 +0000

    bonding: get rid of IFF_SLAVE_INACTIVE netdev->priv_flag
    
    Since bond-related code was moved from net/core/dev.c into bonding,
    IFF_SLAVE_INACTIVE is no longer needed. Replace is with flag "inactive"
    stored in slave structure
    
    Signed-off-by: Jiri Pirko <jpirko@redhat.com>
    Reviewed-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
    Signed-off-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply

* TUN problems (regression?)
From: Paul Moore @ 2012-12-20 23:16 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev

[-- Attachment #1: Type: text/plain, Size: 528 bytes --]

[CC'ing netdev in case this is a known problem I just missed ...]

Hi Jason,

I started doing some more testing with the multiqueue TUN changes and I ran 
into a problem when running tunctl: running it once w/o arguments works as 
expected, but running it a second time results in failure and a 
kmem_cache_sanity_check() failure.  The problem appears to be very repeatable 
on my test VM and happens independent of the LSM/SELinux fixup patches.

Have you seen this before?

-- 
paul moore
security and virtualization @ redhat

[-- Attachment #2: tun_problem.png --]
[-- Type: image/png, Size: 13740 bytes --]

^ permalink raw reply

* Re: TUN problems (regression?)
From: Eric Dumazet @ 2012-12-20 23:38 UTC (permalink / raw)
  To: Paul Moore; +Cc: Jason Wang, netdev
In-Reply-To: <4151394.nMo40zlg68@sifl>

On Thu, 2012-12-20 at 18:16 -0500, Paul Moore wrote:
> [CC'ing netdev in case this is a known problem I just missed ...]
> 
> Hi Jason,
> 
> I started doing some more testing with the multiqueue TUN changes and I ran 
> into a problem when running tunctl: running it once w/o arguments works as 
> expected, but running it a second time results in failure and a 
> kmem_cache_sanity_check() failure.  The problem appears to be very repeatable 
> on my test VM and happens independent of the LSM/SELinux fixup patches.
> 
> Have you seen this before?
> 

Obviously code in tun_flow_init() is wrong...

static int tun_flow_init(struct tun_struct *tun)
{
        int i;

        tun->flow_cache = kmem_cache_create("tun_flow_cache",
                                            sizeof(struct tun_flow_entry), 0, 0,
                                            NULL);
        if (!tun->flow_cache)
                return -ENOMEM;
...
}


I have no idea why we would need a kmem_cache per tun_struct,
and why we even need a kmem_cache.


I would try following patch :

 drivers/net/tun.c |   24 +++---------------------
 1 file changed, 3 insertions(+), 21 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 504f7f1..fbd106e 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -180,7 +180,6 @@ struct tun_struct {
 	int debug;
 #endif
 	spinlock_t lock;
-	struct kmem_cache *flow_cache;
 	struct hlist_head flows[TUN_NUM_FLOW_ENTRIES];
 	struct timer_list flow_gc_timer;
 	unsigned long ageing_time;
@@ -209,8 +208,8 @@ static struct tun_flow_entry *tun_flow_create(struct tun_struct *tun,
 					      struct hlist_head *head,
 					      u32 rxhash, u16 queue_index)
 {
-	struct tun_flow_entry *e = kmem_cache_alloc(tun->flow_cache,
-						    GFP_ATOMIC);
+	struct tun_flow_entry *e = kmalloc(sizeof(*e), GFP_ATOMIC);
+
 	if (e) {
 		tun_debug(KERN_INFO, tun, "create flow: hash %u index %u\n",
 			  rxhash, queue_index);
@@ -223,19 +222,12 @@ static struct tun_flow_entry *tun_flow_create(struct tun_struct *tun,
 	return e;
 }
 
-static void tun_flow_free(struct rcu_head *head)
-{
-	struct tun_flow_entry *e
-		= container_of(head, struct tun_flow_entry, rcu);
-	kmem_cache_free(e->tun->flow_cache, e);
-}
-
 static void tun_flow_delete(struct tun_struct *tun, struct tun_flow_entry *e)
 {
 	tun_debug(KERN_INFO, tun, "delete flow: hash %u index %u\n",
 		  e->rxhash, e->queue_index);
 	hlist_del_rcu(&e->hash_link);
-	call_rcu(&e->rcu, tun_flow_free);
+	kfree_rcu(e, rcu);
 }
 
 static void tun_flow_flush(struct tun_struct *tun)
@@ -833,12 +825,6 @@ static int tun_flow_init(struct tun_struct *tun)
 {
 	int i;
 
-	tun->flow_cache = kmem_cache_create("tun_flow_cache",
-					    sizeof(struct tun_flow_entry), 0, 0,
-					    NULL);
-	if (!tun->flow_cache)
-		return -ENOMEM;
-
 	for (i = 0; i < TUN_NUM_FLOW_ENTRIES; i++)
 		INIT_HLIST_HEAD(&tun->flows[i]);
 
@@ -854,10 +840,6 @@ static void tun_flow_uninit(struct tun_struct *tun)
 {
 	del_timer_sync(&tun->flow_gc_timer);
 	tun_flow_flush(tun);
-
-	/* Wait for completion of call_rcu()'s */
-	rcu_barrier();
-	kmem_cache_destroy(tun->flow_cache);
 }
 
 /* Initialize net device. */

^ permalink raw reply related

* Re: TUN problems (regression?)
From: Stephen Hemminger @ 2012-12-20 23:50 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Paul Moore, Jason Wang, netdev
In-Reply-To: <1356046697.21834.3606.camel@edumazet-glaptop>

On Thu, 20 Dec 2012 15:38:17 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Thu, 2012-12-20 at 18:16 -0500, Paul Moore wrote:
> > [CC'ing netdev in case this is a known problem I just missed ...]
> > 
> > Hi Jason,
> > 
> > I started doing some more testing with the multiqueue TUN changes and I ran 
> > into a problem when running tunctl: running it once w/o arguments works as 
> > expected, but running it a second time results in failure and a 
> > kmem_cache_sanity_check() failure.  The problem appears to be very repeatable 
> > on my test VM and happens independent of the LSM/SELinux fixup patches.
> > 
> > Have you seen this before?
> > 
> 
> Obviously code in tun_flow_init() is wrong...
> 
> static int tun_flow_init(struct tun_struct *tun)
> {
>         int i;
> 
>         tun->flow_cache = kmem_cache_create("tun_flow_cache",
>                                             sizeof(struct tun_flow_entry), 0, 0,
>                                             NULL);
>         if (!tun->flow_cache)
>                 return -ENOMEM;
> ...
> }
> 
> 
> I have no idea why we would need a kmem_cache per tun_struct,
> and why we even need a kmem_cache.

Normally flow malloc/free should be good enough.
It might make sense to use private kmem_cache if doing hlist_nulls.


Acked-by: Stephen Hemminger <shemminger@vyatta.com>

^ permalink raw reply

* [PATCH] ipv4/ip_gre: make ipgre_tunnel_xmit() not parse network header as IP unconditionally
From: Isaku Yamahata @ 2012-12-21  1:12 UTC (permalink / raw)
  To: netdev; +Cc: yamahata

ipgre_tunnel_xmit() parses network header as IP unconditionally.
But transmitting packets are not always IP packet. For example such packet
can be sent by packet socket with sockaddr_ll.sll_protocol set.
So make the function check if skb->protocol is IP.

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 net/ipv4/ip_gre.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index a85ae2f..8fcf0ed 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -760,7 +760,10 @@ static netdev_tx_t ipgre_tunnel_xmit(struct sk_buff *skb, struct net_device *dev
 
 	if (dev->header_ops && dev->type == ARPHRD_IPGRE) {
 		gre_hlen = 0;
-		tiph = (const struct iphdr *)skb->data;
+		if (skb->protocol == htons(ETH_P_IP))
+			tiph = (const struct iphdr *)skb->data;
+		else
+			tiph = &tunnel->parms.iph;
 	} else {
 		gre_hlen = tunnel->hlen;
 		tiph = &tunnel->parms.iph;
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH] ipv4/ip_gre: make ipgre_tunnel_xmit() not parse network header as IP unconditionally
From: Eric Dumazet @ 2012-12-21  1:48 UTC (permalink / raw)
  To: Isaku Yamahata; +Cc: netdev
In-Reply-To: <9ee0cd8f94e1ee866b40ee7b6755e8d8705325c9.1356052319.git.yamahata@valinux.co.jp>

On Fri, 2012-12-21 at 10:12 +0900, Isaku Yamahata wrote:
> ipgre_tunnel_xmit() parses network header as IP unconditionally.
> But transmitting packets are not always IP packet. For example such packet
> can be sent by packet socket with sockaddr_ll.sll_protocol set.
> So make the function check if skb->protocol is IP.
> 
> Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
> ---
>  net/ipv4/ip_gre.c |    5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
> index a85ae2f..8fcf0ed 100644
> --- a/net/ipv4/ip_gre.c
> +++ b/net/ipv4/ip_gre.c
> @@ -760,7 +760,10 @@ static netdev_tx_t ipgre_tunnel_xmit(struct sk_buff *skb, struct net_device *dev
>  
>  	if (dev->header_ops && dev->type == ARPHRD_IPGRE) {
>  		gre_hlen = 0;
> -		tiph = (const struct iphdr *)skb->data;
> +		if (skb->protocol == htons(ETH_P_IP))
> +			tiph = (const struct iphdr *)skb->data;
> +		else
> +			tiph = &tunnel->parms.iph;
>  	} else {
>  		gre_hlen = tunnel->hlen;
>  		tiph = &tunnel->parms.iph;

Seems good to me thanks !

Acked-by: Eric Dumazet <edumazet@google.com>

BTW, it seems another bug exists at line 931 : We dereference tiph while
it could point to freed memory because of the skb_realloc_headroom() at
line 893

I'll send a patch.

^ permalink raw reply

* [PATCH] ip_gre: fix possible use after free
From: Eric Dumazet @ 2012-12-21  2:00 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Isaku Yamahata

From: Eric Dumazet <edumazet@google.com>

Once skb_realloc_headroom() is called, tiph might point to freed memory.

Cache tiph->ttl value before the reallocation, to avoid unexpected
behavior.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Isaku Yamahata <yamahata@valinux.co.jp>
---
 net/ipv4/ip_gre.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index a85ae2f..4c22e70 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -750,6 +750,7 @@ static netdev_tx_t ipgre_tunnel_xmit(struct sk_buff *skb, struct net_device *dev
 	int    gre_hlen;
 	__be32 dst;
 	int    mtu;
+	u8     ttl;
 
 	if (skb->ip_summed == CHECKSUM_PARTIAL &&
 	    skb_checksum_help(skb))
@@ -812,6 +813,7 @@ static netdev_tx_t ipgre_tunnel_xmit(struct sk_buff *skb, struct net_device *dev
 			goto tx_error;
 	}
 
+	ttl = tiph->ttl;
 	tos = tiph->tos;
 	if (tos == 1) {
 		tos = 0;
@@ -904,6 +906,7 @@ static netdev_tx_t ipgre_tunnel_xmit(struct sk_buff *skb, struct net_device *dev
 		dev_kfree_skb(skb);
 		skb = new_skb;
 		old_iph = ip_hdr(skb);
+		/* Warning : tiph value might point to freed memory */
 	}
 
 	skb_reset_transport_header(skb);
@@ -927,8 +930,9 @@ static netdev_tx_t ipgre_tunnel_xmit(struct sk_buff *skb, struct net_device *dev
 	iph->tos		=	ipgre_ecn_encapsulate(tos, old_iph, skb);
 	iph->daddr		=	fl4.daddr;
 	iph->saddr		=	fl4.saddr;
+	iph->ttl		=	ttl;
 
-	if ((iph->ttl = tiph->ttl) == 0) {
+	if (ttl == 0) {
 		if (skb->protocol == htons(ETH_P_IP))
 			iph->ttl = old_iph->ttl;
 #if IS_ENABLED(CONFIG_IPV6)

^ permalink raw reply related

* sock_ioctl sleeping while atomic warning during boot.
From: Dave Jones @ 2012-12-21  2:24 UTC (permalink / raw)
  To: netdev

Seen on Linus tree as of 810a4855513b9cb1a191301eb5e4e28b276cc318


 BUG: sleeping function called from invalid context at mm/slub.c:925
 in_atomic(): 1, irqs_disabled(): 0, pid: 364, name: systemd-udevd
 2 locks on stack by systemd-udevd/364:
  #0: held:     (rtnl_mutex){+.+.+.}, instance: ffffffff81d1c038, at: [<ffffffff815d82d7>] rtnl_lock+0x17/0x20
  #1: held:     (devnet_rename_seq){+.+.+.}, instance: ffffffff81d1ba60, at: [<ffffffff815c9737>] dev_change_name+0x67/0x2a0
 Pid: 364, comm: systemd-udevd Not tainted 3.7.0+ #32
 Call Trace:
  [<ffffffff810a892b>] __might_sleep+0x14b/0x210
  [<ffffffff811ce833>] __kmalloc_track_caller+0x63/0x2d0
  [<ffffffff8146810f>] ? device_rename+0x5f/0x130
  [<ffffffff8119190a>] kstrdup+0x3a/0x70
  [<ffffffff8146810f>] device_rename+0x5f/0x130
  [<ffffffff815c97d3>] dev_change_name+0x103/0x2a0
  [<ffffffff815cb851>] dev_ifsioc+0x251/0x3b0
  [<ffffffff815cbb5b>] dev_ioctl+0x1ab/0x840
  [<ffffffff81367588>] ? __const_udelay+0x28/0x30
  [<ffffffff812f7a49>] ? avc_has_perm_flags+0x29/0x2b0
  [<ffffffff815aa86d>] sock_do_ioctl+0x5d/0x70
  [<ffffffff815aa8fd>] sock_ioctl+0x7d/0x2c0
  [<ffffffff812f9d05>] ? inode_has_perm.isra.47.constprop.60+0x65/0xa0
  [<ffffffff811f1079>] do_vfs_ioctl+0x99/0x5a0
  [<ffffffff812f9dd7>] ? file_has_perm+0x97/0xb0
  [<ffffffff811f1611>] sys_ioctl+0x91/0xb0
  [<ffffffff81711480>] tracesys+0xdd/0xe2
 systemd-udevd[364]: renamed network interface eth0 to p6p1

^ permalink raw reply

* Re: [PATCH 4/5] net: sfc: fix return value check in efx_ptp_probe_channel().
From: Ben Hutchings @ 2012-12-21  3:09 UTC (permalink / raw)
  To: David Miller, tipecaml
  Cc: linux-kernel, kernel-janitors, linux-net-drivers, netdev
In-Reply-To: <20121212.001550.741463250196231082.davem@davemloft.net>

On Wed, 2012-12-12 at 00:15 -0500, David Miller wrote:
> From: Cyril Roelandt <tipecaml@gmail.com>
> Date: Wed, 12 Dec 2012 01:24:53 +0100
> 
> > The ptp_clock_register() returns ERR_PTR() and never returns NULL. Replace the
> > NULL check by a call to IS_ERR().
> > 
> > Signed-off-by: Cyril Roelandt <tipecaml@gmail.com>
> 
> I'll let Ben queue this up.
> 
> Probably he'll want to avoid potentially leaving an ERR_PTR
> in ptp->phc_clock even if, with this fix, that would be
> harmless.

Sorry for the delay in looking at this.  This change is fine, as the
entire structure *ptp will be freed on the failure path.

I'm now on vacation until the new year.  Cyril, please re-send your
patch with the addition of:

Acked-by: Ben Hutchings <bhutchings@solarflare.com>

so it will show up on 'patchwork' again and David can apply it from
there.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: sock_ioctl sleeping while atomic warning during boot.
From: Eric Dumazet @ 2012-12-21  3:25 UTC (permalink / raw)
  To: Dave Jones; +Cc: netdev, Brian Haley
In-Reply-To: <20121221022409.GA1927@redhat.com>

From: Eric Dumazet <edumazet@google.com>

On Thu, 2012-12-20 at 21:24 -0500, Dave Jones wrote:
> Seen on Linus tree as of 810a4855513b9cb1a191301eb5e4e28b276cc318
> 
> 
>  BUG: sleeping function called from invalid context at mm/slub.c:925
>  in_atomic(): 1, irqs_disabled(): 0, pid: 364, name: systemd-udevd
>  2 locks on stack by systemd-udevd/364:
>   #0: held:     (rtnl_mutex){+.+.+.}, instance: ffffffff81d1c038, at: [<ffffffff815d82d7>] rtnl_lock+0x17/0x20
>   #1: held:     (devnet_rename_seq){+.+.+.}, instance: ffffffff81d1ba60, at: [<ffffffff815c9737>] dev_change_name+0x67/0x2a0
>  Pid: 364, comm: systemd-udevd Not tainted 3.7.0+ #32
>  Call Trace:
>   [<ffffffff810a892b>] __might_sleep+0x14b/0x210
>   [<ffffffff811ce833>] __kmalloc_track_caller+0x63/0x2d0
>   [<ffffffff8146810f>] ? device_rename+0x5f/0x130
>   [<ffffffff8119190a>] kstrdup+0x3a/0x70
>   [<ffffffff8146810f>] device_rename+0x5f/0x130
>   [<ffffffff815c97d3>] dev_change_name+0x103/0x2a0
>   [<ffffffff815cb851>] dev_ifsioc+0x251/0x3b0
>   [<ffffffff815cbb5b>] dev_ioctl+0x1ab/0x840
>   [<ffffffff81367588>] ? __const_udelay+0x28/0x30
>   [<ffffffff812f7a49>] ? avc_has_perm_flags+0x29/0x2b0
>   [<ffffffff815aa86d>] sock_do_ioctl+0x5d/0x70
>   [<ffffffff815aa8fd>] sock_ioctl+0x7d/0x2c0
>   [<ffffffff812f9d05>] ? inode_has_perm.isra.47.constprop.60+0x65/0xa0
>   [<ffffffff811f1079>] do_vfs_ioctl+0x99/0x5a0
>   [<ffffffff812f9dd7>] ? file_has_perm+0x97/0xb0
>   [<ffffffff811f1611>] sys_ioctl+0x91/0xb0
>   [<ffffffff81711480>] tracesys+0xdd/0xe2
>  systemd-udevd[364]: renamed network interface eth0 to p6p1
> 
> --

OK, thanks for the report.

We need a seqcount, not a seqlock, as RTNL already protects multiple
writers.

Please try following fix :


[PATCH] net: devnet_rename_seq should be a seqcount

Using a seqlock for devnet_rename_seq is not a good idea,
as device_rename() can sleep.

As we hold RTNL, we dont need a protection for writers,
and only need a seqcount so that readers can catch a change done
by a writer.

Bug added in commit c91f6df2db4972d3 (sockopt: Change getsockopt() of
SO_BINDTODEVICE to return an interface name)

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Brian Haley <brian.haley@hp.com>
---
 include/linux/netdevice.h |    2 +-
 net/core/dev.c            |   18 +++++++++---------
 net/core/sock.c           |    4 ++--
 3 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 02e0f6b..c599e47 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1576,7 +1576,7 @@ extern int call_netdevice_notifiers(unsigned long val, struct net_device *dev);
 
 extern rwlock_t				dev_base_lock;		/* Device list lock */
 
-extern seqlock_t	devnet_rename_seq;	/* Device rename lock */
+extern seqcount_t	devnet_rename_seq;	/* Device rename seq */
 
 
 #define for_each_netdev(net, d)		\
diff --git a/net/core/dev.c b/net/core/dev.c
index d0cbc93..26d73e2 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -203,7 +203,7 @@ static struct list_head offload_base __read_mostly;
 DEFINE_RWLOCK(dev_base_lock);
 EXPORT_SYMBOL(dev_base_lock);
 
-DEFINE_SEQLOCK(devnet_rename_seq);
+seqcount_t devnet_rename_seq;
 
 static inline void dev_base_seq_inc(struct net *net)
 {
@@ -1093,10 +1093,10 @@ int dev_change_name(struct net_device *dev, const char *newname)
 	if (dev->flags & IFF_UP)
 		return -EBUSY;
 
-	write_seqlock(&devnet_rename_seq);
+	write_seqcount_begin(&devnet_rename_seq);
 
 	if (strncmp(newname, dev->name, IFNAMSIZ) == 0) {
-		write_sequnlock(&devnet_rename_seq);
+		write_seqcount_end(&devnet_rename_seq);
 		return 0;
 	}
 
@@ -1104,7 +1104,7 @@ int dev_change_name(struct net_device *dev, const char *newname)
 
 	err = dev_get_valid_name(net, dev, newname);
 	if (err < 0) {
-		write_sequnlock(&devnet_rename_seq);
+		write_seqcount_end(&devnet_rename_seq);
 		return err;
 	}
 
@@ -1112,11 +1112,11 @@ rollback:
 	ret = device_rename(&dev->dev, dev->name);
 	if (ret) {
 		memcpy(dev->name, oldname, IFNAMSIZ);
-		write_sequnlock(&devnet_rename_seq);
+		write_seqcount_end(&devnet_rename_seq);
 		return ret;
 	}
 
-	write_sequnlock(&devnet_rename_seq);
+	write_seqcount_end(&devnet_rename_seq);
 
 	write_lock_bh(&dev_base_lock);
 	hlist_del_rcu(&dev->name_hlist);
@@ -1135,7 +1135,7 @@ rollback:
 		/* err >= 0 after dev_alloc_name() or stores the first errno */
 		if (err >= 0) {
 			err = ret;
-			write_seqlock(&devnet_rename_seq);
+			write_seqcount_begin(&devnet_rename_seq);
 			memcpy(dev->name, oldname, IFNAMSIZ);
 			goto rollback;
 		} else {
@@ -4180,7 +4180,7 @@ static int dev_ifname(struct net *net, struct ifreq __user *arg)
 		return -EFAULT;
 
 retry:
-	seq = read_seqbegin(&devnet_rename_seq);
+	seq = read_seqcount_begin(&devnet_rename_seq);
 	rcu_read_lock();
 	dev = dev_get_by_index_rcu(net, ifr.ifr_ifindex);
 	if (!dev) {
@@ -4190,7 +4190,7 @@ retry:
 
 	strcpy(ifr.ifr_name, dev->name);
 	rcu_read_unlock();
-	if (read_seqretry(&devnet_rename_seq, seq))
+	if (read_seqcount_retry(&devnet_rename_seq, seq))
 		goto retry;
 
 	if (copy_to_user(arg, &ifr, sizeof(struct ifreq)))
diff --git a/net/core/sock.c b/net/core/sock.c
index a692ef4..bc131d4 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -583,7 +583,7 @@ static int sock_getbindtodevice(struct sock *sk, char __user *optval,
 		goto out;
 
 retry:
-	seq = read_seqbegin(&devnet_rename_seq);
+	seq = read_seqcount_begin(&devnet_rename_seq);
 	rcu_read_lock();
 	dev = dev_get_by_index_rcu(net, sk->sk_bound_dev_if);
 	ret = -ENODEV;
@@ -594,7 +594,7 @@ retry:
 
 	strcpy(devname, dev->name);
 	rcu_read_unlock();
-	if (read_seqretry(&devnet_rename_seq, seq))
+	if (read_seqcount_retry(&devnet_rename_seq, seq))
 		goto retry;
 
 	len = strlen(devname) + 1;

^ permalink raw reply related

* Re: sock_ioctl sleeping while atomic warning during boot.
From: Brian Haley @ 2012-12-21  3:31 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Dave Jones, netdev
In-Reply-To: <1356060308.21834.4455.camel@edumazet-glaptop>

On 12/20/2012 10:25 PM, Eric Dumazet wrote:
> OK, thanks for the report.
> 
> We need a seqcount, not a seqlock, as RTNL already protects multiple
> writers.
> 
> Please try following fix :
> 
> 
> [PATCH] net: devnet_rename_seq should be a seqcount
> 
> Using a seqlock for devnet_rename_seq is not a good idea,
> as device_rename() can sleep.
> 
> As we hold RTNL, we dont need a protection for writers,
> and only need a seqcount so that readers can catch a change done
> by a writer.
> 
> Bug added in commit c91f6df2db4972d3 (sockopt: Change getsockopt() of
> SO_BINDTODEVICE to return an interface name)
> 
> Reported-by: Dave Jones <davej@redhat.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Brian Haley <brian.haley@hp.com>

Sorry about that, thanks for the quick fix Eric.

-Brian

^ permalink raw reply

* Re: TUN problems (regression?)
From: Jason Wang @ 2012-12-21  3:32 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Eric Dumazet, Paul Moore, netdev
In-Reply-To: <20121220155001.538bbdb0@nehalam.linuxnetplumber.net>

On 12/21/2012 07:50 AM, Stephen Hemminger wrote:
> On Thu, 20 Dec 2012 15:38:17 -0800
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>> On Thu, 2012-12-20 at 18:16 -0500, Paul Moore wrote:
>>> [CC'ing netdev in case this is a known problem I just missed ...]
>>>
>>> Hi Jason,
>>>
>>> I started doing some more testing with the multiqueue TUN changes and I ran 
>>> into a problem when running tunctl: running it once w/o arguments works as 
>>> expected, but running it a second time results in failure and a 
>>> kmem_cache_sanity_check() failure.  The problem appears to be very repeatable 
>>> on my test VM and happens independent of the LSM/SELinux fixup patches.
>>>
>>> Have you seen this before?
>>>
>> Obviously code in tun_flow_init() is wrong...
>>
>> static int tun_flow_init(struct tun_struct *tun)
>> {
>>         int i;
>>
>>         tun->flow_cache = kmem_cache_create("tun_flow_cache",
>>                                             sizeof(struct tun_flow_entry), 0, 0,
>>                                             NULL);
>>         if (!tun->flow_cache)
>>                 return -ENOMEM;
>> ...
>> }
>>
>>
>> I have no idea why we would need a kmem_cache per tun_struct,
>> and why we even need a kmem_cache.
> Normally flow malloc/free should be good enough.
> It might make sense to use private kmem_cache if doing hlist_nulls.
>
>
> Acked-by: Stephen Hemminger <shemminger@vyatta.com>

Should be at least a global cache, I thought I can get some speed-up by
using kmem_cache.

Acked-by: Jason Wang <jasowang@redhat.com>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: TUN problems (regression?)
From: Eric Dumazet @ 2012-12-21  3:39 UTC (permalink / raw)
  To: Jason Wang; +Cc: Stephen Hemminger, Paul Moore, netdev
In-Reply-To: <50D3D85B.1070605@redhat.com>

On Fri, 2012-12-21 at 11:32 +0800, Jason Wang wrote:
> On 12/21/2012 07:50 AM, Stephen Hemminger wrote:
> > On Thu, 20 Dec 2012 15:38:17 -0800
> > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> >> On Thu, 2012-12-20 at 18:16 -0500, Paul Moore wrote:
> >>> [CC'ing netdev in case this is a known problem I just missed ...]
> >>>
> >>> Hi Jason,
> >>>
> >>> I started doing some more testing with the multiqueue TUN changes and I ran 
> >>> into a problem when running tunctl: running it once w/o arguments works as 
> >>> expected, but running it a second time results in failure and a 
> >>> kmem_cache_sanity_check() failure.  The problem appears to be very repeatable 
> >>> on my test VM and happens independent of the LSM/SELinux fixup patches.
> >>>
> >>> Have you seen this before?
> >>>
> >> Obviously code in tun_flow_init() is wrong...
> >>
> >> static int tun_flow_init(struct tun_struct *tun)
> >> {
> >>         int i;
> >>
> >>         tun->flow_cache = kmem_cache_create("tun_flow_cache",
> >>                                             sizeof(struct tun_flow_entry), 0, 0,
> >>                                             NULL);
> >>         if (!tun->flow_cache)
> >>                 return -ENOMEM;
> >> ...
> >> }
> >>
> >>
> >> I have no idea why we would need a kmem_cache per tun_struct,
> >> and why we even need a kmem_cache.
> > Normally flow malloc/free should be good enough.
> > It might make sense to use private kmem_cache if doing hlist_nulls.
> >
> >
> > Acked-by: Stephen Hemminger <shemminger@vyatta.com>
> 
> Should be at least a global cache, I thought I can get some speed-up by
> using kmem_cache.
> 
> Acked-by: Jason Wang <jasowang@redhat.com>

Was it with SLUB or SLAB ?

Using generic kmalloc-64 is better than a dedicated kmem_cache of 48
bytes per object, as we guarantee each object is on a single cache line.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox