Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: tg3 and Broadcom PHY driver
From: David Miller @ 2009-10-09 21:25 UTC (permalink / raw)
  To: bhutchings; +Cc: felix, mcarlson, netdev
In-Reply-To: <1254185639.27790.3.camel@localhost>

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Tue, 29 Sep 2009 01:53:59 +0100

> On Mon, 2009-09-28 at 14:55 -0700, David Miller wrote:
>> From: Felix Radensky <felix@embedded-sol.com>
>> Date: Mon, 28 Sep 2009 23:52:54 +0200
>> 
>> > Yes, moving CONFIG_TIGON3 right after CONFIG_PHYLIB in
>> > drivers/net/Makefile fixes the problem for me.
>> 
>> Thanks for testing.
>> 
>> We really need to fix this generically.
>> 
>> Does anyone think that moving the MDIO/MII/PHY layer objects
>> to the top of drivers/net/Makefile will break anything?
>> 
>> If not, that's what we should do I think.
> 
> Only the phylib drivers actually need to be moved to fix the
> initialisation order, but moving the others shouldn't hurt.

Ok, I'm adding the following to net-2.6 to resolve this and
will queue it up for -stable too.

Thanks everyone.

net: Link in PHY drivers before others.

We need PHY drivers to initialize in a static kernel before
the MAC drivers that use them.  So link them in first.

Based upon a report by Felix Radensky.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/Makefile |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index d866b8c..48d82e9 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -2,6 +2,10 @@
 # Makefile for the Linux network (ethercard) device drivers.
 #
 
+obj-$(CONFIG_MII) += mii.o
+obj-$(CONFIG_MDIO) += mdio.o
+obj-$(CONFIG_PHYLIB) += phy/
+
 obj-$(CONFIG_TI_DAVINCI_EMAC) += davinci_emac.o
 
 obj-$(CONFIG_E1000) += e1000/
@@ -100,10 +104,6 @@ obj-$(CONFIG_SH_ETH) += sh_eth.o
 # end link order section
 #
 
-obj-$(CONFIG_MII) += mii.o
-obj-$(CONFIG_MDIO) += mdio.o
-obj-$(CONFIG_PHYLIB) += phy/
-
 obj-$(CONFIG_SUNDANCE) += sundance.o
 obj-$(CONFIG_HAMACHI) += hamachi.o
 obj-$(CONFIG_NET) += Space.o loopback.o
-- 
1.6.4.4


^ permalink raw reply related

* Re: [RFCv4 PATCH 1/2] net: Introduce recvmmsg socket syscall
From: David Miller @ 2009-10-09 21:27 UTC (permalink / raw)
  To: acme
  Cc: caitlin.bestler, vanhoof, williams, nhorman, nir.tzachar, niv,
	paul.moore, remi.denis-courmont, steve, netdev
In-Reply-To: <20091009193520.GD12982@ghostprotocols.net>

From: Arnaldo Carvalho de Melo <acme@redhat.com>
Date: Fri, 9 Oct 2009 16:35:20 -0300

> 	The second patch in this series has issues, I still have to
> investigate it properly, study removing the skb_queue_head lock like TCP
> does, but the first patch seems to be OK and already providing good
> results at least as reported by Nir, if there aren't any other concerns
> about the API, can we get it into net-next-2.6?

Please make a formal submission of that first patch with all proper
signoffs and without the "RFC" in the subject line and I'll apply it.

Thanks!

^ permalink raw reply

* Re: [PATCH] Generalize socket rx gap / receive queue overflow cmsg (v2)
From: Eric Dumazet @ 2009-10-09 21:31 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, davem, socketcan
In-Reply-To: <20091009193515.GA28196@hmsreliant.think-freely.org>

Neil Horman a écrit :

>  
> +extern void __sock_recv_ts_and_drops(struct msghdr *msg, struct sock *sk,
> +	struct sk_buff *skb);

Surely you meant __sock_recv_drops() ? It only deals with drops.


> +	case SO_RXQ_OVFL:
> +		v.val = sock_flag(sk, SOCK_RXQ_OVFL);
> +		break;
> +

Hmm, I advise to use v.val = !!sock_flag(sk, SOCK_RXQ_OVFL);
So that application gets 0 or 1, not 0 or some big value.
Its better because it allows us to change internal SOCK_RXQ_OVFL if necessary in the future.

>  drop_n_acct:
> -	spin_lock(&sk->sk_receive_queue.lock);
> -	po->stats.tp_drops++;
> -	spin_unlock(&sk->sk_receive_queue.lock);
> +	po->stats.tp_drops = atomic_inc_return(&sk->sk_drops);

Yes :)

>  EXPORT_SYMBOL_GPL(__sock_recv_timestamp);
>  
> +void __sock_recv_ts_and_drops(struct msghdr *msg, struct sock *sk,
> +	struct sk_buff *skb)
> +{
> +	put_cmsg(msg, SOL_SOCKET, SO_RXQ_OVFL, sizeof(__u32), &skb->dropcount);
> +}
> +EXPORT_SYMBOL_GPL(__sock_recv_ts_and_drops);
> +

Just change the name.

And is it really too large to be inlined ?

In the contrary, sock_recv_timestamp() is so large that I suspect
your sock_recv_ts_and_drops should *not* be inlined, and include inlined versions only :

I suggest something more orthogonal like :

void inline sock_recv_drops(struct msghdr *msg, struct sock *sk, struct sk_buff *skb)
{
	if (sock_flag(sk, SOCK_RXQ_OVFL) && skb && skb->dropcount)
		put_cmsg(msg, SOL_SOCKET, SO_RXQ_OVFL,
			 sizeof(__u32), &skb->dropcount);
}

void sock_recv_ts_and_drops(struct msghdr *msg, struct sock *sk, struct sk_buff *skb)
{
	sock_recv_timestamp(msg, sk, skb); // inlined
	sock_recv_drops(msg, sk, skb); // inlined
}
EXPORT_SYMBOL_GPL(sock_recv_ts_and_drops)


^ permalink raw reply

* Re: [PATCH 2.6.32-rc3] net: VMware virtual Ethernet NIC driver: vmxnet3
From: Stephen Hemminger @ 2009-10-09 21:35 UTC (permalink / raw)
  To: Shreyas Bhatewara
  Cc: Jeff, pv-drivers, netdev, linux-kernel, Andrew, Wright,
	Anthony Liguori, Greg Kroah-Hartman, Chris, Morton,
	virtualization, Garzik, David S. Miller
In-Reply-To: <alpine.LRH.2.00.0910081053460.19107@localhost.localdomain>

On Thu, 8 Oct 2009 10:59:26 -0700 (PDT)
Shreyas Bhatewara <sbhatewara@vmware.com> wrote:

> Hello all,
> 
> I do not mean to be bothersome but this thread has been unusually silent.
> Could you please review the patch for me and reply with your comments / 
> acks ?
> 
> Thanks.
> ->Shreyas  


Looks fine, but just a minor style nit (can be changed after insertion in mainline).

The code:

static void
vmxnet3_do_poll(struct vmxnet3_adapter *adapter, int budget, int *txd_done,
		int *rxd_done)
{
	if (unlikely(adapter->shared->ecr))
		vmxnet3_process_events(adapter);

	*txd_done = vmxnet3_tq_tx_complete(&adapter->tx_queue, adapter);
	*rxd_done = vmxnet3_rq_rx_complete(&adapter->rx_queue, adapter, budget);
}


static int
vmxnet3_poll(struct napi_struct *napi, int budget)
{
	struct vmxnet3_adapter *adapter = container_of(napi,
					  struct vmxnet3_adapter, napi);
	int rxd_done, txd_done;

	vmxnet3_do_poll(adapter, budget, &txd_done, &rxd_done);

	if (rxd_done < budget) {
		napi_complete(napi);
		vmxnet3_enable_intr(adapter, 0);
	}
	return rxd_done;
}


Is simpler if you just have do_poll return rx done value. Probably Gcc
inline's it all anyway.

static int
vmxnet3_do_poll(struct vmxnet3_adapter *adapter, int budget)
{
	if (unlikely(adapter->shared->ecr))
		vmxnet3_process_events(adapter);

	vmxnet3_tq_tx_complete(&adapter->tx_queue, adapter);
	return vmxnet3_rq_rx_complete(&adapter->rx_queue, adapter, budget);
}


static int
vmxnet3_poll(struct napi_struct *napi, int budget)
{
	struct vmxnet3_adapter *adapter = container_of(napi,
					  struct vmxnet3_adapter, napi);
	int rxd_done;

	rxd_done = vmxnet3_do_poll(adapter, budget);
	if (rxd_done < budget) {
		napi_complete(napi);
		vmxnet3_enable_intr(adapter, 0);
	}
	return rxd_done;
}

^ permalink raw reply

* Re: pull request: wireless-next-2.6 2009-10-09
From: David Miller @ 2009-10-09 21:40 UTC (permalink / raw)
  To: linville; +Cc: linux-wireless, netdev
In-Reply-To: <20091009210555.GC22861@tuxdriver.com>

From: "John W. Linville" <linville@tuxdriver.com>
Date: Fri, 9 Oct 2009 17:05:55 -0400

> Here is the usual big first post-window pull request for -next...
> Mostly it is the usual suspects, lots of iwlwifi and ath* along
> with a smattering of other bits.  There are even a few from me! :-)
> Most of these have spent several days banging-around in -next (which
> helped to find some Kconfig problems).
> 
> Please let me know if there are problems!

Pulled, thanks a lot John!

^ permalink raw reply

* Re: netconf notes and materials
From: Bill Fink @ 2009-10-09 21:49 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, peter.p.waskiewicz.jr
In-Reply-To: <20091007.044718.233642056.davem@davemloft.net>

On Wed, 07 Oct 2009, David Miller wrote:

> 
> Just a note that all of the available notes and slide etc.
> materials are available for netconf2009 at:
> 
> 	http://vger.kernel.org/netconf2009.html
> 
> Enjoy.

Thanks very much for the URL!

A question for Peter P Waskiewicz Jr, who presented on "NUMA scaling
issues in 10GbE":

How did you do the NUMA memory performance monitoring that was
presented on one of your slides?  This could be useful to me in
further pursuing an issue I recently raised with the subject
"Receive side performance issue with multi-10-GigE and NUMA"
(see http://article.gmane.org/gmane.linux.network/134658).

					-Thanks again

					-Bill

^ permalink raw reply

* Re: netconf notes and materials
From: David Miller @ 2009-10-09 21:51 UTC (permalink / raw)
  To: billfink; +Cc: netdev, peter.p.waskiewicz.jr
In-Reply-To: <20091009174949.467ddc50.billfink@mindspring.com>

From: Bill Fink <billfink@mindspring.com>
Date: Fri, 9 Oct 2009 17:49:49 -0400

> How did you do the NUMA memory performance monitoring that was
> presented on one of your slides?

Using proprietary internal tools Intel is unlikely to release.

On the bright side, some of those metrics will make their way into the
'perf' facilities in the kernel so they can be monitored, but not all
of them.

^ permalink raw reply

* [PATCH] Re: PACKET_TX_RING: packet size is too long
From: Gabor Gombas @ 2009-10-09 22:05 UTC (permalink / raw)
  To: netdev; +Cc: johann.baudy
In-Reply-To: <20091009090711.GG23133@boogie.lpds.sztaki.hu>

Hi,

Digging list archives I suspect the current value of size_max is the
remnant of the zero-copy mode that was not merged. So I propose the
following patch that IMHO makes the value of size_max consistent with
how the frame is actually handled in tpacket_fill_skb().

If the zero-copy mode is ever to be resurrected, then the user should
explicitely request it, and either the length of the extra padding
should be the same for 32-bit and 64-bit kernels or there must be a way
to query the value at run time.

Gabor

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index f9f7177..745a016 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -985,10 +985,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 		goto out_put;

 	size_max = po->tx_ring.frame_size
-		- sizeof(struct skb_shared_info)
-		- po->tp_hdrlen
-		- LL_ALLOCATED_SPACE(dev)
-		- sizeof(struct sockaddr_ll);
+		- (po->tp_hdrlen - sizeof(struct sockaddr_ll));

 	if (size_max > dev->mtu + reserve)
 		size_max = dev->mtu + reserve;

-- 
     ---------------------------------------------------------
     MTA SZTAKI Computer and Automation Research Institute
                Hungarian Academy of Sciences
     ---------------------------------------------------------

^ permalink raw reply related

* Re: Real networking namespace
From: Paul Moore @ 2009-10-09 22:12 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: Stephen Hemminger, linux-security-module, Al Viro, netdev,
	James Morris
In-Reply-To: <1255106692.2182.224.camel@moss-pluto.epoch.ncsc.mil>

On Friday 09 October 2009 12:44:52 pm Stephen Smalley wrote:
> On Fri, 2009-10-09 at 12:37 -0400, Stephen Smalley wrote:
> > On Fri, 2009-10-09 at 08:38 -0700, Stephen Hemminger wrote:
> > > The existing networking namespace model is unattractive for what I
> > > want, has anyone investigated better alternatives?
> > >
> > > I would like to be able to allow access to a network interface and
> > > associated objects (routing tables etc), to be controlled by Mandatory
> > > Access Control API's. I.e grant access to eth0 and to only certain
> > > processes.  Some the issues with the existing models are:
> > >   * eth0 and associated objects don't really exist in filesystem so
> > >     not subject to LSM style control (SeLinux/SMACK/TOMOYO)

As Stephen points out, SELinux does have the ability to assign security labels 
to network interfaces, check out the 'semanage' command.  A while back I wrote 
up something about the SELinux network "ingress/egress" access controls:

 * http://paulmoore.livejournal.com/2128.html

Smack doesn't support controlling network access at the interface level, but 
that is due to a Smack design decision and not an inherent functionality gap 
in the LSM.  TOMOYO is currently working on improved network access controls 
(see patches posted earlier this week), I haven't had a chance to review them 
yet so I don't know the state of TOMOYO's network access controls.

> > >   * network namespaces do not allow object to exist in multiple
> > > namespaces. The current model is more restrictive than chroot jails. At
> > > least with chroot, put filesystem objects in multiple jails.

Perhaps I don't fully understand what you are getting at here, but I don't 
think this should be an issue with a flexible LSM.

> > Is there something that prevents you from using the existing SELinux
> > network access controls?  netif is a security class governed by SELinux
> > policy, and routing table operations would be covered by the SELinux
> > checks on netlink_route_socket.  SELinux uses a combination of LSM hooks
> > and netfilter hooks to mediate network operations.
> 
> Also, depending on what you want to do, SECMARK may be useful to you.
> That allows you to mark packets with security contexts via iptables, and
> then use SELinux policy to control their flow.
> http://paulmoore.livejournal.com/4281.html
> http://james-morris.livejournal.com/11010.html

While we're at it, a few more links ... here is a presentation from last year 
on Linux's labeled networking capabilities (which hits at a lot of your 
questions):

 * http://paulmoore.livejournal.com/964.html

... and there is a video too:

 * http://paulmoore.livejournal.com/1329.html

-- 
paul moore
linux @ hp

^ permalink raw reply

* Re: [PATCH 0/8] SECURITY ISSUE with connector
From: Greg KH @ 2009-10-09 22:25 UTC (permalink / raw)
  To: Philipp Reisner
  Cc: linux-fbdev-devel, netdev, linux-kernel, dm-devel,
	Evgeniy Polyakov, Andrew Morton, David S. Miller
In-Reply-To: <1254487211-11810-1-git-send-email-philipp.reisner@linbit.com>

On Fri, Oct 02, 2009 at 02:40:03PM +0200, Philipp Reisner wrote:
> Affected: All code that uses connector, in kernel and out of mainline
> 
> The connector, as it is today, does not allow the in kernel receiving
> parts to do any checks on privileges of a message's sender.
> 
> I know, there are not many out there that like connector, but as
> long as it is in the kernel, we have to fix the security issues it has!
> 
> Please either drop connector, or someone who feels a bit responsible
> and has our beloved dictator's blessing, PLEASE PLEASE PLEASE take 
> this into your tree, and send the pull request to Linus.
> 
> Patches 1 to 4 are already Acked-by Evgeny, the connector's maintainer.
> Patches 5 to 7 are the obvious fixes to the connector user's code.

These don't apply to the 2.6.31-stable tree at all.

Could you provide them backported to that tree if you want to see them
go into a .31-stable release?

thanks,

greg k-h

^ permalink raw reply

* Re: behaviour question for igb on nehalem box
From: Chris Friesen @ 2009-10-09 22:31 UTC (permalink / raw)
  To: Brandeburg, Jesse
  Cc: e1000-list, Linux Network Development list, Allan, Bruce W,
	Ronciak, John, Kirsher, Jeffrey T
In-Reply-To: <alpine.WNT.2.00.0910091250440.5328@jbrandeb-desk1.amr.corp.intel.com>

On 10/09/2009 02:22 PM, Brandeburg, Jesse wrote:
> On Fri, 9 Oct 2009, Chris Friesen wrote:
>> I've got some general questions around the expected behaviour of the
>> 82576 igb net device.  (On a dual quad-core Nehalem box, if it matters.)

> the hardware you have only supports 8 
> queues (rx and tx) and the driver is configured to only set up 4 max.

The datasheet for the 82576 says 16 tx queues and 16 rx queues.  Is that
a typo or do we have the economy version?

>> My second question is around how the rx queues are mapped to interrupts.
>>  According to /proc/interrupts there appears to be a 1:1 mapping between
>> queues and interrupts.  However, I've set up at test with a given amount
>> of traffic coming in to the device (from 4 different IP addresses and 4
>> ports).  Under this scenario, "ethtool -S" shows the number of packets
>> increasing for only rx queue 0, but I see the interrupt count going up
>> for two interrupts.
> 
> one transmit interrupt and one receive interrupt?

No, two rx interrupts.  (Can't remember if the tx interrupt was going up
as well or no...was only looking at rx.)

> RSS will spread the 
> receive work out in a flow based way, based on ip/xDP header.  Your test 
> as described should be using more than one flow (and therefore more than 
> one rx queue) unless you got caught out by the default arp_filter 
> behavior (check arp -an).

I was surprised as well since it didn't match what I expected.  What's
the story around the arp_filter?  I just logged onto the test box and
"arp -an" gives:

? (47.135.251.129) at 00:00:5E:00:01:08 [ether] on eth0

but I'm not sure that's worth anything since someone is running a test
and it's currently using all four rx queues and all four rx interrupt
counts are increasing.  I'll have to see if they changed anything.


> Hope this helps,

That's great, thanks.

Chris

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference

^ permalink raw reply

* Re: netconf notes and materials
From: Peter P Waskiewicz Jr @ 2009-10-09 23:01 UTC (permalink / raw)
  To: David Miller; +Cc: billfink@mindspring.com, netdev@vger.kernel.org
In-Reply-To: <20091009.145151.130967664.davem@davemloft.net>

On Fri, 2009-10-09 at 14:51 -0700, David Miller wrote:
> From: Bill Fink <billfink@mindspring.com>
> Date: Fri, 9 Oct 2009 17:49:49 -0400
> 
> > How did you do the NUMA memory performance monitoring that was
> > presented on one of your slides?
> 
> Using proprietary internal tools Intel is unlikely to release.
> 
> On the bright side, some of those metrics will make their way into the
> 'perf' facilities in the kernel so they can be monitored, but not all
> of them.

Yes, they are tools written by our CPU and chipset teams to assist in
debug.  To reinforce what David just said, I know Jesse Barnes is
working hard with the Intel powers-that-be to get the "approved" public
PMU counters into the perf utility.  I'd imagine the PMU counters for
IOH memory throughput will be deemed ok, since it's not uncovering any
IP.

If I hear anything about any of the performance counters, I'll be sure
to forward that on.

Cheers,
-PJ

^ permalink raw reply

* Re: behaviour question for igb on nehalem box
From: Alexander Duyck @ 2009-10-09 23:20 UTC (permalink / raw)
  To: Chris Friesen
  Cc: e1000-list ; gospo@redhat.com, Linux Network Development list,
	Allan, Bruce W, Brandeburg, Jesse, Ronciak, John,
	Kirsher, Jeffrey T
In-Reply-To: <4ACFB9DF.9080909@nortel.com>

Chris Friesen wrote:
> On 10/09/2009 02:22 PM, Brandeburg, Jesse wrote:
>> On Fri, 9 Oct 2009, Chris Friesen wrote:
>>> I've got some general questions around the expected behaviour of the
>>> 82576 igb net device.  (On a dual quad-core Nehalem box, if it matters.)
> 
>> the hardware you have only supports 8 
>> queues (rx and tx) and the driver is configured to only set up 4 max.
> 
> The datasheet for the 82576 says 16 tx queues and 16 rx queues.  Is that
> a typo or do we have the economy version?

Actually the limitation is due to the fact that there are only 10 
interrupts available.  On kernels that support TX multi-queue the number 
of queues would be 4 TX and 4 RX, which would consume 8 interrupts 
leaving 1 for the link status change and one unused.

However on the kernel you are using I don't believe multi-queue NAPI is 
enabled so you shouldn't have multiple RX queues either.  On a 2.6.18 
kernel you should have only 1 RX and 1 TX queue unless you are using the 
driver provided on e1000.sourceforge.net which uses fake netdevs to 
support multi-queue NAPI.  I believe this may be a bug that was 
introduced when SR-IOV support was back-ported from the 2.6.30 kernel.

>>> My second question is around how the rx queues are mapped to interrupts.
>>>  According to /proc/interrupts there appears to be a 1:1 mapping between
>>> queues and interrupts.  However, I've set up at test with a given amount
>>> of traffic coming in to the device (from 4 different IP addresses and 4
>>> ports).  Under this scenario, "ethtool -S" shows the number of packets
>>> increasing for only rx queue 0, but I see the interrupt count going up
>>> for two interrupts.
>> one transmit interrupt and one receive interrupt?
> 
> No, two rx interrupts.  (Can't remember if the tx interrupt was going up
> as well or no...was only looking at rx.)

This may be due to the bug I mentioned above.  Multiple RX queues 
shouldn't be present on the 2.6.18 kernel as I do not believe 
multi-queue NAPI has been back-ported and it could have negative effects.

>> RSS will spread the 
>> receive work out in a flow based way, based on ip/xDP header.  Your test 
>> as described should be using more than one flow (and therefore more than 
>> one rx queue) unless you got caught out by the default arp_filter 
>> behavior (check arp -an).
> 
> I was surprised as well since it didn't match what I expected.  What's
> the story around the arp_filter?  I just logged onto the test box and
> "arp -an" gives:
> 
> ? (47.135.251.129) at 00:00:5E:00:01:08 [ether] on eth0
> 
> but I'm not sure that's worth anything since someone is running a test
> and it's currently using all four rx queues and all four rx interrupt
> counts are increasing.  I'll have to see if they changed anything.
> 
> 
>> Hope this helps,
> 
> That's great, thanks.
> 
> Chris
> 
> ------------------------------------------------------------------------------
> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart your
> developing skills, take BlackBerry mobile applications to market and stay 
> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
> http://p.sf.net/sfu/devconference
> _______________________________________________
> E1000-devel mailing list
> E1000-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/e1000-devel


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference

^ permalink raw reply

* Re: [PATCH] Generalize socket rx gap / receive queue overflow cmsg (v2)
From: Neil Horman @ 2009-10-09 23:21 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, davem, socketcan
In-Reply-To: <4ACFABAE.5050003@gmail.com>

On Fri, Oct 09, 2009 at 11:31:26PM +0200, Eric Dumazet wrote:
> Neil Horman a écrit :
> 
> >  
> > +extern void __sock_recv_ts_and_drops(struct msghdr *msg, struct sock *sk,
> > +	struct sk_buff *skb);
> 
> Surely you meant __sock_recv_drops() ? It only deals with drops.
> 
No, I certainly meant both.  The defintion clearly handles both the timestamp
cmsg and the drops cmsg.  That way we don't need to make two calls in the
receive path for these

> 
> > +	case SO_RXQ_OVFL:
> > +		v.val = sock_flag(sk, SOCK_RXQ_OVFL);
> > +		break;
> > +
> 
> Hmm, I advise to use v.val = !!sock_flag(sk, SOCK_RXQ_OVFL);
> So that application gets 0 or 1, not 0 or some big value.
> Its better because it allows us to change internal SOCK_RXQ_OVFL if necessary in the future.
> 
I don't really see any difference, sock_flag is simply a wrapper around
test_bit.  I can change it if you really need, but it just looks like additional
operations to me

> >  drop_n_acct:
> > -	spin_lock(&sk->sk_receive_queue.lock);
> > -	po->stats.tp_drops++;
> > -	spin_unlock(&sk->sk_receive_queue.lock);
> > +	po->stats.tp_drops = atomic_inc_return(&sk->sk_drops);
> 
> Yes :)
> 
> >  EXPORT_SYMBOL_GPL(__sock_recv_timestamp);
> >  
> > +void __sock_recv_ts_and_drops(struct msghdr *msg, struct sock *sk,
> > +	struct sk_buff *skb)
> > +{
> > +	put_cmsg(msg, SOL_SOCKET, SO_RXQ_OVFL, sizeof(__u32), &skb->dropcount);
> > +}
> > +EXPORT_SYMBOL_GPL(__sock_recv_ts_and_drops);
> > +
> 
> Just change the name.
> 
No.  I'm differentiating from sock_recv_timestamp here, as I'm concerned about
the case in which we enqueue to sk_error_queue.  For those cases, in which
recvmsg is called with MSG_ERRQUEUE, I don't think its right to apply a cmsg
based on skb->dropcount, since the frame may have been from a tx path, or may
not have had dropcount set in the first place.  timestamp is still recorded
there, but I don't think we should mark the dropcount.

> And is it really too large to be inlined ?
> 
No, I was just following the style of sock_recv_timestamp.  

> In the contrary, sock_recv_timestamp() is so large that I suspect
> your sock_recv_ts_and_drops should *not* be inlined, and include inlined versions only :
> 
> I suggest something more orthogonal like :
> 
> void inline sock_recv_drops(struct msghdr *msg, struct sock *sk, struct sk_buff *skb)
> {
> 	if (sock_flag(sk, SOCK_RXQ_OVFL) && skb && skb->dropcount)
> 		put_cmsg(msg, SOL_SOCKET, SO_RXQ_OVFL,
> 			 sizeof(__u32), &skb->dropcount);
> }
> 
> void sock_recv_ts_and_drops(struct msghdr *msg, struct sock *sk, struct sk_buff *skb)
> {
> 	sock_recv_timestamp(msg, sk, skb); // inlined
> 	sock_recv_drops(msg, sk, skb); // inlined
> }
> EXPORT_SYMBOL_GPL(sock_recv_ts_and_drops)
Fine.

^ permalink raw reply

* Re: netconf notes and materials
From: Krzysztof Halasa @ 2009-10-09 23:31 UTC (permalink / raw)
  To: Peter P Waskiewicz Jr
  Cc: David Miller, billfink@mindspring.com, netdev@vger.kernel.org
In-Reply-To: <1255129289.8937.39.camel@localhost.localdomain>

Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> writes:

> Yes, they are tools written by our CPU and chipset teams to assist in
> debug.  To reinforce what David just said, I know Jesse Barnes is
> working hard with the Intel powers-that-be to get the "approved" public
> PMU counters into the perf utility.  I'd imagine the PMU counters for
> IOH memory throughput will be deemed ok, since it's not uncovering any
> IP.

Ehm perhaps something similar could also happen to source of IXP4xx
microcode and/or NPE docs? Not sure about IP, though isn't any IP there
already patented (and thus available)?

This is especially needed WRT HSS (sync serial port).

:-)
-- 
Krzysztof Halasa

^ permalink raw reply

* Re: netconf notes and materials
From: Rick Jones @ 2009-10-09 23:35 UTC (permalink / raw)
  To: Peter P Waskiewicz Jr
  Cc: David Miller, billfink@mindspring.com, netdev@vger.kernel.org
In-Reply-To: <1255129289.8937.39.camel@localhost.localdomain>

Peter P Waskiewicz Jr wrote:
> On Fri, 2009-10-09 at 14:51 -0700, David Miller wrote:
> 
>>From: Bill Fink <billfink@mindspring.com>
>>Date: Fri, 9 Oct 2009 17:49:49 -0400
>>
>>
>>>How did you do the NUMA memory performance monitoring that was
>>>presented on one of your slides?
>>
>>Using proprietary internal tools Intel is unlikely to release.
>>
>>On the bright side, some of those metrics will make their way into the
>>'perf' facilities in the kernel so they can be monitored, but not all
>>of them.
> 
> 
> Yes, they are tools written by our CPU and chipset teams to assist in
> debug.  To reinforce what David just said, I know Jesse Barnes is
> working hard with the Intel powers-that-be to get the "approved" public
> PMU counters into the perf utility.  I'd imagine the PMU counters for
> IOH memory throughput will be deemed ok, since it's not uncovering any
> IP.
> 
> If I hear anything about any of the performance counters, I'll be sure
> to forward that on.

 From the standpoint of I/O, anything that might enable a port of:

http://pcitop.berlios.de/

would be goodness.

rick jones

^ permalink raw reply

* Re: [PATCH 2/8] bitmap: Introduce bitmap_set, bitmap_clear, bitmap_find_next_zero_area
From: Andrew Morton @ 2009-10-09 23:41 UTC (permalink / raw)
  To: Akinobu Mita
  Cc: Fenghua Yu, Greg Kroah-Hartman, linux-ia64, Tony Luck, x86,
	netdev, linux-kernel, linux-altix, Yevgeny Petrilin,
	FUJITA Tomonori, linuxppc-dev, Ingo Molnar, Paul Mackerras,
	H. Peter Anvin, sparclinux, Thomas Gleixner, linux-usb,
	David S. Miller, Lothar Wassmann
In-Reply-To: <1255076961-21325-2-git-send-email-akinobu.mita@gmail.com>

On Fri,  9 Oct 2009 17:29:15 +0900
Akinobu Mita <akinobu.mita@gmail.com> wrote:

> This introduces new bitmap functions:
> 
> bitmap_set: Set specified bit area
> bitmap_clear: Clear specified bit area
> bitmap_find_next_zero_area: Find free bit area
> 
> These are stolen from iommu helper.
> 
> I changed the return value of bitmap_find_next_zero_area if there is
> no zero area.
> 
> find_next_zero_area in iommu helper: returns -1
> bitmap_find_next_zero_area: return >= bitmap size

I'll plan to merge this patch into 2.6.32 so we can trickle all the
other patches into subsystems in an orderly fashion.

> +void bitmap_set(unsigned long *map, int i, int len)
> +{
> +	int end = i + len;
> +
> +	while (i < end) {
> +		__set_bit(i, map);
> +		i++;
> +	}
> +}

This is really inefficient, isn't it?  It's a pretty trivial matter to
romp through memory 32 or 64 bits at a time.

> +EXPORT_SYMBOL(bitmap_set);
> +
> +void bitmap_clear(unsigned long *map, int start, int nr)
> +{
> +	int end = start + nr;
> +
> +	while (start < end) {
> +		__clear_bit(start, map);
> +		start++;
> +	}
> +}
> +EXPORT_SYMBOL(bitmap_clear);

Ditto.

> +unsigned long bitmap_find_next_zero_area(unsigned long *map,
> +					 unsigned long size,
> +					 unsigned long start,
> +					 unsigned int nr,
> +					 unsigned long align_mask)
> +{
> +	unsigned long index, end, i;
> +again:
> +	index = find_next_zero_bit(map, size, start);
> +
> +	/* Align allocation */
> +	index = (index + align_mask) & ~align_mask;
> +
> +	end = index + nr;
> +	if (end >= size)
> +		return end;
> +	i = find_next_bit(map, end, index);
> +	if (i < end) {
> +		start = i + 1;
> +		goto again;
> +	}
> +	return index;
> +}
> +EXPORT_SYMBOL(bitmap_find_next_zero_area);

This needs documentation, please.  It appears that `size' is the size
of the bitmap and `nr' is the number of zeroed bits we're looking for,
but an inattentive programmer could get those reversed.

Also the semantics of `align_mask' could benefit from spelling out.  Is
the alignment with respect to memory boundaries or with respect to
`map' or with respect to map+start or what?

And why does align_mask exist at all?  I was a bit surprised to see it
there.  In which scenarios will it be non-zero?

^ permalink raw reply

* Re: behaviour question for igb on nehalem box
From: Alexander Duyck @ 2009-10-09 23:48 UTC (permalink / raw)
  To: Chris Friesen
  Cc: e1000-list, Linux Network Development list, Allan, Bruce W,
	Brandeburg, Jesse, Ronciak, John, Kirsher, Jeffrey T,
	gospo@redhat.com
In-Reply-To: <4ACFC52F.4050509@intel.com>

Alexander Duyck wrote:
> Chris Friesen wrote:
>> On 10/09/2009 02:22 PM, Brandeburg, Jesse wrote:
>>> On Fri, 9 Oct 2009, Chris Friesen wrote:
>>>> I've got some general questions around the expected behaviour of the
>>>> 82576 igb net device.  (On a dual quad-core Nehalem box, if it matters.)
>>> the hardware you have only supports 8 
>>> queues (rx and tx) and the driver is configured to only set up 4 max.
>> The datasheet for the 82576 says 16 tx queues and 16 rx queues.  Is that
>> a typo or do we have the economy version?
> 
> Actually the limitation is due to the fact that there are only 10 
> interrupts available.  On kernels that support TX multi-queue the number 
> of queues would be 4 TX and 4 RX, which would consume 8 interrupts 
> leaving 1 for the link status change and one unused.
> 
> However on the kernel you are using I don't believe multi-queue NAPI is 
> enabled so you shouldn't have multiple RX queues either.  On a 2.6.18 
> kernel you should have only 1 RX and 1 TX queue unless you are using the 
> driver provided on e1000.sourceforge.net which uses fake netdevs to 
> support multi-queue NAPI.  I believe this may be a bug that was 
> introduced when SR-IOV support was back-ported from the 2.6.30 kernel.

Actually after looking closer at the Redhat source it looks like they 
have done the fake netdev workaround in their own code so I guess igb 
driver in the RHEL kernel does support multiple RX queues.

>>>> My second question is around how the rx queues are mapped to interrupts.
>>>>  According to /proc/interrupts there appears to be a 1:1 mapping between
>>>> queues and interrupts.  However, I've set up at test with a given amount
>>>> of traffic coming in to the device (from 4 different IP addresses and 4
>>>> ports).  Under this scenario, "ethtool -S" shows the number of packets
>>>> increasing for only rx queue 0, but I see the interrupt count going up
>>>> for two interrupts.
>>> one transmit interrupt and one receive interrupt?
>> No, two rx interrupts.  (Can't remember if the tx interrupt was going up
>> as well or no...was only looking at rx.)
> 
> This may be due to the bug I mentioned above.  Multiple RX queues 
> shouldn't be present on the 2.6.18 kernel as I do not believe 
> multi-queue NAPI has been back-ported and it could have negative effects.

The odds of any 2 flows overlapping when you are only using 4 flows is 
pretty high, especially if the addresses/ports are close in range.  You 
typically need something on the order of about 16 flows over a wide 
range of port numbers in order to get a good distribution.

Thanks,

Alex




------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference

^ permalink raw reply

* RE: [PATCH 2.6.32-rc3] net: VMware virtual Ethernet NIC driver: vmxnet3
From: Shreyas Bhatewara @ 2009-10-09 23:52 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Jeff, pv-drivers, netdev, linux-kernel, Andrew, Wright,
	Anthony Liguori, Greg Kroah-Hartman, Chris, Morton,
	virtualization, Garzik, David S. Miller
In-Reply-To: <20091009143538.644844aa@nehalam>

> -----Original Message-----
> From: Stephen Hemminger [mailto:shemminger@linux-foundation.org]
> Sent: Friday, October 09, 2009 2:36 PM
> To: Shreyas Bhatewara
> Cc: linux-kernel; netdev; David S. Miller; Jeff Garzik; Anthony
> Liguori; Chris Wright; Greg Kroah-Hartman; Andrew Morton;
> virtualization; pv-drivers
> Subject: Re: [PATCH 2.6.32-rc3] net: VMware virtual Ethernet NIC
> driver: vmxnet3
> 
> On Thu, 8 Oct 2009 10:59:26 -0700 (PDT)
> Shreyas Bhatewara <sbhatewara@vmware.com> wrote:
> 
> > Hello all,
> >
> > I do not mean to be bothersome but this thread has been unusually
> silent.
> > Could you please review the patch for me and reply with your comments
> /
> > acks ?
> >
> > Thanks.
> > ->Shreyas
> 
> 
> Looks fine, but just a minor style nit (can be changed after insertion
> in mainline).
> 
> The code:
> 
> static void
> vmxnet3_do_poll(struct vmxnet3_adapter *adapter, int budget, int
> *txd_done,
> 		int *rxd_done)
> {
> 	if (unlikely(adapter->shared->ecr))
> 		vmxnet3_process_events(adapter);
> 
> 	*txd_done = vmxnet3_tq_tx_complete(&adapter->tx_queue, adapter);
> 	*rxd_done = vmxnet3_rq_rx_complete(&adapter->rx_queue, adapter,
> budget);
> }
> 
> 
> static int
> vmxnet3_poll(struct napi_struct *napi, int budget)
> {
> 	struct vmxnet3_adapter *adapter = container_of(napi,
> 					  struct vmxnet3_adapter, napi);
> 	int rxd_done, txd_done;
> 
> 	vmxnet3_do_poll(adapter, budget, &txd_done, &rxd_done);
> 
> 	if (rxd_done < budget) {
> 		napi_complete(napi);
> 		vmxnet3_enable_intr(adapter, 0);
> 	}
> 	return rxd_done;
> }
> 
> 
> Is simpler if you just have do_poll return rx done value. Probably Gcc
> inline's it all anyway.
> 
> static int
> vmxnet3_do_poll(struct vmxnet3_adapter *adapter, int budget)
> {
> 	if (unlikely(adapter->shared->ecr))
> 		vmxnet3_process_events(adapter);
> 
> 	vmxnet3_tq_tx_complete(&adapter->tx_queue, adapter);
> 	return vmxnet3_rq_rx_complete(&adapter->rx_queue, adapter,
> budget);
> }
> 
> 
> static int
> vmxnet3_poll(struct napi_struct *napi, int budget)
> {
> 	struct vmxnet3_adapter *adapter = container_of(napi,
> 					  struct vmxnet3_adapter, napi);
> 	int rxd_done;
> 
> 	rxd_done = vmxnet3_do_poll(adapter, budget);
> 	if (rxd_done < budget) {
> 		napi_complete(napi);
> 		vmxnet3_enable_intr(adapter, 0);
> 	}
> 	return rxd_done;
> }



Thanks Stephen.

Yes, the vmxnet3_do_poll() was an inline function in the very first patch. It was thought of as a better idea to let gcc handle the inlining.
I will piggyback this nit on a forthcoming change.

->Shreyas

^ permalink raw reply

* Re: [PATCH] Generalize socket rx gap / receive queue overflow cmsg (v3)
From: Neil Horman @ 2009-10-09 23:56 UTC (permalink / raw)
  To: netdev; +Cc: eric.dumazet, davem, socketcan, nhorman
In-Reply-To: <20091007180835.GB20524@hmsreliant.think-freely.org>

Ok, take 3 with Erics new notes

Change Notes:

1) Modified inlining of sock_recv_ts_and_drops to be more efficient

2) modify getsockopt for SO_RXQ_OVFL to gurantee only a 1 or 0 return

=============================================================


Create a new socket level option to report number of queue overflows

Recently I augmented the AF_PACKET protocol to report the number of frames lost
on the socket receive queue between any two enqueued frames.  This value was
exported via a SOL_PACKET level cmsg.  AFter I completed that work it was
requested that this feature be generalized so that any datagram oriented socket
could make use of this option.  As such I've created this patch, It creates a
new SOL_SOCKET level option called SO_RXQ_OVFL, which when enabled exports a
SOL_SOCKET level cmsg that reports the nubmer of times the sk_receive_queue
overflowed between any two given frames.  It also augments the AF_PACKET
protocol to take advantage of this new feature (as it previously did not touch
sk->sk_drops, which this patch uses to record the overflow count).  Tested
successfully by me.

Notes:

1) Unlike my previous patch, this patch simply records the sk_drops value, which
is not a number of drops between packets, but rather a total number of drops.
Deltas must be computed in user space.

2) While this patch currently works with datagram oriented protocols, it will
also be accepted by non-datagram oriented protocols. I'm not sure if thats
agreeable to everyone, but my argument in favor of doing so is that, for those
protocols which aren't applicable to this option, sk_drops will always be zero,
and reporting no drops on a receive queue that isn't used for those
non-participating protocols seems reasonable to me.  This also saves us having
to code in a per-protocol opt in mechanism.

3) This applies cleanly to net-next assuming that commit
977750076d98c7ff6cbda51858bb5a5894a9d9ab (my af packet cmsg patch) is reverted

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>


 include/asm-generic/socket.h |    1 +
 include/linux/skbuff.h       |    6 ++++--
 include/net/sock.h           |    3 +++
 net/atm/common.c             |    2 +-
 net/bluetooth/af_bluetooth.c |    2 +-
 net/bluetooth/rfcomm/sock.c  |    2 +-
 net/can/bcm.c                |    2 +-
 net/can/raw.c                |    2 +-
 net/core/sock.c              |   17 ++++++++++++++++-
 net/ieee802154/dgram.c       |    2 +-
 net/ieee802154/raw.c         |    2 +-
 net/ipv4/raw.c               |    2 +-
 net/ipv4/udp.c               |    2 +-
 net/ipv6/raw.c               |    2 +-
 net/ipv6/udp.c               |    2 +-
 net/key/af_key.c             |    2 +-
 net/packet/af_packet.c       |    7 +++----
 net/rxrpc/ar-recvmsg.c       |    2 +-
 net/sctp/socket.c            |    2 +-
 net/socket.c                 |   16 ++++++++++++++++
 20 files changed, 57 insertions(+), 21 deletions(-)

diff --git a/include/asm-generic/socket.h b/include/asm-generic/socket.h
index 538991c..9a6115e 100644
--- a/include/asm-generic/socket.h
+++ b/include/asm-generic/socket.h
@@ -63,4 +63,5 @@
 #define SO_PROTOCOL		38
 #define SO_DOMAIN		39
 
+#define SO_RXQ_OVFL             40
 #endif /* __ASM_GENERIC_SOCKET_H */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -389,8 +389,10 @@ struct sk_buff {
 #ifdef CONFIG_NETWORK_SECMARK
 	__u32			secmark;
 #endif
-
-	__u32			mark;
+	union {
+		__u32		mark;
+		__u32		dropcount;
+	};
 
 	__u16			vlan_tci;
 
diff --git a/include/net/sock.h b/include/net/sock.h
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -505,6 +505,7 @@ enum sock_flags {
 	SOCK_TIMESTAMPING_RAW_HARDWARE, /* %SOF_TIMESTAMPING_RAW_HARDWARE */
 	SOCK_TIMESTAMPING_SYS_HARDWARE, /* %SOF_TIMESTAMPING_SYS_HARDWARE */
 	SOCK_FASYNC, /* fasync() active */
+	SOCK_RXQ_OVFL,
 };
 
 static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
@@ -1493,6 +1494,8 @@ sock_recv_timestamp(struct msghdr *msg, struct sock *sk, struct sk_buff *skb)
 		sk->sk_stamp = kt;
 }
 
+extern void sock_recv_ts_and_drops(struct msghdr *msg, struct sock *sk, struct sk_buff *skb);
+
 /**
  * sock_tx_timestamp - checks whether the outgoing packet is to be time stamped
  * @msg:	outgoing packet
diff --git a/net/atm/common.c b/net/atm/common.c
--- a/net/atm/common.c
+++ b/net/atm/common.c
@@ -496,7 +496,7 @@ int vcc_recvmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
 	error = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied);
 	if (error)
 		return error;
-	sock_recv_timestamp(msg, sk, skb);
+	sock_recv_ts_and_drops(msg, sk, skb);
 	pr_debug("RcvM %d -= %d\n", atomic_read(&sk->sk_rmem_alloc), skb->truesize);
 	atm_return(vcc, skb->truesize);
 	skb_free_datagram(sk, skb);
diff --git a/net/bluetooth/af_bluetooth.c b/net/bluetooth/af_bluetooth.c
--- a/net/bluetooth/af_bluetooth.c
+++ b/net/bluetooth/af_bluetooth.c
@@ -257,7 +257,7 @@ int bt_sock_recvmsg(struct kiocb *iocb, struct socket *sock,
 	skb_reset_transport_header(skb);
 	err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied);
 	if (err == 0)
-		sock_recv_timestamp(msg, sk, skb);
+		sock_recv_ts_and_drops(msg, sk, skb);
 
 	skb_free_datagram(sk, skb);
 
diff --git a/net/bluetooth/rfcomm/sock.c b/net/bluetooth/rfcomm/sock.c
--- a/net/bluetooth/rfcomm/sock.c
+++ b/net/bluetooth/rfcomm/sock.c
@@ -703,7 +703,7 @@ static int rfcomm_sock_recvmsg(struct kiocb *iocb, struct socket *sock,
 		copied += chunk;
 		size   -= chunk;
 
-		sock_recv_timestamp(msg, sk, skb);
+		sock_recv_ts_and_drops(msg, sk, skb);
 
 		if (!(flags & MSG_PEEK)) {
 			atomic_sub(chunk, &sk->sk_rmem_alloc);
diff --git a/net/can/bcm.c b/net/can/bcm.c
--- a/net/can/bcm.c
+++ b/net/can/bcm.c
@@ -1534,7 +1534,7 @@ static int bcm_recvmsg(struct kiocb *iocb, struct socket *sock,
 		return err;
 	}
 
-	sock_recv_timestamp(msg, sk, skb);
+	sock_recv_ts_and_drops(msg, sk, skb);
 
 	if (msg->msg_name) {
 		msg->msg_namelen = sizeof(struct sockaddr_can);
diff --git a/net/can/raw.c b/net/can/raw.c
--- a/net/can/raw.c
+++ b/net/can/raw.c
@@ -702,7 +702,7 @@ static int raw_recvmsg(struct kiocb *iocb, struct socket *sock,
 		return err;
 	}
 
-	sock_recv_timestamp(msg, sk, skb);
+	sock_recv_ts_and_drops(msg, sk, skb);
 
 	if (msg->msg_name) {
 		msg->msg_namelen = sizeof(struct sockaddr_can);
diff --git a/net/core/sock.c b/net/core/sock.c
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -276,6 +276,8 @@ int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
 	int err = 0;
 	int skb_len;
+	unsigned long flags;
+	struct sk_buff_head *list = &sk->sk_receive_queue;
 
 	/* Cast sk->rcvbuf to unsigned... It's pointless, but reduces
 	   number of warnings when compiling with -W --ANK
@@ -305,7 +307,10 @@ int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 	 */
 	skb_len = skb->len;
 
-	skb_queue_tail(&sk->sk_receive_queue, skb);
+	spin_lock_irqsave(&list->lock, flags);
+	skb->dropcount = atomic_read(&sk->sk_drops);
+	__skb_queue_tail(list, skb);
+	spin_unlock_irqrestore(&list->lock, flags);
 
 	if (!sock_flag(sk, SOCK_DEAD))
 		sk->sk_data_ready(sk, skb_len);
@@ -702,6 +707,12 @@ set_rcvbuf:
 
 		/* We implement the SO_SNDLOWAT etc to
 		   not be settable (1003.1g 5.3) */
+	case SO_RXQ_OVFL:
+		if (valbool)
+			sock_set_flag(sk, SOCK_RXQ_OVFL);
+		else
+			sock_reset_flag(sk, SOCK_RXQ_OVFL);
+		break;
 	default:
 		ret = -ENOPROTOOPT;
 		break;
@@ -901,6 +912,10 @@ int sock_getsockopt(struct socket *sock, int level, int optname,
 		v.val = sk->sk_mark;
 		break;
 
+	case SO_RXQ_OVFL:
+		v.val = !!sock_flag(sk, SOCK_RXQ_OVFL);
+		break;
+
 	default:
 		return -ENOPROTOOPT;
 	}
diff --git a/net/ieee802154/dgram.c b/net/ieee802154/dgram.c
--- a/net/ieee802154/dgram.c
+++ b/net/ieee802154/dgram.c
@@ -303,7 +303,7 @@ static int dgram_recvmsg(struct kiocb *iocb, struct sock *sk,
 	if (err)
 		goto done;
 
-	sock_recv_timestamp(msg, sk, skb);
+	sock_recv_ts_and_drops(msg, sk, skb);
 
 	if (flags & MSG_TRUNC)
 		copied = skb->len;
diff --git a/net/ieee802154/raw.c b/net/ieee802154/raw.c
--- a/net/ieee802154/raw.c
+++ b/net/ieee802154/raw.c
@@ -191,7 +191,7 @@ static int raw_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	if (err)
 		goto done;
 
-	sock_recv_timestamp(msg, sk, skb);
+	sock_recv_ts_and_drops(msg, sk, skb);
 
 	if (flags & MSG_TRUNC)
 		copied = skb->len;
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -682,7 +682,7 @@ static int raw_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	if (err)
 		goto done;
 
-	sock_recv_timestamp(msg, sk, skb);
+	sock_recv_ts_and_drops(msg, sk, skb);
 
 	/* Copy the address. */
 	if (sin) {
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -951,7 +951,7 @@ try_again:
 		UDP_INC_STATS_USER(sock_net(sk),
 				UDP_MIB_INDATAGRAMS, is_udplite);
 
-	sock_recv_timestamp(msg, sk, skb);
+	sock_recv_ts_and_drops(msg, sk, skb);
 
 	/* Copy the address. */
 	if (sin) {
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -497,7 +497,7 @@ static int rawv6_recvmsg(struct kiocb *iocb, struct sock *sk,
 			sin6->sin6_scope_id = IP6CB(skb)->iif;
 	}
 
-	sock_recv_timestamp(msg, sk, skb);
+	sock_recv_ts_and_drops(msg, sk, skb);
 
 	if (np->rxopt.all)
 		datagram_recv_ctl(sk, msg, skb);
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -252,7 +252,7 @@ try_again:
 					UDP_MIB_INDATAGRAMS, is_udplite);
 	}
 
-	sock_recv_timestamp(msg, sk, skb);
+	sock_recv_ts_and_drops(msg, sk, skb);
 
 	/* Copy the address. */
 	if (msg->msg_name) {
diff --git a/net/key/af_key.c b/net/key/af_key.c
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -3606,7 +3606,7 @@ static int pfkey_recvmsg(struct kiocb *kiocb,
 	if (err)
 		goto out_free;
 
-	sock_recv_timestamp(msg, sk, skb);
+	sock_recv_ts_and_drops(msg, sk, skb);
 
 	err = (flags & MSG_TRUNC) ? skb->len : copied;
 
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -627,15 +627,14 @@ static int packet_rcv(struct sk_buff *skb, struct net_device *dev,
 
 	spin_lock(&sk->sk_receive_queue.lock);
 	po->stats.tp_packets++;
+	skb->dropcount = atomic_read(&sk->sk_drops);
 	__skb_queue_tail(&sk->sk_receive_queue, skb);
 	spin_unlock(&sk->sk_receive_queue.lock);
 	sk->sk_data_ready(sk, skb->len);
 	return 0;
 
 drop_n_acct:
-	spin_lock(&sk->sk_receive_queue.lock);
-	po->stats.tp_drops++;
-	spin_unlock(&sk->sk_receive_queue.lock);
+	po->stats.tp_drops = atomic_inc_return(&sk->sk_drops);
 
 drop_n_restore:
 	if (skb_head != skb->data && skb_shared(skb)) {
@@ -1478,7 +1477,7 @@ static int packet_recvmsg(struct kiocb *iocb, struct socket *sock,
 	if (err)
 		goto out_free;
 
-	sock_recv_timestamp(msg, sk, skb);
+	sock_recv_ts_and_drops(msg, sk, skb);
 
 	if (msg->msg_name)
 		memcpy(msg->msg_name, &PACKET_SKB_CB(skb)->sa,
diff --git a/net/rxrpc/ar-recvmsg.c b/net/rxrpc/ar-recvmsg.c
--- a/net/rxrpc/ar-recvmsg.c
+++ b/net/rxrpc/ar-recvmsg.c
@@ -146,7 +146,7 @@ int rxrpc_recvmsg(struct kiocb *iocb, struct socket *sock,
 				memcpy(msg->msg_name,
 				       &call->conn->trans->peer->srx,
 				       sizeof(call->conn->trans->peer->srx));
-			sock_recv_timestamp(msg, &rx->sk, skb);
+			sock_recv_ts_and_drops(msg, &rx->sk, skb);
 		}
 
 		/* receive the message */
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -1958,7 +1958,7 @@ SCTP_STATIC int sctp_recvmsg(struct kiocb *iocb, struct sock *sk,
 	if (err)
 		goto out_free;
 
-	sock_recv_timestamp(msg, sk, skb);
+	sock_recv_ts_and_drops(msg, sk, skb);
 	if (sctp_ulpevent_is_notification(event)) {
 		msg->msg_flags |= MSG_NOTIFICATION;
 		sp->pf->event_msgname(event, msg->msg_name, addr_len);
diff --git a/net/socket.c b/net/socket.c
--- a/net/socket.c
+++ b/net/socket.c
@@ -668,6 +668,22 @@ void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk,
 
 EXPORT_SYMBOL_GPL(__sock_recv_timestamp);
 
+inline void sock_recv_drops(struct msghdr *msg, struct sock *sk, struct sk_buff *skb)
+{
+	if (sock_flag(sk, SOCK_RXQ_OVFL) && skb && skb->dropcount)
+		put_cmsg(msg, SOL_SOCKET, SO_RXQ_OVFL,
+			sizeof(__u32), &skb->dropcount);
+}
+
+void sock_recv_ts_and_drops(struct msghdr *msg, struct sock *sk,
+	struct sk_buff *skb)
+{
+	sock_recv_timestamp(msg, sk, skb);
+	sock_recv_drops(msg, sk, skb);
+	put_cmsg(msg, SOL_SOCKET, SO_RXQ_OVFL, sizeof(__u32), &skb->dropcount);
+}
+EXPORT_SYMBOL_GPL(sock_recv_ts_and_drops);
+
 static inline int __sock_recvmsg(struct kiocb *iocb, struct socket *sock,
 				 struct msghdr *msg, size_t size, int flags)
 {

^ permalink raw reply related

* Re: bisect results of MSI-X related panic (help!)
From: Jesse Brandeburg @ 2009-10-10  0:24 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Frans Pop, Jesse Brandeburg, linux-kernel, netdev, Ingo Molnar,
	hpa
In-Reply-To: <4AAE105E.1080005@kernel.org>

On Mon, Sep 14, 2009 at 2:43 AM, Tejun Heo <tj@kernel.org> wrote:
> Tejun Heo wrote:
>> Frans Pop wrote:
>>> Jesse Brandeburg wrote:
>>>> I've bisected, here is my bisect log, problem is that the commit
>>>> identified is a merge commit, and *I don't know what to revert to test*.
>>>> It appears the parent of the merge:
>>>> 6e15cf04860074ad032e88c306bea656bbdd0f22 is marked good, but looks to be
>>>> in a possibly related area to the panic.
>>> That merge does contain quite a few merge fixups, so it's quite possible
>>> one of them is the cause of the failure.
>>> Maybe the simplest way to verify that is to compile both parents of the
>>> merge to doublecheck that they work OK. Then, if a compile of the merge
>>> itself is bad, the problem really is in the merge commit itself.
>>>
>>> That commit is the "percpu" merge, so I've added Tejun (author of most of
>>> that branch) and Ingo (merger) in CC.
>>
>> Sorry, the oops doesn't ring a bell, well, not yet at least.  It would
>> be great if the bisection can be narrowed down more.
>
> Also, building w/ debug option on, capturing more oops traces and
> pasting gdb output of l *<oops address> might shed some more light.

Okay, it has been a while and I have an update on this issue.  The
actual panic seems to have disappeared in 2.6.32-rc1(2), however, with
CONFIG_CC_STACKPROTECTOR=y, I am still panicking, the stack protector
fault shows only this message, no backtrace is listed:

Kernel stack is corrupted in: ffffffff810b5b31

I've built with a full debug kernel before this crash, so I did:

(gdb) l *0xffffffff810b5b31
0xffffffff810b5b31 is in move_native_irq (kernel/irq/migration.c:67).
62			return;
63	
64		desc->chip->mask(irq);
65		move_masked_irq(irq);
66		desc->chip->unmask(irq);
>>> 67	}
68	
(gdb) l move_native_irq
54	void move_native_irq(int irq)
55	{
56		struct irq_desc *desc = irq_to_desc(irq);
57	
58		if (likely(!(desc->status & IRQ_MOVE_PENDING)))
59			return;
60	
61		if (unlikely(desc->status & IRQ_DISABLED))
62			return;
63	
64		desc->chip->mask(irq);
65		move_masked_irq(irq);
66		desc->chip->unmask(irq);
67	}

So, this seems very related to my panic, as it is likely that
irqbalance or something else might try to move my interrupt from one
core to another and this seems likely related, and the original issue
as well as this one reproduce with LOTS of MSI-X vectors active.

- I tried connecting after the panic with kgdboc, no connection
- I tried kdump, but the same kernel I am using panics/hangs during
boot right after udev during the kexec() kernel boot (should I try
harder to get this working given it got so far?)
- I have ftrace function tracer running but no way to get at the log
post panic (wouldn't it be great if the kernel just dumped the ftrace
log on __stack_chk_fail?)

any other debugging tricks/ideas?

^ permalink raw reply

* Re: Real networking namespace
From: Stephen Hemminger @ 2009-10-10  2:08 UTC (permalink / raw)
  To: Paul Moore
  Cc: Stephen Smalley, linux-security-module, Al Viro, netdev,
	James Morris
In-Reply-To: <200910091812.16046.paul.moore@hp.com>

On Fri, 9 Oct 2009 18:12:15 -0400
Paul Moore <paul.moore@hp.com> wrote:

> On Friday 09 October 2009 12:44:52 pm Stephen Smalley wrote:
> > On Fri, 2009-10-09 at 12:37 -0400, Stephen Smalley wrote:
> > > On Fri, 2009-10-09 at 08:38 -0700, Stephen Hemminger wrote:
> > > > The existing networking namespace model is unattractive for what I
> > > > want, has anyone investigated better alternatives?
> > > >
> > > > I would like to be able to allow access to a network interface and
> > > > associated objects (routing tables etc), to be controlled by Mandatory
> > > > Access Control API's. I.e grant access to eth0 and to only certain
> > > > processes.  Some the issues with the existing models are:
> > > >   * eth0 and associated objects don't really exist in filesystem so
> > > >     not subject to LSM style control (SeLinux/SMACK/TOMOYO)
> 
> As Stephen points out, SELinux does have the ability to assign security labels 
> to network interfaces, check out the 'semanage' command.  A while back I wrote 
> up something about the SELinux network "ingress/egress" access controls:
> 
>  * http://paulmoore.livejournal.com/2128.html

I was hoping to be able to not have inaccessible interfaces visible,
is it possible to not have interfaces show up in commands like:
  ip link show
or sysfs?


-- 

^ permalink raw reply

* [PATCH] net,bonding: Add return statement in bond_create_proc_entry.
From: Rakib Mullick @ 2009-10-10  2:10 UTC (permalink / raw)
  To: Jay Vosburgh, netdev, linux-kernel, Andrew Morton; +Cc: bonding-devel

The function bond_create_proc_entry supposed to return int instead of void.
And fixes the following compilation warning.

drivers/net/bonding/bond_main.c: In function `bond_create_proc_entry':
drivers/net/bonding/bond_main.c:3393: warning: control reaches end of
non-void function

---
Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>

--- linus/drivers/net/bonding/bond_main.c	2009-10-09 17:38:35.000000000 +0600
+++ rakib/drivers/net/bonding/bond_main.c	2009-10-09 17:47:46.000000000 +0600
@@ -3391,6 +3391,7 @@ static void bond_destroy_proc_dir(void)

 static int bond_create_proc_entry(struct bonding *bond)
 {
+	return 0;
 }

 static void bond_remove_proc_entry(struct bonding *bond)

^ permalink raw reply

* Re: PATCH: Network Device Naming mechanism and policy
From: Stephen Hemminger @ 2009-10-10  2:44 UTC (permalink / raw)
  To: Matt Domsch; +Cc: netdev, linux-hotplug, Narendra_K, jordan_hargrave
In-Reply-To: <20091009210909.GA9836@auslistsprd01.us.dell.com>

On Fri, 9 Oct 2009 16:09:09 -0500
Matt Domsch <Matt_Domsch@dell.com> wrote:

> On Fri, Oct 09, 2009 at 09:00:01AM -0500, Narendra K wrote:
> > On Fri, Oct 09, 2009 at 07:12:07PM +0530, K, Narendra wrote:
> > > > example udev config:
> > > > SUBSYSTEM=="net",
> > > SYMLINK+="net/by-mac/$sysfs{ifindex}.$sysfs{address}"
> > > 
> > > work as well.  But coupling the ifindex to the MAC address like this
> > > doesn't work.  (In general, coupling any two unrelated attributes when
> > > trying to do persistent names doesn't work.)
> > > 
> > Attaching the latest patch incorporating review comments.
> 
> Same patch, rebased to linux-next.
> 
> By creating character devices for every network device, we can use
> udev to maintain alternate naming policies for devices, including
> additional names for the same device, without interfering with the
> name that the kernel assigns a device.
> 
> This is conditionalized on CONFIG_NET_CDEV.  If enabled (the default),
> device nodes will automatically be created in /dev/netdev/ for each
> network device.  (/dev/net/ is already populated by the tun device.)
> 
> These device nodes are not functional at the moment - open() returns
> -ENOSYS.  Their only purpose is to provide userspace with a kernel
> name to ifindex mapping, in a form that udev can easily manage.
> 
> Signed-off-by: Jordan Hargrave <Jordan_Hargrave@dell.com>
> Signed-off-by: Narendra K <Narendra_K@dell.com>
> Signed-off-by: Matt Domsch <Matt_Domsch@dell.com>

Maybe I'm dense but can't see why having a useless /dev/net/ symlinks
is a good interface choice. Perhaps you should explain the race between
PCI scan and udev in more detail, and why solving it in either of those
places won't work. As it stands you are proposing yet another wart to
the already complex set of network interface API's which has implications
for security as well as increasing the number of possible bugs.

-- 

^ permalink raw reply

* Re: [PATCH] net: Add netdev_alloc_skb_ip_align() helper
From: David Miller @ 2009-10-10  3:43 UTC (permalink / raw)
  To: eric.dumazet; +Cc: thomas, netdev, thierry.reding, nios2-dev, linux-kernel
In-Reply-To: <4ACF6562.9040109@gmail.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 9 Oct 2009 18:31:30 +0200

> David Miller a écrit :
> 
>> Looks ok, but I want to look at how often this exact sequence
>> would match.  If it applies to a lot of cases, I'll add this
>> but I know of many exceptions in my head already :-)
> 
> Well, it was more as a reference. I believe about 20-30 call sites
> could use it. Do you want me to provide a combo patch ?

No, that's not necessary.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox