Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] firewire: net: rate-limit log spam at transmit failure
From: Stefan Richter @ 2010-11-08  8:12 UTC (permalink / raw)
  To: Maxim Levitsky; +Cc: linux1394-devel, netdev
In-Reply-To: <1289180517.4318.3.camel@maxim-laptop>

Maxim Levitsky wrote:
> But why the timeout is  never set?

It is set to the default 0.1s per IEEE 1394 in core-card.c::fw_card_initialize.

If card->split_timeout_jiffies or card->split_timeout_cycles /ever/ become
zero, then only due to a memory corrupting bug.

> Also, note that I see here that if I send a TCP stream from one system
> to another then the system that recieves the packets (and sends TCP
> acks), still overflows the queue (error 10, and confirmed by printks).

OK, I'll send a stricter version of "firewire: net: throttle TX queue before
running out of tlabels".
-- 
Stefan Richter
-=====-==-=- =-== -=---
http://arcgraph.de/sr/

^ permalink raw reply

* Re: [PATCH 0/9] Fix leaking of kernel heap addresses in net/
From: Rémi Denis-Courmont @ 2010-11-08  8:04 UTC (permalink / raw)
  To: ext Dan Rosenberg
  Cc: chas@cmf.nrl.navy.mil, davem@davemloft.net, kuznet@ms2.inr.ac.ru,
	pekkas@netcore.fi, jmorris@namei.org, yoshfuji@linux-ipv6.org,
	kaber@trash.net, netdev@vger.kernel.org, security@kernel.org,
	stable@kernel.org
In-Reply-To: <1289147492.3090.137.camel@Dan>

On Sunday 07 November 2010 18:31:32 ext Dan Rosenberg, you wrote:
> This patch series resolves the leakage of kernel heap addresses to
> userspace via network protocol /proc interfaces and public error
> messages.  Revealing this information is a bad idea from a security
> perspective for a number of reasons, the most obvious of which is it
> provides unprivileged users a mechanism by which to create a structure
> in the kernel heap containing function pointers, obtain the address of
> that structure, and overwrite those function pointers by leveraging
> other vulnerabilities.  It is my hope that by eliminating this
> information leakage, in conjunction with making statically-declared
> function pointer tables read-only (to be done in a separate patch
> series), we can at least add a small hurdle for the exploitation of a
> subset of kernel vulnerabilities.

Seems like this patch series is incomplete to me as far as /proc/net is 
concerned.

-- 
Rémi Denis-Courmont
Nokia Devices R&D, Maemo Software, Helsinki

^ permalink raw reply

* Re:[PATCH v14 06/17] Use callback to deal with skb_release_data() specially.
From: xiaohui.xin @ 2010-11-08  8:03 UTC (permalink / raw)
  To: eric.dumazet, netdev, kvm, linux-kernel, mst, mingo, davem,
	herbert, jdi
  Cc: Xin Xiaohui
In-Reply-To: <1288861663.2659.47.camel@edumazet-laptop>

From: Xin Xiaohui <xiaohui.xin@intel.com>

>> Hmm, I suggest you read the comment two lines above.
>>
>> If destructor_arg is now cleared each time we allocate a new skb, then,
>> please move it before dataref in shinfo structure, so that the following
>> memset() does the job efficiently...
>
>
>Something like :
>
>diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>index e6ba898..2dca504 100644
>--- a/include/linux/skbuff.h
>+++ b/include/linux/skbuff.h
>@@ -195,6 +195,9 @@ struct skb_shared_info {
> 	__be32          ip6_frag_id;
> 	__u8		tx_flags;
> 	struct sk_buff	*frag_list;
>+	/* Intermediate layers must ensure that destructor_arg
>+	 * remains valid until skb destructor */
>+	void		*destructor_arg;
> 	struct skb_shared_hwtstamps hwtstamps;
>
> 	/*
>@@ -202,9 +205,6 @@ struct skb_shared_info {
> 	 */
> 	atomic_t	dataref;
>
>-	/* Intermediate layers must ensure that destructor_arg
>-	 * remains valid until skb destructor */
>-	void *		destructor_arg;
> 	/* must be last field, see pskb_expand_head() */
> 	skb_frag_t	frags[MAX_SKB_FRAGS];
> };
>
>

Will that affect the cache line?
Or, we can move the line to clear destructor_arg to the end of __alloc_skb().
It looks like as the following, which one do you prefer?

Thanks
Xiaohui

---
 net/core/skbuff.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index c83b421..df852f2 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -224,6 +224,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 
 		child->fclone = SKB_FCLONE_UNAVAILABLE;
 	}
+	shinfo->destructor_arg = NULL;
 out:
 	return skb;
 nodata:
@@ -343,6 +344,13 @@ static void skb_release_data(struct sk_buff *skb)
 		if (skb_has_frags(skb))
 			skb_drop_fraglist(skb);
 
+		if (skb->dev && dev_is_mpassthru(skb->dev)) {
+			struct skb_ext_page *ext_page =
+				skb_shinfo(skb)->destructor_arg;
+			if (ext_page && ext_page->dtor)
+				ext_page->dtor(ext_page);
+		}
+
 		kfree(skb->head);
 	}
 }
-- 
1.7.3

^ permalink raw reply related

* RE: [PATCH v13 10/16] Add a hook to intercept external buffers from NIC driver.
From: Xin, Xiaohui @ 2010-11-08  7:43 UTC (permalink / raw)
  To: David Miller
  Cc: netdev@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, mst@redhat.com, mingo@elte.hu,
	herbert@gondor.apana.org.au, jdike@linux.intel.com
In-Reply-To: <20101029.132836.115944599.davem@davemloft.net>

I have addressed this issue in v14 patch set.

Thanks
Xiaohui

>-----Original Message-----
>From: David Miller [mailto:davem@davemloft.net]
>Sent: Saturday, October 30, 2010 4:29 AM
>To: Xin, Xiaohui
>Cc: netdev@vger.kernel.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
>mst@redhat.com; mingo@elte.hu; herbert@gondor.apana.org.au; jdike@linux.intel.com
>Subject: Re: [PATCH v13 10/16] Add a hook to intercept external buffers from NIC driver.
>
>From: "Xin, Xiaohui" <xiaohui.xin@intel.com>
>Date: Wed, 27 Oct 2010 09:33:12 +0800
>
>> Somehow, it seems not a trivial work to support it now. Can we support it
>> later and as a todo with our current work?
>
>I would prefer the feature work properly, rather than only in specific
>cases, before being integated.

^ permalink raw reply

* Re: [Security] [SECURITY] Fix leaking of kernel heap addresses via /proc
From: Eric Dumazet @ 2010-11-08  7:33 UTC (permalink / raw)
  To: David Miller
  Cc: andi, drosenberg, chas3, tytso, torvalds, kuznet, pekkas, jmorris,
	yoshfuji, kaber, remi.denis-courmont, netdev, security
In-Reply-To: <20101107.180108.71121019.davem@davemloft.net>

Le dimanche 07 novembre 2010 à 18:01 -0800, David Miller a écrit :
> From: Andi Kleen <andi@firstfloor.org>
> Date: Mon, 8 Nov 2010 00:56:10 +0100
> 
> > I would just remove the pointers from /proc and supply 
> > gdb macros that extract the equivalent information from /proc/kcore.
> > This is a bit racy, but for debugging it should be no
> > problem to run them multiple times as needed.
> 
> I do not think at all that this is tenable for the kind of
> things people use the socket pointers for when debugging
> problems.
> 
> I defeinitely prefer the inode number to this idea.

We currently have no guarantee of sockets inode numbers unicity.
I admit chances of clash are low.

When a printk() happens right before a BUG(), how are we going to check
the dumped registers are possibly close the socket involved, if we dont
have access to the machine, and only the crashlog ?

BTW, any local user can look at "dmesg", and crash reports. These
reports are even published on a remote site (bugzilla) so that hostile
hackers can be feeded.

I am OK to delete socket pointers from /proc files for non root users
(after checking things like lsof continue to work correctly).
I dont remember using them while doing debugging stuff.

BTW, rtnetlink also expose socket pointers to non root users :

$ ss -e dst 192.168.20.108
State      Recv-Q Send-Q    Local Address:Port    Peer Address:Port   
ESTAB      0      0         10.150.51.210:46979   192.168.20.108:ssh 
timer:(keepalive,119min,0) ino:136919 sk:ffff88002129d7c0


Mixing in same patch /proc pointers removal and printk() pointers
removal seems wrong to me. Very different problems.




^ permalink raw reply

* Re: [RFC v2] ipvs: allow transmit of GRO aggregated skbs
From: Simon Horman @ 2010-11-08  7:31 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: lvs-devel, netdev, Herbert Xu
In-Reply-To: <20101106142817.GA27212@verge.net.au>

[ CCing Herbet Xu ]

On Sat, Nov 06, 2010 at 11:28:21PM +0900, Simon Horman wrote:
> On Sat, Nov 06, 2010 at 04:18:21PM +0200, Julian Anastasov wrote:
> > 
> > 	Hello,
> > 
> > On Sat, 6 Nov 2010, Simon Horman wrote:
> > 
> > >This is a first attempt at allowing LVS to transmit
> > >skbs of greater than MTU length that have been aggregated by GRO.
> > >
> > >I have lightly tested the ip_vs_dr_xmit() portion of this patch and
> > >although it seems to work I am unsure that netif_needs_gso() is the correct
> > >test to use.
> > 
> > 	ip_forward() uses !skb_is_gso(skb), so may be it is
> > enough to check for GRO instead of using netif_needs_gso?
> 
> Thanks, I'll look into that.

Hi Julian,

just to clarify, you think that !skb_is_gso(skb) should be
used in ip_vs_xmit.c? If so, yes I think that makes sense
and I'll re-spin my patch accordingly.

^ permalink raw reply

* how to read one udp packet with more than one recvfrom() calls?
From: ranjith kumar @ 2010-11-08  7:08 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 582 bytes --]

Hi,

I  have implemented client and server programs using udp
protocol(files are attached).
UDP packet size is 500bytes.

I want to read these 500bytes in two calls to recvfrom(). First time
reading 100bytes and second time 400bytes.
How to do this?

When I tried to change the third argument of recvfrom(size_t len),
from 500 to 100, first 100bytes are read correctly.
But when I call recvfrom() second time with len=400, it is reading the
first 400bytes of "next udp packet".
Why? Isn't it possible to read one udp packet in two calls to
recvfrom()/read()????

Thanks in advance.

[-- Attachment #2: client.c --]
[-- Type: application/octet-stream, Size: 1169 bytes --]

#include<stdio.h>
#include <sys/types.h>
#include <string.h>
#include <sys/types.h>
#include <unistd.h>
#include <sys/time.h>
#include <stdio.h>
#include <netinet/in.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <string.h>

#define BUFLEN 500
#define PORT  5000
#define NPACK  5

 
#define SRV_IP "107.109.38.32"
 /* fprintf(stdout,), #includes and #defines like in the server */

 int main(void)
 {
   struct sockaddr_in si_other;
   int s, i, slen=sizeof(si_other);
   char buf[BUFLEN];

   if ((s=socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP))==-1)
     fprintf(stdout,"socket");

   memset((char *) &si_other, 0, sizeof(si_other));
   si_other.sin_family = AF_INET;
   si_other.sin_port = htons(PORT);
   if (inet_aton(SRV_IP, &si_other.sin_addr)==0) {
     fprintf(stderr, "inet_aton() failed\n");
     exit(1);
   }

   for (i=0; i<NPACK; i++) {
     printf("Sending packet %d\n", i);
     sprintf(buf, "This is packet %d\n", i);
//	write(s,buf,BUFLEN);
     if (sendto(s, buf, BUFLEN, 0, &si_other, slen)==-1)
      fprintf(stdout,"sendto()");
   }

   close(s);
   return 0;
 }


[-- Attachment #3: server.c --]
[-- Type: application/octet-stream, Size: 1083 bytes --]

#include<stdio.h>
#include <sys/types.h>
#include <string.h>
#include <sys/types.h>
#include <unistd.h>
#include <sys/time.h>
#include <stdio.h>
#include <netinet/in.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <string.h>

#define BUFLEN 500
#define PORT  5000
#define NPACK  5



void diep(char *s)
{
	perror(s);
	exit(1);
}

int main(void)
{
	struct sockaddr_in si_me, si_other;
	int s, i, slen=sizeof(si_other);
	char buf[BUFLEN];

	if ((s=socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP))==-1)
		diep("socket");

	memset((char *) &si_me, 0, sizeof(si_me));
	si_me.sin_family = AF_INET;
	si_me.sin_port = htons(PORT);
	si_me.sin_addr.s_addr = htonl(INADDR_ANY);
	if (bind(s, &si_me, sizeof(si_me))==-1)
		diep("bind");

	for (i=0; i<NPACK; i++) {
		if (recvfrom(s, buf, BUFLEN, 0, &si_other, &slen)==-1)
			diep("recvfrom()");
//	read(s,buf,BUFLEN);
		printf("Received packet from %s:%d\nData: %s\n\n", 
				inet_ntoa(si_other.sin_addr), ntohs(si_other.sin_port), buf);
	}

	close(s);
	return 0;
}

^ permalink raw reply

* Re: [ovs-dev] Flow Control and Port Mirroring
From: Simon Horman @ 2010-11-08  4:59 UTC (permalink / raw)
  To: Rusty Russell
  Cc: virtualization, Jesse Gross, dev, virtualization, netdev, kvm,
	Michael S. Tsirkin
In-Reply-To: <201011081341.23529.rusty@rustcorp.com.au>

On Mon, Nov 08, 2010 at 01:41:23PM +1030, Rusty Russell wrote:
> On Sat, 30 Oct 2010 01:29:33 pm Simon Horman wrote:
> > [ CCed VHOST contacts ]
> > 
> > On Thu, Oct 28, 2010 at 01:22:02PM -0700, Jesse Gross wrote:
> > > On Thu, Oct 28, 2010 at 4:54 AM, Simon Horman <horms@verge.net.au> wrote:
> > > > My reasoning is that in the non-mirroring case the guest is
> > > > limited by the external interface through wich the packets
> > > > eventually flow - that is 1Gbit/s. But in the mirrored either
> > > > there is no flow control or the flow control is acting on the
> > > > rate of dummy0, which is essentailly infinate.
> > > >
> > > > Before investigating this any further I wanted to ask if
> > > > this behaviour is intentional.
> > > 
> > > It's not intentional but I can take a guess at what is happening.
> > > 
> > > When we send the packet to a mirror, the skb is cloned but only the
> > > original skb is charged to the sender.  If the original packet is
> > > delivered to localhost then it will be freed quickly and no longer
> > > accounted for, despite the fact that the "real" packet is still
> > > sitting in the transmit queue on the NIC.  The UDP stack will then
> > > send the next packet, limited only by the speed of the CPU.
> > 
> > That would explain what I have observed.
> 
> I can't find the thread (what is ovs-dev?),

Sorry, yes its on ovs-dev.
http://openvswitch.org/pipermail/dev_openvswitch.org/2010-October/003806.html

> but I think the tap device
> has this fundamental feature: you can blast as many packets as you want
> through it.
> 
> If that's a bad thing, we have to look harder...

There does seem to be flow control in the non-mirrored case.
So I suspect its occurring at the skb level but that breaks down when
a clone occurs. It would seem that fragment level flow control would
help this problem (which is basically what Xen's netback/netfront has),
but by this point I am speculating wildly.  I'll try and find out exactly
where the problem is occurring in order for us to have a more informed
discussion.

^ permalink raw reply

* Re: [PATCH] net dst: need linux/cache.h for ____cacheline_aligned_in_smp.
From: David Miller @ 2010-11-08  3:58 UTC (permalink / raw)
  To: lethal; +Cc: linville, netdev
In-Reply-To: <20101108035130.GA11477@linux-sh.org>

From: Paul Mundt <lethal@linux-sh.org>
Date: Mon, 8 Nov 2010 12:51:30 +0900

> Presently the b43legacy build fails on an sh randconfig:
 ...
> Signed-off-by: Paul Mundt <lethal@linux-sh.org>

Applied, thanks Paul.

^ permalink raw reply

* [PATCH] net dst: need linux/cache.h for ____cacheline_aligned_in_smp.
From: Paul Mundt @ 2010-11-08  3:51 UTC (permalink / raw)
  To: David Miller; +Cc: John W. Linville, netdev

Presently the b43legacy build fails on an sh randconfig:

In file included from include/net/dst.h:12,
                 from drivers/net/wireless/b43legacy/xmit.c:32:
include/net/dst_ops.h:28: error: expected ':', ',', ';', '}' or '__attribute__' before '____cacheline_aligned_in_smp'
include/net/dst_ops.h: In function 'dst_entries_get_fast':
include/net/dst_ops.h:33: error: 'struct dst_ops' has no member named 'pcpuc_entries'
include/net/dst_ops.h: In function 'dst_entries_get_slow':
include/net/dst_ops.h:41: error: 'struct dst_ops' has no member named 'pcpuc_entries'
include/net/dst_ops.h: In function 'dst_entries_add':
include/net/dst_ops.h:49: error: 'struct dst_ops' has no member named 'pcpuc_entries'
include/net/dst_ops.h: In function 'dst_entries_init':
include/net/dst_ops.h:55: error: 'struct dst_ops' has no member named 'pcpuc_entries'
include/net/dst_ops.h: In function 'dst_entries_destroy':
include/net/dst_ops.h:60: error: 'struct dst_ops' has no member named 'pcpuc_entries'
make[5]: *** [drivers/net/wireless/b43legacy/xmit.o] Error 1
make[5]: *** Waiting for unfinished jobs....

Signed-off-by: Paul Mundt <lethal@linux-sh.org>

---

 include/net/dst_ops.h |    1 +
 1 file changed, 1 insertion(+)

diff --git a/include/net/dst_ops.h b/include/net/dst_ops.h
index 1fa5306..51665b3 100644
--- a/include/net/dst_ops.h
+++ b/include/net/dst_ops.h
@@ -2,6 +2,7 @@
 #define _NET_DST_OPS_H
 #include <linux/types.h>
 #include <linux/percpu_counter.h>
+#include <linux/cache.h>
 
 struct dst_entry;
 struct kmem_cachep;

^ permalink raw reply related

* Re: Sky2 2.6.36-09934-g2aab243 DMAR error with tcp timestamp enabled
From: Michael Breuer @ 2010-11-08  3:38 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Stephen Hemminger, Jarek Poplawski, David Miller, netdev
In-Reply-To: <20101107191304.4a6cdfa4@s6510>

On 11/7/2010 10:13 PM, Stephen Hemminger wrote:
> On Sat, 06 Nov 2010 12:57:53 -0400
> Michael Breuer<mbreuer@majjas.com>  wrote:
>
>> Basically, if I enable tcp timestamps (now disabled) I get a sky2 hang.
>> As with the earlier issue the effects are not seen until after a couple
>> days of uptime and seem exacerbated by load.
>>
>> I can't 100% confirm that the problem is not occurring without tcp
>> timestamps, but will leave the system up for a while to try to confirm.
>> This didn't occur previously without tcp timestamps enabled, but I also
>> pulled git changes between the two events.
>>
>> I'm now also on 2.6.37-rc1.... I did a quick scan and didn't see any
>> obvious commits between 2.6.36-09934 and -rc1 that would have affected this.
>>
>>   From the log:
>> Nov  2 05:41:54 mail kernel: DRHD: handling fault status reg 2
>> Nov  2 05:41:54 mail kernel: DMAR:[DMA Read] Request device [06:00.0]
>> fault addr ffea3000
>> Nov  2 05:41:54 mail kernel: DMAR:[fault reason 06] PTE Read access is
>> not set
>> Nov  2 05:41:54 mail kernel: sky2 0000:06:00.0: error interrupt
>> status=0x80000000
>> Nov  2 05:41:54 mail kernel: sky2 0000:06:00.0: PCI hardware error (0x2010)
>> Nov  2 05:42:01 mail clamd[9755]: SelfCheck: Database status OK.
>> Nov  2 05:42:11 mail root: ping of potter failed
>> Nov  2 05:42:16 mail kernel: ------------[ cut here ]------------
>> Nov  2 05:42:16 mail kernel: WARNING: at net/sched/sch_generic.c:258
>> dev_watchdog+0x251/0x260()
>> Nov  2 05:42:16 mail kernel: Hardware name: System Product Name
>> Nov  2 05:42:16 mail kernel: NETDEV WATCHDOG: eth0 (sky2): transmit
>> queue 0 timed out
>> Nov  2 05:42:16 mail kernel: Modules linked in: cpufreq_stats
>> ip6table_filter ip6table_mangle ip6_tables ipt_MASQUERADE iptable_nat
>> nf_nat iptable_mangle iptable_raw ebtable_nat ebtables bridge stp
>> appletalk psnap llc nfsd lockd nfs_acl auth_rpcgss exportfs coretemp
>> sunrpc acpi_cpufreq mperf sit tunnel4 ipt_LOG nf_conntrack_netbios_ns
>> nf_conntrack_ftp xt_DSCP xt_dscp xt_mark nf_conntrack_ipv6
>> nf_defrag_ipv6 xt_state xt_multiport ipv6 kvm_intel kvm
>> snd_hda_codec_analog snd_ens1371 gameport snd_rawmidi snd_ac97_codec
>> snd_hda_intel snd_hda_codec ac97_bus snd_hwdep snd_seq snd_seq_device
>> snd_pcm gspca_spca505 gspca_main snd_timer videodev snd v4l1_compat
>> i2c_i801 sky2 v4l2_compat_ioctl32 iTCO_wdt pcspkr asus_atk0110
>> i7core_edac edac_core soundcore iTCO_vendor_support snd_page_alloc
>> microcode raid456 async_raid6_recov async_pq raid6_pq async_xor xor
>> async_memcpy async_tx raid1 ata_generic firewire_ohci pata_acpi
>> firewire_core crc_itu_t pata_marvell nouveau ttm drm_kms_helper drm
>> i2c_algo_bit i2c_core video output [
>> Nov  2 05:42:16 mail kernel: last unloaded: ip6_tables]
>> Nov  2 05:42:16 mail kernel: Pid: 0, comm: swapper Tainted: G        W
>> 2.6.36-09934-g2aab243 #44
>> Nov  2 05:42:16 mail kernel: Call Trace:
>> Nov  2 05:42:16 mail kernel:<IRQ>   [<ffffffff81058a4f>]
>> warn_slowpath_common+0x7f/0xc0
>> Nov  2 05:42:16 mail kernel: [<ffffffff81058b46>]
>> warn_slowpath_fmt+0x46/0x50
>> Nov  2 05:42:16 mail kernel: [<ffffffff814603d1>] dev_watchdog+0x251/0x260
>> Nov  2 05:42:16 mail kernel: [<ffffffff8108a4a6>] ?
>> tick_program_event+0x26/0x30
>> Nov  2 05:42:16 mail kernel: [<ffffffff8107eed4>] ?
>> hrtimer_interrupt+0x134/0x240
>> Nov  2 05:42:16 mail kernel: [<ffffffff81068ab0>]
>> run_timer_softirq+0x160/0x390
>> Nov  2 05:42:16 mail kernel: [<ffffffff8108a368>] ?
>> tick_dev_program_event+0x48/0x110
>> Nov  2 05:42:16 mail kernel: [<ffffffff81460180>] ? dev_watchdog+0x0/0x260
>> Nov  2 05:42:16 mail kernel: [<ffffffff8105f981>] __do_softirq+0xb1/0x220
>> Nov  2 05:42:16 mail kernel: [<ffffffff8100cfdc>] call_softirq+0x1c/0x30
>> Nov  2 05:42:16 mail kernel: [<ffffffff8100ea15>] do_softirq+0x65/0xa0
>> Nov  2 05:42:16 mail kernel: [<ffffffff8105f845>] irq_exit+0x85/0x90
>> Nov  2 05:42:16 mail kernel: [<ffffffff81511d61>] do_IRQ+0x71/0xf0
>> Nov  2 05:42:16 mail kernel: [<ffffffff8150a7d3>] ret_from_intr+0x0/0x11
>> Nov  2 05:42:16 mail kernel:<EOI>   [<ffffffff812e4165>] ?
>> intel_idle+0xd5/0x170
>> Nov  2 05:42:16 mail kernel: [<ffffffff812e4148>] ? intel_idle+0xb8/0x170
>> Nov  2 05:42:16 mail kernel: [<ffffffff81425b51>]
>> cpuidle_idle_call+0x91/0x150
>> Nov  2 05:42:16 mail kernel: [<ffffffff8100aa8b>] cpu_idle+0xbb/0x150
>> Nov  2 05:42:16 mail kernel: [<ffffffff814f1785>] rest_init+0x75/0x80
>> Nov  2 05:42:16 mail kernel: [<ffffffff81b4ae9b>] start_kernel+0x3dc/0x3e7
>> Nov  2 05:42:16 mail kernel: [<ffffffff81b4a346>]
>> x86_64_start_reservations+0x131/0x135
>> Nov  2 05:42:16 mail kernel: [<ffffffff81b4a450>]
>> x86_64_start_kernel+0x106/0x115
>> Nov  2 05:42:16 mail kernel: ---[ end trace d9d3a1889f8925bf ]---
>> Nov  2 05:42:16 mail kernel: sky2 0000:06:00.0: eth0: tx timeout
>> Nov  2 05:42:16 mail kernel: sky2 0000:06:00.0: eth0: transmit ring 29
>> .. 117 report=29 done=29
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Looks like a hardware issue, never saw it before.
> Are you running MTU>  1500?
> Does turning off TSO help?
>
> One possibility is that NET_IP_ALIGN changed. Now the ethernet header is
> aligned and the IP header is not.
>
MTU=1500
TCP timestamps seems to be the culprit - no issues with it disabled. I 
hit the problem after running about 18 hours with TCP timestamps 
enabled. Has been stable since rebuilding without timestamps... but 
another day would be more telling.

Didn't look into the header alignment - but would that be inconsistent 
with tcp timestamps being involved?

^ permalink raw reply

* Re: Sky2 2.6.36-09934-g2aab243 DMAR error with tcp timestamp enabled
From: Stephen Hemminger @ 2010-11-08  3:13 UTC (permalink / raw)
  To: Michael Breuer; +Cc: Stephen Hemminger, Jarek Poplawski, David Miller, netdev
In-Reply-To: <4CD58911.3050201@majjas.com>

On Sat, 06 Nov 2010 12:57:53 -0400
Michael Breuer <mbreuer@majjas.com> wrote:

> Basically, if I enable tcp timestamps (now disabled) I get a sky2 hang. 
> As with the earlier issue the effects are not seen until after a couple 
> days of uptime and seem exacerbated by load.
> 
> I can't 100% confirm that the problem is not occurring without tcp 
> timestamps, but will leave the system up for a while to try to confirm. 
> This didn't occur previously without tcp timestamps enabled, but I also 
> pulled git changes between the two events.
> 
> I'm now also on 2.6.37-rc1.... I did a quick scan and didn't see any 
> obvious commits between 2.6.36-09934 and -rc1 that would have affected this.
> 
>  From the log:
> Nov  2 05:41:54 mail kernel: DRHD: handling fault status reg 2
> Nov  2 05:41:54 mail kernel: DMAR:[DMA Read] Request device [06:00.0] 
> fault addr ffea3000
> Nov  2 05:41:54 mail kernel: DMAR:[fault reason 06] PTE Read access is 
> not set
> Nov  2 05:41:54 mail kernel: sky2 0000:06:00.0: error interrupt 
> status=0x80000000
> Nov  2 05:41:54 mail kernel: sky2 0000:06:00.0: PCI hardware error (0x2010)
> Nov  2 05:42:01 mail clamd[9755]: SelfCheck: Database status OK.
> Nov  2 05:42:11 mail root: ping of potter failed
> Nov  2 05:42:16 mail kernel: ------------[ cut here ]------------
> Nov  2 05:42:16 mail kernel: WARNING: at net/sched/sch_generic.c:258 
> dev_watchdog+0x251/0x260()
> Nov  2 05:42:16 mail kernel: Hardware name: System Product Name
> Nov  2 05:42:16 mail kernel: NETDEV WATCHDOG: eth0 (sky2): transmit 
> queue 0 timed out
> Nov  2 05:42:16 mail kernel: Modules linked in: cpufreq_stats 
> ip6table_filter ip6table_mangle ip6_tables ipt_MASQUERADE iptable_nat 
> nf_nat iptable_mangle iptable_raw ebtable_nat ebtables bridge stp 
> appletalk psnap llc nfsd lockd nfs_acl auth_rpcgss exportfs coretemp 
> sunrpc acpi_cpufreq mperf sit tunnel4 ipt_LOG nf_conntrack_netbios_ns 
> nf_conntrack_ftp xt_DSCP xt_dscp xt_mark nf_conntrack_ipv6 
> nf_defrag_ipv6 xt_state xt_multiport ipv6 kvm_intel kvm 
> snd_hda_codec_analog snd_ens1371 gameport snd_rawmidi snd_ac97_codec 
> snd_hda_intel snd_hda_codec ac97_bus snd_hwdep snd_seq snd_seq_device 
> snd_pcm gspca_spca505 gspca_main snd_timer videodev snd v4l1_compat 
> i2c_i801 sky2 v4l2_compat_ioctl32 iTCO_wdt pcspkr asus_atk0110 
> i7core_edac edac_core soundcore iTCO_vendor_support snd_page_alloc 
> microcode raid456 async_raid6_recov async_pq raid6_pq async_xor xor 
> async_memcpy async_tx raid1 ata_generic firewire_ohci pata_acpi 
> firewire_core crc_itu_t pata_marvell nouveau ttm drm_kms_helper drm 
> i2c_algo_bit i2c_core video output [
> Nov  2 05:42:16 mail kernel: last unloaded: ip6_tables]
> Nov  2 05:42:16 mail kernel: Pid: 0, comm: swapper Tainted: G        W   
> 2.6.36-09934-g2aab243 #44
> Nov  2 05:42:16 mail kernel: Call Trace:
> Nov  2 05:42:16 mail kernel: <IRQ>  [<ffffffff81058a4f>] 
> warn_slowpath_common+0x7f/0xc0
> Nov  2 05:42:16 mail kernel: [<ffffffff81058b46>] 
> warn_slowpath_fmt+0x46/0x50
> Nov  2 05:42:16 mail kernel: [<ffffffff814603d1>] dev_watchdog+0x251/0x260
> Nov  2 05:42:16 mail kernel: [<ffffffff8108a4a6>] ? 
> tick_program_event+0x26/0x30
> Nov  2 05:42:16 mail kernel: [<ffffffff8107eed4>] ? 
> hrtimer_interrupt+0x134/0x240
> Nov  2 05:42:16 mail kernel: [<ffffffff81068ab0>] 
> run_timer_softirq+0x160/0x390
> Nov  2 05:42:16 mail kernel: [<ffffffff8108a368>] ? 
> tick_dev_program_event+0x48/0x110
> Nov  2 05:42:16 mail kernel: [<ffffffff81460180>] ? dev_watchdog+0x0/0x260
> Nov  2 05:42:16 mail kernel: [<ffffffff8105f981>] __do_softirq+0xb1/0x220
> Nov  2 05:42:16 mail kernel: [<ffffffff8100cfdc>] call_softirq+0x1c/0x30
> Nov  2 05:42:16 mail kernel: [<ffffffff8100ea15>] do_softirq+0x65/0xa0
> Nov  2 05:42:16 mail kernel: [<ffffffff8105f845>] irq_exit+0x85/0x90
> Nov  2 05:42:16 mail kernel: [<ffffffff81511d61>] do_IRQ+0x71/0xf0
> Nov  2 05:42:16 mail kernel: [<ffffffff8150a7d3>] ret_from_intr+0x0/0x11
> Nov  2 05:42:16 mail kernel: <EOI>  [<ffffffff812e4165>] ? 
> intel_idle+0xd5/0x170
> Nov  2 05:42:16 mail kernel: [<ffffffff812e4148>] ? intel_idle+0xb8/0x170
> Nov  2 05:42:16 mail kernel: [<ffffffff81425b51>] 
> cpuidle_idle_call+0x91/0x150
> Nov  2 05:42:16 mail kernel: [<ffffffff8100aa8b>] cpu_idle+0xbb/0x150
> Nov  2 05:42:16 mail kernel: [<ffffffff814f1785>] rest_init+0x75/0x80
> Nov  2 05:42:16 mail kernel: [<ffffffff81b4ae9b>] start_kernel+0x3dc/0x3e7
> Nov  2 05:42:16 mail kernel: [<ffffffff81b4a346>] 
> x86_64_start_reservations+0x131/0x135
> Nov  2 05:42:16 mail kernel: [<ffffffff81b4a450>] 
> x86_64_start_kernel+0x106/0x115
> Nov  2 05:42:16 mail kernel: ---[ end trace d9d3a1889f8925bf ]---
> Nov  2 05:42:16 mail kernel: sky2 0000:06:00.0: eth0: tx timeout
> Nov  2 05:42:16 mail kernel: sky2 0000:06:00.0: eth0: transmit ring 29 
> .. 117 report=29 done=29
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Looks like a hardware issue, never saw it before.
Are you running MTU > 1500?
Does turning off TSO help?

One possibility is that NET_IP_ALIGN changed. Now the ethernet header is
aligned and the IP header is not.



^ permalink raw reply

* Re: [ovs-dev] Flow Control and Port Mirroring
From: Rusty Russell @ 2010-11-08  3:11 UTC (permalink / raw)
  To: virtualization
  Cc: dev, kvm, Michael S. Tsirkin, netdev, Jesse Gross, virtualization
In-Reply-To: <20101030025932.GG12842@verge.net.au>

On Sat, 30 Oct 2010 01:29:33 pm Simon Horman wrote:
> [ CCed VHOST contacts ]
> 
> On Thu, Oct 28, 2010 at 01:22:02PM -0700, Jesse Gross wrote:
> > On Thu, Oct 28, 2010 at 4:54 AM, Simon Horman <horms@verge.net.au> wrote:
> > > My reasoning is that in the non-mirroring case the guest is
> > > limited by the external interface through wich the packets
> > > eventually flow - that is 1Gbit/s. But in the mirrored either
> > > there is no flow control or the flow control is acting on the
> > > rate of dummy0, which is essentailly infinate.
> > >
> > > Before investigating this any further I wanted to ask if
> > > this behaviour is intentional.
> > 
> > It's not intentional but I can take a guess at what is happening.
> > 
> > When we send the packet to a mirror, the skb is cloned but only the
> > original skb is charged to the sender.  If the original packet is
> > delivered to localhost then it will be freed quickly and no longer
> > accounted for, despite the fact that the "real" packet is still
> > sitting in the transmit queue on the NIC.  The UDP stack will then
> > send the next packet, limited only by the speed of the CPU.
> 
> That would explain what I have observed.

I can't find the thread (what is ovs-dev?), but I think the tap device
has this fundamental feature: you can blast as many packets as you want
through it.

If that's a bad thing, we have to look harder...

Cheers,
Rusty.

^ permalink raw reply

* Re: [Security] [SECURITY] Fix leaking of kernel heap addresses via /proc
From: David Miller @ 2010-11-08  2:01 UTC (permalink / raw)
  To: andi
  Cc: drosenberg, chas3, tytso, torvalds, kuznet, pekkas, jmorris,
	yoshfuji, kaber, remi.denis-courmont, netdev, security
In-Reply-To: <20101107235610.GE17592@basil.fritz.box>

From: Andi Kleen <andi@firstfloor.org>
Date: Mon, 8 Nov 2010 00:56:10 +0100

> I would just remove the pointers from /proc and supply 
> gdb macros that extract the equivalent information from /proc/kcore.
> This is a bit racy, but for debugging it should be no
> problem to run them multiple times as needed.

I do not think at all that this is tenable for the kind of
things people use the socket pointers for when debugging
problems.

I defeinitely prefer the inode number to this idea.

^ permalink raw reply

* Re: netconf notes and materials
From: David Miller @ 2010-11-08  1:53 UTC (permalink / raw)
  To: roszenrami; +Cc: netdev
In-Reply-To: <AANLkTimZFfft66L983BbOXb4+LN9yzx2e3=t-kfQyM-_@mail.gmail.com>

From: Rami Rosen <roszenrami@gmail.com>
Date: Sun, 7 Nov 2010 21:06:20 +0200

> David,
> 1)  Great, thanks for the link!
> 
> 2)  Regarding your "Linux Networking Futures 2010" slides :
>  You mention in the fifth slide :
> "XFS is in review state, 2.6.38 likely'.
> 
> I suppose you probably mean "XPS" patches,
> the transmit Packet Steering patches by
> Tom Herbert. Or am I wrong and don't know something ?

You're correct, and the typo is mine :-)

^ permalink raw reply

* Re: [PATCH] firewire: net: rate-limit log spam at transmit failure
From: Maxim Levitsky @ 2010-11-08  1:41 UTC (permalink / raw)
  To: Stefan Richter; +Cc: linux1394-devel, netdev@vger.kernel.org
In-Reply-To: <4CD6957C.5030504@s5r6.in-berlin.de>

On Sun, 2010-11-07 at 13:03 +0100, Stefan Richter wrote:
> Maxim Levitsky wrote:
> > I have here my own hack to set the transaction timeout,
> 
> You can use firecontrol to set it on-the-fly.  To set it on node ffc2 i.e. the
> node with phy ID 2, and controller 1 (i.e. libraw1394 "port" 1):
> # echo "w . 2 0xfffff0000018 4 3" | firecontrol 1
> 
> This sets the whole-seconds part to 3, which gives you 3.1 seconds timeout if
> the fractional part at 0xfffff000001c is still at its default value of 0.1
> seconds, i.e. 800 << 19 (subsecond part in units of (1/8000)s shifted by 19).
> 
> # { echo "r . 2 0xfffff0000018 4"; echo "r . 2 0xfffff000001c 4"; } |
> firecontrol 1
> 
> displays the current register value on node ffc2 at port 1.  Type "help" in
> firecontrol for more available firecontrol commands.
Agreed.

But why the timeout is  never set?
What is the default?

I think that timeout_jiffies is never initialized, thus, it is 0 due to
kzalloc.

Also, note that I see here that if I send a TCP stream from one system
to another then the system that recieves the packets (and sends TCP
acks), still overflows the queue (error 10, and confirmed by printks).

Best regards,
	Maxim Levitsky


^ permalink raw reply

* Re: [Security] [SECURITY] Fix leaking of kernel heap addresses via /proc
From: Willy Tarreau @ 2010-11-08  1:00 UTC (permalink / raw)
  To: David Miller
  Cc: torvalds, chas, security, pekkas, yoshfuji, netdev, drosenberg,
	jmorris, remi.denis-courmont, kuznet, kaber
In-Reply-To: <20101106.165703.193714684.davem@davemloft.net>

On Sat, Nov 06, 2010 at 04:57:03PM -0700, David Miller wrote:
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Sat, 6 Nov 2010 13:50:32 -0700
> 
> > On Saturday, November 6, 2010, Dan Rosenberg <drosenberg@vsecurity.com> wrote:
> >>
> >> Clearly, in most cases we cannot just remove the field from the /proc
> >> output, as this would break a number of userspace programs that rely on
> >> consistency.  However, I propose that we replace the address with a "0"
> >> rather than leaking this information.
> > 
> > I really think it would be much better to use the unidentified number
> > or similar.
> > 
> > Just replacing with zeroes is annoying, and has the potential of
> > losing actual information.
> 
> I would really like to see the specific examples of where this is
> happening, it sounds like something very silly to me.

It has happened to me several times to use an hex editor to check some
socket's parameters (eg: backlog) based on the pointer. Sometimes I had
even change some parameters at runtime as part of debugging sessions.

In fact we could consider than many places that return pointers could
return 0 to normal users and the real value only to root (or any special
capability). I find it important not to reduce the observability of the
kernel for the sake of security.

Regards,
Willy


^ permalink raw reply

* Re: [Security] [SECURITY] Fix leaking of kernel heap addresses via /proc
From: Andi Kleen @ 2010-11-07 23:56 UTC (permalink / raw)
  To: Dan Rosenberg
  Cc: chas3, Andi Kleen, Ted Ts'o, Linus Torvalds,
	davem@davemloft.net, kuznet@ms2.inr.ac.ru, pekkas@netcore.fi,
	jmorris@namei.org, yoshfuji@linux-ipv6.org, kaber@trash.net,
	remi.denis-courmont@nokia.com, netdev@vger.kernel.org,
	security@kernel.org
In-Reply-To: <1289172456.3090.184.camel@Dan>

> The criticism raised so far is that cutting out the pointers entirely
> results in the omission of potentially useful debugging information.  I
> see two viable options to address this: either print out or omit
> addresses based on privileges (CAP_NET_ADMIN, for example), or have it
> controllable via sysctl.  I'm leaning towards the sysctl
> option...thoughts?

I would just remove the pointers from /proc and supply 
gdb macros that extract the equivalent information from /proc/kcore.
This is a bit racy, but for debugging it should be no
problem to run them multiple times as needed.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply

* Re: [Security] [SECURITY] Fix leaking of kernel heap addresses via /proc
From: Andi Kleen @ 2010-11-07 23:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: chas3, Andi Kleen, Ted Ts'o, Dan Rosenberg,
	davem@davemloft.net, kuznet@ms2.inr.ac.ru, pekkas@netcore.fi,
	jmorris@namei.org, yoshfuji@linux-ipv6.org, kaber@trash.net,
	remi.denis-courmont@nokia.com, netdev@vger.kernel.org,
	security@kernel.org
In-Reply-To: <AANLkTikDeVXcYRXqcq-qD_mf19awnKPTYGZbZguxJLDm@mail.gmail.com>

> So the only question is whether kernel pointers are an information
> leak worth worrying about. Personally, I think it's just damn stupid
> to expose a kernel pointer unless you _have_ to. There is absolutely

Agreed.

> no reason to expose the address of a socket in /proc, perhaps unless
> you're in some super-duper-debugging mode that no sane person would
> ever use outside of specialized network debugging (and I bet that in
> that case you still shouldn't expose it in a _normal_ proc file, but
> in debugfs or something).

In most cases you can get them by running gdb or crash on /proc/kcore 
anyways.

Still overall I suspect there may be a case that making
dmesg root only is a good idea.  At least optionally for the non 
kernel hackers out there.

The early memory map information can be likely used to guess a lot of 
locations and perhaps really make that heap overflow easier to write.
And perhaps some more mapping randomization inside the kernel.

I'm not sure the case is very strong though.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply

* Re: [Security] [SECURITY] Fix leaking of kernel heap addresses via /proc
From: Dan Rosenberg @ 2010-11-07 23:27 UTC (permalink / raw)
  To: chas3
  Cc: Andi Kleen, Ted Ts'o, Linus Torvalds, davem@davemloft.net,
	kuznet@ms2.inr.ac.ru, pekkas@netcore.fi, jmorris@namei.org,
	yoshfuji@linux-ipv6.org, kaber@trash.net,
	remi.denis-courmont@nokia.com, netdev@vger.kernel.org,
	security@kernel.org
In-Reply-To: <201011072248.oA7MmjKg025857@cmf.nrl.navy.mil>

Based on the feedback so far, I think a few things need to be worked
out.  Firstly, I'm going to separate out printk leaks as an issue to be
worked on at a later time - it seems there is some interest in
evaluating whether dmesg should be restricted, but that is out of scope
of my plans for this patch series.

That leaves the /proc leakage.  I don't think XOR-ing is a viable
option, since it would be relatively easy to infer the constant value.
The criticism raised so far is that cutting out the pointers entirely
results in the omission of potentially useful debugging information.  I
see two viable options to address this: either print out or omit
addresses based on privileges (CAP_NET_ADMIN, for example), or have it
controllable via sysctl.  I'm leaning towards the sysctl
option...thoughts?

-Dan


^ permalink raw reply

* Re: [Security] [SECURITY] Fix leaking of kernel heap addresses via /proc
From: Linus Torvalds @ 2010-11-07 23:22 UTC (permalink / raw)
  To: chas3
  Cc: Andi Kleen, Ted Ts'o, Dan Rosenberg, davem@davemloft.net,
	kuznet@ms2.inr.ac.ru, pekkas@netcore.fi, jmorris@namei.org,
	yoshfuji@linux-ipv6.org, kaber@trash.net,
	remi.denis-courmont@nokia.com, netdev@vger.kernel.org,
	security@kernel.org
In-Reply-To: <201011072248.oA7MmjKg025857@cmf.nrl.navy.mil>

On Sun, Nov 7, 2010 at 2:48 PM, Chas Williams (CONTRACTOR)
<chas@cmf.nrl.navy.mil> wrote:
>
> i suppose one could use idr to map the pointers to unique values.  the
> infiniband code uses this technique>

We already _have_ the unique value that /proc uses - the inode number.
It's what lsof and friends use to match things across different files
anyway.

Why are people arguing about stupid things? If it's wrong to expose
kernel pointers in /proc (and I do think it generally is something we
should try to avoid), then we already do have the natural alternative.
Which happens to be what the patch already does.

So the only question is whether kernel pointers are an information
leak worth worrying about. Personally, I think it's just damn stupid
to expose a kernel pointer unless you _have_ to. There is absolutely
no reason to expose the address of a socket in /proc, perhaps unless
you're in some super-duper-debugging mode that no sane person would
ever use outside of specialized network debugging (and I bet that in
that case you still shouldn't expose it in a _normal_ proc file, but
in debugfs or something).

                          Linus

^ permalink raw reply

* Re: [Security] [SECURITY] Fix leaking of kernel heap addresses via /proc
From: Andi Kleen @ 2010-11-07 23:14 UTC (permalink / raw)
  To: chas3
  Cc: Andi Kleen, Ted Ts'o, Linus Torvalds, Dan Rosenberg,
	davem@davemloft.net, kuznet@ms2.inr.ac.ru, pekkas@netcore.fi,
	jmorris@namei.org, yoshfuji@linux-ipv6.org, kaber@trash.net,
	remi.denis-courmont@nokia.com, netdev@vger.kernel.org,
	security@kernel.org
In-Reply-To: <201011072248.oA7MmjKg025857@cmf.nrl.navy.mil>

On Sun, Nov 07, 2010 at 05:48:45PM -0500, Chas Williams (CONTRACTOR) wrote:
> In message <87sjzcssx5.fsf@basil.nowhere.org>,Andi Kleen writes:
> >Ted Ts'o <tytso@mit.edu> writes:
> >
> >> Are there any userspace programs that might be reasonably expected to
> >> _use_ this information?  If there is, we could just pick a random
> >> number at boot time, and then XOR the heap adddress with that random
> >> number.
> >
> >If any of the addresses can be guessed ever (and that is likely if it's
> >allocated at boot) determining the random value will be trivial
> >for everyone.
> 
> i suppose one could use idr to map the pointers to unique values.  the
> infiniband code uses this technique>

idr requires allocating memory, and it's unclear you can do that
in all the situations where debugging printks are used. And
how would you get the idr table out of a broken kernel? And further
the memory allocations would eventually fill up your memory
if they go on. I don't think idr is a solution.

I don't really have a good solution either. Even if the
the individual pointers were removed from printks (which
probably doesn't make much difference anyways because those
printks usually happen only in unlikely debug situations):

The information about the memory layout early on in dmesg
is probably enough to make a good educated guess about
the locations of standard slab caches on a known kernel
image. For example the first N sockets opened are very likely
easy to guess this way.

I guess one could make dmesg root only, although I personally
use it often as non root.

Or maybe add some more randomization to the buddy allocator.
The drawback of that is that they tend to make
benchmarks unstable due to cache coloring differences.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply

* Re: [PATCH 2/2 v5] xps: Transmit Packet Steering
From: Tom Herbert @ 2010-11-07 23:11 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, netdev
In-Reply-To: <1289162455.2478.295.camel@edumazet-laptop>

> Tom, I still dont understand why *xps_maps is here, and not in
> net_device ?
>

Originally it was in the net_device, but that necessitated a reference
be held on the device since we can't destroy the map until all
kobjects were released, but the open reference prevents the kojects
from being released in the first place.

> I am asking because netdev_get_xps_maps(dev) might be slowed down
> because queue 0 state might change often (__QUEUE_STATE_XOFF)
>
Could a add a pointer in the net_device also.

> This means _tx[0] becomes a very hot cache line, needed to access all
> queues (from get_xps_queue())
>
> Other than that, your patch seems fine (not tested yet)
>
> Thanks
>
>
>

^ permalink raw reply

* Re: [RESEND PATCH] virtio-net: init link state correctly
From: Rusty Russell @ 2010-11-07 23:11 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, linux-kernel, davem, markmc, kvm, mst
In-Reply-To: <20101105094718.19101.58898.stgit@dhcp-91-158.nay.redhat.com>

On Fri, 5 Nov 2010 08:17:18 pm Jason Wang wrote:
> For device that supports VIRTIO_NET_F_STATUS, there's no need to
> assume the link is up and we need to call nerif_carrier_off() before
> querying device status, otherwise we may get wrong operstate after
> diver was loaded because the link watch event was not fired as
> expected.
> 
> For device that does not support VIRITO_NET_F_STATUS, we could not get
> its status through virtnet_update_status() and what we can only do is
> always assuming the link is up.
> 
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>

Acked-by: Rusty Russell <rusty@rustcorp.com.au>

Thanks!
Rusty.

^ permalink raw reply

* Re: [PATCH] virtio_net: Fix queue full check
From: Rusty Russell @ 2010-11-07 23:08 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Krishna Kumar2, davem, netdev, yvugenfi
In-Reply-To: <20101104122424.GA29830@redhat.com>

On Thu, 4 Nov 2010 10:54:24 pm Michael S. Tsirkin wrote:
> I thought about this some more.  I think the original
> code is actually correct in returning ENOSPC: indirect
> buffers are nice, but it's a mistake
> to rely on them as a memory allocation might fail.
> 
> And if you look at virtio-net, it is dropping packets
> under memory pressure which is not really a happy outcome:
> the packet will get freed, reallocated and we get another one,
> adding pressure on the allocator instead of releasing it
> until we free up some buffers.
> 
> So I now think we should calculate the capacity
> assuming non-indirect entries, and if we manage to
> use indirect, all the better.

I've long said it's a weakness in the network stack that it insists
drivers stop the tx queue before they *might* run out of room, leading to
worst-case assumptions and underutilization of the tx ring.

However, I lost that debate, and so your patch is the way it's supposed to
work.  The other main indirect user (block) doesn't care as its queue
allows for post-attempt blocking.

I enhanced your commentry a little:

Subject: virtio: return correct capacity to users
Date: Thu, 4 Nov 2010 14:24:24 +0200
From: "Michael S. Tsirkin" <mst@redhat.com>

We can't rely on indirect buffers for capacity
calculations because they need a memory allocation
which might fail.  In particular, virtio_net can get
into this situation under stress, and it drops packets
and performs badly.

So return the number of buffers we can guarantee users.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Reported-By: Krishna Kumar2 <krkumar2@in.ibm.com>

Thanks!
Rusty.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox