Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: Network performance with small packets
From: Shirley Ma @ 2011-02-02 17:10 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar2, David Miller, kvm, mashirle, netdev, netdev-owner,
	Sridhar Samudrala, Steve Dobbelstein
In-Reply-To: <20110202154706.GA12738@redhat.com>

On Wed, 2011-02-02 at 17:47 +0200, Michael S. Tsirkin wrote:
> On Wed, Feb 02, 2011 at 07:39:45AM -0800, Shirley Ma wrote:
> > On Wed, 2011-02-02 at 12:48 +0200, Michael S. Tsirkin wrote:
> > > Yes, I think doing this in the host is much simpler,
> > > just send an interrupt after there's a decent amount
> > > of space in the queue.
> > > 
> > > Having said that the simple heuristic that I coded
> > > might be a bit too simple.
> > 
> > >From the debugging out what I have seen so far (a single small
> message
> > TCP_STEAM test), I think the right approach is to patch both guest
> and
> > vhost.
> 
> One problem is slowing down the guest helps here.
> So there's a chance that just by adding complexity
> in guest driver we get a small improvement :(
> 
> We can't rely on a patched guest anyway, so
> I think it is best to test guest and host changes separately.
> 
> And I do agree something needs to be done in guest too,
> for example when vqs share an interrupt, we
> might invoke a callback when we see vq is not empty
> even though it's not requested. Probably should
> check interrupts enabled here?

Yes, I modified xmit callback something like below:

static void skb_xmit_done(struct virtqueue *svq)
{
        struct virtnet_info *vi = svq->vdev->priv;

        /* Suppress further interrupts. */
        virtqueue_disable_cb(svq);

        /* We were probably waiting for more output buffers. */
        if (netif_queue_stopped(vi->dev)) {
                free_old_xmit_skbs(vi);
                if (virtqueue_free_size(svq)  <= svq->vring.num / 2) {
                        virtqueue_enable_cb(svq);
			return;
		}
        }
	netif_wake_queue(vi->dev);
}

> > The problem I have found is a regression for single  small
> > message TCP_STEAM test. Old kernel works well for TCP_STREAM, only
> new
> > kernel has problem.
> 
> Likely new kernel is faster :)

> > For Steven's problem, it's multiple stream TCP_RR issues, the old
> guest
> > doesn't perform well, so does new guest kernel. We tested reducing
> vhost
> > signaling patch before, it didn't help the performance at all.
> > 
> > Thanks
> > Shirley
> 
> Yes, it seems unrelated to tx interrupts. 

The issue is more likely related to latency. Do you have anything in
mind on how to reduce vhost latency?

Thanks
Shirley


^ permalink raw reply

* Re: Network performance with small packets
From: Shirley Ma @ 2011-02-02 17:12 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm, mashirle,
	netdev
In-Reply-To: <20110202154841.GB12738@redhat.com>

On Wed, 2011-02-02 at 17:48 +0200, Michael S. Tsirkin wrote:
> And this is with sndbuf=0 in host, yes?
> And do you see a lot of tx interrupts?
> How packets per interrupt?

Nope, sndbuf doens't matter since I never hit reaching sock wmem
condition in vhost. I am still playing around, let me know what data you
would like to collect.

Thanks
Shirley


^ permalink raw reply

* Re: [PATCH v5] Gemini: Gigabit ethernet driver
From: Michał Mirosław @ 2011-02-02 17:18 UTC (permalink / raw)
  To: David Miller; +Cc: ulli.kroll, gemini-board-dev, netdev, linux-kernel.bfrz
In-Reply-To: <20110128.143159.48506419.davem@davemloft.net>

On Fri, Jan 28, 2011 at 02:31:59PM -0800, David Miller wrote:
> From: Michał Mirosław <mirq-linux@rere.qmqm.pl>
> Date: Thu, 27 Jan 2011 00:24:19 +0100 (CET)
> > +#define NETIF_TSO_FEATURES	\
> > +	(NETIF_F_TSO|NETIF_F_TSO_ECN|NETIF_F_TSO6)
> > +#define GMAC_TX_OFFLOAD_FEATURES	\
> > +	(NETIF_TSO_FEATURES|NETIF_F_ALL_CSUM)
> 
> Please, when definiting macros locally for your driver, do not name
> them with prefixes that match those defined generically by the
> network stack.  Otherwise it is confusing for people reading the
> driver.
> 
> One should be able to see "NETIF_XXX" somewhere and expect to find
> it's definition somewhere in the generic networking driver interfaces,
> not in the driver itself.

Sure. Renamed to GMAC_TSO_FEATURES for next versions.

> > +static struct toe_private *netdev_to_toe(struct net_device *dev)
> > +{
> > +	return dev->ml_priv;
> > +}
> 
> There is no reason to use ->ml_priv just to have a common backpointer
> to a structure shared between multiple interfaces.
> 
> Simply add a "struct toe_private *" to your "struct gmac_private" and
> stick it there.
> 
> The cost of the dereference is identical in both cases, so there is not
> even a performance incentive to use ->ml_priv.

What is the correct use of ml_priv? I saw that a few drivers use it for
same or similar purpose (pointer to state shared between network devices).

> > +static void __iomem *gmac_ctl_reg(struct net_device *dev, unsigned int reg)
> > +{
> > +	return (void __iomem *)dev->base_addr + reg;
> > +}
> Please do not abuse dev->base_addr in this way, simply define another
> "void __iomem *" pointer in your gmac_private and use that.

The field is unused for memory-mapped devices. Still, that's why I wrap
this kind of things in one-liners - I can easily change it.

> > +	page = pfn_to_page(dma_to_pfn(toe->dev, rx->word2.buf_adr));
> Please do not use non-portable routines such as dma_to_pfn() unless it
> is absolutely unavoidable.  Instead, use schemes for page struct
> lookup like those used by drivers such as drivers/net/niu.c, which uses
> a hash table to find pages based upon DMA address.
> 
> I'd like you to be able to enable this driver on as many platforms as
> possible, not just ARM, so we can be build testing your driver as we
> make changes to various network driver APIs, and we can't do that if
> you put ARM specific stuff in here.

To make it build I need to add a bunch of #ifdefs to compile out other
platform-specific code. I see no reason to make it inefficient on the
only platform it's supposed to work. I have isolated the code to another
one-liner functions in case this happens to be bigger problem.

If I make it buildable (NOT working) on other archs is it enough I add
Kconfig dependency on (ARCH_GEMINI || BROKEN) to allow it to be
build-tested there?

> > +		dev_err(&dev->dev, "Unsupported MII interface\n");
> Please use "netdev_err(dev, ..."
> Please use netdev_*() when possible elsewhere in this driver too.

Fixed. This is called before the netdev is registered, though.

> > +	writel(
> > +		(GMAC0_SWTQ00_EOF_INT_BIT|GMAC0_SWTQ00_FIN_INT_BIT)
> > +			<< (6 * dev->dev_id + txq_num),
> > +		toe_reg(toe, GLOBAL_INTERRUPT_STATUS_0_REG));
> Please format this more reasonably, this looks awful.

Fixed.

> > +	txq->ring[w].word0.bits32 = skb_headlen(skb);
> > +	txq->ring[w].word1.bits32 = skb->len | tss_flags;
> > +	txq->ring[w].word2.bits32 = mapping;
> > +	txq->ring[w].word3.bits32 = tss_pkt_len(skb) | SOF_BIT;
> What is the endinness of the RX and TX descriptors of this chipset?
> Please use "__be32", "__le32", and the endianness conversion interfaces
> as needed.

StorLink people who wrote original spaghetti-code probably didn't know
either.  The SoC has beed used only in little-endian mode so it isn't even
possible for me to test how would it work in big-endian mode.

Best Regards,
Michał Mirosław


^ permalink raw reply

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
From: Jay Vosburgh @ 2011-02-02 17:30 UTC (permalink / raw)
  To: Oleg V. Ukhno
  Cc: Nicolas de Pesloüan, John Fastabend, netdev@vger.kernel.org
In-Reply-To: <4D4833EA.50407@yandex-team.ru>

Oleg V. Ukhno <olegu@yandex-team.ru> wrote:

>On 01/29/2011 05:28 AM, Jay Vosburgh wrote:
>> Oleg V. Ukhno<olegu@yandex-team.ru>  wrote:
>>
>> 	I've thought about this whole thing, and here's what I view as
>> the proper way to do this.
>>
>> 	In my mind, this proposal is two separate pieces:
>>
>> 	First, a piece to make round-robin a selectable hash for
>> xmit_hash_policy.  The documentation for this should follow the pattern
>> of the "layer3+4" hash policy, in particular noting that the new
>> algorithm violates the 802.3ad standard in exciting ways, will result in
>> out of order delivery, and that other 802.3ad implementations may or may
>> not tolerate this.
>>
>> 	Second, a piece to make certain transmitted packets use the
>> source MAC of the sending slave instead of the bond's MAC.  This should
>> be a separate option from the round-robin hash policy.  I'd call it
>> something like "mac_select" with two values: "default" (what we do now)
>> and "slave_src_mac" to use the slave's real MAC for certain types of
>> traffic (I'm open to better names; that's just what I came up with while
>> writing this).  I believe that "certain types" means "everything but
>> ARP," but might be "only IP and IPv6."  Structuring the option in this
>> manner leaves the option open for additional selections in the future,
>> which a simple "on/off" option wouldn't.  This option should probably
>> only affect a subset of modes; I'm thinking anything except balance-tlb
>> or -alb (because they do funky MAC things already) and active-backup (it
>> doesn't balance traffic, and already uses fail_over_mac to control
>> this).  I think this option also needs a whole new section down in the
>> bottom explaining how to exploit it (the "pick special MACs on slaves to
>> trick switch hash" business).
>>
>> 	Comments?
>>
>> 	-J
>>
>Jay,
>As for me splitting my initial proposal into two logically diffent pieces
>is ok, this will provide more flexible configuration.
>Do I understand correctly, that after I rewrite  patch in splitted form,
>as you described above, and enhance documentation it will be /can be
>applied to kernel?

	Yes, although the patches may have to go through a few
revisions.


>Then what should I do: rewrite patch and resubmit it as a new one?

	Yes.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com


^ permalink raw reply

* Re: Network performance with small packets
From: Michael S. Tsirkin @ 2011-02-02 17:32 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Krishna Kumar2, David Miller, kvm, mashirle, netdev, netdev-owner,
	Sridhar Samudrala, Steve Dobbelstein
In-Reply-To: <1296666635.25430.35.camel@localhost.localdomain>

On Wed, Feb 02, 2011 at 09:10:35AM -0800, Shirley Ma wrote:
> On Wed, 2011-02-02 at 17:47 +0200, Michael S. Tsirkin wrote:
> > On Wed, Feb 02, 2011 at 07:39:45AM -0800, Shirley Ma wrote:
> > > On Wed, 2011-02-02 at 12:48 +0200, Michael S. Tsirkin wrote:
> > > > Yes, I think doing this in the host is much simpler,
> > > > just send an interrupt after there's a decent amount
> > > > of space in the queue.
> > > > 
> > > > Having said that the simple heuristic that I coded
> > > > might be a bit too simple.
> > > 
> > > >From the debugging out what I have seen so far (a single small
> > message
> > > TCP_STEAM test), I think the right approach is to patch both guest
> > and
> > > vhost.
> > 
> > One problem is slowing down the guest helps here.
> > So there's a chance that just by adding complexity
> > in guest driver we get a small improvement :(
> > 
> > We can't rely on a patched guest anyway, so
> > I think it is best to test guest and host changes separately.
> > 
> > And I do agree something needs to be done in guest too,
> > for example when vqs share an interrupt, we
> > might invoke a callback when we see vq is not empty
> > even though it's not requested. Probably should
> > check interrupts enabled here?
> 
> Yes, I modified xmit callback something like below:
> 
> static void skb_xmit_done(struct virtqueue *svq)
> {
>         struct virtnet_info *vi = svq->vdev->priv;
> 
>         /* Suppress further interrupts. */
>         virtqueue_disable_cb(svq);
> 
>         /* We were probably waiting for more output buffers. */
>         if (netif_queue_stopped(vi->dev)) {
>                 free_old_xmit_skbs(vi);
>                 if (virtqueue_free_size(svq)  <= svq->vring.num / 2) {
>                         virtqueue_enable_cb(svq);
> 			return;
> 		}
>         }
> 	netif_wake_queue(vi->dev);
> }

OK, but this should have no effect with a vhost patch
which should ensure that we don't get an interrupt
until the queue is at least half empty.
Right?

> > > The problem I have found is a regression for single  small
> > > message TCP_STEAM test. Old kernel works well for TCP_STREAM, only
> > new
> > > kernel has problem.
> > 
> > Likely new kernel is faster :)
> 
> > > For Steven's problem, it's multiple stream TCP_RR issues, the old
> > guest
> > > doesn't perform well, so does new guest kernel. We tested reducing
> > vhost
> > > signaling patch before, it didn't help the performance at all.
> > > 
> > > Thanks
> > > Shirley
> > 
> > Yes, it seems unrelated to tx interrupts. 
> 
> The issue is more likely related to latency.

Could be. Why do you think so?

> Do you have anything in
> mind on how to reduce vhost latency?
> 
> Thanks
> Shirley

Hmm, bypassing the bridge might help a bit.
Are you using tap+bridge or macvtap?

^ permalink raw reply

* non-symmetric Unix dgram sockets and poll
From: Alban Crequy @ 2011-02-02 17:36 UTC (permalink / raw)
  To: netdev

Hi,

I have 3 Unix dgram sockets (sockA, sockB, sockC):
- sockA is connected to sockB.
- sockB is connected to sockC.
- sockC is not connected.

SockA cannot send any message to sockB because net/unix/af_unix.c::unix_may_send()
prevents it. Is there any reason for that restriction?

If it was possible to send that message, unix_dgram_poll() on sockB could be in a
situation where it needs to add the current process in 2 different wait queues:
- sk_sleep(sk) for POLLIN (messages coming from sockA)
- &unix_sk(other)->peer_wait for POLLOUT (to send messages to sockC)

Is it correct?

Thanks,
Alban

^ permalink raw reply

* af_unix unix_getname: return size for unnamed sockets too small?
From: Marcus Meissner @ 2011-02-02 17:40 UTC (permalink / raw)
  To: davem, eric.dumazet, ebiederm, netdev, linux-kernel, gorcunov

Hi,

In net/unix/af_unix.c::unix_getname() there is a small problem:

        if (!u->addr) {
                sunaddr->sun_family = AF_UNIX;
                sunaddr->sun_path[0] = 0;		// not copied out
                *uaddr_len = sizeof(short);
        } else {
                struct unix_address *addr = u->addr;

                *uaddr_len = addr->len;
                memcpy(sunaddr, addr->name, *uaddr_len);
        }

The if (!u->addr) case will not copy out the \0 in the sun_path, as
uaddr_len is just the size of sun_family.

(Shown by socat crashing after decoding gethostname return and expected
sun_path to be a valid string (and not seeing the \0)).

Should it perhaps be *uaddr_len = sizeof(short)+sizeof(char)?

Ciao, Marcus

^ permalink raw reply

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
From: Jay Vosburgh @ 2011-02-02 17:57 UTC (permalink / raw)
  To: =?UTF-8?B?Tmljb2xhcyBkZSBQZXNsb8O8YW4=?=
  Cc: Oleg V. Ukhno, John Fastabend, netdev@vger.kernel.org
In-Reply-To: <4D4929BB.2000403@gmail.com>

Nicolas de Pesloüan <nicolas.2p.debian@gmail.com> wrote:

>Le 29/01/2011 03:28, Jay Vosburgh a écrit :
>> 	I've thought about this whole thing, and here's what I view as
>> the proper way to do this.
>>
>> 	In my mind, this proposal is two separate pieces:
>>
>> 	First, a piece to make round-robin a selectable hash for
>> xmit_hash_policy.  The documentation for this should follow the pattern
>> of the "layer3+4" hash policy, in particular noting that the new
>> algorithm violates the 802.3ad standard in exciting ways, will result in
>> out of order delivery, and that other 802.3ad implementations may or may
>> not tolerate this.
>>
>> 	Second, a piece to make certain transmitted packets use the
>> source MAC of the sending slave instead of the bond's MAC.  This should
>> be a separate option from the round-robin hash policy.  I'd call it
>> something like "mac_select" with two values: "default" (what we do now)
>> and "slave_src_mac" to use the slave's real MAC for certain types of
>> traffic (I'm open to better names; that's just what I came up with while
>> writing this).  I believe that "certain types" means "everything but
>> ARP," but might be "only IP and IPv6."  Structuring the option in this
>> manner leaves the option open for additional selections in the future,
>> which a simple "on/off" option wouldn't.  This option should probably
>> only affect a subset of modes; I'm thinking anything except balance-tlb
>> or -alb (because they do funky MAC things already) and active-backup (it
>> doesn't balance traffic, and already uses fail_over_mac to control
>> this).  I think this option also needs a whole new section down in the
>> bottom explaining how to exploit it (the "pick special MACs on slaves to
>> trick switch hash" business).
>>
>> 	Comments?
>
>Looks really sensible to me.
>
>I just propose the following option and option values : "src_mac_select"
>(instead of mac_select), with "default" and "slave_mac" (instead of
>slave_src_mac) as possible values. In the future, we might need a
>"dst_mac_select" option... :-)

	I originally thought of using the nomenclature you propose; my
thinking for doing it the way I ended up with is to minimize the number
of tunable knobs that bonding has (so, the dst_mac would be a setting
for mac_select).  That works as long as there aren't a lot of settings
that would be turned on simultaneously, since each combination would
have to be a separate option, or the options parser would have to handle
multiple settings (e.g., mac_select=src+dst or something like that).

	Anyway, after thinking about it some more, in the long run it's
probably safer to separate these two, so, Oleg, use the above naming
("src_mac_select" with "default" and "slave_mac").

>Also, are there any risks that this kind of session load-balancing won't
>properly cooperate with multiqueue (as explained in "Overriding
>Configuration for Special Cases" in Documentation/networking/bonding.txt)?
>I think it is important to ensure we keep the ability to fine tune the
>egress path selection

	I think the logic for the mac_select (or src_mac_select or
whatever) just has to be done last, after the slave selection is done by
the multiqueue stuff.  That's probably a good tidbit to put in the
documentation as well.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* Re: kernel 2.6.37 : oops in cleanup_once
From: Yann Dupont @ 2011-02-02 17:59 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, netdev
In-Reply-To: <1296659336.20445.22.camel@edumazet-laptop>

Le 02/02/2011 16:08, Eric Dumazet a écrit :
> Le mercredi 02 février 2011 à 16:04 +0100, Yann Dupont a écrit :
>> Ok, will do it at 18:30 CET (to minimize impact)
>> It the suspected bug SLUB related ?
>>
> no : It can be a corruption from another part of kernel.
>
>> The 2.6.34.2 kernel previously used on that server used SLAB.
>>
>>
>> 2 questions :
>> -How can I be sure slub_nomerge is active ? Boot message ?
>
> # ls -l /sys/kernel/slab/
>
> If you have symlinks : merge is on (default)
>
> If you dont have symlinks : nomerge is in action
>
>> -Is there a very severe impact on performance ?
>>
> not at all
>
>> Regards,
>>
>
well. The server had the good taste to oops at 18H05, 25 minutes before 
the planned reboot :)

here is the oops (I think it's quite the same) :


Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128042] 
BUG: unable to handle kernel NULL pointer dereference at 000000000000000d
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128097] 
IP: [<ffffffff8130e6bf>] cleanup_once+0x3f/0xa0
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128146] PGD 0
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128173] 
Oops: 0002 [#1] SMP
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128200] 
last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128250] CPU 7
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128260] 
Modules linked in: dell_rbu acpi_cpufreq freq_table mperf nls_utf8 
nls_cp437 btrfs zlib_deflate crc32c libcrc32c ufs qnx4 hfsplus hfs minix 
ntfs vfat msdos fat jfs rei
serfs ext4 jbd2 crc16 ext3 jbd tun ipt_MASQUERADE iptable_nat nf_nat 
ipt_REJECT kvm_intel kvm xt_physdev ip6t_LOG nf_conntrack_ipv6 
nf_defrag_ipv6 ip6table_filter ip6_tables ipt_LOG xt_multiport xt_limit 
xt_tcpudp xt_state iptable_filter
  ip_tables x_tables nf_conntrack_tftp nf_conntrack_ftp 
nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ipv6 8021q bridge stp ext2 
mbcache fuse snd_pcm snd_timer ghes hed button snd soundcore i5000_edac 
edac_core processor shpchp tpm_tis pc
i_hotplug tpm rng_core snd_page_alloc i5k_amb dcdbas tpm_bios joydev 
evdev psmouse pcspkr serio_raw thermal_sys xfs exportfs dm_mod sg sr_mod 
cdrom sd_mod usbhid hid usb_storage qla2xxx scsi_transport_fc scsi_tgt 
uhci_hcd mptsas mptscsih
  mptbase bnx2 scsi_transport_sas scsi_mod ehci_hcd [last unloaded: 
scsi_wait_scan]
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128834]
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128855] 
Pid: 0, comm: kworker/0:1 Not tainted 2.6.37-dsiun-110105 #17 
0MY736/PowerEdge M600
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128901] 
RIP: 0010:[<ffffffff8130e6bf>]  [<ffffffff8130e6bf>] cleanup_once+0x3f/0xa0
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128948] 
RSP: 0018:ffff8800cfdc3e20  EFLAGS: 00010206
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.128974] 
RAX: ffff8803a7e0ea18 RBX: ffff8803a7e0ea00 RCX: 0000000000000005
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129003] 
RDX: adde806c0d860b00 RSI: 0000000000000096 RDI: ffffffff8152a970
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129032] 
RBP: 00000000000248f6 R08: 00000000003d0900 R09: 0000000000000000
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129062] 
R10: dead000000200200 R11: 0000000000000000 R12: ffff8800cfdc3ea0
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129091] 
R13: 0000000000000100 R14: ffff88040fd29fd8 R15: 0000000000000000
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129121] 
FS:  0000000000000000(0000) GS:ffff8800cfdc0000(0000) knlGS:0000000000000000
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129166] 
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129193] 
CR2: 000000000000000d CR3: 00000000014f1000 CR4: 00000000000026e0
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129223] 
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129252] 
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129282] 
Process kworker/0:1 (pid: 0, threadinfo ffff88040fd28000, task 
ffff88040fce6450)
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129327] Stack:
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129347]  
0000000000000082 00000001008d3b66 00000000000248f6 ffffffff8130e988
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129397]  
ffff88040fd24000 ffff88040fd24000 ffffffff8152a9a0 ffffffff8105e95f
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129446]  
ffff8800cfdc3e58 ffff88040fd25020 ffffffff8130e950 ffff88040fd29fd8
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129496] 
Call Trace:
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129523] <IRQ>
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129551]  
[<ffffffff8130e988>] ? peer_check_expire+0x38/0x110
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129581]  
[<ffffffff8105e95f>] ? run_timer_softirq+0x16f/0x350
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129609]  
[<ffffffff8130e950>] ? peer_check_expire+0x0/0x110
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129638]  
[<ffffffff81079c6b>] ? ktime_get+0x5b/0xe0
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129666]  
[<ffffffff8105685a>] ? __do_softirq+0xaa/0x1e0
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129694]  
[<ffffffff81003ddc>] ? call_softirq+0x1c/0x30
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129722]  
[<ffffffff81005f75>] ? do_softirq+0x65/0xa0
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129748]  
[<ffffffff81056745>] ? irq_exit+0x85/0x90
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129776]  
[<ffffffff8102137a>] ? smp_apic_timer_interrupt+0x6a/0xa0
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129806]  
[<ffffffff81003893>] ? apic_timer_interrupt+0x13/0x20
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129833] <EOI>
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129857]  
[<ffffffff8123f5ce>] ? acpi_hw_register_read+0x54/0xe2
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129890]  
[<ffffffffa01c52b8>] ? acpi_idle_enter_simple+0xf4/0x126 [processor]
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.129936]  
[<ffffffffa01c52b1>] ? acpi_idle_enter_simple+0xed/0x126 [processor]
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.131555]  
[<ffffffffa01c5034>] ? acpi_idle_enter_bm+0xeb/0x27b [processor]
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.131591]  
[<ffffffff812c0deb>] ? cpuidle_idle_call+0x8b/0x140
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.131619]  
[<ffffffff8100208a>] ? cpu_idle+0x6a/0xf0
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.131645] 
Code: 00 48 8b 05 c4 c2 21 00 48 3d 60 a9 52 81 74 5c 48 8d 58 e8 48 8b 
15 11 02 24 00 2b 53 28 48 39 ea 72 49 48 8b 4b 18 48 8b 53 20 <48> 89 
51 08 48 89 0a 48 89 43 18 48 89 43 20 f0 ff 40 14 48 c7
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.131847] 
RIP  [<ffffffff8130e6bf>] cleanup_once+0x3f/0xa0
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.131876]  
RSP <ffff8800cfdc3e20>
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.131898] 
CR2: 000000000000000d
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.132280] 
---[ end trace a9f45436c3b7c143 ]---
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.132350] 
Kernel panic - not syncing: Fatal exception in interrupt
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.132422] 
Pid: 0, comm: kworker/0:1 Tainted: G      D     2.6.37-dsiun-110105 #17
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.132510] 
Call Trace:
Feb  2 18:05:33 linkwood.u11.univ-nantes.prive kernel: [37323.132574] 
<IRQ>  [<ffffffff8137c75e>] ? panic+0x92/0x1a2

and I also have a screenshot with more details. I'll send it in a 
private message.



Since 18H30, the server runs with slub_nomerge.

-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr

^ permalink raw reply

* [PATCH] be2net: use device model DMA API
From: Ivan Vecera @ 2011-02-02 18:05 UTC (permalink / raw)
  To: netdev; +Cc: sathyap, subbus, sarveshwarb, ajitk

Use DMA API as PCI equivalents will be deprecated.

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
---
 drivers/net/benet/be_ethtool.c |   25 +++++-----
 drivers/net/benet/be_main.c    |   98 +++++++++++++++++++++-------------------
 2 files changed, 64 insertions(+), 59 deletions(-)

diff --git a/drivers/net/benet/be_ethtool.c b/drivers/net/benet/be_ethtool.c
index b4be027..0c99314 100644
--- a/drivers/net/benet/be_ethtool.c
+++ b/drivers/net/benet/be_ethtool.c
@@ -376,8 +376,9 @@ static int be_get_settings(struct net_device *netdev, struct ethtool_cmd *ecmd)
 		}
 
 		phy_cmd.size = sizeof(struct be_cmd_req_get_phy_info);
-		phy_cmd.va = pci_alloc_consistent(adapter->pdev, phy_cmd.size,
-					&phy_cmd.dma);
+		phy_cmd.va = dma_alloc_coherent(&adapter->pdev->dev,
+						phy_cmd.size, &phy_cmd.dma,
+						GFP_KERNEL);
 		if (!phy_cmd.va) {
 			dev_err(&adapter->pdev->dev, "Memory alloc failure\n");
 			return -ENOMEM;
@@ -416,8 +417,8 @@ static int be_get_settings(struct net_device *netdev, struct ethtool_cmd *ecmd)
 		adapter->port_type = ecmd->port;
 		adapter->transceiver = ecmd->transceiver;
 		adapter->autoneg = ecmd->autoneg;
-		pci_free_consistent(adapter->pdev, phy_cmd.size,
-					phy_cmd.va, phy_cmd.dma);
+		dma_free_coherent(&adapter->pdev->dev, phy_cmd.size, phy_cmd.va,
+				  phy_cmd.dma);
 	} else {
 		ecmd->speed = adapter->link_speed;
 		ecmd->port = adapter->port_type;
@@ -554,8 +555,8 @@ be_test_ddr_dma(struct be_adapter *adapter)
 	};
 
 	ddrdma_cmd.size = sizeof(struct be_cmd_req_ddrdma_test);
-	ddrdma_cmd.va = pci_alloc_consistent(adapter->pdev, ddrdma_cmd.size,
-					&ddrdma_cmd.dma);
+	ddrdma_cmd.va = dma_alloc_coherent(&adapter->pdev->dev, ddrdma_cmd.size,
+					   &ddrdma_cmd.dma, GFP_KERNEL);
 	if (!ddrdma_cmd.va) {
 		dev_err(&adapter->pdev->dev, "Memory allocation failure\n");
 		return -ENOMEM;
@@ -569,8 +570,8 @@ be_test_ddr_dma(struct be_adapter *adapter)
 	}
 
 err:
-	pci_free_consistent(adapter->pdev, ddrdma_cmd.size,
-			ddrdma_cmd.va, ddrdma_cmd.dma);
+	dma_free_coherent(&adapter->pdev->dev, ddrdma_cmd.size, ddrdma_cmd.va,
+			  ddrdma_cmd.dma);
 	return ret;
 }
 
@@ -662,8 +663,8 @@ be_read_eeprom(struct net_device *netdev, struct ethtool_eeprom *eeprom,
 
 	memset(&eeprom_cmd, 0, sizeof(struct be_dma_mem));
 	eeprom_cmd.size = sizeof(struct be_cmd_req_seeprom_read);
-	eeprom_cmd.va = pci_alloc_consistent(adapter->pdev, eeprom_cmd.size,
-				&eeprom_cmd.dma);
+	eeprom_cmd.va = dma_alloc_coherent(&adapter->pdev->dev, eeprom_cmd.size,
+					   &eeprom_cmd.dma, GFP_KERNEL);
 
 	if (!eeprom_cmd.va) {
 		dev_err(&adapter->pdev->dev,
@@ -677,8 +678,8 @@ be_read_eeprom(struct net_device *netdev, struct ethtool_eeprom *eeprom,
 		resp = (struct be_cmd_resp_seeprom_read *) eeprom_cmd.va;
 		memcpy(data, resp->seeprom_data + eeprom->offset, eeprom->len);
 	}
-	pci_free_consistent(adapter->pdev, eeprom_cmd.size, eeprom_cmd.va,
-			eeprom_cmd.dma);
+	dma_free_coherent(&adapter->pdev->dev, eeprom_cmd.size, eeprom_cmd.va,
+			  eeprom_cmd.dma);
 
 	return status;
 }
diff --git a/drivers/net/benet/be_main.c b/drivers/net/benet/be_main.c
index 28a32a6..82b2df8 100644
--- a/drivers/net/benet/be_main.c
+++ b/drivers/net/benet/be_main.c
@@ -125,8 +125,8 @@ static void be_queue_free(struct be_adapter *adapter, struct be_queue_info *q)
 {
 	struct be_dma_mem *mem = &q->dma_mem;
 	if (mem->va)
-		pci_free_consistent(adapter->pdev, mem->size,
-			mem->va, mem->dma);
+		dma_free_coherent(&adapter->pdev->dev, mem->size, mem->va,
+				  mem->dma);
 }
 
 static int be_queue_alloc(struct be_adapter *adapter, struct be_queue_info *q,
@@ -138,7 +138,8 @@ static int be_queue_alloc(struct be_adapter *adapter, struct be_queue_info *q,
 	q->len = len;
 	q->entry_size = entry_size;
 	mem->size = len * entry_size;
-	mem->va = pci_alloc_consistent(adapter->pdev, mem->size, &mem->dma);
+	mem->va = dma_alloc_coherent(&adapter->pdev->dev, mem->size, &mem->dma,
+				     GFP_KERNEL);
 	if (!mem->va)
 		return -1;
 	memset(mem->va, 0, mem->size);
@@ -484,7 +485,7 @@ static void wrb_fill_hdr(struct be_adapter *adapter, struct be_eth_hdr_wrb *hdr,
 	AMAP_SET_BITS(struct amap_eth_hdr_wrb, len, hdr, len);
 }
 
-static void unmap_tx_frag(struct pci_dev *pdev, struct be_eth_wrb *wrb,
+static void unmap_tx_frag(struct device *dev, struct be_eth_wrb *wrb,
 		bool unmap_single)
 {
 	dma_addr_t dma;
@@ -494,11 +495,10 @@ static void unmap_tx_frag(struct pci_dev *pdev, struct be_eth_wrb *wrb,
 	dma = (u64)wrb->frag_pa_hi << 32 | (u64)wrb->frag_pa_lo;
 	if (wrb->frag_len) {
 		if (unmap_single)
-			pci_unmap_single(pdev, dma, wrb->frag_len,
-				PCI_DMA_TODEVICE);
+			dma_unmap_single(dev, dma, wrb->frag_len,
+					 DMA_TO_DEVICE);
 		else
-			pci_unmap_page(pdev, dma, wrb->frag_len,
-				PCI_DMA_TODEVICE);
+			dma_unmap_page(dev, dma, wrb->frag_len, DMA_TO_DEVICE);
 	}
 }
 
@@ -507,7 +507,7 @@ static int make_tx_wrbs(struct be_adapter *adapter,
 {
 	dma_addr_t busaddr;
 	int i, copied = 0;
-	struct pci_dev *pdev = adapter->pdev;
+	struct device *dev = &adapter->pdev->dev;
 	struct sk_buff *first_skb = skb;
 	struct be_queue_info *txq = &adapter->tx_obj.q;
 	struct be_eth_wrb *wrb;
@@ -521,9 +521,8 @@ static int make_tx_wrbs(struct be_adapter *adapter,
 
 	if (skb->len > skb->data_len) {
 		int len = skb_headlen(skb);
-		busaddr = pci_map_single(pdev, skb->data, len,
-					 PCI_DMA_TODEVICE);
-		if (pci_dma_mapping_error(pdev, busaddr))
+		busaddr = dma_map_single(dev, skb->data, len, DMA_TO_DEVICE);
+		if (dma_mapping_error(dev, busaddr))
 			goto dma_err;
 		map_single = true;
 		wrb = queue_head_node(txq);
@@ -536,10 +535,9 @@ static int make_tx_wrbs(struct be_adapter *adapter,
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 		struct skb_frag_struct *frag =
 			&skb_shinfo(skb)->frags[i];
-		busaddr = pci_map_page(pdev, frag->page,
-				       frag->page_offset,
-				       frag->size, PCI_DMA_TODEVICE);
-		if (pci_dma_mapping_error(pdev, busaddr))
+		busaddr = dma_map_page(dev, frag->page, frag->page_offset,
+				       frag->size, DMA_TO_DEVICE);
+		if (dma_mapping_error(dev, busaddr))
 			goto dma_err;
 		wrb = queue_head_node(txq);
 		wrb_fill(wrb, busaddr, frag->size);
@@ -563,7 +561,7 @@ dma_err:
 	txq->head = map_head;
 	while (copied) {
 		wrb = queue_head_node(txq);
-		unmap_tx_frag(pdev, wrb, map_single);
+		unmap_tx_frag(dev, wrb, map_single);
 		map_single = false;
 		copied -= wrb->frag_len;
 		queue_head_inc(txq);
@@ -888,8 +886,9 @@ get_rx_page_info(struct be_adapter *adapter,
 	BUG_ON(!rx_page_info->page);
 
 	if (rx_page_info->last_page_user) {
-		pci_unmap_page(adapter->pdev, dma_unmap_addr(rx_page_info, bus),
-			adapter->big_page_size, PCI_DMA_FROMDEVICE);
+		dma_unmap_page(&adapter->pdev->dev,
+			       dma_unmap_addr(rx_page_info, bus),
+			       adapter->big_page_size, DMA_FROM_DEVICE);
 		rx_page_info->last_page_user = false;
 	}
 
@@ -1195,9 +1194,9 @@ static void be_post_rx_frags(struct be_rx_obj *rxo)
 				rxo->stats.rx_post_fail++;
 				break;
 			}
-			page_dmaaddr = pci_map_page(adapter->pdev, pagep, 0,
-						adapter->big_page_size,
-						PCI_DMA_FROMDEVICE);
+			page_dmaaddr = dma_map_page(&adapter->pdev->dev, pagep,
+						    0, adapter->big_page_size,
+						    DMA_FROM_DEVICE);
 			page_info->page_offset = 0;
 		} else {
 			get_page(pagep);
@@ -1270,8 +1269,8 @@ static void be_tx_compl_process(struct be_adapter *adapter, u16 last_index)
 	do {
 		cur_index = txq->tail;
 		wrb = queue_tail_node(txq);
-		unmap_tx_frag(adapter->pdev, wrb, (unmap_skb_hdr &&
-					skb_headlen(sent_skb)));
+		unmap_tx_frag(&adapter->pdev->dev, wrb,
+			      (unmap_skb_hdr && skb_headlen(sent_skb)));
 		unmap_skb_hdr = false;
 
 		num_wrbs++;
@@ -2179,7 +2178,8 @@ static int be_setup_wol(struct be_adapter *adapter, bool enable)
 	memset(mac, 0, ETH_ALEN);
 
 	cmd.size = sizeof(struct be_cmd_req_acpi_wol_magic_config);
-	cmd.va = pci_alloc_consistent(adapter->pdev, cmd.size, &cmd.dma);
+	cmd.va = dma_alloc_coherent(&adapter->pdev->dev, cmd.size, &cmd.dma,
+				    GFP_KERNEL);
 	if (cmd.va == NULL)
 		return -1;
 	memset(cmd.va, 0, cmd.size);
@@ -2190,8 +2190,8 @@ static int be_setup_wol(struct be_adapter *adapter, bool enable)
 		if (status) {
 			dev_err(&adapter->pdev->dev,
 				"Could not enable Wake-on-lan\n");
-			pci_free_consistent(adapter->pdev, cmd.size, cmd.va,
-					cmd.dma);
+			dma_free_coherent(&adapter->pdev->dev, cmd.size, cmd.va,
+					  cmd.dma);
 			return status;
 		}
 		status = be_cmd_enable_magic_wol(adapter,
@@ -2204,7 +2204,7 @@ static int be_setup_wol(struct be_adapter *adapter, bool enable)
 		pci_enable_wake(adapter->pdev, PCI_D3cold, 0);
 	}
 
-	pci_free_consistent(adapter->pdev, cmd.size, cmd.va, cmd.dma);
+	dma_free_coherent(&adapter->pdev->dev, cmd.size, cmd.va, cmd.dma);
 	return status;
 }
 
@@ -2528,8 +2528,8 @@ int be_load_fw(struct be_adapter *adapter, u8 *func)
 	dev_info(&adapter->pdev->dev, "Flashing firmware file %s\n", fw_file);
 
 	flash_cmd.size = sizeof(struct be_cmd_write_flashrom) + 32*1024;
-	flash_cmd.va = pci_alloc_consistent(adapter->pdev, flash_cmd.size,
-					&flash_cmd.dma);
+	flash_cmd.va = dma_alloc_coherent(&adapter->pdev->dev, flash_cmd.size,
+					  &flash_cmd.dma, GFP_KERNEL);
 	if (!flash_cmd.va) {
 		status = -ENOMEM;
 		dev_err(&adapter->pdev->dev,
@@ -2558,8 +2558,8 @@ int be_load_fw(struct be_adapter *adapter, u8 *func)
 		status = -1;
 	}
 
-	pci_free_consistent(adapter->pdev, flash_cmd.size, flash_cmd.va,
-				flash_cmd.dma);
+	dma_free_coherent(&adapter->pdev->dev, flash_cmd.size, flash_cmd.va,
+			  flash_cmd.dma);
 	if (status) {
 		dev_err(&adapter->pdev->dev, "Firmware load error\n");
 		goto fw_exit;
@@ -2700,13 +2700,13 @@ static void be_ctrl_cleanup(struct be_adapter *adapter)
 	be_unmap_pci_bars(adapter);
 
 	if (mem->va)
-		pci_free_consistent(adapter->pdev, mem->size,
-			mem->va, mem->dma);
+		dma_free_coherent(&adapter->pdev->dev, mem->size, mem->va,
+				  mem->dma);
 
 	mem = &adapter->mc_cmd_mem;
 	if (mem->va)
-		pci_free_consistent(adapter->pdev, mem->size,
-			mem->va, mem->dma);
+		dma_free_coherent(&adapter->pdev->dev, mem->size, mem->va,
+				  mem->dma);
 }
 
 static int be_ctrl_init(struct be_adapter *adapter)
@@ -2721,8 +2721,10 @@ static int be_ctrl_init(struct be_adapter *adapter)
 		goto done;
 
 	mbox_mem_alloc->size = sizeof(struct be_mcc_mailbox) + 16;
-	mbox_mem_alloc->va = pci_alloc_consistent(adapter->pdev,
-				mbox_mem_alloc->size, &mbox_mem_alloc->dma);
+	mbox_mem_alloc->va = dma_alloc_coherent(&adapter->pdev->dev,
+						mbox_mem_alloc->size,
+						&mbox_mem_alloc->dma,
+						GFP_KERNEL);
 	if (!mbox_mem_alloc->va) {
 		status = -ENOMEM;
 		goto unmap_pci_bars;
@@ -2734,8 +2736,9 @@ static int be_ctrl_init(struct be_adapter *adapter)
 	memset(mbox_mem_align->va, 0, sizeof(struct be_mcc_mailbox));
 
 	mc_cmd_mem->size = sizeof(struct be_cmd_req_mcast_mac_config);
-	mc_cmd_mem->va = pci_alloc_consistent(adapter->pdev, mc_cmd_mem->size,
-			&mc_cmd_mem->dma);
+	mc_cmd_mem->va = dma_alloc_coherent(&adapter->pdev->dev,
+					    mc_cmd_mem->size, &mc_cmd_mem->dma,
+					    GFP_KERNEL);
 	if (mc_cmd_mem->va == NULL) {
 		status = -ENOMEM;
 		goto free_mbox;
@@ -2751,8 +2754,8 @@ static int be_ctrl_init(struct be_adapter *adapter)
 	return 0;
 
 free_mbox:
-	pci_free_consistent(adapter->pdev, mbox_mem_alloc->size,
-		mbox_mem_alloc->va, mbox_mem_alloc->dma);
+	dma_free_coherent(&adapter->pdev->dev, mbox_mem_alloc->size,
+			  mbox_mem_alloc->va, mbox_mem_alloc->dma);
 
 unmap_pci_bars:
 	be_unmap_pci_bars(adapter);
@@ -2766,8 +2769,8 @@ static void be_stats_cleanup(struct be_adapter *adapter)
 	struct be_dma_mem *cmd = &adapter->stats_cmd;
 
 	if (cmd->va)
-		pci_free_consistent(adapter->pdev, cmd->size,
-			cmd->va, cmd->dma);
+		dma_free_coherent(&adapter->pdev->dev, cmd->size,
+				  cmd->va, cmd->dma);
 }
 
 static int be_stats_init(struct be_adapter *adapter)
@@ -2775,7 +2778,8 @@ static int be_stats_init(struct be_adapter *adapter)
 	struct be_dma_mem *cmd = &adapter->stats_cmd;
 
 	cmd->size = sizeof(struct be_cmd_req_get_stats);
-	cmd->va = pci_alloc_consistent(adapter->pdev, cmd->size, &cmd->dma);
+	cmd->va = dma_alloc_coherent(&adapter->pdev->dev, cmd->size, &cmd->dma,
+				     GFP_KERNEL);
 	if (cmd->va == NULL)
 		return -1;
 	memset(cmd->va, 0, cmd->size);
@@ -2918,11 +2922,11 @@ static int __devinit be_probe(struct pci_dev *pdev,
 	adapter->netdev = netdev;
 	SET_NETDEV_DEV(netdev, &pdev->dev);
 
-	status = pci_set_dma_mask(pdev, DMA_BIT_MASK(64));
+	status = dma_set_mask(&pdev->dev, DMA_BIT_MASK(64));
 	if (!status) {
 		netdev->features |= NETIF_F_HIGHDMA;
 	} else {
-		status = pci_set_dma_mask(pdev, DMA_BIT_MASK(32));
+		status = dma_set_mask(&pdev->dev, DMA_BIT_MASK(32));
 		if (status) {
 			dev_err(&pdev->dev, "Could not set PCI DMA Mask\n");
 			goto free_netdev;
-- 
1.7.3.4


^ permalink raw reply related

* Re: Network performance with small packets
From: Shirley Ma @ 2011-02-02 18:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar2, David Miller, kvm, mashirle, netdev, netdev-owner,
	Sridhar Samudrala, Steve Dobbelstein
In-Reply-To: <20110202173213.GA13907@redhat.com>

On Wed, 2011-02-02 at 19:32 +0200, Michael S. Tsirkin wrote:
> OK, but this should have no effect with a vhost patch
> which should ensure that we don't get an interrupt
> until the queue is at least half empty.
> Right?

There should be some coordination between guest and vhost. We shouldn't
count the TX packets when netif queue is enabled since next guest TX
xmit will free any used buffers in vhost. We need to be careful here in
case we miss the interrupts when netif queue has stopped.

However we can't change old guest so we can test the patches separately
for guest only, vhost only, and the combination.

> > > 
> > > Yes, it seems unrelated to tx interrupts. 
> > 
> > The issue is more likely related to latency.
> 
> Could be. Why do you think so?

Since I played with latency hack, I can see performance difference for
different latency.

> > Do you have anything in
> > mind on how to reduce vhost latency?
> > 
> > Thanks
> > Shirley
> 
> Hmm, bypassing the bridge might help a bit.
> Are you using tap+bridge or macvtap? 

I am using tap+bridge for TCP_RR test, I think Steven tested macvtap
before. He might have some data from his workload performance
measurement.

Shirley

^ permalink raw reply

* Re: Network performance with small packets
From: Michael S. Tsirkin @ 2011-02-02 18:20 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm, mashirle,
	netdev
In-Reply-To: <1296661371.25430.13.camel@localhost.localdomain>

On Wed, Feb 02, 2011 at 07:42:51AM -0800, Shirley Ma wrote:
> On Wed, 2011-02-02 at 12:49 +0200, Michael S. Tsirkin wrote:
> > On Tue, Feb 01, 2011 at 11:33:49PM -0800, Shirley Ma wrote:
> > > On Tue, 2011-02-01 at 23:14 -0800, Shirley Ma wrote:
> > > > w/i guest change, I played around the parameters,for example: I
> > could
> > > > get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message
> > > > size,
> > > > w/i dropping packet, I was able to get up to 6.2Gb/s with similar
> > CPU
> > > > usage. 
> > > 
> > > I meant w/o guest change, only vhost changes. Sorry about that.
> > > 
> > > Shirley
> > 
> > Ah, excellent. What were the parameters? 
> 
> I used half of the ring size 129 for packet counters,
> but the
> performance is still not as good as dropping packets on guest, 3.7 Gb/s
> vs. 6.2Gb/s.
> 
> Shirley

How many packets and bytes per interrupt are sent?
Also, what about other values for the counters and other counters?

What does your patch do? Just drop packets instead of
stopping the interface?

To have an understanding when should we drop packets
in the guest, we need to know *why* does it help.
Otherwise, how do we know it will work for others?
Note that qdisc will drop packets when it overruns -
so what is different? Also, are we over-running some other queue
somewhere?

-- 
MST

^ permalink raw reply

* [PATCH v4 2/2] iproute2: support device group semantics
From: Vlad Dogaru @ 2011-02-02 18:23 UTC (permalink / raw)
  To: netdev; +Cc: Vlad Dogaru, Stephen Hemminger, Patrick McHardy
In-Reply-To: <1296671021-24421-1-git-send-email-ddvlad@rosedu.org>

Add the group keyword to ip link set, which has the following meaning:
If both a group and a device name are pressent, we change the device's
group to the specified one. If only a group is present, then the
operation specified by the rest of the command should apply on an entire
group, not a single device.

So, to set eth0 to the default group, one would use
	ip link set dev eth0 group default

Conversely, to set all the devices in the default group down, use
	ip link set group default down

Signed-off-by: Vlad Dogaru <ddvlad@rosedu.org>
---
 include/utils.h |    3 ++-
 ip/iplink.c     |   40 +++++++++++++++++++++++++++++++++++++---
 ip/link_veth.c  |    3 ++-
 man/man8/ip.8   |   19 +++++++++++++++++--
 4 files changed, 58 insertions(+), 7 deletions(-)

diff --git a/include/utils.h b/include/utils.h
index 3da6998..595a7d6 100644
--- a/include/utils.h
+++ b/include/utils.h
@@ -151,5 +151,6 @@ extern int makeargs(char *line, char *argv[], int maxargs);
 
 struct iplink_req;
 int iplink_parse(int argc, char **argv, struct iplink_req *req,
-		char **name, char **type, char **link, char **dev);
+		char **name, char **type, char **link, char **dev,
+		int *group);
 #endif /* __UTILS_H__ */
diff --git a/ip/iplink.c b/ip/iplink.c
index 97a960b..8160855 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -51,7 +51,7 @@ void iplink_usage(void)
 		fprintf(stderr, "                   type TYPE [ ARGS ]\n");
 		fprintf(stderr, "       ip link delete DEV type TYPE [ ARGS ]\n");
 		fprintf(stderr, "\n");
-		fprintf(stderr, "       ip link set DEVICE [ { up | down } ]\n");
+		fprintf(stderr, "       ip link set { dev DEVICE | group DEVGROUP } [ { up | down } ]\n");
 	} else
 		fprintf(stderr, "Usage: ip link set DEVICE [ { up | down } ]\n");
 
@@ -244,7 +244,7 @@ int iplink_parse_vf(int vf, int *argcp, char ***argvp,
 
 
 int iplink_parse(int argc, char **argv, struct iplink_req *req,
-		char **name, char **type, char **link, char **dev)
+		char **name, char **type, char **link, char **dev, int *group)
 {
 	int ret, len;
 	char abuf[32];
@@ -253,6 +253,7 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req,
 	int netns = -1;
 	int vf = -1;
 
+	*group = -1;
 	ret = argc;
 
 	while (argc > 0) {
@@ -383,6 +384,12 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req,
 				  *argv, strlen(*argv));
 			argc--; argv++;
 			break;
+		} else if (strcmp(*argv, "group") == 0) {
+			NEXT_ARG();
+			if (*group != -1)
+				duparg("group", *argv);
+			if (rtnl_group_a2n(group, *argv))
+				invarg("Invalid \"group\" value\n", *argv);
 		} else {
 			if (strcmp(*argv, "dev") == 0) {
 				NEXT_ARG();
@@ -406,6 +413,7 @@ static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv)
 	char *name = NULL;
 	char *link = NULL;
 	char *type = NULL;
+	int group;
 	struct link_util *lu = NULL;
 	struct iplink_req req;
 	int ret;
@@ -417,12 +425,38 @@ static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv)
 	req.n.nlmsg_type = cmd;
 	req.i.ifi_family = preferred_family;
 
-	ret = iplink_parse(argc, argv, &req, &name, &type, &link, &dev);
+	ret = iplink_parse(argc, argv, &req, &name, &type, &link, &dev, &group);
 	if (ret < 0)
 		return ret;
 
 	argc -= ret;
 	argv += ret;
+
+	if (group != -1) {
+		if (dev)
+			addattr_l(&req.n, sizeof(req), IFLA_GROUP,
+					&group, sizeof(group));
+		else {
+			if (argc) {
+				fprintf(stderr, "Garbage instead of arguments "
+						"\"%s ...\". Try \"ip link "
+						"help\".\n", *argv);
+				return -1;
+			}
+			if (flags & NLM_F_CREATE) {
+				fprintf(stderr, "group cannot be used when "
+						"creating devices.\n");
+				return -1;
+			}
+
+			req.i.ifi_index = 0;
+			addattr32(&req.n, sizeof(req), IFLA_GROUP, group);
+			if (rtnl_talk(&rth, &req.n, 0, 0, NULL, NULL, NULL) < 0)
+				exit(2);
+			return 0;
+		}
+	}
+
 	ll_init_map(&rth);
 
 	if (type) {
diff --git a/ip/link_veth.c b/ip/link_veth.c
index 9f5e871..3d19b01 100644
--- a/ip/link_veth.c
+++ b/ip/link_veth.c
@@ -30,6 +30,7 @@ static int veth_parse_opt(struct link_util *lu, int argc, char **argv,
 	char *name, *type, *link, *dev;
 	int err, len;
 	struct rtattr * data;
+	int group;
 
 	if (strcmp(argv[0], "peer") != 0) {
 		usage();
@@ -42,7 +43,7 @@ static int veth_parse_opt(struct link_util *lu, int argc, char **argv,
 	hdr->nlmsg_len += sizeof(struct ifinfomsg);
 
 	err = iplink_parse(argc - 1, argv + 1, (struct iplink_req *)hdr,
-			   &name, &type, &link, &dev);
+			   &name, &type, &link, &dev, &group);
 	if (err < 0)
 		return err;
 
diff --git a/man/man8/ip.8 b/man/man8/ip.8
index 730788a..8f82842 100644
--- a/man/man8/ip.8
+++ b/man/man8/ip.8
@@ -55,8 +55,10 @@ ip \- show / manipulate routing, devices, policy routing and tunnels
 .RI "[ " ARGS " ]"
 
 .ti -8
-.BI "ip link set " DEVICE
-.RB "{ " up " | " down " | " arp " { " on " | " off " } |"
+.BR "ip link set " {
+.IR DEVICE " | "
+.BI "group " GROUP
+.RB "} { " up " | " down " | " arp " { " on " | " off " } |"
 .br
 .BR promisc " { " on " | " off " } |"
 .br
@@ -930,6 +932,13 @@ specifies network device to operate on. When configuring SR-IOV Virtual Fuction
 device.
 
 .TP
+.BI group " GROUP "
+.I GROUP
+has a dual role: If both group and dev are present, then move the device to the
+specified group.  If only a group is specified, then the command operates on
+all devices in that group.
+
+.TP
 .BR up " and " down
 change the state of the device to
 .B UP
@@ -996,6 +1005,12 @@ move the device to the network namespace associated with the process
 give the device a symbolic name for easy reference.
 
 .TP
+.BI group " GROUP"
+specify the group the device belongs to.
+The available groups are listed in file
+.BR "/etc/iproute2/group" .
+
+.TP
 .BI vf " NUM"
 specify a Virtual Function device to be configured. The associated PF device
 must be specified using the
-- 
1.7.1


^ permalink raw reply related

* [PATCH v4 0/2] iproute2: support for device groups
From: Vlad Dogaru @ 2011-02-02 18:23 UTC (permalink / raw)
  To: netdev; +Cc: Vlad Dogaru, Stephen Hemminger, Patrick McHardy

This patch series adds userspace support for network device groups.
There is support for setting device groups, listing only interfaces of a
specific group, and setting basic device parameters for all interfaces
in a group.

Changes since version 3:
 * drop devgroup keyword. There is a single keyword for specifying
   groups now.
 * store group names internally as a hash, similar to routing table
   names. This is more efficient for batch mode operations.

Changes since version 2:
 * groups now have names. The mapping between names and numbers (which
   are used by the kernel) is in /etc/iproute2/group.
 * the ip man page and usage display function have been updated to
   reflect the new functionality.

Changes since version 1:
 * a single attribute is used for both setting the group of a device and
   manipulating an entire group.

Vlad Dogaru (2):
  iproute2: support listing devices by group
  iproute2: support device group semantics

 etc/iproute2/group        |    2 +
 include/linux/if_link.h   |    1 +
 include/linux/netdevice.h |    2 +-
 include/rt_names.h        |    1 +
 include/utils.h           |    3 +-
 ip/ipaddress.c            |   14 ++++++++++++
 ip/iplink.c               |   42 ++++++++++++++++++++++++++++++++++---
 ip/link_veth.c            |    3 +-
 lib/rt_names.c            |   50 +++++++++++++++++++++++++++++++++++++++++++++
 man/man8/ip.8             |   30 +++++++++++++++++++++++---
 10 files changed, 137 insertions(+), 11 deletions(-)
 create mode 100644 etc/iproute2/group

^ permalink raw reply

* [PATCH v4 1/2] iproute2: support listing devices by group
From: Vlad Dogaru @ 2011-02-02 18:23 UTC (permalink / raw)
  To: netdev; +Cc: Vlad Dogaru, Stephen Hemminger, Patrick McHardy
In-Reply-To: <1296671021-24421-1-git-send-email-ddvlad@rosedu.org>

User can specify device group to list by using the group keyword:

	ip link show group test

If no group is specified, 0 (default) is implied.

Signed-off-by: Vlad Dogaru <ddvlad@rosedu.org>
---
 etc/iproute2/group        |    2 +
 include/linux/if_link.h   |    1 +
 include/linux/netdevice.h |    2 +-
 include/rt_names.h        |    1 +
 ip/ipaddress.c            |   14 ++++++++++++
 ip/iplink.c               |    2 +-
 lib/rt_names.c            |   50 +++++++++++++++++++++++++++++++++++++++++++++
 man/man8/ip.8             |   11 ++++++++-
 8 files changed, 79 insertions(+), 4 deletions(-)
 create mode 100644 etc/iproute2/group

diff --git a/etc/iproute2/group b/etc/iproute2/group
new file mode 100644
index 0000000..6f000b2
--- /dev/null
+++ b/etc/iproute2/group
@@ -0,0 +1,2 @@
+# device group names
+0	default
diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index e87456c..54d05f9 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -135,6 +135,7 @@ enum {
 	IFLA_VF_PORTS,
 	IFLA_PORT_SELF,
 	IFLA_AF_SPEC,
+	IFLA_GROUP,
 	__IFLA_MAX
 };
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index bec4e23..ad2e34d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -33,7 +33,7 @@
 
 #define MAX_ADDR_LEN	32		/* Largest hardware address length */
 
-
+#define INIT_NETDEV_GROUP	0	/* Initial group net devices belong to */
 
 /* Media selection options. */
 enum {
diff --git a/include/rt_names.h b/include/rt_names.h
index 07a10e0..e5dbd45 100644
--- a/include/rt_names.h
+++ b/include/rt_names.h
@@ -13,6 +13,7 @@ int rtnl_rtscope_a2n(__u32 *id, char *arg);
 int rtnl_rttable_a2n(__u32 *id, char *arg);
 int rtnl_rtrealm_a2n(__u32 *id, char *arg);
 int rtnl_dsfield_a2n(__u32 *id, char *arg);
+int rtnl_group_a2n(int *id, char *arg);
 
 const char *inet_proto_n2a(int proto, char *buf, int len);
 int inet_proto_a2n(char *buf);
diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index a775ecd..e4748e3 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -49,6 +49,7 @@ static struct
 	char *flushb;
 	int flushp;
 	int flushe;
+	int group;
 } filter;
 
 static int do_link;
@@ -246,6 +247,12 @@ int print_linkinfo(const struct sockaddr_nl *who,
 	    fnmatch(filter.label, RTA_DATA(tb[IFLA_IFNAME]), 0))
 		return 0;
 
+	if (tb[IFLA_GROUP]) {
+		int group = *(int*)RTA_DATA(tb[IFLA_GROUP]);
+		if (group != filter.group)
+			return -1;
+	}
+
 	if (n->nlmsg_type == RTM_DELLINK)
 		fprintf(fp, "Deleted ");
 
@@ -718,9 +725,12 @@ static int ipaddr_list_or_flush(int argc, char **argv, int flush)
 	if (filter.family == AF_UNSPEC)
 		filter.family = preferred_family;
 
+	filter.group = INIT_NETDEV_GROUP;
+
 	if (flush) {
 		if (argc <= 0) {
 			fprintf(stderr, "Flush requires arguments.\n");
+
 			return -1;
 		}
 		if (filter.family == AF_PACKET) {
@@ -779,6 +789,10 @@ static int ipaddr_list_or_flush(int argc, char **argv, int flush)
 		} else if (strcmp(*argv, "label") == 0) {
 			NEXT_ARG();
 			filter.label = *argv;
+		} else if (strcmp(*argv, "group") == 0) {
+			NEXT_ARG();
+			if (rtnl_group_a2n(&filter.group, *argv))
+				invarg("Invalid \"group\" value\n", *argv);
 		} else {
 			if (strcmp(*argv, "dev") == 0) {
 				NEXT_ARG();
diff --git a/ip/iplink.c b/ip/iplink.c
index cb2c4f5..97a960b 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -71,7 +71,7 @@ void iplink_usage(void)
 	fprintf(stderr, "	                  [ vf NUM [ mac LLADDR ]\n");
 	fprintf(stderr, "				   [ vlan VLANID [ qos VLAN-QOS ] ]\n");
 	fprintf(stderr, "				   [ rate TXRATE ] ] \n");
-	fprintf(stderr, "       ip link show [ DEVICE ]\n");
+	fprintf(stderr, "       ip link show [ DEVICE | group GROUP ]\n");
 
 	if (iplink_have_newlink()) {
 		fprintf(stderr, "\n");
diff --git a/lib/rt_names.c b/lib/rt_names.c
index 52edfe3..30d43cd 100644
--- a/lib/rt_names.c
+++ b/lib/rt_names.c
@@ -447,3 +447,53 @@ int rtnl_dsfield_a2n(__u32 *id, char *arg)
 	return 0;
 }
 
+
+static struct rtnl_hash_entry dflt_group_entry  = { .id = 0, .name = "default" };
+
+static struct rtnl_hash_entry * rtnl_group_hash[256] = {
+	[0] = &dflt_group_entry,
+};
+
+static int rtnl_group_init;
+
+static void rtnl_group_initialize(void)
+{
+	rtnl_group_init = 1;
+	rtnl_hash_initialize("/etc/iproute2/group",
+			     rtnl_group_hash, 256);
+}
+
+int rtnl_group_a2n(int *id, char *arg)
+{
+	static char *cache = NULL;
+	static unsigned long res;
+	struct rtnl_hash_entry *entry;
+	char *end;
+	int i;
+
+	if (cache && strcmp(cache, arg) == 0) {
+		*id = res;
+		return 0;
+	}
+
+	if (!rtnl_group_init)
+		rtnl_group_initialize();
+
+	for (i=0; i<256; i++) {
+		entry = rtnl_group_hash[i];
+		while (entry && strcmp(entry->name, arg))
+			entry = entry->next;
+		if (entry) {
+			cache = entry->name;
+			res = entry->id;
+			*id = res;
+			return 0;
+		}
+	}
+
+	i = strtol(arg, &end, 0);
+	if (!end || end == arg || *end || i < 0)
+		return -1;
+	*id = i;
+	return 0;
+}
diff --git a/man/man8/ip.8 b/man/man8/ip.8
index 8d55fa9..730788a 100644
--- a/man/man8/ip.8
+++ b/man/man8/ip.8
@@ -99,7 +99,9 @@ ip \- show / manipulate routing, devices, policy routing and tunnels
 
 .ti -8
 .B ip link show
-.RI "[ " DEVICE " ]"
+.RI "[ " DEVICE " | "
+.B group
+.IR GROUP " ]"
 
 .ti -8
 .BR "ip addr" " { " add " | " del " } "
@@ -1056,7 +1058,12 @@ call.
 .BI dev " NAME " (default)
 .I NAME
 specifies the network device to show.
-If this argument is omitted all devices are listed.
+If this argument is omitted all devices in the default group are listed.
+
+.TP
+.BI group " GROUP "
+.I GROUP
+specifies what group of devices to show.
 
 .TP
 .B up
-- 
1.7.1


^ permalink raw reply related

* Re: Network performance with small packets
From: Shirley Ma @ 2011-02-02 18:26 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm, mashirle,
	netdev
In-Reply-To: <20110202182024.GA14257@redhat.com>

On Wed, 2011-02-02 at 20:20 +0200, Michael S. Tsirkin wrote:
> How many packets and bytes per interrupt are sent?
> Also, what about other values for the counters and other counters?
> 
> What does your patch do? Just drop packets instead of
> stopping the interface?
> 
> To have an understanding when should we drop packets
> in the guest, we need to know *why* does it help.
> Otherwise, how do we know it will work for others?
> Note that qdisc will drop packets when it overruns -
> so what is different? Also, are we over-running some other queue
> somewhere? 

Agreed. I am trying to put more debugging output to look for all these
answers.

Shirley


^ permalink raw reply

* Re: Network performance with small packets
From: Michael S. Tsirkin @ 2011-02-02 18:27 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Krishna Kumar2, David Miller, kvm, mashirle, netdev, netdev-owner,
	Sridhar Samudrala, Steve Dobbelstein
In-Reply-To: <1296670311.25430.49.camel@localhost.localdomain>

On Wed, Feb 02, 2011 at 10:11:51AM -0800, Shirley Ma wrote:
> On Wed, 2011-02-02 at 19:32 +0200, Michael S. Tsirkin wrote:
> > OK, but this should have no effect with a vhost patch
> > which should ensure that we don't get an interrupt
> > until the queue is at least half empty.
> > Right?
> 
> There should be some coordination between guest and vhost.

What kind of coordination? With a patched vhost, and a full ring.
you should get an interrupt per 100 packets.
Is this what you see? And if yes, isn't the guest patch
doing nothing then?

> We shouldn't
> count the TX packets when netif queue is enabled since next guest TX
> xmit will free any used buffers in vhost. We need to be careful here in
> case we miss the interrupts when netif queue has stopped.
> 
> However we can't change old guest so we can test the patches separately
> for guest only, vhost only, and the combination.
> 
> > > > 
> > > > Yes, it seems unrelated to tx interrupts. 
> > > 
> > > The issue is more likely related to latency.
> > 
> > Could be. Why do you think so?
> 
> Since I played with latency hack, I can see performance difference for
> different latency.

Which hack was that?

> > > Do you have anything in
> > > mind on how to reduce vhost latency?
> > > 
> > > Thanks
> > > Shirley
> > 
> > Hmm, bypassing the bridge might help a bit.
> > Are you using tap+bridge or macvtap? 
> 
> I am using tap+bridge for TCP_RR test, I think Steven tested macvtap
> before. He might have some data from his workload performance
> measurement.
> 
> Shirley

^ permalink raw reply

* Re: af_unix unix_getname: return size for unnamed sockets too small?
From: Eric W. Biederman @ 2011-02-02 18:59 UTC (permalink / raw)
  To: Marcus Meissner; +Cc: davem, eric.dumazet, netdev, linux-kernel, gorcunov
In-Reply-To: <20110202174015.GB25515@suse.de>

Marcus Meissner <meissner@suse.de> writes:

> Hi,
>
> In net/unix/af_unix.c::unix_getname() there is a small problem:
>
>         if (!u->addr) {
>                 sunaddr->sun_family = AF_UNIX;
>                 sunaddr->sun_path[0] = 0;		// not copied out
>                 *uaddr_len = sizeof(short);
>         } else {
>                 struct unix_address *addr = u->addr;
>
>                 *uaddr_len = addr->len;
>                 memcpy(sunaddr, addr->name, *uaddr_len);
>         }
>
> The if (!u->addr) case will not copy out the \0 in the sun_path, as
> uaddr_len is just the size of sun_family.
>
> (Shown by socat crashing after decoding gethostname return and expected
> sun_path to be a valid string (and not seeing the \0)).

Perhaps my memory is scrambled but the sun_path has embedded '\0's so I
don't see how a correct application can expect the path to be '\0'
terminated.  An application should be looking at the length we give it.

> Should it perhaps be *uaddr_len = sizeof(short)+sizeof(char)?

I don't think so.

Eric

^ permalink raw reply

* Re: Network performance with small packets
From: Shirley Ma @ 2011-02-02 19:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar2, David Miller, kvm, mashirle, netdev, netdev-owner,
	Sridhar Samudrala, Steve Dobbelstein
In-Reply-To: <20110202182720.GB14257@redhat.com>

On Wed, 2011-02-02 at 20:27 +0200, Michael S. Tsirkin wrote:
> On Wed, Feb 02, 2011 at 10:11:51AM -0800, Shirley Ma wrote:
> > On Wed, 2011-02-02 at 19:32 +0200, Michael S. Tsirkin wrote:
> > > OK, but this should have no effect with a vhost patch
> > > which should ensure that we don't get an interrupt
> > > until the queue is at least half empty.
> > > Right?
> > 
> > There should be some coordination between guest and vhost.
> 
> What kind of coordination? With a patched vhost, and a full ring.
> you should get an interrupt per 100 packets.
> Is this what you see? And if yes, isn't the guest patch
> doing nothing then?

vhost_signal won't be able send any TX interrupts to guest when guest TX
interrupt is disabled. Guest TX interrupt is only enabled when running
out of descriptors.

> > We shouldn't
> > count the TX packets when netif queue is enabled since next guest TX
> > xmit will free any used buffers in vhost. We need to be careful here
> in
> > case we miss the interrupts when netif queue has stopped.
> > 
> > However we can't change old guest so we can test the patches
> separately
> > for guest only, vhost only, and the combination.
> > 
> > > > > 
> > > > > Yes, it seems unrelated to tx interrupts. 
> > > > 
> > > > The issue is more likely related to latency.
> > > 
> > > Could be. Why do you think so?
> > 
> > Since I played with latency hack, I can see performance difference
> for
> > different latency.
> 
> Which hack was that? 

I tried to accumulate multiple guest to host notifications for TX xmits,
it did help multiple streams TCP_RR results; I also forced vhost
handle_tx to handle more packets; both hack seemed help.

Thanks
Shirley


^ permalink raw reply

* Re: Network performance with small packets
From: Michael S. Tsirkin @ 2011-02-02 20:17 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Krishna Kumar2, David Miller, kvm, mashirle, netdev, netdev-owner,
	Sridhar Samudrala, Steve Dobbelstein
In-Reply-To: <1296674975.25430.59.camel@localhost.localdomain>

On Wed, Feb 02, 2011 at 11:29:35AM -0800, Shirley Ma wrote:
> On Wed, 2011-02-02 at 20:27 +0200, Michael S. Tsirkin wrote:
> > On Wed, Feb 02, 2011 at 10:11:51AM -0800, Shirley Ma wrote:
> > > On Wed, 2011-02-02 at 19:32 +0200, Michael S. Tsirkin wrote:
> > > > OK, but this should have no effect with a vhost patch
> > > > which should ensure that we don't get an interrupt
> > > > until the queue is at least half empty.
> > > > Right?
> > > 
> > > There should be some coordination between guest and vhost.
> > 
> > What kind of coordination? With a patched vhost, and a full ring.
> > you should get an interrupt per 100 packets.
> > Is this what you see? And if yes, isn't the guest patch
> > doing nothing then?
> 
> vhost_signal won't be able send any TX interrupts to guest when guest TX
> interrupt is disabled. Guest TX interrupt is only enabled when running
> out of descriptors.

Well, this is also the only case where the queue is stopped, no?

> > > We shouldn't
> > > count the TX packets when netif queue is enabled since next guest TX
> > > xmit will free any used buffers in vhost. We need to be careful here
> > in
> > > case we miss the interrupts when netif queue has stopped.
> > > 
> > > However we can't change old guest so we can test the patches
> > separately
> > > for guest only, vhost only, and the combination.
> > > 
> > > > > > 
> > > > > > Yes, it seems unrelated to tx interrupts. 
> > > > > 
> > > > > The issue is more likely related to latency.
> > > > 
> > > > Could be. Why do you think so?
> > > 
> > > Since I played with latency hack, I can see performance difference
> > for
> > > different latency.
> > 
> > Which hack was that? 
> 
> I tried to accumulate multiple guest to host notifications for TX xmits,
> it did help multiple streams TCP_RR results;

I don't see a point to delay used idx update, do you?
So delaying just signal seems better, right?

> I also forced vhost
> handle_tx to handle more packets; both hack seemed help.
> 
> Thanks
> Shirley

Haven't noticed that part, how does your patch make it
handle more packets?

-- 
MST

^ permalink raw reply

* Re: Network performance with small packets
From: Shirley Ma @ 2011-02-02 21:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar2, David Miller, kvm, mashirle, netdev, netdev-owner,
	Sridhar Samudrala, Steve Dobbelstein
In-Reply-To: <20110202201731.GB15150@redhat.com>

On Wed, 2011-02-02 at 22:17 +0200, Michael S. Tsirkin wrote:
> Well, this is also the only case where the queue is stopped, no?
Yes. I got some debugging data, I saw that sometimes there were so many
packets were waiting for free in guest between vhost_signal & guest xmit
callback. Looks like the time spent too long from vhost_signal to guest
xmit callback?

> > I tried to accumulate multiple guest to host notifications for TX
> xmits,
> > it did help multiple streams TCP_RR results;
> I don't see a point to delay used idx update, do you?

It might cause per vhost handle_tx processed more packets.

> So delaying just signal seems better, right?

I think I need to define the test matrix to collect data for TX xmit
from guest to host here for different tests.

Data to be collected:
---------------------
1. kvm_stat for VM, I/O exits
2. cpu utilization for both guest and host
3. cat /proc/interrupts on guest
4. packets rate from vhost handle_tx per loop
5. guest netif queue stop rate
6. how many packets are waiting for free between vhost signaling and
guest callback
7. performance results

Test
----
1. TCP_STREAM single stream test for 1K to 4K message size
2. TCP_RR (64 instance test): 128 - 1K request/response size

Different hacks
---------------
1. Base line data ( with the patch to fix capacity check first,
free_old_xmit_skbs returns number of skbs)

2. Drop packet data (will put some debugging in generic networking code)

3. Delay guest netif queue wake up until certain descriptors (1/2 ring
size, 1/4 ring size...) are available once the queue has stopped.

4. Accumulate more packets per vhost signal in handle_tx?

5. 3 & 4 combinations

6. Accumulate more packets per guest kick() (TCP_RR) by adding a timer? 

7. Accumulate more packets per vhost handle_tx() by adding some delay?

> Haven't noticed that part, how does your patch make it
handle more packets?

Added a delay in handle_tx().

What else?

It would take sometimes to do this.

Shirley


^ permalink raw reply

* [PATCH net-2.6] gro: reset skb_iif on reuse
From: andy @ 2011-02-02 21:27 UTC (permalink / raw)
  To: netdev

Like Herbert's change from a few days ago:

66c46d741e2e60f0e8b625b80edb0ab820c46d7a gro: Reset dev pointer on reuse

this may not be necessary at this point, but we should still clean up
the skb->skb_iif.  If not we may end up with an invalid valid for
skb->skb_iif when the skb is reused and the check is done in
__netif_receive_skb.

Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
---
 net/core/dev.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 4c90789..b6d0bf8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3426,6 +3426,7 @@ static void napi_reuse_skb(struct napi_struct *napi, struct sk_buff *skb)
 	skb_reserve(skb, NET_IP_ALIGN - skb_headroom(skb));
 	skb->vlan_tci = 0;
 	skb->dev = napi->dev;
+	skb->skb_iif = 0;
 
 	napi->skb = skb;
 }
-- 
1.7.3.2


^ permalink raw reply related

* Re: Network performance with small packets
From: Michael S. Tsirkin @ 2011-02-02 21:20 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Krishna Kumar2, David Miller, kvm, mashirle, netdev, netdev-owner,
	Sridhar Samudrala, Steve Dobbelstein
In-Reply-To: <1296680585.25430.98.camel@localhost.localdomain>

On Wed, Feb 02, 2011 at 01:03:05PM -0800, Shirley Ma wrote:
> On Wed, 2011-02-02 at 22:17 +0200, Michael S. Tsirkin wrote:
> > Well, this is also the only case where the queue is stopped, no?
> Yes. I got some debugging data, I saw that sometimes there were so many
> packets were waiting for free in guest between vhost_signal & guest xmit
> callback.

What does this mean?

> Looks like the time spent too long from vhost_signal to guest
> xmit callback?



> > > I tried to accumulate multiple guest to host notifications for TX
> > xmits,
> > > it did help multiple streams TCP_RR results;
> > I don't see a point to delay used idx update, do you?
> 
> It might cause per vhost handle_tx processed more packets.

I don't understand. It's a couple of writes - what is the issue?

> > So delaying just signal seems better, right?
> 
> I think I need to define the test matrix to collect data for TX xmit
> from guest to host here for different tests.
> 
> Data to be collected:
> ---------------------
> 1. kvm_stat for VM, I/O exits
> 2. cpu utilization for both guest and host
> 3. cat /proc/interrupts on guest
> 4. packets rate from vhost handle_tx per loop
> 5. guest netif queue stop rate
> 6. how many packets are waiting for free between vhost signaling and
> guest callback
> 7. performance results
> 
> Test
> ----
> 1. TCP_STREAM single stream test for 1K to 4K message size
> 2. TCP_RR (64 instance test): 128 - 1K request/response size
> 
> Different hacks
> ---------------
> 1. Base line data ( with the patch to fix capacity check first,
> free_old_xmit_skbs returns number of skbs)
> 
> 2. Drop packet data (will put some debugging in generic networking code)
> 
> 3. Delay guest netif queue wake up until certain descriptors (1/2 ring
> size, 1/4 ring size...) are available once the queue has stopped.
> 
> 4. Accumulate more packets per vhost signal in handle_tx?
> 
> 5. 3 & 4 combinations
> 
> 6. Accumulate more packets per guest kick() (TCP_RR) by adding a timer? 
> 
> 7. Accumulate more packets per vhost handle_tx() by adding some delay?
> 
> > Haven't noticed that part, how does your patch make it
> handle more packets?
> 
> Added a delay in handle_tx().
> 
> What else?
> 
> It would take sometimes to do this.
> 
> Shirley


Need to think about this.

^ permalink raw reply

* Re: [GIT PULL nf-next-2.6] IPVS build fixes and clean-ups
From: Simon Horman @ 2011-02-02 21:39 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Randy Dunlap, netdev, linux-next, linux-kernel, lvs-devel,
	Stephen Rothwell, Hans Schillstrom
In-Reply-To: <4D4840BC.4050303@trash.net>

On Tue, Feb 01, 2011 at 06:19:56PM +0100, Patrick McHardy wrote:
> Am 01.02.2011 18:05, schrieb Randy Dunlap:
> > On 02/01/11 02:04, Simon Horman wrote:
> >> On Tue, Feb 01, 2011 at 03:06:37PM +1100, Simon Horman wrote:
> >>> On Mon, Jan 31, 2011 at 04:50:09PM -0800, Randy Dunlap wrote:
> >>>> On Tue,  1 Feb 2011 11:14:11 +1100 Simon Horman wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> This short patch series addresses two linux-next build problems
> >>>>> raised by Randy Dunlap:
> >>>>>
> >>>>> * net/netfilter/ipvs/ip_vs_core.c:1891: warning: format '%lu' expects type 'long unsigned int', but argument 2 has type 'unsigned int'
> >>>>> * ERROR: "unregister_net_sysctl_table" [net/netfilter/ipvs/ip_vs.ko]
> >>>>>   ERROR: "register_net_sysctl_table" [net/netfilter/ipvs/ip_vs.ko] undefined!
> >>>>>
> >>>>> The remainder of the changset is cleanups that I noticed along the way.
> >>>>
> >>>> These 4 patches build successfully for me.
> >>>> However, I do see these warnings (sorry I missed them earlier):
> >>>>
> >>>> WARNING: net/netfilter/ipvs/ip_vs.o(.init.text+0x161): Section mismatch in reference from the function init_module() to the function .exit.text:ip_vs_sync_cleanup()
> >>>> WARNING: net/netfilter/ipvs/ip_vs.o(.init.text+0x161): Section mismatch in reference from the function init_module() to the function .exit.text:ip_vs_sync_cleanup()
> >>>
> >>> Thanks, I'll look into that. I will be travelling for a good portion of the
> >>> next day and a bit so I apologise in advance if that delays my next patch.
> >>
> >> Hi,
> >>
> >> I the following patch seems to be the right fix for this to me.
> >> I will send an amended pull request.
> >>
> >> IPVS: Remove ip_vs_sync_cleanup from section __exit
> >>
> >> ip_vs_sync_cleanup() may be called from ip_vs_init() on error
> >> and thus needs to be accesible from section __init
> >>
> >> Reporte-by: Randy Dunlap <randy.dunlap@oracle.com>
> >   Reported-by:
> > 
> >> Signed-off-by: Simon Horman <horms@verge.net.au>
> > 
> > Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
> 
> Thanks, since Simon is travelling, I'll apply the patches now
> with your and Hans's Acks and will push them to Dave later.

Thanks, I've finished travelling now :-)

^ permalink raw reply

* Re: Network performance with small packets
From: Shirley Ma @ 2011-02-02 21:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar2, David Miller, kvm, mashirle, netdev, netdev-owner,
	Sridhar Samudrala, Steve Dobbelstein
In-Reply-To: <20110202212047.GD15150@redhat.com>

On Wed, 2011-02-02 at 23:20 +0200, Michael S. Tsirkin wrote:
> > On Wed, 2011-02-02 at 22:17 +0200, Michael S. Tsirkin wrote:
> > > Well, this is also the only case where the queue is stopped, no?
> > Yes. I got some debugging data, I saw that sometimes there were so
> many
> > packets were waiting for free in guest between vhost_signal & guest
> xmit
> > callback.
> 
> What does this mean?

Let's look at the sequence here:

guest start_xmit()
	xmit_skb()
	if ring is full,
		enable_cb()

guest skb_xmit_done()
	disable_cb,
        printk free_old_xmit_skbs <-- it was between more than 1/2 to
full ring size
	printk vq->num_free 

vhost handle_tx()
	if (guest interrupt is enabled)
		signal guest to free xmit buffers

So between guest queue full/stopped queue/enable call back to guest
receives the callback from host to free_old_xmit_skbs, there were about
1/2 to full ring size descriptors available. I thought there were only a
few. (I disabled your vhost patch for this test.)
 

> > Looks like the time spent too long from vhost_signal to guest
> > xmit callback?
> 
> 
> 
> > > > I tried to accumulate multiple guest to host notifications for
> TX
> > > xmits,
> > > > it did help multiple streams TCP_RR results;
> > > I don't see a point to delay used idx update, do you?
> > 
> > It might cause per vhost handle_tx processed more packets.
> 
> I don't understand. It's a couple of writes - what is the issue?

Oh, handle_tx could process more packets per loop for multiple streams
TCP_RR case. I need to print out the data rate per loop to confirm this.

Shirley


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox