Netdev List
 help / color / mirror / Atom feed
* Re: 3 packet TCP window limit?
From: Stephen Hemminger @ 2010-05-05 22:03 UTC (permalink / raw)
  To: Brian Bloniarz; +Cc: dormando, netdev, Rick Jones, shemminger
In-Reply-To: <4BE1DB82.2030403@athenacr.com>

On Wed, 05 May 2010 16:56:34 -0400
Brian Bloniarz <bmb@athenacr.com> wrote:

> dormando wrote:
> >> This sounds like TCP slow start.
> >>
> >> http://en.wikipedia.org/wiki/Slow-start
> >>
> >> As far as tunables you might want to play with the initcwnd route
> >> flag (see "ip route help")
> > 
> > Ah, yes, initcwnd was it. I'm well aware of TCP Congestion control / slow
> > start / etc. However I couldn't find the damn tunable for it :)
> 
> Documenting the flag in ip(8) might increase its visibility
> a little. I don't see it documented in the iproute2 git head,
> though it shows up on http://linux.die.net/man/8/ip somehow.
> 
> Stephen, do you know why that is?

No one sent me an official patch to change it?

^ permalink raw reply

* Re: [PATCH 2/6] netns: Teach network device kobjects which namespace they are in.
From: Serge E. Hallyn @ 2010-05-05 22:01 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg Kroah-Hartman, Kay Sievers, linux-kernel, Tejun Heo,
	Cornelia Huck, Eric Dumazet, Benjamin LaHaise, netdev,
	David Miller
In-Reply-To: <m14oim2kcm.fsf@fess.ebiederm.org>

Quoting Eric W. Biederman (ebiederm@xmission.com):
> "Serge E. Hallyn" <serue@us.ibm.com> writes:
> 
> > Quoting Eric W. Biederman (ebiederm@xmission.com):
> >> diff --git a/net/Kconfig b/net/Kconfig
> >> index 041c35e..265e33b 100644
> >> --- a/net/Kconfig
> >> +++ b/net/Kconfig
> >> @@ -45,6 +45,14 @@ config COMPAT_NETLINK_MESSAGES
> >> 
> >>  menu "Networking options"
> >> 
> >> +config NET_NS
> >> +	bool "Network namespace support"
> >> +	default n
> >> +	depends on EXPERIMENTAL && NAMESPACES
> >> +	help
> >> +	  Allow user space to create what appear to be multiple instances
> >> +	  of the network stack.
> >> +
> >
> > Hi Eric,
> >
> > I'm confused - NET_NS is defined in init/Kconfig right now.  Is the tree
> > you're working from very different from mine, or is this the unfortunate
> > rekult of the patches sitting so long?
> 
> Old patches, nothing that complains when you make a mistake like this,
> and apparently I have a blind spot in my personal code review.

haha, we all know about that.

> At one point it was not possible to enable the network namespace until
> the sysfs stuff was enabled, but things have been going on long enough
> that we worked around that restriction.

Yeah, I remember that, and leaving this wouldn't break anything.

> >>  int netdev_kobject_init(void)
> >>  {
> >> +	kobj_ns_type_register(&net_ns_type_operations);
> >> +#ifdef CONFIG_SYSFS
> >> +	register_pernet_subsys(&sysfs_net_ops);
> >> +#endif
> >>  	return class_register(&net_class);
> >
> > I think the kobj_ns_type_register() needs to be under
> > ifdef CONFIG_SYSFS as well, bc net_ns_type_operations is defined
> > under ifdef CONFIG_SYSFS.
> 
> kobj_ns_type_register should not be under CONFIG_SYSFS.  Which means
> that kobj_ns_type_operations needs not to be under CONFIG_SYSFS as
> well.  That you for spotting that bug.

np - outside of that,

Acked-by: Serge E. Hallyn <serue@us.ibm.com>

I saw no problems with the other patches, just don't feel qualified
to give an ack.

thanks,
-serge

^ permalink raw reply

* Re: linux kernel's IPV6_MULTICAST_HOPS default is 64; should be 1?
From: David Miller @ 2010-05-05 22:00 UTC (permalink / raw)
  To: brian.haley; +Cc: dlstevens, enh, netdev, netdev-owner
In-Reply-To: <4BE1907F.40903@hp.com>

From: Brian Haley <brian.haley@hp.com>
Date: Wed, 05 May 2010 11:36:31 -0400

> I now see that in Elliot's email, but I think it's incorrect.  The RFC
> says that setting it to -1 should get you the kernel default, which is
> now 1.  Without this change, setting it to -1 will get you 64, the
> old behavior.  If the user wants to, they can always just set it to
> 64 themselves, that's better than assuming when you set it to -1
> you're going to get 64.

It's not 64, it's whatever the per-route metric is.

> I'm just trying to make this follow the RFC and behave like other OSes
> for consistency.

I'm just trying to have a real compatability story, and we don't with
any of the proposals you guys are giving me because it complete ignores
the fact that the old default comes from the route and is not some
constant.

^ permalink raw reply

* Re: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for vmxnet3
From: Arnd Bergmann @ 2010-05-05 21:53 UTC (permalink / raw)
  To: Dmitry Torokhov
  Cc: virtualization@lists.linux-foundation.org, Pankaj Thakkar,
	Christoph Hellwig, pv-drivers@vmware.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <201005051336.32712.dtor@vmware.com>

On Wednesday 05 May 2010 22:36:31 Dmitry Torokhov wrote:
> 
> On Wednesday 05 May 2010 01:09:48 pm Arnd Bergmann wrote:
> > > > If you have any interesting in developing this further, do:
> > > > 
> > > >  (1) move the limited VF drivers directly into the kernel tree,
> > > >      talk to them through a normal ops vector
> > > 
> > > [PT] This assumes that all the VF drivers would always be available.
> > > Also we have to support windows and our current design supports it
> > > nicely in an OS agnostic manner.
> > 
> > Your approach assumes that the plugin is always available, which has
> > exactly the same implications.
> 
> Since plugin[s] are carried by the host they are indeed always
> available.

But what makes you think that you can build code that can be linked
into arbitrary future kernel versions? The kernel does not define any
calling conventions that are stable across multiple versions or
configurations. For example, you'd have to provide different binaries
for each combination of

- 32/64 bit code
- gcc -mregparm=?
- lockdep
- tracepoints
- stackcheck
- NOMMU
- highmem
- whatever new gets merged

If you build the plugins only for specific versions of "enterprise" Linux
kernels, the code becomes really hard to debug and maintain.
If you wrap everything in your own version of the existing interfaces, your
code gets bloated to the point of being unmaintainable.

So I have to correct myself: this is very different from assuming the
driver is available in the guest, it's actually much worse.

	Arnd

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: __alloc_skb() speedup
From: David Miller @ 2010-05-05 21:52 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, hadi, therbert
In-Reply-To: <1273060809.2367.67.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 05 May 2010 14:00:09 +0200

> Sorry, I was thinking about the shinfo part :
> 
> memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> 
> offsetof(struct skb_shared_info, dataref) is small enough and we dont
> dirty a full cache line, so maybe I can keep prefetchw(data + size) ?

You do dirty a full line on sparc64, the prefetch invalidate goes a L1
cache line at a time, so 32 bytes.  And this memset() is 40 bytes.
The call to the memset symbol is still generated by gcc for this case.

I think the cutoff for doing it inline is something like 16 bytes on
sparc64, four 64-bit loads and stores.

Unlike x86 these risc chips don't have string-op instructions, and for
sparc64 and powerpc the instructions are fixed in size (4 bytes) so
the inline cost is "(memset_size / word_size) * 4".  Whereas on x86 the
inlining cost is more-or-less fixed.

> If not, in which cases can we use prefetchw() in kernel, if some arches
> dont handle it well ?

It has to be looked at in a case-by-case basis.  There is no simple
answer here.

^ permalink raw reply

* Re: [PATCH/RFC] cxgb4: Add MAINTAINERS info
From: Divy Le Ray @ 2010-05-05 21:40 UTC (permalink / raw)
  To: Steve Wise
  Cc: Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, dm-ut6Up61K2wZBDgjK7y7TUQ
In-Reply-To: <4BE1CE39.1050707-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>

On 05/05/2010 12:59 PM, Steve Wise wrote:
> Looks good to me.

It looks right to me too.

Cheers,
Divy

>
> Steve.
>
> Roland Dreier wrote:
>> Hi guys, does this info for cxgb4/iw_cxgb4 (pretty much copied from
>> cxgb3, except with Dimitris instead of Divy) look right?  If so I'll add
>> it to my tree.
>>
>> Thanks,
>>   Roland
>> ---
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 7a9ccda..a00231b 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -1719,6 +1719,20 @@ W:    http://www.openfabrics.org
>>  S:    Supported
>>  F:    drivers/infiniband/hw/cxgb3/
>>
>> +CXGB4 ETHERNET DRIVER (CXGB4)
>> +M:    Dimitris Michailidis <dm-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
>> +L:    netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> +W:    http://www.chelsio.com
>> +S:    Supported
>> +F:    drivers/net/cxgb4/
>> +
>> +CXGB4 IWARP RNIC DRIVER (IW_CXGB4)
>> +M:    Steve Wise <swise-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
>> +L:    linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> +W:    http://www.openfabrics.org
>> +S:    Supported
>> +F:    drivers/infiniband/hw/cxgb4/
>> +
>>  CYBERPRO FB DRIVER
>>  M:    Russell King <linux-lFZ/pmaqli7XmaaqVzeoHQ@public.gmane.org>
>>  L:    linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org (moderated for 
>> non-subscribers)
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: 3 packet TCP window limit?
From: dormando @ 2010-05-05 21:31 UTC (permalink / raw)
  To: Rick Jones; +Cc: Brian Bloniarz, netdev
In-Reply-To: <4BE1D3B0.9000207@hp.com>

> I don't believe linux as yet has a damn tunable for it :)

ip route initcwnd sure does it :)

> > Other OS's appear to have a larger initcwnd.
>
> Names?  Values?

OpenBSD 4.6 jumps between a ~5k fetch to a ~6k fetch

> > As do commercial load balancers.
>
> Names?  Values?

An older Big/IP appears to be between 5k and 6k as well. I remember a
sales meeting with netscaler (pre-NDA) back in 2004 or 2005 where they
claimed to have opened up slow start. There might be others but I can't
remember which side of the NDA I was informed of their TCP tunning.

Linux is consistently between 3k and 4k. Just the distinction from having
the RTT in the ~4k or the ~6k range makes our latency graphs go nutty.
I've been testing a subset of traffic at an initcwnd of 10 for the last
few hours and latency has dropped even more, though I see some bad
outliers.

> > The default of 3 seems to be tuned for 56k dialup modems. I'm a
> > little surprised that none of the pluggable TCP congestion control
> > algorithms changed this value. I went through all of them except for
> > tcp_yeah.
>
> The initcwnd comes from IETF RFCs and their "thou shalts" and "thou shalt
> nots."  As you note below, Google et al seek to alter/extend the RFCs.  That
> is an ongoing discussion in some of the ietf related mailing lists.

The RFC clearly states "around 4k", but these other OS's/products have an
extra kilobyte snuck in? Could this be on purpose via rfc
interpretation, or an off by one on the initcwnd estimator? :)

^ permalink raw reply

* Re: 3 packet TCP window limit?
From: Brian Bloniarz @ 2010-05-05 20:56 UTC (permalink / raw)
  To: dormando; +Cc: netdev, Rick Jones, shemminger
In-Reply-To: <alpine.LNX.2.00.1005051239290.28957@d>

dormando wrote:
>> This sounds like TCP slow start.
>>
>> http://en.wikipedia.org/wiki/Slow-start
>>
>> As far as tunables you might want to play with the initcwnd route
>> flag (see "ip route help")
> 
> Ah, yes, initcwnd was it. I'm well aware of TCP Congestion control / slow
> start / etc. However I couldn't find the damn tunable for it :)

Documenting the flag in ip(8) might increase its visibility
a little. I don't see it documented in the iproute2 git head,
though it shows up on http://linux.die.net/man/8/ip somehow.

Stephen, do you know why that is?

^ permalink raw reply

* [PATCH linux-next] irq: Clear CPU mask in affinity_hint when none is provided
From: Peter P Waskiewicz Jr @ 2010-05-05 20:56 UTC (permalink / raw)
  To: tglx, davem, arjan, bhutchings; +Cc: netdev, linux-kernel

When an interrupt is disabled and torn down, the CPU mask
returned through affinity_hint right now is all CPUs.  Also, for
drivers that don't provide an affinity_hint mask, this can be
misleading.  There should be no hint at all, meaning an empty
CPU mask.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
---

 kernel/irq/proc.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c
index e1e7408..9ffa24d 100644
--- a/kernel/irq/proc.c
+++ b/kernel/irq/proc.c
@@ -45,7 +45,7 @@ static int irq_affinity_hint_proc_show(struct seq_file *m, void *v)
 	if (desc->affinity_hint)
 		cpumask_copy(mask, desc->affinity_hint);
 	else
-		cpumask_setall(mask);
+		cpumask_clear(mask);
 	raw_spin_unlock_irqrestore(&desc->lock, flags);
 
 	seq_cpumask(m, mask);


^ permalink raw reply related

* pull request: wireless-next-2.6 2010-05-05
From: John W. Linville @ 2010-05-05 20:35 UTC (permalink / raw)
  To: davem; +Cc: linux-wireless, netdev

Dave,

Another enormous batch intended for 2.6.35...  This is mostly the 'usual
suspects' -- driver updates from Intel, Atheros, Nokia, and the rt2x00
guys, mac80211 stuff from Johannes, and a variety of other bits.  One
noteworth bit is the addition of the orinoco_usb driver.

Please let me know if there are problems!

¡Olé!

Thanks,

John

---

The following changes since commit 0a12761bcd5646691c5d16dd93df84d1b8849285:
  David S. Miller (1):
        forcedeth: Kill NAPI config options.

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next-2.6 for-davem

Abhijeet Kolekar (2):
      iwlwifi: reset pci retry timeout
      iwl3945: add ucode statistics

Benoit Papillault (1):
      ath9k: Added get_survey callback in order to get channel noise

Dan Carpenter (1):
      iwl: cleanup: remove unneeded error handling

Dan Williams (1):
      libertas: fix 8686 firmware loading regression

Daniel Halperin (1):
      iwlwifi: set AMPDU status variables correctly

David Kilroy (6):
      orinoco: add hermes_ops
      orinoco: allow driver to specify netdev_ops
      orinoco: encapsulate driver locking
      orinoco: add orinoco_usb driver
      orinoco_usb: avoid in_atomic
      orinoco_usb: implement fw download

Felix Fietkau (13):
      mac80211: fix handling of 4-address-mode in ieee80211_change_iface
      ath9k_hw: update initvals for AR9003
      ath9k_hw: fix pll clock setting for 5ghz on AR9003
      ath9k_hw: fix typo in the AR9003 EEPROM data structure definition
      ath9k_hw: update EEPROM data structure for AR9280
      ath9k_hw: fix fast clock handling for 5GHz channels
      ath9k: wake queue after processing edma rx frames
      ath9k_hw: use the configured power limit for AR9003
      ath9k_hw: Fix typos in tx rate power level parsing for AR9003
      ath9k_hw: Fix endian bug in an AR9003 EEPROM field
      ath9k_hw: fix noisefloor timeout handling on AR9003
      cfg80211: add ap isolation support
      mac80211: implement ap isolation support

Gertjan van Wingerde (4):
      rt2x00: Remove rt2x00pci.h include from rt2800lib.
      rt2x00: Enable RT30xx by default.
      rt2x00: Fix HT40+/HT40- setting in rt2800.
      rt2x00: Register frame length in TX entry descriptor instead of L2PAD.

Hans de Goede (1):
      p54pci: fix regression from prevent stuck rx-ring on slow system

Helmut Schaa (6):
      rt2x00: rt2800lib: disable HT40 for now as it causes reception problems
      rt2x00: rt2800: use tx_power2 in rt2800_config_channel_rf3xxx
      rt2x00: fix typo in rt2800.h
      rt2x00: rt2800lib: Fix rx path on SoC devices
      rt2x00: rt2800lib: Remove redundant check for RT2872
      rt2x00: rt2800lib: update rfcsr & bbp init code for SoC devices

Johannes Berg (9):
      mac80211: give virtual interface to hw_scan
      mac80211: notify driver about IBSS status
      mac80211: tell driver about IBSS merge
      iwlwifi: work around passive scan issue
      mac80211: fix ieee80211_find_sta[_by_hw]
      mac80211: allow controlling aggregation manually
      mac80211: improve IBSS scanning
      mac80211_hwsim: fix double-scan detection
      mac80211: use fixed channel in ibss join when appropriate

John W. Linville (25):
      ssb: do not read SPROM if it does not exist
      MAINTAINERS: add entry for include/linux/iw_handler.h
      mwl8k: remove usage of deprecated noise value
      ar9170: remove usage of deprecated noise value
      ath5k: remove usage of deprecated noise value
      ath9k: remove usage of deprecated noise value
      b43: remove usage of deprecated noise value
      b43legacy: remove usage of deprecated noise value
      libertas_tf: remove usage of deprecated noise value
      p54: remove usage of deprecated noise value
      rt2x00: remove usage of deprecated noise value
      wl1251: remove usage of deprecated noise value
      rtl8180: use cached queue mapping for skb in rtl8180_tx
      rtl8180: fix tx status reporting
      libertas_tf: avoid warning about pr_fmt redefinition
      mac80211: remove deprecated noise field from ieee80211_rx_status
      iwmc3200wifi: cleanup unneeded debugfs error handling
      rt2x00: remove now unused noise field from struct rxdone_entry_desc
      b43: Added get_survey callback in order to get channel noise
      b43legacy: Added get_survey callback in order to get channel noise
      Merge branch 'wireless-next-2.6' of git://git.kernel.org/.../iwlwifi/iwlwifi-2.6
      iwmc3200wifi: fix busted iwm_debugfs_init definition
      rtl8180: use SET_IEEE80211_PERM_ADDR
      rtl8187: use SET_IEEE80211_PERM_ADDR
      Merge branch 'master' of git://git.kernel.org/.../linville/wireless-next-2.6 into for-davem

Juuso Oikarinen (5):
      mac80211: Fix sta->last_tx_rate setting with no-op rate control devices
      mac80211: Determine dynamic PS timeout based on ps-qos network latency
      cfg80211: Remove default dynamic PS timeout value
      wl1271: Improve command polling
      wl1271: Rewrite hardware keep-alive handling

Larry Finger (1):
      ssb: Make bus registration failure not be silent

Luciano Coelho (1):
      wl1271: fix a bunch of sparse warnings

Luis R. Rodriguez (2):
      ath9k_hw: disable TX IQ calibration for AR9003
      ath9k_hw: Fix TX interrupt mitigation settings

Rafał Miłecki (3):
      ssb: Look for SPROM at different offset on higher rev CC
      ssb: Use relative offsets for SPROM
      ssb: Fix order of definitions and some text space indents

Reinette Chatre (2):
      Merge branch 'wireless-2.6' into wireless-next-2.6
      iwlwifi: recalculate average tpt if not current

Saravanan Dhanabal (1):
      wl1271: Configure QOS nullfunc template for U-APSD

Shanyu Zhao (2):
      mac80211: fix rts threshold check
      iwlwifi: set correct AC to swq_id for aggregation

Stanislaw Gruszka (2):
      mac80211: do not wip out old supported rates
      mac80211: fix supported rates IE if AP doesn't give us it's rates

Steve deRosier (1):
      libertastf: add configurable debug messages

Sujith (14):
      ath9k_htc: Simplify TX URB management
      ath9k_htc: Handle device unplug properly
      ath9k_htc: Use multiple register writes
      ath9k_htc: Cancel running timers before disabling HW
      ath9k_hw: Remove pointless ANI deinit
      ath9k_htc: Pass correct private pointer
      ath9k_htc: Use USB reboot
      ath9k_htc: Process command data properly
      ath9k_htc: Increase WMI timeout value
      ath9k_htc: Fix WMI command race
      ath9k_htc: Really fix device hotunplug
      ath9k_htc: Remove unnecessary powersave restore
      ath9k_htc: Validate TX Endpoint ID
      ath9k_htc: Simplify RX IRQ handler

Vasanthakumar Thiagarajan (1):
      ath9k_hw: Fix usec to hw clock conversion in 5Ghz for ar9003

Vivek Natarajan (2):
      ath9k_htc: Handle CONF_IDLE during unassociated state to save power.
      ath9k: Avoid corrupt frames being forwarded to mac80211.

Wey-Yi Guy (4):
      iwlwifi: remove get_stats callback function
      iwlwifi: remove outdated comments
      iwlwifi: set hw parameters based on device type
      iwlwifi: greenfield support only true for 11n devices

Xose Vazquez Perez (1):
      wireless: rt2x00: rt2800usb: be in sync with latest windows drivers.

 MAINTAINERS                                      |    1 +
 drivers/net/wireless/Kconfig                     |    6 +
 drivers/net/wireless/at76c50x-usb.c              |    1 +
 drivers/net/wireless/ath/ar9170/main.c           |    4 +-
 drivers/net/wireless/ath/ath5k/base.c            |    6 +-
 drivers/net/wireless/ath/ath9k/ani.c             |    9 -
 drivers/net/wireless/ath/ath9k/ani.h             |    1 -
 drivers/net/wireless/ath/ath9k/ar5008_initvals.h |    2 +-
 drivers/net/wireless/ath/ath9k/ar5008_phy.c      |    5 +-
 drivers/net/wireless/ath/ath9k/ar9002_hw.c       |    5 +
 drivers/net/wireless/ath/ath9k/ar9002_initvals.h |    8 +-
 drivers/net/wireless/ath/ath9k/ar9002_phy.c      |   16 +-
 drivers/net/wireless/ath/ath9k/ar9003_calib.c    |    3 +-
 drivers/net/wireless/ath/ath9k/ar9003_eeprom.c   |    8 +-
 drivers/net/wireless/ath/ath9k/ar9003_eeprom.h   |    2 +-
 drivers/net/wireless/ath/ath9k/ar9003_initvals.h |  265 ++--
 drivers/net/wireless/ath/ath9k/ar9003_mac.c      |    3 +
 drivers/net/wireless/ath/ath9k/ar9003_phy.c      |   16 +-
 drivers/net/wireless/ath/ath9k/common.c          |    1 -
 drivers/net/wireless/ath/ath9k/eeprom.h          |    3 +-
 drivers/net/wireless/ath/ath9k/eeprom_def.c      |    2 +
 drivers/net/wireless/ath/ath9k/hif_usb.c         |  149 +-
 drivers/net/wireless/ath/ath9k/hif_usb.h         |    1 -
 drivers/net/wireless/ath/ath9k/htc.h             |    2 +
 drivers/net/wireless/ath/ath9k/htc_drv_init.c    |    8 +
 drivers/net/wireless/ath/ath9k/htc_drv_main.c    |   84 +-
 drivers/net/wireless/ath/ath9k/htc_drv_txrx.c    |   59 +-
 drivers/net/wireless/ath/ath9k/htc_hst.c         |    5 +-
 drivers/net/wireless/ath/ath9k/hw.c              |   27 +-
 drivers/net/wireless/ath/ath9k/hw.h              |    8 +-
 drivers/net/wireless/ath/ath9k/mac.c             |   10 +-
 drivers/net/wireless/ath/ath9k/main.c            |   20 +
 drivers/net/wireless/ath/ath9k/wmi.c             |   11 +-
 drivers/net/wireless/ath/ath9k/wmi.h             |    4 +-
 drivers/net/wireless/ath/ath9k/xmit.c            |    2 +
 drivers/net/wireless/b43/main.c                  |   21 +-
 drivers/net/wireless/b43/xmit.c                  |    1 -
 drivers/net/wireless/b43legacy/main.c            |   21 +-
 drivers/net/wireless/b43legacy/xmit.c            |    1 -
 drivers/net/wireless/iwlwifi/Makefile            |    1 +
 drivers/net/wireless/iwlwifi/iwl-1000.c          |    1 -
 drivers/net/wireless/iwlwifi/iwl-3945-debugfs.c  |  500 ++++++
 drivers/net/wireless/iwlwifi/iwl-3945-debugfs.h  |   60 +
 drivers/net/wireless/iwlwifi/iwl-3945.c          |   72 +-
 drivers/net/wireless/iwlwifi/iwl-3945.h          |    2 +
 drivers/net/wireless/iwlwifi/iwl-5000.c          |   73 +-
 drivers/net/wireless/iwlwifi/iwl-6000.c          |   73 +-
 drivers/net/wireless/iwlwifi/iwl-agn-lib.c       |   23 +-
 drivers/net/wireless/iwlwifi/iwl-agn-rs.c        |   20 +-
 drivers/net/wireless/iwlwifi/iwl-agn-tx.c        |   16 +-
 drivers/net/wireless/iwlwifi/iwl-agn.c           |   14 -
 drivers/net/wireless/iwlwifi/iwl-commands.h      |    4 +-
 drivers/net/wireless/iwlwifi/iwl-core.c          |   10 +-
 drivers/net/wireless/iwlwifi/iwl-core.h          |    4 +-
 drivers/net/wireless/iwlwifi/iwl-debugfs.c       |   18 +-
 drivers/net/wireless/iwlwifi/iwl-dev.h           |    5 +
 drivers/net/wireless/iwlwifi/iwl-scan.c          |    3 +-
 drivers/net/wireless/iwlwifi/iwl3945-base.c      |    6 +-
 drivers/net/wireless/iwmc3200wifi/bus.h          |    2 +-
 drivers/net/wireless/iwmc3200wifi/debug.h        |    7 +-
 drivers/net/wireless/iwmc3200wifi/debugfs.c      |  110 +--
 drivers/net/wireless/iwmc3200wifi/sdio.c         |   17 +-
 drivers/net/wireless/libertas/if_sdio.c          |    4 +-
 drivers/net/wireless/libertas_tf/cmd.c           |  203 ++-
 drivers/net/wireless/libertas_tf/deb_defs.h      |  104 ++
 drivers/net/wireless/libertas_tf/if_usb.c        |  251 +++-
 drivers/net/wireless/libertas_tf/libertas_tf.h   |    2 +
 drivers/net/wireless/libertas_tf/main.c          |   91 +-
 drivers/net/wireless/mac80211_hwsim.c            |    6 +-
 drivers/net/wireless/mwl8k.c                     |    6 +-
 drivers/net/wireless/orinoco/Kconfig             |    7 +
 drivers/net/wireless/orinoco/Makefile            |    1 +
 drivers/net/wireless/orinoco/airport.c           |    8 +-
 drivers/net/wireless/orinoco/cfg.c               |    2 +-
 drivers/net/wireless/orinoco/fw.c                |   10 +-
 drivers/net/wireless/orinoco/hermes.c            |  286 ++++-
 drivers/net/wireless/orinoco/hermes.h            |   62 +-
 drivers/net/wireless/orinoco/hermes_dld.c        |  243 +---
 drivers/net/wireless/orinoco/hw.c                |   63 +-
 drivers/net/wireless/orinoco/main.c              |  137 +-
 drivers/net/wireless/orinoco/orinoco.h           |   30 +-
 drivers/net/wireless/orinoco/orinoco_cs.c        |    6 +-
 drivers/net/wireless/orinoco/orinoco_nortel.c    |    2 +-
 drivers/net/wireless/orinoco/orinoco_pci.c       |    2 +-
 drivers/net/wireless/orinoco/orinoco_plx.c       |    2 +-
 drivers/net/wireless/orinoco/orinoco_tmd.c       |    2 +-
 drivers/net/wireless/orinoco/orinoco_usb.c       | 1800 ++++++++++++++++++++++
 drivers/net/wireless/orinoco/spectrum_cs.c       |    7 +-
 drivers/net/wireless/orinoco/wext.c              |    6 +-
 drivers/net/wireless/p54/main.c                  |    3 +-
 drivers/net/wireless/p54/p54pci.c                |   16 +-
 drivers/net/wireless/p54/txrx.c                  |    1 -
 drivers/net/wireless/rt2x00/Kconfig              |    4 +-
 drivers/net/wireless/rt2x00/rt2400pci.c          |    4 +-
 drivers/net/wireless/rt2x00/rt2500pci.c          |    2 +-
 drivers/net/wireless/rt2x00/rt2500usb.c          |    2 +-
 drivers/net/wireless/rt2x00/rt2800.h             |   11 +-
 drivers/net/wireless/rt2x00/rt2800lib.c          |   91 +-
 drivers/net/wireless/rt2x00/rt2800pci.c          |    6 +-
 drivers/net/wireless/rt2x00/rt2800usb.c          |   13 +-
 drivers/net/wireless/rt2x00/rt2x00dev.c          |    1 -
 drivers/net/wireless/rt2x00/rt2x00queue.c        |    6 +-
 drivers/net/wireless/rt2x00/rt2x00queue.h        |    6 +-
 drivers/net/wireless/rt2x00/rt61pci.c            |    5 +-
 drivers/net/wireless/rt2x00/rt73usb.c            |    2 +-
 drivers/net/wireless/rtl818x/rtl8180_dev.c       |   13 +-
 drivers/net/wireless/rtl818x/rtl8187_dev.c       |   10 +-
 drivers/net/wireless/wl12xx/wl1251_main.c        |    2 +-
 drivers/net/wireless/wl12xx/wl1251_rx.c          |    6 -
 drivers/net/wireless/wl12xx/wl1271_acx.c         |    2 +-
 drivers/net/wireless/wl12xx/wl1271_boot.c        |    8 +-
 drivers/net/wireless/wl12xx/wl1271_cmd.c         |   14 +-
 drivers/net/wireless/wl12xx/wl1271_conf.h        |    2 +-
 drivers/net/wireless/wl12xx/wl1271_main.c        |   97 +-
 drivers/ssb/driver_chipcommon.c                  |    2 +
 drivers/ssb/main.c                               |    3 +
 drivers/ssb/pci.c                                |   14 +-
 drivers/ssb/sprom.c                              |   14 +
 include/linux/nl80211.h                          |    5 +
 include/linux/ssb/ssb.h                          |    4 +
 include/linux/ssb/ssb_driver_chipcommon.h        |   15 +
 include/linux/ssb/ssb_regs.h                     |  239 ++--
 include/net/cfg80211.h                           |    4 +
 include/net/mac80211.h                           |   21 +-
 net/mac80211/cfg.c                               |   18 +-
 net/mac80211/debugfs_sta.c                       |   65 +-
 net/mac80211/driver-ops.h                        |    5 +-
 net/mac80211/driver-trace.h                      |    9 +-
 net/mac80211/ibss.c                              |   27 +-
 net/mac80211/ieee80211_i.h                       |    3 +-
 net/mac80211/main.c                              |   19 +-
 net/mac80211/mlme.c                              |   21 +
 net/mac80211/rx.c                                |    2 -
 net/mac80211/scan.c                              |   53 +-
 net/mac80211/sta_info.c                          |   17 +-
 net/mac80211/status.c                            |    7 +
 net/mac80211/tx.c                                |    5 +-
 net/mac80211/work.c                              |   28 +-
 net/wireless/core.c                              |    3 +-
 net/wireless/nl80211.c                           |    4 +
 140 files changed, 4751 insertions(+), 1368 deletions(-)
 create mode 100644 drivers/net/wireless/iwlwifi/iwl-3945-debugfs.c
 create mode 100644 drivers/net/wireless/iwlwifi/iwl-3945-debugfs.h
 create mode 100644 drivers/net/wireless/libertas_tf/deb_defs.h
 create mode 100644 drivers/net/wireless/orinoco/orinoco_usb.c

Omnibus patch is available here:

	http://www.kernel.org/pub/linux/kernel/people/linville/wireless-next-2.6-2010-04-23.patch.bz2

-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply

* Re: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for vmxnet3
From: Dmitry Torokhov @ 2010-05-05 20:36 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: virtualization@lists.linux-foundation.org, Pankaj Thakkar,
	Christoph Hellwig, pv-drivers@vmware.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <201005052209.48465.arnd@arndb.de>

On Wednesday 05 May 2010 01:09:48 pm Arnd Bergmann wrote:
> > > If you have any interesting in developing this further, do:
> > > 
> > >  (1) move the limited VF drivers directly into the kernel tree,
> > >      talk to them through a normal ops vector
> > 
> > [PT] This assumes that all the VF drivers would always be available.
> > Also we have to support windows and our current design supports it
> > nicely in an OS agnostic manner.
> 
> Your approach assumes that the plugin is always available, which has
> exactly the same implications.

Since plugin[s] are carried by the host they are indeed always
available.

-- 
Dmitry

^ permalink raw reply

* Re: 3 packet TCP window limit?
From: Rick Jones @ 2010-05-05 20:23 UTC (permalink / raw)
  To: dormando; +Cc: Brian Bloniarz, netdev
In-Reply-To: <alpine.LNX.2.00.1005051239290.28957@d>

dormando wrote:
>>This sounds like TCP slow start.
>>
>>http://en.wikipedia.org/wiki/Slow-start
>>
>>As far as tunables you might want to play with the initcwnd route
>>flag (see "ip route help")
> 
> Ah, yes, initcwnd was it. I'm well aware of TCP Congestion control / slow
> start / etc. However I couldn't find the damn tunable for it :)

I don't believe linux as yet has a damn tunable for it :)

> ssthresh/tso/etc didn't seem to unwedge it. 

If they did, it would be a bug.  In fact there *was* a bug "way back when" where 
TSO being enabled caused the stack to ignore initcwd, but that was fixed circa 
2.6.14.  Until it was fixed (it was difficult to notice unless one was speaking 
to a non-Linux reciever, since Linux receivers autotune the receive window) it 
did some very nice things for SPECweb benchmark results :)

> Felt like describing it in the most generic way possible would help :)
> 
> Other OS's appear to have a larger initcwnd.

Names?  Values?

> As do commercial load balancers.

Names?  Values?

> The default of 3 seems to be tuned for 56k dialup modems. I'm a
> little surprised that none of the pluggable TCP congestion control
> algorithms changed this value. I went through all of them except for
> tcp_yeah.

The initcwnd comes from IETF RFCs and their "thou shalts" and "thou shalt nots." 
  As you note below, Google et al seek to alter/extend the RFCs.  That is an 
ongoing discussion in some of the ietf related mailing lists.

rick jones

> Anyway, thanks and sorry for the nearly off-topic post here. I see some
> google papers on bumping initcwnd to 10... but I guess that's not linux's
> deal yet.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* Re: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for vmxnet3
From: Arnd Bergmann @ 2010-05-05 20:09 UTC (permalink / raw)
  To: virtualization
  Cc: Pankaj Thakkar, Christoph Hellwig, Dmitry Torokhov,
	pv-drivers@vmware.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <F1354E79A137A24CBA60059AA65CB1B802A235C535@EXCH-MBX-2.vmware.com>

On Wednesday 05 May 2010 19:47:10 Pankaj Thakkar wrote:
> > 
> > Forget about the licensing.  Loading binary blobs written to a shim
> > layer is a complete pain in the ass and totally unsupportable, and
> > also uninteresting because of the overhead.
> 
> [PT] Why do you think it is unsupportable? How different is it from any module
> written against a well maintained interface? What overhead are you talking about?

We have the right number of module loaders in the kernel: one. If you
add another one, you're doubling the amount of code that anyone
working on that code needs to know about.
 
> > If you have any interesting in developing this further, do:
> > 
> >  (1) move the limited VF drivers directly into the kernel tree,
> >      talk to them through a normal ops vector
> [PT] This assumes that all the VF drivers would always be available.
> Also we have to support windows and our current design supports it
> nicely in an OS agnostic manner.

Your approach assumes that the plugin is always available, which has
exactly the same implications.

> >  (2) get rid of the whole shim crap and instead integrate the limited
> >      VF driver with the full VF driver we already have, instead of
> >      duplicating the code
> [PT] Having a full VF driver adds a lot of dependency on the guest VM
> and this is what NPA tries to avoid.

If you have the limited driver for some hardware that does not have
the real thing, we could still ship just that. I would however guess
that most vendors are interested in not just running in vmware but
also other hypervisors that still require the full driver, so that
case would be rare, especially in the long run.

	Arnd

^ permalink raw reply

* Re: 3 packet TCP window limit?
From: dormando @ 2010-05-05 20:01 UTC (permalink / raw)
  To: Brian Bloniarz; +Cc: netdev
In-Reply-To: <4BE171EC.20904@athenacr.com>

> This sounds like TCP slow start.
>
> http://en.wikipedia.org/wiki/Slow-start
>
> As far as tunables you might want to play with the initcwnd route
> flag (see "ip route help")

Ah, yes, initcwnd was it. I'm well aware of TCP Congestion control / slow
start / etc. However I couldn't find the damn tunable for it :)
ssthresh/tso/etc didn't seem to unwedge it. Felt like describing it in the
most generic way possible would help :)

Other OS's appear to have a larger initcwnd. As do commercial load
balancers. The default of 3 seems to be tuned for 56k dialup modems. I'm a
little surprised that none of the pluggable TCP congestion control
algorithms changed this value. I went through all of them except for
tcp_yeah.

Anyway, thanks and sorry for the nearly off-topic post here. I see some
google papers on bumping initcwnd to 10... but I guess that's not linux's
deal yet.

^ permalink raw reply

* Re: [PATCH/RFC] cxgb4: Add MAINTAINERS info
From: Steve Wise @ 2010-05-05 19:59 UTC (permalink / raw)
  To: Roland Dreier
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	dm-ut6Up61K2wZBDgjK7y7TUQ
In-Reply-To: <adawrvijhpq.fsf_-_-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>

Looks good to me.

Steve.

Roland Dreier wrote:
> Hi guys, does this info for cxgb4/iw_cxgb4 (pretty much copied from
> cxgb3, except with Dimitris instead of Divy) look right?  If so I'll add
> it to my tree.
>
> Thanks,
>   Roland
> ---
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 7a9ccda..a00231b 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1719,6 +1719,20 @@ W:	http://www.openfabrics.org
>  S:	Supported
>  F:	drivers/infiniband/hw/cxgb3/
>  
> +CXGB4 ETHERNET DRIVER (CXGB4)
> +M:	Dimitris Michailidis <dm-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
> +L:	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> +W:	http://www.chelsio.com
> +S:	Supported
> +F:	drivers/net/cxgb4/
> +
> +CXGB4 IWARP RNIC DRIVER (IW_CXGB4)
> +M:	Steve Wise <swise-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
> +L:	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> +W:	http://www.openfabrics.org
> +S:	Supported
> +F:	drivers/infiniband/hw/cxgb4/
> +
>  CYBERPRO FB DRIVER
>  M:	Russell King <linux-lFZ/pmaqli7XmaaqVzeoHQ@public.gmane.org>
>  L:	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org (moderated for non-subscribers)
>   

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 2/6] netns: Teach network device kobjects which namespace they are in.
From: Eric W. Biederman @ 2010-05-05 19:56 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Greg Kroah-Hartman, Kay Sievers, linux-kernel, Tejun Heo,
	Cornelia Huck, Eric Dumazet, Benjamin LaHaise, netdev,
	David Miller
In-Reply-To: <20100505151746.GA15654@us.ibm.com>

"Serge E. Hallyn" <serue@us.ibm.com> writes:

> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> diff --git a/net/Kconfig b/net/Kconfig
>> index 041c35e..265e33b 100644
>> --- a/net/Kconfig
>> +++ b/net/Kconfig
>> @@ -45,6 +45,14 @@ config COMPAT_NETLINK_MESSAGES
>> 
>>  menu "Networking options"
>> 
>> +config NET_NS
>> +	bool "Network namespace support"
>> +	default n
>> +	depends on EXPERIMENTAL && NAMESPACES
>> +	help
>> +	  Allow user space to create what appear to be multiple instances
>> +	  of the network stack.
>> +
>
> Hi Eric,
>
> I'm confused - NET_NS is defined in init/Kconfig right now.  Is the tree
> you're working from very different from mine, or is this the unfortunate
> rekult of the patches sitting so long?

Old patches, nothing that complains when you make a mistake like this,
and apparently I have a blind spot in my personal code review.

At one point it was not possible to enable the network namespace until
the sysfs stuff was enabled, but things have been going on long enough
that we worked around that restriction.

>>  source "net/packet/Kconfig"
>>  source "net/unix/Kconfig"
>>  source "net/xfrm/Kconfig"
>> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
>> index 099c753..1b98e36 100644
>> --- a/net/core/net-sysfs.c
>> +++ b/net/core/net-sysfs.c
>> @@ -13,7 +13,9 @@
>>  #include <linux/kernel.h>
>>  #include <linux/netdevice.h>
>>  #include <linux/if_arp.h>
>> +#include <linux/nsproxy.h>
>>  #include <net/sock.h>
>> +#include <net/net_namespace.h>
>>  #include <linux/rtnetlink.h>
>>  #include <linux/wireless.h>
>>  #include <net/wext.h>
>> @@ -466,6 +468,37 @@ static struct attribute_group wireless_group = {
>>  };
>>  #endif
>> 
>> +static const void *net_current_ns(void)
>> +{
>> +	return current->nsproxy->net_ns;
>> +}
>> +
>> +static const void *net_initial_ns(void)
>> +{
>> +	return &init_net;
>> +}
>> +
>> +static const void *net_netlink_ns(struct sock *sk)
>> +{
>> +	return sock_net(sk);
>> +}
>> +
>> +static struct kobj_ns_type_operations net_ns_type_operations = {
>> +	.type = KOBJ_NS_TYPE_NET,
>> +	.current_ns = net_current_ns,
>> +	.netlink_ns = net_netlink_ns,
>> +	.initial_ns = net_initial_ns,
>> +};
>> +
>> +static void net_kobj_ns_exit(struct net *net)
>> +{
>> +	kobj_ns_exit(KOBJ_NS_TYPE_NET, net);
>> +}
>> +
>> +static struct pernet_operations sysfs_net_ops = {
>> +	.exit = net_kobj_ns_exit,
>> +};
>> +
>>  #endif /* CONFIG_SYSFS */
>
> ...
>
>>  int netdev_kobject_init(void)
>>  {
>> +	kobj_ns_type_register(&net_ns_type_operations);
>> +#ifdef CONFIG_SYSFS
>> +	register_pernet_subsys(&sysfs_net_ops);
>> +#endif
>>  	return class_register(&net_class);
>
> I think the kobj_ns_type_register() needs to be under
> ifdef CONFIG_SYSFS as well, bc net_ns_type_operations is defined
> under ifdef CONFIG_SYSFS.

kobj_ns_type_register should not be under CONFIG_SYSFS.  Which means
that kobj_ns_type_operations needs not to be under CONFIG_SYSFS as
well.  That you for spotting that bug.

Grr.

Eric

^ permalink raw reply

* Re: RFC: Network Plugin Architecture (NPA) for vmxnet3
From: Pankaj Thakkar @ 2010-05-05 19:44 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	virtualization@lists.linux-foundation.org, pv-drivers@vmware.com,
	Shreyas Bhatewara
In-Reply-To: <4BE1B217.80600@redhat.com>

On Wed, May 05, 2010 at 10:59:51AM -0700, Avi Kivity wrote:
> Date: Wed, 5 May 2010 10:59:51 -0700
> From: Avi Kivity <avi@redhat.com>
> To: Pankaj Thakkar <pthakkar@vmware.com>
> CC: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
> 	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
> 	"virtualization@lists.linux-foundation.org"
>  <virtualization@lists.linux-foundation.org>,
> 	"pv-drivers@vmware.com" <pv-drivers@vmware.com>,
> 	Shreyas Bhatewara <sbhatewara@vmware.com>
> Subject: Re: RFC: Network Plugin Architecture (NPA) for vmxnet3
> 
> On 05/05/2010 02:02 AM, Pankaj Thakkar wrote:
> > 2. Hypervisor control: All control operations from the guest such as programming
> > MAC address go through the hypervisor layer and hence can be subjected to
> > hypervisor policies. The PF driver can be further used to put policy decisions
> > like which VLAN the guest should be on.
> >    
> 
> Is this enforced?  Since you pass the hardware through, you can't rely 
> on the guest actually doing this, yes?

We don't pass the whole VF to the guest. Only the BAR which is responsible for
TX/RX/intr is mapped into guest space. The interface between the shell and
plugin only allows to do operations related to TX and RX such as send a packet
to the VF, allocate RX buffers, indicate a packet upto the shell. All control
operations are handled by the shell and the shell does what the existing
vmxnet3 drivers does (touch a specific register and let the device emulation do
the work). When a VF is mapped to the guest the hypervisor knows this and
programs the h/w accordingly on behalf of the shell. So for example if the VM
does a MAC address change inside the guest, the shell would write to
VMXNET3_REG_MAC{L|H} registers which would trigger the device emulation to read
the new mac address and update its internal virtual port information for the
virtual switch and if the VF is mapped it would also program the embedded
switch RX filters to reflect the new mac address.

> 
> > The plugin image is provided by the IHVs along with the PF driver and is
> > packaged in the hypervisor. The plugin image is OS agnostic and can be loaded
> > either into a Linux VM or a Windows VM. The plugin is written against the Shell
> > API interface which the shell is responsible for implementing. The API
> > interface allows the plugin to do TX and RX only by programming the hardware
> > rings (along with things like buffer allocation and basic initialization). The
> > virtual machine comes up in paravirtualized/emulated mode when it is booted.
> > The hypervisor allocates the VF and other resources and notifies the shell of
> > the availability of the VF. The hypervisor injects the plugin into memory
> > location specified by the shell. The shell initializes the plugin by calling
> > into a known entry point and the plugin initializes the data path. The control
> > path is already initialized by the PF driver when the VF is allocated. At this
> > point the shell switches to using the loaded plugin to do all further TX and RX
> > operations. The guest networking stack does not participate in these operations
> > and continues to function normally. All the control operations continue being
> > trapped by the hypervisor and are directed to the PF driver as needed. For
> > example, if the MAC address changes the hypervisor updates its internal state
> > and changes the state of the embedded switch as well through the PF control
> > API.
> >    
> 
> This is essentially a miniature network stack with a its own mini 
> bonding layer, mini hotplug, and mini API, except s/API/ABI/.  Is this a 
> correct view?

To some extent yes but there is no complicated bonding nor there is any thing
like a PCI hotplug. The shell interface is small and the OS always interacts
with the shell as the main driver. Based on the underlying VF the plugin
changes and this plugin as well is really small. Our vmxnet3 s/w plugin is
about 1300 lines with whitespaces and comments and the Intel Kawela plugin is
about 1100 lines with whitspaces and comments. The design principle is to put
more of the complexity related to initialization/control into the PF driver
rather than in plugin.

> 
> If so, the Linuxy approach would be to use the ordinary drivers and the 
> Linux networking API, and hide the bond setup using namespaces.  The 
> bond driver, or perhaps a new, similar, driver can be enhanced to 
> propagate ethtool commands to its (hidden) components, and to have a 
> control channel with the hypervisor.
> 
> This would make the approach hypervisor agnostic, you're just pairing 
> two devices and presenting them to the rest of the stack as a single device.
> 
> > We have reworked our existing Linux vmxnet3 driver to accomodate NPA by
> > splitting the driver into two parts: Shell and Plugin. The new split driver is
> >    
> 
> So the Shell would be the reworked or new bond driver, and Plugins would 
> be ordinary Linux network drivers.

In NPA we do not rely on the guest OS to provide any of these services like
bonding or PCI hotplug. We don't rely on the guest OS to unmap a VF and switch
a VM out of passthrough. In a bonding approach that becomes an issue you can't
just yank a device from underneath, you have to wait for the OS to process the
request and switch from using VF to the emulated device and this makes the
hypervisor dependent on the guest OS. Also we don't rely on the presence of all
the drivers inside the guest OS (be it Linux or Windows), the ESX hypervisor
carries all the plugins and the PF drivers and injects the right one as needed.
These plugins are guest agnostic and the IHVs do not have to write plugins for
different OS.


Thanks,

-pankaj

 

^ permalink raw reply

* Re: linear sk_buff
From: Phil Sutter @ 2010-05-05 19:32 UTC (permalink / raw)
  To: netdev
In-Reply-To: <o2kdac45061005051212na062f611m1c87240c367b5d38@mail.gmail.com>

Hi,

On Wed, May 05, 2010 at 10:12:06PM +0300, Mark Ryden wrote:
>  I would appreciate if someone in this mailing list  can say in a
> sentence or two what is a linear
> sk_buff and what is a non linear sk_buff; does it has to do with fragmentation?
> (I am sure that many know the answer, but I am confused and googling
> made me overconfused)

As far as I can tell, linear and non-linear sk_buffs differ in how the
data they contain is kept internally. Linear sk_buffs are trivial, it's
data resides in one, continuous block. In non-linear sk_buffs, data is
spread across multiple junks, organised in a data structure comparable
to e.g. scatterlists.

For further information, I'd highly recommend David Miller's "how SKBs
work": http://vger.kernel.org/~davem/skb.html .

Greetings, Phil

^ permalink raw reply

* Re: linear sk_buff
From: Rémi Denis-Courmont @ 2010-05-05 19:35 UTC (permalink / raw)
  To: Mark Ryden; +Cc: netdev
In-Reply-To: <o2kdac45061005051212na062f611m1c87240c367b5d38@mail.gmail.com>

Le mercredi 5 mai 2010 22:12:06 Mark Ryden, vous avez écrit :
> Hello,
>  I would appreciate if someone in this mailing list  can say in a
> sentence or two what is a linear
> sk_buff and what is a non linear sk_buff; does it has to do with
> fragmentation? (I am sure that many know the answer, but I am confused and
> googling made me overconfused)

A linear sk_buff is one that has no memory pages and no fragments sk_buff.

Such a buffer is made of a contiguous portion of the (kernel) memory.


-- 
Rémi Denis-Courmont
http://www.remlab.net/
http://fi.linkedin.com/in/remidenis

^ permalink raw reply

* [PATCH v2] sctp: Fix a race between ICMP protocol unreachable and connect()
From: Vlad Yasevich @ 2010-05-05 19:36 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-sctp, Vlad Yasevich
In-Reply-To: <1273087783-18250-1-git-send-email-vladislav.yasevich@hp.com>

[.. removed leftover debuggin printk. should probably be queued for stable
 as well... ]

ICMP protocol unreachable handling completely disregarded
the fact that the user may have locket the socket.  It proceeded
to destroy the association, even though the user may have
held the lock and had a ref on the association.  This resulted
in the following:

Attempt to release alive inet socket f6afcc00

=========================
[ BUG: held lock freed! ]
-------------------------
somenu/2672 is freeing memory f6afcc00-f6afcfff, with a lock still held
there!
 (sk_lock-AF_INET){+.+.+.}, at: [<c122098a>] sctp_connect+0x13/0x4c
1 lock held by somenu/2672:
 #0:  (sk_lock-AF_INET){+.+.+.}, at: [<c122098a>] sctp_connect+0x13/0x4c

stack backtrace:
Pid: 2672, comm: somenu Not tainted 2.6.32-telco #55
Call Trace:
 [<c1232266>] ? printk+0xf/0x11
 [<c1038553>] debug_check_no_locks_freed+0xce/0xff
 [<c10620b4>] kmem_cache_free+0x21/0x66
 [<c1185f25>] __sk_free+0x9d/0xab
 [<c1185f9c>] sk_free+0x1c/0x1e
 [<c1216e38>] sctp_association_put+0x32/0x89
 [<c1220865>] __sctp_connect+0x36d/0x3f4
 [<c122098a>] ? sctp_connect+0x13/0x4c
 [<c102d073>] ? autoremove_wake_function+0x0/0x33
 [<c12209a8>] sctp_connect+0x31/0x4c
 [<c11d1e80>] inet_dgram_connect+0x4b/0x55
 [<c11834fa>] sys_connect+0x54/0x71
 [<c103a3a2>] ? lock_release_non_nested+0x88/0x239
 [<c1054026>] ? might_fault+0x42/0x7c
 [<c1054026>] ? might_fault+0x42/0x7c
 [<c11847ab>] sys_socketcall+0x6d/0x178
 [<c10da994>] ? trace_hardirqs_on_thunk+0xc/0x10
 [<c1002959>] syscall_call+0x7/0xb

This was because the sctp_wait_for_connect() would aqcure the socket
lock and then proceed to release the last reference count on the
association, thus cause the fully destruction path to finish freeing
the socket.

The simplest solution is to start a very short timer in case the socket
is owned by user.  When the timer expires, we can do some verification
and be able to do the release properly.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
---
 include/net/sctp/sm.h      |    1 +
 include/net/sctp/structs.h |    3 +++
 net/sctp/input.c           |   22 ++++++++++++++++++----
 net/sctp/sm_sideeffect.c   |   35 +++++++++++++++++++++++++++++++++++
 net/sctp/transport.c       |    2 ++
 5 files changed, 59 insertions(+), 4 deletions(-)

diff --git a/include/net/sctp/sm.h b/include/net/sctp/sm.h
index 851c813..61d73e3 100644
--- a/include/net/sctp/sm.h
+++ b/include/net/sctp/sm.h
@@ -279,6 +279,7 @@ int sctp_do_sm(sctp_event_t event_type, sctp_subtype_t subtype,
 /* 2nd level prototypes */
 void sctp_generate_t3_rtx_event(unsigned long peer);
 void sctp_generate_heartbeat_event(unsigned long peer);
+void sctp_generate_proto_unreach_event(unsigned long peer);
 
 void sctp_ootb_pkt_free(struct sctp_packet *);
 
diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 597f8e2..219043a 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -1010,6 +1010,9 @@ struct sctp_transport {
 	/* Heartbeat timer is per destination. */
 	struct timer_list hb_timer;
 
+	/* Timer to handle ICMP proto unreachable envets */
+	struct timer_list proto_unreach_timer;
+
 	/* Since we're using per-destination retransmission timers
 	 * (see above), we're also using per-destination "transmitted"
 	 * queues.  This probably ought to be a private struct
diff --git a/net/sctp/input.c b/net/sctp/input.c
index 2a57018..ea21924 100644
--- a/net/sctp/input.c
+++ b/net/sctp/input.c
@@ -440,11 +440,25 @@ void sctp_icmp_proto_unreachable(struct sock *sk,
 {
 	SCTP_DEBUG_PRINTK("%s\n",  __func__);
 
-	sctp_do_sm(SCTP_EVENT_T_OTHER,
-		   SCTP_ST_OTHER(SCTP_EVENT_ICMP_PROTO_UNREACH),
-		   asoc->state, asoc->ep, asoc, t,
-		   GFP_ATOMIC);
+	if (sock_owned_by_user(sk)) {
+		if (timer_pending(&t->proto_unreach_timer))
+			return;
+		else {
+			if (!mod_timer(&t->proto_unreach_timer,
+						jiffies + (HZ/20)))
+				sctp_association_hold(asoc);
+		}
+			
+	} else {
+		if (timer_pending(&t->proto_unreach_timer) &&
+		    del_timer(&t->proto_unreach_timer))
+			sctp_association_put(asoc);
 
+		sctp_do_sm(SCTP_EVENT_T_OTHER,
+			   SCTP_ST_OTHER(SCTP_EVENT_ICMP_PROTO_UNREACH),
+			   asoc->state, asoc->ep, asoc, t,
+			   GFP_ATOMIC);
+	}
 }
 
 /* Common lookup code for icmp/icmpv6 error handler. */
diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
index d5ae450..eb1f42f 100644
--- a/net/sctp/sm_sideeffect.c
+++ b/net/sctp/sm_sideeffect.c
@@ -397,6 +397,41 @@ out_unlock:
 	sctp_transport_put(transport);
 }
 
+/* Handle the timeout of the ICMP protocol unreachable timer.  Trigger
+ * the correct state machine transition that will close the association.
+ */
+void sctp_generate_proto_unreach_event(unsigned long data)
+{
+	struct sctp_transport *transport = (struct sctp_transport *) data;
+	struct sctp_association *asoc = transport->asoc;
+	
+	sctp_bh_lock_sock(asoc->base.sk);
+	if (sock_owned_by_user(asoc->base.sk)) {
+		SCTP_DEBUG_PRINTK("%s:Sock is busy.\n", __func__);
+
+		/* Try again later.  */
+		if (!mod_timer(&transport->proto_unreach_timer,
+				jiffies + (HZ/20)))
+			sctp_association_hold(asoc);
+		goto out_unlock;
+	}
+
+	/* Is this structure just waiting around for us to actually
+	 * get destroyed?
+	 */
+	if (asoc->base.dead)
+		goto out_unlock;
+
+	sctp_do_sm(SCTP_EVENT_T_OTHER,
+		   SCTP_ST_OTHER(SCTP_EVENT_ICMP_PROTO_UNREACH),
+		   asoc->state, asoc->ep, asoc, transport, GFP_ATOMIC);
+
+out_unlock:
+	sctp_bh_unlock_sock(asoc->base.sk);
+	sctp_association_put(asoc);
+}
+
+
 /* Inject a SACK Timeout event into the state machine.  */
 static void sctp_generate_sack_event(unsigned long data)
 {
diff --git a/net/sctp/transport.c b/net/sctp/transport.c
index be4d63d..4a36803 100644
--- a/net/sctp/transport.c
+++ b/net/sctp/transport.c
@@ -108,6 +108,8 @@ static struct sctp_transport *sctp_transport_init(struct sctp_transport *peer,
 			(unsigned long)peer);
 	setup_timer(&peer->hb_timer, sctp_generate_heartbeat_event,
 			(unsigned long)peer);
+	setup_timer(&peer->proto_unreach_timer,
+		    sctp_generate_proto_unreach_event, (unsigned long)peer);
 
 	/* Initialize the 64-bit random nonce sent with heartbeat. */
 	get_random_bytes(&peer->hb_nonce, sizeof(peer->hb_nonce));
-- 
1.6.0.4


^ permalink raw reply related

* [PATCH] sctp: Fix a race between ICMP protocol unreachable and connect()
From: Vlad Yasevich @ 2010-05-05 19:29 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-sctp, Vlad Yasevich

ICMP protocol unreachable handling completely disregarded
the fact that the user may have locket the socket.  It proceeded
to destroy the association, even though the user may have
held the lock and had a ref on the association.  This resulted
in the following:

Attempt to release alive inet socket f6afcc00

=========================
[ BUG: held lock freed! ]
-------------------------
somenu/2672 is freeing memory f6afcc00-f6afcfff, with a lock still held
there!
 (sk_lock-AF_INET){+.+.+.}, at: [<c122098a>] sctp_connect+0x13/0x4c
1 lock held by somenu/2672:
 #0:  (sk_lock-AF_INET){+.+.+.}, at: [<c122098a>] sctp_connect+0x13/0x4c

stack backtrace:
Pid: 2672, comm: somenu Not tainted 2.6.32-telco #55
Call Trace:
 [<c1232266>] ? printk+0xf/0x11
 [<c1038553>] debug_check_no_locks_freed+0xce/0xff
 [<c10620b4>] kmem_cache_free+0x21/0x66
 [<c1185f25>] __sk_free+0x9d/0xab
 [<c1185f9c>] sk_free+0x1c/0x1e
 [<c1216e38>] sctp_association_put+0x32/0x89
 [<c1220865>] __sctp_connect+0x36d/0x3f4
 [<c122098a>] ? sctp_connect+0x13/0x4c
 [<c102d073>] ? autoremove_wake_function+0x0/0x33
 [<c12209a8>] sctp_connect+0x31/0x4c
 [<c11d1e80>] inet_dgram_connect+0x4b/0x55
 [<c11834fa>] sys_connect+0x54/0x71
 [<c103a3a2>] ? lock_release_non_nested+0x88/0x239
 [<c1054026>] ? might_fault+0x42/0x7c
 [<c1054026>] ? might_fault+0x42/0x7c
 [<c11847ab>] sys_socketcall+0x6d/0x178
 [<c10da994>] ? trace_hardirqs_on_thunk+0xc/0x10
 [<c1002959>] syscall_call+0x7/0xb

This was because the sctp_wait_for_connect() would aqcure the socket
lock and then proceed to release the last reference count on the
association, thus cause the fully destruction path to finish freeing
the socket.

The simplest solution is to start a very short timer in case the socket
is owned by user.  When the timer expires, we can do some verification
and be able to do the release properly.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
---
 include/net/sctp/sm.h      |    1 +
 include/net/sctp/structs.h |    3 +++
 net/sctp/input.c           |   23 +++++++++++++++++++----
 net/sctp/sm_sideeffect.c   |   35 +++++++++++++++++++++++++++++++++++
 net/sctp/transport.c       |    2 ++
 5 files changed, 60 insertions(+), 4 deletions(-)

diff --git a/include/net/sctp/sm.h b/include/net/sctp/sm.h
index 851c813..61d73e3 100644
--- a/include/net/sctp/sm.h
+++ b/include/net/sctp/sm.h
@@ -279,6 +279,7 @@ int sctp_do_sm(sctp_event_t event_type, sctp_subtype_t subtype,
 /* 2nd level prototypes */
 void sctp_generate_t3_rtx_event(unsigned long peer);
 void sctp_generate_heartbeat_event(unsigned long peer);
+void sctp_generate_proto_unreach_event(unsigned long peer);
 
 void sctp_ootb_pkt_free(struct sctp_packet *);
 
diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 597f8e2..219043a 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -1010,6 +1010,9 @@ struct sctp_transport {
 	/* Heartbeat timer is per destination. */
 	struct timer_list hb_timer;
 
+	/* Timer to handle ICMP proto unreachable envets */
+	struct timer_list proto_unreach_timer;
+
 	/* Since we're using per-destination retransmission timers
 	 * (see above), we're also using per-destination "transmitted"
 	 * queues.  This probably ought to be a private struct
diff --git a/net/sctp/input.c b/net/sctp/input.c
index 2a57018..94b2eb2 100644
--- a/net/sctp/input.c
+++ b/net/sctp/input.c
@@ -440,11 +440,25 @@ void sctp_icmp_proto_unreachable(struct sock *sk,
 {
 	SCTP_DEBUG_PRINTK("%s\n",  __func__);
 
-	sctp_do_sm(SCTP_EVENT_T_OTHER,
-		   SCTP_ST_OTHER(SCTP_EVENT_ICMP_PROTO_UNREACH),
-		   asoc->state, asoc->ep, asoc, t,
-		   GFP_ATOMIC);
+	if (sock_owned_by_user(sk)) {
+		if (timer_pending(&t->proto_unreach_timer))
+			return;
+		else {
+			if (!mod_timer(&t->proto_unreach_timer,
+						jiffies + (HZ/20)))
+				sctp_association_hold(asoc);
+		}
+			
+	} else {
+		if (timer_pending(&t->proto_unreach_timer) &&
+		    del_timer(&t->proto_unreach_timer))
+			sctp_association_put(asoc);
 
+		sctp_do_sm(SCTP_EVENT_T_OTHER,
+			   SCTP_ST_OTHER(SCTP_EVENT_ICMP_PROTO_UNREACH),
+			   asoc->state, asoc->ep, asoc, t,
+			   GFP_ATOMIC);
+	}
 }
 
 /* Common lookup code for icmp/icmpv6 error handler. */
@@ -599,6 +613,7 @@ void sctp_v4_err(struct sk_buff *skb, __u32 info)
 		}
 		else {
 			if (ICMP_PROT_UNREACH == code) {
+				printk("processing code unreach\n");
 				sctp_icmp_proto_unreachable(sk, asoc,
 							    transport);
 				goto out_unlock;
diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
index d5ae450..eb1f42f 100644
--- a/net/sctp/sm_sideeffect.c
+++ b/net/sctp/sm_sideeffect.c
@@ -397,6 +397,41 @@ out_unlock:
 	sctp_transport_put(transport);
 }
 
+/* Handle the timeout of the ICMP protocol unreachable timer.  Trigger
+ * the correct state machine transition that will close the association.
+ */
+void sctp_generate_proto_unreach_event(unsigned long data)
+{
+	struct sctp_transport *transport = (struct sctp_transport *) data;
+	struct sctp_association *asoc = transport->asoc;
+	
+	sctp_bh_lock_sock(asoc->base.sk);
+	if (sock_owned_by_user(asoc->base.sk)) {
+		SCTP_DEBUG_PRINTK("%s:Sock is busy.\n", __func__);
+
+		/* Try again later.  */
+		if (!mod_timer(&transport->proto_unreach_timer,
+				jiffies + (HZ/20)))
+			sctp_association_hold(asoc);
+		goto out_unlock;
+	}
+
+	/* Is this structure just waiting around for us to actually
+	 * get destroyed?
+	 */
+	if (asoc->base.dead)
+		goto out_unlock;
+
+	sctp_do_sm(SCTP_EVENT_T_OTHER,
+		   SCTP_ST_OTHER(SCTP_EVENT_ICMP_PROTO_UNREACH),
+		   asoc->state, asoc->ep, asoc, transport, GFP_ATOMIC);
+
+out_unlock:
+	sctp_bh_unlock_sock(asoc->base.sk);
+	sctp_association_put(asoc);
+}
+
+
 /* Inject a SACK Timeout event into the state machine.  */
 static void sctp_generate_sack_event(unsigned long data)
 {
diff --git a/net/sctp/transport.c b/net/sctp/transport.c
index be4d63d..4a36803 100644
--- a/net/sctp/transport.c
+++ b/net/sctp/transport.c
@@ -108,6 +108,8 @@ static struct sctp_transport *sctp_transport_init(struct sctp_transport *peer,
 			(unsigned long)peer);
 	setup_timer(&peer->hb_timer, sctp_generate_heartbeat_event,
 			(unsigned long)peer);
+	setup_timer(&peer->proto_unreach_timer,
+		    sctp_generate_proto_unreach_event, (unsigned long)peer);
 
 	/* Initialize the 64-bit random nonce sent with heartbeat. */
 	get_random_bytes(&peer->hb_nonce, sizeof(peer->hb_nonce));
-- 
1.6.0.4


^ permalink raw reply related

* linear sk_buff
From: Mark Ryden @ 2010-05-05 19:12 UTC (permalink / raw)
  To: netdev

Hello,
 I would appreciate if someone in this mailing list  can say in a
sentence or two what is a linear
sk_buff and what is a non linear sk_buff; does it has to do with fragmentation?
(I am sure that many know the answer, but I am confused and googling
made me overconfused)

Regards, and sorry for the noise,

Mark

^ permalink raw reply

* [PATCH/RFC] cxgb4: Add MAINTAINERS info
From: Roland Dreier @ 2010-05-05 19:01 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW, dm-ut6Up61K2wZBDgjK7y7TUQ
In-Reply-To: <20100428103356.931ba48c.randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>

Hi guys, does this info for cxgb4/iw_cxgb4 (pretty much copied from
cxgb3, except with Dimitris instead of Divy) look right?  If so I'll add
it to my tree.

Thanks,
  Roland
---
diff --git a/MAINTAINERS b/MAINTAINERS
index 7a9ccda..a00231b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1719,6 +1719,20 @@ W:	http://www.openfabrics.org
 S:	Supported
 F:	drivers/infiniband/hw/cxgb3/
 
+CXGB4 ETHERNET DRIVER (CXGB4)
+M:	Dimitris Michailidis <dm-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
+L:	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
+W:	http://www.chelsio.com
+S:	Supported
+F:	drivers/net/cxgb4/
+
+CXGB4 IWARP RNIC DRIVER (IW_CXGB4)
+M:	Steve Wise <swise-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
+L:	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
+W:	http://www.openfabrics.org
+S:	Supported
+F:	drivers/infiniband/hw/cxgb4/
+
 CYBERPRO FB DRIVER
 M:	Russell King <linux-lFZ/pmaqli7XmaaqVzeoHQ@public.gmane.org>
 L:	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org (moderated for non-subscribers)
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: RFC: Network Plugin Architecture (NPA) for vmxnet3
From: Pankaj Thakkar @ 2010-05-05 19:00 UTC (permalink / raw)
  To: Chris Wright
  Cc: linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	virtualization@lists.linux-foundation.org, pv-drivers@vmware.com,
	Shreyas Bhatewara, kvm@vger.kernel.org
In-Reply-To: <20100505005852.GA28034@sequoia.sous-sol.org>

On Tue, May 04, 2010 at 05:58:52PM -0700, Chris Wright wrote:
> Date: Tue, 4 May 2010 17:58:52 -0700
> From: Chris Wright <chrisw@sous-sol.org>
> To: Pankaj Thakkar <pthakkar@vmware.com>
> CC: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
> 	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
> 	"virtualization@lists.linux-foundation.org"
>  <virtualization@lists.linux-foundation.org>,
> 	"pv-drivers@vmware.com" <pv-drivers@vmware.com>,
> 	Shreyas Bhatewara <sbhatewara@vmware.com>,
> 	"kvm@vger.kernel.org" <kvm@vger.kernel.org>
> Subject: Re: RFC: Network Plugin Architecture (NPA) for vmxnet3
> 
> * Pankaj Thakkar (pthakkar@vmware.com) wrote:
> > We intend to upgrade the upstreamed vmxnet3 driver to implement NPA so that
> > Linux users can exploit the benefits provided by passthrough devices in a
> > seamless manner while retaining the benefits of virtualization. The document
> > below tries to answer most of the questions which we anticipated. Please let us
> > know your comments and queries.
> 
> How does the throughput, latency, and host CPU utilization for normal
> data path compare with say NetQueue?

NetQueue is really for scaling across multiple VMs. NPA allows similar scaling
and also helps in improving the CPU efficiency for a single VM since the
hypervisor is bypassed. Througput wise both emulation and passthrough (NPA) can
obtain line rates on 10gig but passthrough saves upto 40% cpu based on the
workload. We did a demo at IDF 2009 where we compared 8 VMs running on NetQueue
v/s 8 VMs running on NPA (using Niantic) and we obtained similar CPU efficiency
gains.

> 
> And does this obsolete your UPT implementation?

NPA and UPT share a lot of code in the hypervisor. UPT was adopted only by very
limited IHVs and hence NPA is our way forward to have all IHVs onboard.

> How many cards actually support this NPA interface?  What does it look
> like, i.e. where is the NPA specification?  (AFAIK, we never got the UPT
> one).

We have it working internally with Intel Niantic (10G) and Kawela (1G) SR-IOV
NIC. We are also working with upcoming Broadcom 10G card and plan to support
other IHVs. This is unlike UPT so we don't dictate the register sets or rings
like we did in UPT. Rather we have guidelines like that the card should have an
embedded switch for inter VF switching or should support programming (rx
filters, VLAN, etc) though the PF driver rather than the VF driver.

> How do you handle hardware which has a more symmetric view of the
> SR-IOV world (SR-IOV is only PCI sepcification, not a network driver
> specification)?  Or hardware which has multiple functions per physical
> port (multiqueue, hw filtering, embedded switch, etc.)?

I am not sure what do you mean by symmetric view of SR-IOV world?

NPA allows multi-queue VFs and requires an embedded switch currently. As far as
the PF driver is concerned we require IHVs to support all existing and upcoming
features like NetQueue, FCoE, etc. The PF driver is considered special and is
used to drive the traffic for the emulated/paravirtualized VMs and is also used
to program things on behalf of the VFs through the hypervisor. If the hardware
has multiple physical functions they are treated as separate adapters (with
their own set of VFs) and we require the embedded switch to maintain that
distinction as well.


> > NPA offers several benefits:
> > 1. Performance: Critical performance sensitive paths are not trapped and the
> > guest can directly drive the hardware without incurring virtualization
> > overheads.
> 
> Can you demonstrate with data?

The setup is 2.667Ghz Nehalem server running SLES11 VM talking to a 2.33Ghz
Barcelona client box running RHEL 5.1. We had netperf streams with 16k msg size
over 64k socket size running between server VM and client and they are using
Intel Niantic 10G cards. In both cases (NPA and regular) the VM was CPU
saturated (used one full core).

TX: regular vmxnet3 = 3085.5 Mbps/GHz; NPA vmxnet3 = 4397.2 Mbps/GHz
RX: regular vmxnet3 = 1379.6 Mbps/GHz; NPA vmxnet3 = 2349.7 Mbps/GHz

We have similar results for other configuration and in general we have seen NPA
is better in terms of CPU cost and can save upto 40% of CPU cost.

> 
> > 2. Hypervisor control: All control operations from the guest such as programming
> > MAC address go through the hypervisor layer and hence can be subjected to
> > hypervisor policies. The PF driver can be further used to put policy decisions
> > like which VLAN the guest should be on.
> 
> This can happen without NPA as well.  VF simply needs to request
> the change via the PF (in fact, hw does that right now).  Also, we
> already have a host side management interface via PF (see, for example,
> RTM_SETLINK IFLA_VF_MAC interface).
> 
> What is control plane interface?  Just something like a fixed register set?

All operations other than TX/RX go through the vmxnet3 shell to the vmxnet3
device emulation. So the control plane is really the vmxnet3 device emulation
as far as the guest is concerned.

> 
> > 3. Guest Management: No hardware specific drivers need to be installed in the
> > guest virtual machine and hence no overheads are incurred for guest management.
> > All software for the driver (including the PF driver and the plugin) is
> > installed in the hypervisor.
> 
> So we have a plugin per hardware VF implementation?  And the hypervisor
> injects this code into the guest?

One guest-agnostic plugin per VF implementation. Yes, the plugin is injected
into the guest by the hypervisor.

> > The plugin image is provided by the IHVs along with the PF driver and is
> > packaged in the hypervisor. The plugin image is OS agnostic and can be loaded
> > either into a Linux VM or a Windows VM. The plugin is written against the Shell
> 
> And it will need to be GPL AFAICT from what you've said thus far.  It
> does sound worrisome, although I suppose hw firmware isn't particularly
> different.

Yes it would be GPL and we are thinking of enforcing the license in the
hypervisor as well as in the shell.

> How does the shell switch back to emulated mode for live migration?

The hypervisor sends a notification to the shell to switch out of passthrough
and it quiesces the VF and tears down the mapping between VF and the guest. The
shell free's up the buffers and other resources on behalf of the plugin and
reinitializes the s/w vmxnet3 emulation plugin.

> Please make this shell API interface and the PF/VF requirments available.

We have an internal prototype working but we are not yet ready to post the
patch to LKML. We are still in the process of making changes to our windows
driver and want to ensure that we take into account all changes that could
happen.

Thanks,

-pankaj


^ permalink raw reply

* Re: TCP-MD5 checksum failure on x86_64 SMP
From: Eric Dumazet @ 2010-05-05 18:53 UTC (permalink / raw)
  To: Bhaskar Dutta; +Cc: Stephen Hemminger, Ben Hutchings, netdev
In-Reply-To: <g2p571fb4001005051103w67e1b9ddn3e8f7feb84d0559@mail.gmail.com>

Le mercredi 05 mai 2010 à 23:33 +0530, Bhaskar Dutta a écrit :

> Hi,
> 
> TSO, GSO and SG are already turned off.
> rx/tx checksumming is on, but that shouldn't matter, right?
> 
> # ethtool -k eth0
> Offload parameters for eth0:
> rx-checksumming: on
> tx-checksumming: on
> scatter-gather: off
> tcp segmentation offload: off
> udp fragmentation offload: off
> generic segmentation offload: off
> 
> The bad packets are very small in size, most have no data at all (<300 bytes).
> 
> After adding some logs to kernel 2.6.31-12, it seems that
> tcp_v4_md5_hash_skb (function that calculates the md5 hash) is
> (might?) getting corrupt.
> 
> The tcp4_pseudohdr (bp = &hp->md5_blk.ip4) structure's saddr, daddr
> and len fields get modified to different values towards the end of the
> tcp_v4_md5_hash_skb function whenever there is a checksum error.
> 
> The tcp4_pseudohdr (bp) is within the tcp_md5sig_pool (hp), which is
> filled up by tcp_get_md5sig_pool (which calls per_cpu_ptr).
> 
> Using a local copy of the tcp4_pseudohdr in the same function
> tcp_v4_md5_hash_skb (copied all fields from the original
> tcp4_pseudohdr within the tcp_md5sig_pool) and calculating the md5
> checksum with the local  tcp4_pseudohdr seems to solve the issue
> (don't see bad packets for a hours in load tests, and without the
> change I can see them instantaneously in the load tests).
> 
> I am still unable to figure out how this is happening. Please let me
> know if you have any pointers.

I am not familiar with this code, but I suspect same per_cpu data can be
used at both time by a sender (process context) and by a receiver
(softirq context).

To trigger this, you need at least two active md5 sockets.

tcp_get_md5sig_pool() should probably disable bh to make sure current
cpu wont be preempted by softirq processing


Something like :

diff --git a/include/net/tcp.h b/include/net/tcp.h
index fb5c66b..e232123 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1221,12 +1221,15 @@ struct tcp_md5sig_pool          *tcp_get_md5sig_pool(void)
 	struct tcp_md5sig_pool *ret = __tcp_get_md5sig_pool(cpu);
 	if (!ret)
 		put_cpu();
+	else
+		local_bh_disable();
 	return ret;
 }
 
 static inline void             tcp_put_md5sig_pool(void)
 {
 	__tcp_put_md5sig_pool();
+	local_bh_enable();
 	put_cpu();
 }



^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox