Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 1/3] drivers/net/ipg.c: Fix skbuff leak
From: linux @ 2008-01-09  0:38 UTC (permalink / raw)
  To: linux, romieu; +Cc: akpm, davem, netdev
In-Reply-To: <20080108213640.GC4450@electric-eye.fr.zoreil.com>

> Can you try the patch below ?

Testing now... (I presume you noticed the one-character typo in my
earlier patch.  That should be "mc = mc->next", not "mv = mc->next".)

That doesn't seem to do it.  Not entirely, at least.  After downloading
and partially re-uploading an 800M file, slabtop reports:

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
341576 341574  99%    0.50K  42697        8    170788K kmalloc-512
342006 341953  99%    0.19K  16286       21     65144K kmalloc-192
 30592  30575  99%    2.00K   7648        4     61184K kmalloc-2048
 30213  30193  99%    0.44K   3357        9     13428K skbuff_fclone_cache
  7650   7643  99%    0.08K    150       51       600K sysfs_dir_cache
  4000   3938  98%    0.12K    125       32       500K kmalloc-128
   258    258 100%    1.15K     43        6       344K raid5-md5
   232    221  95%    1.00K     58        4       232K kmalloc-1024
  3136   3110  99%    0.06K     49       64       196K kmalloc-64
   264     80  30%    0.68K     24       11       192K ext3_inode_cache

The "kmalloc-2048" was down in the noise before the upload started.
This is in single-user mode, after sync and echo 3 > /proc/sys/vm/drop_caches.


I'll have to try this after this evening's social plans, but I'm thinking
of implementing more rapid bug detection: explicitly zero the sp->TxBuff
slot when the skb is freed, and check that it is zero before putting
anything else in there.  (And likewise for RxBuff.)

That way, I don't have to use up a noticeable amount of memory to see
the bug and reboot to clear up the damage each test cycle.

^ permalink raw reply

* Re: 2.6.24-rc6-mm1
From: FUJITA Tomonori @ 2008-01-09  0:54 UTC (permalink / raw)
  To: akpm
  Cc: fujita.tomonori, mingo, just.for.lkml, tomof, jarkao2, herbert,
	linux-kernel, neilb, bfields, netdev, tom
In-Reply-To: <20080108162739.f2f577ce.akpm@linux-foundation.org>

On Tue, 8 Jan 2008 16:27:39 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Wed, 09 Jan 2008 08:57:53 +0900
> FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> wrote:
> 
> > Andrew, can you replace
> > 
> > iommu-sg-add-iommu-helper-functions-for-the-free-area-management.patch
> > 
> > with the updated patch:
> > 
> > http://ozlabs.org/pipermail/linuxppc-dev/2007-December/048997.html
> > 
> > For your convenience I've attached the updated patch too.
> 
> <generates the incremental>

Thanks for putting the fix to -mm.


> > --- a/lib/iommu-helper.c~a
> > +++ a/lib/iommu-helper.c
> > @@ -8,15 +8,20 @@
> >  static unsigned long find_next_zero_area(unsigned long *map,
> >  					 unsigned long size,
> >  					 unsigned long start,
> > -					 unsigned int nr)
> > +					 unsigned int nr,
> > +					 unsigned long align_mask)
> >  {
> >  	unsigned long index, end, i;
> >  again:
> >  	index = find_next_zero_bit(map, size, start);
> > +
> > +	/* Align allocation */
> > +	index = (index + align_mask) & ~align_mask;
> 
> The ALIGN() macro is the approved way of doing this.
> 
> (I don't think ALIGN adds much value really, especially given that you've
> commented what's going on, but I guess it does make reviewing and reading a
> little easier).

Would be better to use __ALIGN_MASK? I can find only one user who
directly use __ALIGN_MASK. The POWER IOMMU calculates align_mask by
itself so it's easier to pass align_mask as an argument.

^ permalink raw reply

* Re: 2.6.24-rc6-mm1
From: Andrew Morton @ 2008-01-09  1:07 UTC (permalink / raw)
  To: FUJITA Tomonori
  Cc: fujita.tomonori, mingo, just.for.lkml, tomof, jarkao2, herbert,
	linux-kernel, neilb, bfields, netdev, tom
In-Reply-To: <20080109095445N.fujita.tomonori@lab.ntt.co.jp>

On Wed, 09 Jan 2008 09:54:45 +0900
FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> wrote:

> > > --- a/lib/iommu-helper.c~a
> > > +++ a/lib/iommu-helper.c
> > > @@ -8,15 +8,20 @@
> > >  static unsigned long find_next_zero_area(unsigned long *map,
> > >  					 unsigned long size,
> > >  					 unsigned long start,
> > > -					 unsigned int nr)
> > > +					 unsigned int nr,
> > > +					 unsigned long align_mask)
> > >  {
> > >  	unsigned long index, end, i;
> > >  again:
> > >  	index = find_next_zero_bit(map, size, start);
> > > +
> > > +	/* Align allocation */
> > > +	index = (index + align_mask) & ~align_mask;
> > 
> > The ALIGN() macro is the approved way of doing this.
> > 
> > (I don't think ALIGN adds much value really, especially given that you've
> > commented what's going on, but I guess it does make reviewing and reading a
> > little easier).
> 
> Would be better to use __ALIGN_MASK? I can find only one user who
> directly use __ALIGN_MASK. The POWER IOMMU calculates align_mask by
> itself so it's easier to pass align_mask as an argument.

ALIGN() should be OK - its aditional type coercion isn't useful in this
case but ALIGN() is the official interface.

I don't see any reason why vermilion.c had to reach for __ALIGN_MASK.  I'll
switch it to ALIGN().


^ permalink raw reply

* Re: SACK scoreboard
From: Lachlan Andrew @ 2008-01-09  1:34 UTC (permalink / raw)
  To: David Miller; +Cc: jheffner, ilpo.jarvinen, netdev, quetchen
In-Reply-To: <20080108.144456.173014334.davem@davemloft.net>

Greetings David,

On 08/01/2008, David Miller <davem@davemloft.net> wrote:
> From: John Heffner <jheffner@psc.edu>
>
> > I haven't thought about this too hard, but can we approximate this by
> > moving scaked data into a sacked queue, then if something bad happens
> > merge this back into the retransmit queue?
>
> That defeats the impetus for the change.
>
> We want to free up the data, say, 2 packets at a time as
> ACKs come in.  The key goal is smooth liberation of
> retransmit queue packets over time.

John also suggested freeing the packets as a lower priority task, just
doing it after they're acknowledged.

When the ACK finally comes, you could do something like moving John's
entire list of packets to a "to be freed" list, and free a few every
time (say) another ACK comes in.

$0.02,
Lachlan

-- 
Lachlan Andrew  Dept of Computer Science, Caltech
1200 E California Blvd, Mail Code 256-80, Pasadena CA 91125, USA
Ph: +1 (626) 395-8820    Fax: +1 (626) 568-3603
http://netlab.caltech.edu/~lachlan

^ permalink raw reply

* Belay that... -- Re: Please pull 'upstream-jgarzik' branch of wireless-2.6
From: John W. Linville @ 2008-01-09  1:49 UTC (permalink / raw)
  To: jeff-o2qLIJkoznsdnm+yROfE0A
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <20080108224202.GE3086-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org>

Please don't pull yet -- I let a patch get in out of order.

I'll post a new pull request when I straighten this out...

John

On Tue, Jan 08, 2008 at 05:42:02PM -0500, John W. Linville wrote:
> On Tue, Jan 08, 2008 at 05:23:05PM -0500, John W. Linville wrote:
> > Jeff,
> > 
> > Another round of patches intended for 2.6.25...the biggest factions are
> > rt2x00 and b43 updates, as well as some Viro-isms... :-)
> > 
> > Please let me know if there are any problems!
> > 
> > John
> > 
> > P.S.  Copying Dave in case he is handling these requests...FWIW, it
> > will definitely depend on the patches already in netdev-2.6#upstream...
> 
> I left out a patch.  I have pushed it on top of the previous request.
> 
> Let me know if there are problems!
> 
> Thanks,
> 
> John
-- 
John W. Linville
linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org

^ permalink raw reply

* Re: SACK scoreboard
From: Andi Kleen @ 2008-01-09  2:25 UTC (permalink / raw)
  To: David Miller; +Cc: jheffner, ilpo.jarvinen, lachlan.andrew, netdev, quetchen
In-Reply-To: <20080108.144456.173014334.davem@davemloft.net>

David Miller <davem@davemloft.net> writes:
>
> The big problem is that recovery from even a single packet loss in a
> window makes us run kfree_skb() for a all the packets in a full
> window's worth of data when recovery completes.

Why exactly is it a problem to free them all at once? Are you worried
about kernel preemption latencies?

-Andi

^ permalink raw reply

* Re: [PATCH net-2.6.25 2/6][IPVS] Switch to using ctl_paths.
From: Simon Horman @ 2008-01-09  2:49 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: David Miller, Linux Netdev List, devel
In-Reply-To: <47839D93.8090406@openvz.org>

On Tue, Jan 08, 2008 at 06:58:11PM +0300, Pavel Emelyanov wrote:
> The feature of ipvs ctls is that the net/ipv4/vs path
> is common for core ipvs ctls and for two schedulers,
> so I make it exported and re-use it in modules.
> 
> Two other .c files required linux/sysctl.h to make the
> extern declaration of this path compile well.
> 
> Signed-off-by: Pavel Emelyanov <xemul@openvz.org>

Thanks, this looks good to me and I've confirmed that
the same entires with the same permissions exist under
/proc/sys/net/ipv4/vs before and after the change.

Acked-by: Simon Horman <horms@verge.net.au>

-- 
Horms


^ permalink raw reply

* Please pull 'upstream-jgarzik' branch of wireless-2.6 (use this one)
From: John W. Linville @ 2008-01-09  3:20 UTC (permalink / raw)
  To: jeff; +Cc: netdev, linux-wireless, davem

[ 2nd try...more thoroughly checked... ]

Jeff,

Another round of patches intended for 2.6.25...the biggest factions are
rt2x00 and b43 updates, as well as some Viro-isms... :-)

Please let me know if there are any problems!

John

P.S.  Copying Dave in case he is handling these requests...FWIW, it
will definitely depend on the patches already in netdev-2.6#upstream...

---

Individual patches are available here:

	http://www.kernel.org/pub/linux/kernel/people/linville/wireless-2.6/upstream-jgarzik

---

The following changes since commit 65d0aa09c183ee45dc1786675209313fa75cf4ec:
  Jeff Garzik (1):
        wireless/iwl: fix namespace breakage

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6.git upstream-jgarzik

Al Viro (35):
      eliminate byteswapping in struct ieee80211_qos_parameters
      several missing cpu_to_le16() in ieee80211softmac_capabilities()
      ieee80211softmac_auth_resp() fix
      ieee80211: fix misannotations
      ieee80211: beacon->capability is little-endian
      airo: fix transmit_802_11_packet()
      airo: fix endianness bug in ->dBm handling
      airo: bug in airo_interrupt() handling on incoming 802.11
      airo endianness bug: cap_rid.extSoftCap
      airo: fix writerids() endianness
      hostap: fix endianness with txdesc->sw_support
      p54common annotations and fixes
      ipw2100 annotations and fixes
      ray_cs fixes
      ipw2200 fix: struct ieee80211_radiotap_header is little-endian
      ipw2200 fix: ->rt_chbitmask is le16
      ipw2200: ipw_tx_skb() endianness bug
      airo: trivial endianness annotations
      airo: sanitize handling of SSID_rid
      bap_read()/bap_write() work with fixed-endian buffers
      airo: sanitize BSSListRid handling
      airo: sanitize handling of WepKeyRid
      airo: sanitize handling of StatsRid
      airo: sanitize handling of CapabilityRid
      airo: sanitize APListRid handling
      airo: sanitize handling of StatusRid
      airo: last of endianness annotations
      hostap annotations
      hostap: don't mess with mixed-endian even for internal skb queues
      p54pci: endianness annotations and fixes
      bcm43xx annotations
      prism54 trivial annotations
      ipw2200 trivial annotations
      ipw2200: do not byteswap struct ipw_associate
      misc wireless annotations

Daniel Walker (1):
      prism54: remove questionable down_interruptible usage

Ivo van Doorn (12):
      rt2x00: Fix chipset debugfs file
      rt2x00: Always call ieee80211_stop_queue() when return NETDEV_TX_BUSY
      rt2x00: Only set the TBCN flag when the interface is configured to send beacons.
      rt2x00: Store queue idx and entry idx in data_ring and data_entry
      rt2x00: Move start() and stop() handlers into rt2x00lib.c
      rt2x00: Put 802.11 data on 4 byte boundary
      rt2x00: Move packet filter flags
      rt2x00: Cleanup write_tx_desc() arguments
      rt2x00: Determine MY_BSS from descriptor
      rt2x00: Move init_txring and init_rxring into rt2x00lib
      rt2x00: Correctly initialize data and desc pointer
      rt2x00: Release rt2x00 2.0.14

John W. Linville (1):
      Revert "rtl8187: fix tx power reading"

Michael Buesch (10):
      ssb: Fix extraction of values from SPROM
      b43: Only select allowed TX and RX antennas
      b43: Fix chip access validation for new devices
      ssb: Fix PCMCIA lowlevel register access
      b43: Remove PIO support
      b43: Add definitions for MAC Control register
      b43-ssb-bridge: Add PCI ID for BCM43XG
      b43: Add NPHY kconfig option
      b43: Fix any N-PHY related WARN_ON() in the attach stage.
      zd1211rw: fix alignment for QOS and WDS frames

Miguel Botón (3):
      ssb: add 'ssb_pcihost_set_power_state' function
      b44: power down PHY when interface down
      iwlwifi: fix compilation warning in 'iwl-4965.c'

Zhu Yi (1):
      iwlwifi: fix typo in 'drivers/net/wireless/iwlwifi/Kconfig'

 drivers/net/b44.c                             |   28 +-
 drivers/net/wireless/adm8211.c                |    8 +-
 drivers/net/wireless/airo.c                   | 1233 ++++++++++++-------------
 drivers/net/wireless/atmel.c                  |   30 +-
 drivers/net/wireless/b43/Kconfig              |   58 +-
 drivers/net/wireless/b43/Makefile             |    9 +-
 drivers/net/wireless/b43/b43.h                |   78 +-
 drivers/net/wireless/b43/debugfs.c            |    1 -
 drivers/net/wireless/b43/dma.c                |   19 +-
 drivers/net/wireless/b43/dma.h                |   50 -
 drivers/net/wireless/b43/main.c               |  357 +++-----
 drivers/net/wireless/b43/main.h               |    3 +
 drivers/net/wireless/b43/xmit.c               |   26 +-
 drivers/net/wireless/b43legacy/main.c         |    5 -
 drivers/net/wireless/b43legacy/phy.c          |    2 +-
 drivers/net/wireless/bcm43xx/bcm43xx.h        |    6 +-
 drivers/net/wireless/bcm43xx/bcm43xx_main.c   |   40 +-
 drivers/net/wireless/bcm43xx/bcm43xx_pio.c    |    6 +-
 drivers/net/wireless/bcm43xx/bcm43xx_xmit.c   |    6 +-
 drivers/net/wireless/hostap/hostap_80211.h    |   34 +-
 drivers/net/wireless/hostap/hostap_80211_rx.c |    2 +-
 drivers/net/wireless/hostap/hostap_ap.c       |   72 +-
 drivers/net/wireless/hostap/hostap_common.h   |   34 +-
 drivers/net/wireless/hostap/hostap_download.c |   22 +-
 drivers/net/wireless/hostap/hostap_hw.c       |   28 +-
 drivers/net/wireless/hostap/hostap_info.c     |    9 +-
 drivers/net/wireless/hostap/hostap_ioctl.c    |   66 +-
 drivers/net/wireless/hostap/hostap_main.c     |    6 +-
 drivers/net/wireless/hostap/hostap_pci.c      |   16 +-
 drivers/net/wireless/hostap/hostap_wlan.h     |  202 ++--
 drivers/net/wireless/ipw2100.c                |   10 +-
 drivers/net/wireless/ipw2200.c                |  175 ++--
 drivers/net/wireless/ipw2200.h                |  190 ++--
 drivers/net/wireless/iwlwifi/Kconfig          |    4 +-
 drivers/net/wireless/iwlwifi/iwl-4965.c       |    2 +-
 drivers/net/wireless/iwlwifi/iwl3945-base.c   |    4 +-
 drivers/net/wireless/iwlwifi/iwl4965-base.c   |    4 +-
 drivers/net/wireless/p54common.c              |    8 +-
 drivers/net/wireless/p54pci.c                 |   16 +-
 drivers/net/wireless/p54pci.h                 |    4 +-
 drivers/net/wireless/prism54/isl_38xx.h       |   10 +-
 drivers/net/wireless/prism54/isl_ioctl.c      |   12 +-
 drivers/net/wireless/prism54/islpci_eth.c     |    2 +-
 drivers/net/wireless/prism54/islpci_eth.h     |   38 +-
 drivers/net/wireless/prism54/islpci_mgt.h     |    2 +-
 drivers/net/wireless/ray_cs.c                 |   69 +-
 drivers/net/wireless/rt2x00/rt2400pci.c       |  102 +--
 drivers/net/wireless/rt2x00/rt2500pci.c       |   88 +--
 drivers/net/wireless/rt2x00/rt2500usb.c       |   36 +-
 drivers/net/wireless/rt2x00/rt2x00.h          |   32 +-
 drivers/net/wireless/rt2x00/rt2x00debug.c     |   13 +-
 drivers/net/wireless/rt2x00/rt2x00dev.c       |  142 +++-
 drivers/net/wireless/rt2x00/rt2x00lib.h       |    4 +-
 drivers/net/wireless/rt2x00/rt2x00mac.c       |   59 +-
 drivers/net/wireless/rt2x00/rt2x00pci.c       |   28 +-
 drivers/net/wireless/rt2x00/rt2x00ring.h      |   13 +
 drivers/net/wireless/rt2x00/rt2x00usb.c       |   87 +-
 drivers/net/wireless/rt2x00/rt2x00usb.h       |    5 +-
 drivers/net/wireless/rt2x00/rt61pci.c         |  114 +--
 drivers/net/wireless/rt2x00/rt73usb.c         |   35 +-
 drivers/net/wireless/rtl8187_rtl8225.c        |    8 +-
 drivers/net/wireless/wavelan_cs.p.h           |    2 +-
 drivers/net/wireless/zd1211rw/zd_mac.c        |   17 +-
 drivers/ssb/b43_pci_bridge.c                  |    1 +
 drivers/ssb/pci.c                             |   76 ++-
 drivers/ssb/pcmcia.c                          |   71 +-
 include/linux/ssb/ssb.h                       |   29 +-
 include/linux/ssb/ssb_regs.h                  |   38 +-
 include/net/ieee80211.h                       |    6 +-
 net/ieee80211/ieee80211_crypt_tkip.c          |   22 +-
 net/ieee80211/ieee80211_rx.c                  |   47 +-
 net/ieee80211/ieee80211_tx.c                  |   14 +-
 net/ieee80211/softmac/ieee80211softmac_auth.c |    6 +-
 net/ieee80211/softmac/ieee80211softmac_io.c   |   10 +-
 74 files changed, 1972 insertions(+), 2139 deletions(-)

Omnibus patch attached as 'upstream-jgarzik.patch.bz2'.
-- 
John W. Linville
linville@tuxdriver.com

^ permalink raw reply

* Re: SACK scoreboard
From: John Heffner @ 2008-01-09  4:27 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David Miller, ilpo.jarvinen, lachlan.andrew, netdev, quetchen
In-Reply-To: <p73sl17pwfy.fsf@bingen.suse.de>

Andi Kleen wrote:
> David Miller <davem@davemloft.net> writes:
>> The big problem is that recovery from even a single packet loss in a
>> window makes us run kfree_skb() for a all the packets in a full
>> window's worth of data when recovery completes.
> 
> Why exactly is it a problem to free them all at once? Are you worried
> about kernel preemption latencies?
> 
> -Andi


I also wonder how much of a problem this is (for now, with window sizes 
of order 10000 packets.  My understanding is that the biggest problems 
arise from O(N^2) time for recovery because every ack was expensive. 
Have current tests shown the final ack to be a major source of problems?

   -John

^ permalink raw reply

* Re: [IPV4] ROUTE: ip_rt_dump() is unecessary slow
From: Herbert Xu @ 2008-01-09  6:02 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, dipankar, netdev
In-Reply-To: <47831A37.5000607@cosmosbay.com>

Eric Dumazet <dada1@cosmosbay.com> wrote:
> 
> Very good question, but honestly I really dont see why it was there at the 
> first place :

It was there because someone went through this file and robotically
replaced all conditional read barriers with rcu_dereference, even when
it made zero sense.

Basically you can add a conditional barrier either at the point where
the pointer gets read, or where it gets derferenced.  Previously we
did the latter (except that the show function didn't have a barrier
at all which is technically a bug though harmless in pratice).  This
patch moves it to the spot where it gets read which is also OK.

> static struct rtable *rt_cache_get_next(struct seq_file *seq, struct rtable *r)
> {
> -       struct rt_cache_iter_state *st = rcu_dereference(seq->private);
> +       struct rt_cache_iter_state *st = seq->private;
> 
> -       r = r->u.dst.rt_next;
> +       r = rcu_dereference(r->u.dst.rt_next);
>        while (!r) {
>                rcu_read_unlock_bh();
>                if (--st->bucket < 0)
>                        break;
>                rcu_read_lock_bh();
> -               r = rt_hash_table[st->bucket].chain;
> +               r = rcu_dereference(rt_hash_table[st->bucket].chain);
>        }
>        return r;

Slight optimisation: please move both barriers onto the return statement,
i.e.,

	return rcu_dereference(r);

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: SACK scoreboard
From: linux @ 2008-01-09  6:04 UTC (permalink / raw)
  To: netdev; +Cc: linux

Just some idle brainstorming on the subject...

It seems the only way to handle network pipes sigificantly larger (delay *
bandwidth product) than the processor cache is to make freeing retransmit
data o(n).

Now, there are some ways to reduce the constant factor.  The one that
comes to mind first is to not queue sk_buffs.  Throw away the struct
sk_buff after transmission and just queue skb_frag_structs, pages, or
maybe even higher-order pages of data.  Then freeing the data when it's
acked has a much smaller constant factor, particularly d-cache footprint,
and no slab operations.

The downside is more work to recreate the skb if you do have to
retransmit, but optimizing for retransmits is silly.

Some implementations could leave large chunks of memory locked until
all of the sk_buff->skb_shared_info->skb_frag_structs referencing them
have gone away, but you can look at the transmit window when deciding
how big a chunk size to use.


Then, to actually get below O(n), you want to keep the queued data in
a data structure known to the memory manager.  Basically, splice the
retransmit queue onto the free list.

It may require some kludgery in the memory manager.  In particular, doing
that in O(1) time obviously means that you can't coalesce adjacent free
regions to build higher-order pages.  So you'd have to have a threshold
for uncoalesced pages and a way to force coalescing under memory pressure.

You're just deferring work until the page is allocated, but the point
is that then it's okay to bring it into cache when it's about to be
used again.  It's the redundant round trip just because an ack arrived
that's annoying.

I've done thins sort of thing with specialized fixed-block-size allocators
before (an alpha-beta minimax search tree allocates nodes one at a
time, but frees whole subtrees at once), but might it be feasible for
kernel use?

^ permalink raw reply

* Re: [PATCH 0/3] bonding: 3 fixes for 2.6.24
From: Herbert Xu @ 2008-01-09  6:08 UTC (permalink / raw)
  To: Andy Gospodarek; +Cc: olel, fubar, netdev, jgarzik, davem, andy
In-Reply-To: <20080108191706.GD8728@gospo.usersys.redhat.com>

Andy Gospodarek <andy@greyhouse.net> wrote:
> 
> Jay's patches will not fix this issue.  I think something like this did
> it for me, but as I mentioned to Jay in the last thread, I'm not
> convinced it doesn't violate some of the locking expectations we have.
> 
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index 423298c..3c6619a 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -3915,7 +3915,7 @@ static void bond_set_multicast_list(struct net_device *bond_dev)
>        struct bonding *bond = bond_dev->priv;
>        struct dev_mc_list *dmi;
> 
> -       write_lock_bh(&bond->lock);
> +       read_lock(&bond->lock);

This is wrong.  As I said before, the correct fix (until set_multicast
no longer gets called from BH context) is to change all you process
context read_lock's to read_lock_bh.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [XFRM]: xfrm_algo_clone() allocates too much memory
From: Herbert Xu @ 2008-01-09  6:15 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, netdev
In-Reply-To: <4783C2E2.5060102@cosmosbay.com>

Eric Dumazet <dada1@cosmosbay.com> wrote:
> 
> diff --git a/include/net/xfrm.h b/include/net/xfrm.h
> index 58dfa82..731f0a8 100644
> --- a/include/net/xfrm.h
> +++ b/include/net/xfrm.h
> @@ -1188,10 +1188,15 @@ static inline int xfrm_aevent_is_on(void)
>        return ret;
> }
> 
> +static inline int alg_len(struct xfrm_algo *alg)

Could you please add an xfrm prefix to this?

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [SOCK]: Adds a rcu_dereference() in sk_filter
From: Herbert Xu @ 2008-01-09  6:17 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, netdev, dim, kuznet
In-Reply-To: <4783DDBF.2040303@cosmosbay.com>

Eric Dumazet <dada1@cosmosbay.com> wrote:
> 
> It seems commit fda9ef5d679b07c9d9097aaf6ef7f069d794a8f9 introduced a RCU 
> protection for sk_filter(), without a rcu_dereference()
> 
> Either we need a rcu_dereference(), either a comment should explain why we 
> dont need it. I vote for the former.
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 5/7]: [NET]: Fix drivers to handle napi_disable() disabling interrupts.
From: Don Fry @ 2008-01-09  6:14 UTC (permalink / raw)
  To: netdev; +Cc: davem

Tested pcnet32 on x86_64 box and see no problems with the change.
The code is only exercised if doing loopback testing, or changing
the ring size during a receive storm.

Acked-by:  Don Fry <pcnet32@verizon.net>
---

[NET]: Fix drivers to handle napi_disable() disabling interrupts.

When we add the generic napi_disable_pending() breakout
logic to net_rx_action() it means that napi_disable()
can cause NAPI poll interrupt events to be disabled.

And this is exactly what we want.  If a napi_disable()
is pending, and we are looping in the ->poll(), we want
->poll() event interrupts to stay disabled and we want
to complete the NAPI poll ASAP.

When ->poll() break out during device down was being handled on a
per-driver basis, often these drivers would turn interrupts back on
when '!netif_running()' was detected.

And this would just cause a reschedule of the NAPI ->poll() in the
interrupt handler before the napi_disable() could get in there and
grab the NAPI_STATE_SCHED bit.

The vast majority of drivers don't care if napi_disable() might have
the side effect of disabling NAPI ->poll() event interrupts.  In all
such cases, when a napi_disable() is performed, the driver just
disabled interrupts or is about to.

However there were three exceptions to this in PCNET32, R8169, and
SKY2.  To fix those cases, at the subsequent napi_enable() points, I
added code to ensure that the ->poll() interrupt events are enabled in
the hardware.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/pcnet32.c |    5 +++++
 drivers/net/r8169.c   |    2 ++
 drivers/net/sky2.c    |    3 +++
 3 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/drivers/net/pcnet32.c b/drivers/net/pcnet32.c
index ff92aca..90498ff 100644
--- a/drivers/net/pcnet32.c
+++ b/drivers/net/pcnet32.c
@@ -455,9 +455,14 @@ static void pcnet32_netif_start(struct net_device
*dev)
 {
 #ifdef CONFIG_PCNET32_NAPI
 	struct pcnet32_private *lp = netdev_priv(dev);
+	ulong ioaddr = dev->base_addr;
+	u16 val;
 #endif
 	netif_wake_queue(dev);
 #ifdef CONFIG_PCNET32_NAPI
+	val = lp->a.read_csr(ioaddr, CSR3);
+	val &= 0x00ff;
+	lp->a.write_csr(ioaddr, CSR3, val);
 	napi_enable(&lp->napi);
 #endif
 }
-- 
1.5.4.rc2.38.gd6da3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




^ permalink raw reply related

* Re: SACK scoreboard
From: David Miller @ 2008-01-09  6:35 UTC (permalink / raw)
  To: lachlan.andrew; +Cc: jheffner, ilpo.jarvinen, netdev, quetchen
In-Reply-To: <aa7d2c6d0801081734t363768ffi31adb3bd26e639c7@mail.gmail.com>

From: "Lachlan Andrew" <lachlan.andrew@gmail.com>
Date: Tue, 8 Jan 2008 17:34:03 -0800

> John also suggested freeing the packets as a lower priority task, just
> doing it after they're acknowledged.
> 
> When the ACK finally comes, you could do something like moving John's
> entire list of packets to a "to be freed" list, and free a few every
> time (say) another ACK comes in.

You could, but this requires extra state and the operations to
properly queue such items to the deferred task and wake it up.

And none of this would be necessary if we could make SACK
indications hard.

^ permalink raw reply

* Re: [PATCH 0/3] bonding: 3 fixes for 2.6.24
From: Krzysztof Oledzki @ 2008-01-09  6:35 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: netdev, Jeff Garzik, David Miller, Andy Gospodarek
In-Reply-To: <29560.1199820632@death>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2403 bytes --]



On Tue, 8 Jan 2008, Jay Vosburgh wrote:

> Krzysztof Oledzki <olel@ans.pl> wrote:
>
>> On Mon, 7 Jan 2008, Jay Vosburgh wrote:
>>
>>> 	Following are three fixes to fix locking problems and
>>> silence locking-related warnings in the current 2.6.24-rc.
>>>
>>> 	patch 1: fix locking in sysfs primary/active selection
>>>
>>> 	Call core network functions with expected locks to
>>> eliminate potential deadlock and silence warnings.
>>>
>>> 	patch 2: fix ASSERT_RTNL that produces spurious warnings
>>>
>>> 	Relocate ASSERT_RTNL to remove a false warning; after patch,
>>> ASSERT is located in code that holds only RTNL (additional locks were
>>> causing the ASSERT to trip)
>>>
>>> 	patch 3: fix locking during alb failover and slave removal
>>>
>>> 	Fix all call paths into alb_fasten_mac_swap to hold only RTNL.
>>> Eliminates deadlock and silences warnings.
>>>
>>> 	Patches are against the current netdev-2.6#upstream branch.
>>>
>>> 	Please apply for 2.6.24.
>>
>> 2.6.24-rc7 + patches #1, #2, #3:
>>
>> bonding: bond0: setting mode to active-backup (1).
>> bonding: bond0: Setting MII monitoring interval to 100.
>> ADDRCONF(NETDEV_UP): bond0: link is not ready
>> bonding: bond0: Adding slave eth0.
>> e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
>> bonding: bond0: making interface eth0 the new active one.
>> bonding: bond0: first active interface up!
>> bonding: bond0: enslaving eth0 as an active interface with an up link.
>> bonding: bond0: Adding slave eth1.
>> ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
>>
>> =========================================================
>> [ INFO: possible irq lock inversion dependency detected ]
>> 2.6.24-rc7 #1
>> ---------------------------------------------------------
>> events/0/9 just changed the state of lock:
>> (&mc->mca_lock){-+..}, at: [<c041258e>] mld_ifc_timer_expire+0x130/0x1fb
>> but this lock took another, soft-read-irq-unsafe lock in the past:
>> (&bond->lock){-.--}
>>
>> and interrupts could create inverse lock ordering between them.
>
> 	Just to be clear: the patch set I posted yesterday was not
> intended to resolve the lockdep problem; I haven't studied that one yet.

Fine. Just let you know that someone test your patches and everything 
works, except mentioned problem.

Best regards,

 				Krzysztof Olędzki

^ permalink raw reply

* Re: SACK scoreboard
From: David Miller @ 2008-01-09  6:39 UTC (permalink / raw)
  To: andi; +Cc: jheffner, ilpo.jarvinen, lachlan.andrew, netdev, quetchen
In-Reply-To: <p73sl17pwfy.fsf@bingen.suse.de>

From: Andi Kleen <andi@firstfloor.org>
Date: Wed, 09 Jan 2008 03:25:05 +0100

> David Miller <davem@davemloft.net> writes:
> >
> > The big problem is that recovery from even a single packet loss in a
> > window makes us run kfree_skb() for a all the packets in a full
> > window's worth of data when recovery completes.
> 
> Why exactly is it a problem to free them all at once? Are you worried
> about kernel preemption latencies?

If the cpu is there spinning freeing up 500,000 SKBs, it isn't
processing RX packets.

It adds severe spikes in CPU utilization that are even moderate
line rates begins to affect RTTs.

Or do you think it's OK to process 500,000 SKBs while locked
in a software interrupt.

Perhaps you have another broken awk script to prove this :-)

^ permalink raw reply

* Re: [ANNOUNCE] bridge-utils 1.4
From: Denys Fedoryshchenko @ 2008-01-09  6:40 UTC (permalink / raw)
  To: Stephen Hemminger, bridge; +Cc: netdev
In-Reply-To: <20080108085607.37ab168f@deepthought>


As mentioned in
http://marc.info/?l=linux-bridge&m=113105949718826&w=2

Released package doesn't contain ./configure script

For people who know what is make on, it is easy to run autoconf , but some 
know only how to use ./configure :-)

Other than this, it is works fine with me, but i didn't test it deeply yet.

On Tue, 8 Jan 2008 08:56:07 -0800, Stephen Hemminger wrote
> Minor update to bridge-utils. Mostly fixing bugs in usage of sysfs.
> 
> Release tarball:
>   http://downloads.sourceforge.net/bridge/bridge-utils-1.4.tar.gz
> 
> Alon Bar-Lev (1):
>       Allow bridge-utils to run when no TCP/IP is available
> 
> Denys Vlasenko (1):
>       fix use of sysfs (affects 32/64 bit compat)
> 
> Jeremy Jackson (1):
>       Fix parsing of port_id's (hex).
> 
> Stephen Hemminger (3):
>       Add ignore for generated files.
>       Update gitignore
>       Use linux/if.h rather than net/if.h
> 
> -- 
> Stephen Hemminger <stephen.hemminger@vyatta.com>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.


^ permalink raw reply

* Re: SACK scoreboard
From: David Miller @ 2008-01-09  6:41 UTC (permalink / raw)
  To: jheffner; +Cc: andi, ilpo.jarvinen, lachlan.andrew, netdev, quetchen
In-Reply-To: <47844D1C.1060706@psc.edu>

From: John Heffner <jheffner@psc.edu>
Date: Tue, 08 Jan 2008 23:27:08 -0500

> I also wonder how much of a problem this is (for now, with window sizes 
> of order 10000 packets.  My understanding is that the biggest problems 
> arise from O(N^2) time for recovery because every ack was expensive. 
> Have current tests shown the final ack to be a major source of problems?

Yes, several people have reported this.

It's a real issue, and it's only going to get worse.

^ permalink raw reply

* Re: [PATCH 5/7]: [NET]: Fix drivers to handle napi_disable() disabling interrupts.
From: David Miller @ 2008-01-09  6:44 UTC (permalink / raw)
  To: pcnet32; +Cc: netdev
In-Reply-To: <1199859269.9002.6.camel@Linux.home>

From: Don Fry <pcnet32@verizon.net>
Date: Tue, 08 Jan 2008 22:14:29 -0800

> Tested pcnet32 on x86_64 box and see no problems with the change.
> The code is only exercised if doing loopback testing, or changing
> the ring size during a receive storm.
> 
> Acked-by:  Don Fry <pcnet32@verizon.net>

Thanks for testing and the feedback.

^ permalink raw reply

* Re: SACK scoreboard
From: Andi Kleen @ 2008-01-09  7:03 UTC (permalink / raw)
  To: David Miller
  Cc: andi, jheffner, ilpo.jarvinen, lachlan.andrew, netdev, quetchen
In-Reply-To: <20080108.223925.105105455.davem@davemloft.net>

> It adds severe spikes in CPU utilization that are even moderate
> line rates begins to affect RTTs.
> 
> Or do you think it's OK to process 500,000 SKBs while locked
> in a software interrupt.

You can always push it into a work queue.  Even put it to
other cores if you want. 

In fact this is already done partly for the ->completion_queue.
Wouldn't be a big change to queue it another level down.

Also even freeing a lot of objects doesn't have to be
that expensive. I suspect the most cost is in taking
the slab locks, but that could be batched. Without
that the kmem_free fast path isn't particularly
expensive, as long as the headers are still in cache.

Long ago I had sk_buff optimized for the routing case so that freeing
can be all done with a single cache line. That is 
long gone, but could be probably restored.

But asking for protocol changes just to work around such a
internal implementation detail is ... <I miss words> 

> Perhaps you have another broken awk script to prove this :-)

Your hand waved numbers on inline sizes there were definitely worse 
than mine. 

-Andi


^ permalink raw reply

* Re: SACK scoreboard
From: David Miller @ 2008-01-09  7:16 UTC (permalink / raw)
  To: andi; +Cc: jheffner, ilpo.jarvinen, lachlan.andrew, netdev, quetchen
In-Reply-To: <20080109070318.GA8581@one.firstfloor.org>

From: Andi Kleen <andi@firstfloor.org>
Date: Wed, 9 Jan 2008 08:03:18 +0100

> Also even freeing a lot of objects doesn't have to be
> that expensive.  I suspect the most cost is in taking
> the slab locks, but that could be batched.

We're touching SKB struct members, doing atomics on them, etc. for
objects we haven't referenced for at least two RTTs so are guarenteed
to be cache cold.

Let's say best case we can get it down to 2 cache line misses, and as
a very aggressive goal we can get the cost down to 100 cycles.  For
500,000 packets this is 500 million cpu cycles to free them all up.

That's 1/4 of a second even on a 2 GHZ cpu.

And yes there are inherent costs in handling TCP windows that are
500,000 packets in size.  But, that freeing cost should be spread
throughout the handling of the RTT feedback, not handled all at once.

> Your hand waved numbers on inline sizes there were definitely worse 
> than mine. 

Your primary objective seems to be "being right", and that's fine but
realize that it makes discussing anything with you about as fun as
picking one's toe nails with an ice axe.

Eventually you will be ignored by most folks who get fed up by this
style of argument.

So, have fun being right rather than being pleasant to work with.

^ permalink raw reply

* Re: [XFRM]: xfrm_algo_clone() allocates too much memory
From: Eric Dumazet @ 2008-01-09  7:29 UTC (permalink / raw)
  To: Herbert Xu; +Cc: davem, netdev
In-Reply-To: <E1JCUDY-000791-00@gondolin.me.apana.org.au>

[-- Attachment #1: Type: text/plain, Size: 1020 bytes --]

Herbert Xu a écrit :
> Eric Dumazet <dada1@cosmosbay.com> wrote:
>> diff --git a/include/net/xfrm.h b/include/net/xfrm.h
>> index 58dfa82..731f0a8 100644
>> --- a/include/net/xfrm.h
>> +++ b/include/net/xfrm.h
>> @@ -1188,10 +1188,15 @@ static inline int xfrm_aevent_is_on(void)
>>        return ret;
>> }
>>
>> +static inline int alg_len(struct xfrm_algo *alg)
> 
> Could you please add an xfrm prefix to this?

Sure, thanks for the suggestion :)

[XFRM]: xfrm_algo_clone() allocates too much memory

alg_key_len is the length in bits of the key, not in bytes.

Best way to fix this is to move alg_len() function from net/xfrm/xfrm_user.c 
to include/net/xfrm.h, and to use it in xfrm_algo_clone()

alg_len() is renamed to xfrm_alg_len() because of its global exposition.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

  include/net/xfrm.h   |    7 ++++++-
  net/xfrm/xfrm_user.c |   17 ++++++-----------
  2 files changed, 12 insertions(+), 12 deletions(-)

[-- Attachment #2: xfrm_algo_clone.patch --]
[-- Type: text/plain, Size: 2471 bytes --]

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 58dfa82..1dd20cf 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1188,10 +1188,15 @@ static inline int xfrm_aevent_is_on(void)
 	return ret;
 }
 
+static inline int xfrm_alg_len(struct xfrm_algo *alg)
+{
+	return sizeof(*alg) + ((alg->alg_key_len + 7) / 8);
+}
+
 #ifdef CONFIG_XFRM_MIGRATE
 static inline struct xfrm_algo *xfrm_algo_clone(struct xfrm_algo *orig)
 {
-	return (struct xfrm_algo *)kmemdup(orig, sizeof(*orig) + orig->alg_key_len, GFP_KERNEL);
+	return kmemdup(orig, xfrm_alg_len(orig), GFP_KERNEL);
 }
 
 static inline void xfrm_states_put(struct xfrm_state **states, int n)
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index e75dbdc..c4f6419 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -31,11 +31,6 @@
 #include <linux/in6.h>
 #endif
 
-static inline int alg_len(struct xfrm_algo *alg)
-{
-	return sizeof(*alg) + ((alg->alg_key_len + 7) / 8);
-}
-
 static int verify_one_alg(struct nlattr **attrs, enum xfrm_attr_type_t type)
 {
 	struct nlattr *rt = attrs[type];
@@ -45,7 +40,7 @@ static int verify_one_alg(struct nlattr **attrs, enum xfrm_attr_type_t type)
 		return 0;
 
 	algp = nla_data(rt);
-	if (nla_len(rt) < alg_len(algp))
+	if (nla_len(rt) < xfrm_alg_len(algp))
 		return -EINVAL;
 
 	switch (type) {
@@ -204,7 +199,7 @@ static int attach_one_algo(struct xfrm_algo **algpp, u8 *props,
 		return -ENOSYS;
 	*props = algo->desc.sadb_alg_id;
 
-	p = kmemdup(ualg, alg_len(ualg), GFP_KERNEL);
+	p = kmemdup(ualg, xfrm_alg_len(ualg), GFP_KERNEL);
 	if (!p)
 		return -ENOMEM;
 
@@ -516,9 +511,9 @@ static int copy_to_user_state_extra(struct xfrm_state *x,
 		NLA_PUT_U64(skb, XFRMA_LASTUSED, x->lastused);
 
 	if (x->aalg)
-		NLA_PUT(skb, XFRMA_ALG_AUTH, alg_len(x->aalg), x->aalg);
+		NLA_PUT(skb, XFRMA_ALG_AUTH, xfrm_alg_len(x->aalg), x->aalg);
 	if (x->ealg)
-		NLA_PUT(skb, XFRMA_ALG_CRYPT, alg_len(x->ealg), x->ealg);
+		NLA_PUT(skb, XFRMA_ALG_CRYPT, xfrm_alg_len(x->ealg), x->ealg);
 	if (x->calg)
 		NLA_PUT(skb, XFRMA_ALG_COMP, sizeof(*(x->calg)), x->calg);
 
@@ -1978,9 +1973,9 @@ static inline size_t xfrm_sa_len(struct xfrm_state *x)
 {
 	size_t l = 0;
 	if (x->aalg)
-		l += nla_total_size(alg_len(x->aalg));
+		l += nla_total_size(xfrm_alg_len(x->aalg));
 	if (x->ealg)
-		l += nla_total_size(alg_len(x->ealg));
+		l += nla_total_size(xfrm_alg_len(x->ealg));
 	if (x->calg)
 		l += nla_total_size(sizeof(*x->calg));
 	if (x->encap)

^ permalink raw reply related

* Re: Please pull 'upstream-davem' branch of wireless-2.6
From: David Miller @ 2008-01-09  7:36 UTC (permalink / raw)
  To: linville-2XuSBdqkA4R54TAoqtyWWQ
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20080108192914.GA3086-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org>

From: "John W. Linville" <linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org>
Date: Tue, 8 Jan 2008 14:29:14 -0500

> Here are a few more for 2.6.25.  The are mostly clean-ups for the new
> PID rate control algorithm, and some A-MPDU bits related to supporting
> 802.11n.

Pulled and pushed back out to net-2.6.25, thanks John.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox