Netdev List
 help / color / mirror / Atom feed
* Re: bridging + load balancing bonding
From: Eric Dumazet @ 2009-10-22 15:41 UTC (permalink / raw)
  To: Jasper Spaans; +Cc: netdev
In-Reply-To: <20091022122339.GA20148@spaans.fox.local>

Jasper Spaans a écrit :
> Hi,
> 
> We're using the following setup for bonding and bridging, to be able to put
> large amounts of data through multiple IDS analyzers:
> 
>                              +---[br0]----+     +--- eth1 ---(IDS machine 1)
> (Span port from switch) -- eth0          bond0--+
>                                                 +--- eth2 ---(IDS machine 2)
> 
> eth0 receives network traffic, which should be passed to machines which are
> connected to eth1 and eth2. These machines run an IDS package, and there are
> two of those for performance reasons.
> 
> bond0 is configured to load balance the packets using "balance-xor", in this
> case combined with xmit_hash_policy layer2.
> 
> However, we're seeing problems: packets from one flow do not end up at the
> same IDS machine.  This is because this selection is not based on the source
> _and_ destination mac addresses of the original packet, but on the mac
> address of the bonding device and the destination mac address of the
> package.
> 
> This is also clear in the code:
> For example, in bond_main.c, in bond_xmit_hash_policy_l2:
> 	return (data->h_dest[5] ^ bond_dev->dev_addr[5]) % count;
> 
> Changing this to
> 	return (data->h_dest[5] ^ data->h_source[5]) % count;
> fixes our problems, but is this harmful for packets originating locally (or
> being routed?)
> 
> If not, can this be applied? Or does anyone have other ideas?
> 

Hi Jasper

Very nice setup, and nice finding.

Dont locally generated (or outed) packets have h_source set to bond_dev->dev_addr anyway ?

So your solution might be the right fix...

About other ideas... I was thinking of TEE target (not in mainline unfortunatly) :

iptables -t mangle -A PREROUTING -i eth0 <some hash on mac addr>  -j TEE --gateway 192.168.99.1  # IDS1
iptables -t mangle -A PREROUTING -i eth0 !<some hash on mac addr>  -j TEE --gateway 192.168.99.2  # IDS2



^ permalink raw reply

* Problem with MDI/MDI-X auto-switching in E100 driver
From: Eugene T. Bordenkircher @ 2009-10-22 15:38 UTC (permalink / raw)
  To: netdev, linux-kernel; +Cc: linux.nics

Around line 1466 of e100.c (git master) is the following code to turn on 
MDI/MDI-X auto-switching if it is not already.

         } else if ((nic->mac >= mac_82550_D102) || ((nic->flags & ich) &&
             (mdio_read(netdev, nic->mii.phy_id, MII_TPISTATUS) & 0x8000) &&
                  !(nic->eeprom[eeprom_cnfg_mdix] & eeprom_mdix_enabled))) {
                  /* enable/disable MDI/MDI-X auto-switching. */
                  mdio_write(netdev, nic->mii.phy_id, MII_NCONFIG,
                                  nic->mii.force_media ? 0 : NCONFIG_AUTO_SWITCH);
         }

This code is broken in the case where an 8255x is used without magnetics.  Per 
Intel Application note 435, without the magnetics, auto switching is not 
possible.  The only way to turn this off without driver modifications is to set 
the force_media flag via ethtool, which has the side effect of turning off all 
auto-negotiation. This happens to be the case on a product I am currently 
working on.

It seems a better solution to this is to trust the eeprom's configuration 
rather than override it.  Am I missing something or does this sound reasonable?

Eugene T. Bordenkircher

^ permalink raw reply

* Re: [PATCH 0/5] Candidate fix for increased number of GFP_ATOMIC failures V2
From: Pekka Enberg @ 2009-10-22 14:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski,
	Tobias Oetiker, Rafael J. Wysocki, David Miller, Reinette Chatre,
	Kalle Valo, David Rientjes, KOSAKI Motohiro, Mohamed Abbas,
	Jens Axboe, John W. Linville, Bartlomiej Zolnierkiewicz,
	Greg Kroah-Hartman, Stephan von Krawczynski, Kernel Testers List,
	netdev, linux-kernel, linux-mm@kvack.org", akpm, cl
In-Reply-To: <1256221356-26049-1-git-send-email-mel@csn.ul.ie>

On Thu, Oct 22, 2009 at 5:22 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> Test 1: Verify your problem occurs on 2.6.32-rc5 if you can
>
> Test 2: Apply the following two patches and test again
>
>  1/5 page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed
>  2/5 page allocator: Do not allow interrupts to use ALLOC_HARDER

These are pretty obvious bug fixes and should go to linux-next ASAP IMHO.

> Test 5: If things are still screwed, apply the following
>  5/5 Revert 373c0a7e, 8aa7e847: Fix congestion_wait() sync/async vs read/write confusion
>
>        Frans Pop reports that the bulk of his problems go away when this
>        patch is reverted on 2.6.31. There has been some confusion on why
>        exactly this patch was wrong but apparently the conversion was not
>        complete and further work was required. It's unknown if all the
>        necessary work exists in 2.6.31-rc5 or not. If there are still
>        allocation failures and applying this patch fixes the problem,
>        there are still snags that need to be ironed out.

As explained by Jens Axboe, this changes timing but is not the source
of the OOMs so the revert is bogus even if it "helps" on some
workloads. IIRC the person who reported the revert to help things did
report that the OOMs did not go away, they were simply harder to
trigger with the revert.

^ permalink raw reply

* Re: [PATCH 1/5] page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed
From: Pekka Enberg @ 2009-10-22 14:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski,
	Tobias Oetiker, Rafael J. Wysocki, David Miller, Reinette Chatre,
	Kalle Valo, David Rientjes, KOSAKI Motohiro, Mohamed Abbas,
	Jens Axboe, John W. Linville, Bartlomiej Zolnierkiewicz,
	Greg Kroah-Hartman, Stephan von Krawczynski, Kernel Testers List,
	netdev, linux-kernel, linux-mm@kvack.org"
In-Reply-To: <1256221356-26049-2-git-send-email-mel@csn.ul.ie>

On Thu, Oct 22, 2009 at 5:22 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> If a direct reclaim makes no forward progress, it considers whether it
> should go OOM or not. Whether OOM is triggered or not, it may retry the
> application afterwards. In times past, this would always wake kswapd as well
> but currently, kswapd is not woken up after direct reclaim fails. For order-0
> allocations, this makes little difference but if there is a heavy mix of
> higher-order allocations that direct reclaim is failing for, it might mean
> that kswapd is not rewoken for higher orders as much as it did previously.
>
> This patch wakes up kswapd when an allocation is being retried after a direct
> reclaim failure. It would be expected that kswapd is already awake, but
> this has the effect of telling kswapd to reclaim at the higher order as well.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

You seem to have dropped the Reviewed-by tags from me and Christoph
for this patch.

>  mm/page_alloc.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bf72055..dfa4362 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1817,9 +1817,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>        if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
>                goto nopage;
>
> +restart:
>        wake_all_kswapd(order, zonelist, high_zoneidx);
>
> -restart:
>        /*
>         * OK, we're below the kswapd watermark and have kicked background
>         * reclaim. Now things get more complex, so set up alloc_flags according
> --
> 1.6.3.3
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

^ permalink raw reply

* [PATCH net-next-2.6] rtnetlink: speedup rtnl_dump_ifinfo()
From: Eric Dumazet @ 2009-10-22 14:34 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linux Netdev List, Benjamin LaHaise

When handling large number of netdevice, rtnl_dump_ifinfo()
is very slow because it has O(N^2) complexity.

Instead of scanning one single list, we can use the 256 sub lists
of the dev_index hash table.

This considerably speedups "ip link" operations

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/net_namespace.h |    4 +++
 net/core/dev.c              |    7 +-----
 net/core/rtnetlink.c        |   37 ++++++++++++++++++++++------------
 3 files changed, 30 insertions(+), 18 deletions(-)

diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 6994101..0addd45 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -28,6 +28,10 @@ struct ctl_table_header;
 struct net_generic;
 struct sock;
 
+
+#define NETDEV_HASHBITS    8
+#define NETDEV_HASHENTRIES (1 << NETDEV_HASHBITS)
+
 struct net {
 	atomic_t		count;		/* To decided when the network
 						 *  namespace should be freed.
diff --git a/net/core/dev.c b/net/core/dev.c
index fa88dcd..e7bada1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -193,18 +193,15 @@ static struct list_head ptype_all __read_mostly;	/* Taps */
 DEFINE_RWLOCK(dev_base_lock);
 EXPORT_SYMBOL(dev_base_lock);
 
-#define NETDEV_HASHBITS	8
-#define NETDEV_HASHENTRIES (1 << NETDEV_HASHBITS)
-
 static inline struct hlist_head *dev_name_hash(struct net *net, const char *name)
 {
 	unsigned hash = full_name_hash(name, strnlen(name, IFNAMSIZ));
-	return &net->dev_name_head[hash & ((1 << NETDEV_HASHBITS) - 1)];
+	return &net->dev_name_head[hash & (NETDEV_HASHENTRIES - 1)];
 }
 
 static inline struct hlist_head *dev_index_hash(struct net *net, int ifindex)
 {
-	return &net->dev_index_head[ifindex & ((1 << NETDEV_HASHBITS) - 1)];
+	return &net->dev_index_head[ifindex & (NETDEV_HASHENTRIES - 1)];
 }
 
 /* Device list insertion */
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index ba13b09..52ea418 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -682,22 +682,33 @@ nla_put_failure:
 static int rtnl_dump_ifinfo(struct sk_buff *skb, struct netlink_callback *cb)
 {
 	struct net *net = sock_net(skb->sk);
-	int idx;
-	int s_idx = cb->args[0];
+	int h, s_h;
+	int idx = 0, s_idx;
 	struct net_device *dev;
-
-	idx = 0;
-	for_each_netdev(net, dev) {
-		if (idx < s_idx)
-			goto cont;
-		if (rtnl_fill_ifinfo(skb, dev, RTM_NEWLINK,
-				     NETLINK_CB(cb->skb).pid,
-				     cb->nlh->nlmsg_seq, 0, NLM_F_MULTI) <= 0)
-			break;
+	struct hlist_head *head;
+	struct hlist_node *node;
+
+	s_h = cb->args[0];
+	s_idx = cb->args[1];
+
+	for (h = s_h; h < NETDEV_HASHENTRIES; h++, s_idx = 0) {
+		idx = 0;
+		head = &net->dev_index_head[h];
+		hlist_for_each_entry(dev, node, head, index_hlist) {
+			if (idx < s_idx)
+				goto cont;
+			if (rtnl_fill_ifinfo(skb, dev, RTM_NEWLINK,
+					     NETLINK_CB(cb->skb).pid,
+					     cb->nlh->nlmsg_seq, 0,
+					     NLM_F_MULTI) <= 0)
+				goto out;
 cont:
-		idx++;
+			idx++;
+		}
 	}
-	cb->args[0] = idx;
+out:
+	cb->args[1] = idx;
+	cb->args[0] = h;
 
 	return skb->len;
 }

^ permalink raw reply related

* Re: [PATCH net-next-2.6 1/4] net: introduce mc list helpers
From: Jiri Pirko @ 2009-10-22 14:28 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: netdev, davem, eric.dumazet, jeffrey.t.kirsher, jesse.brandeburg,
	bruce.w.allan, peter.p.waskiewicz.jr, john.ronciak, e1000-devel,
	mchehab, linux-media
In-Reply-To: <1256221112.2785.13.camel@achroite>

Thu, Oct 22, 2009 at 04:18:32PM CEST, bhutchings@solarflare.com wrote:
>On Thu, 2009-10-22 at 15:52 +0200, Jiri Pirko wrote:
>> This helpers should be used by network drivers to access to netdev
>> multicast lists.
>[...]
>> +static inline void netdev_mc_walk(struct net_device *dev,
>> +				  void (*func)(void *, unsigned char *),
>> +				  void *data)
>> +{
>> +	struct dev_addr_list *mclist;
>> +	int i;
>> +
>> +	for (i = 0, mclist = dev->mc_list; mclist && i < dev->mc_count;
>> +	     i++, mclist = mclist->next)
>> +		func(data, mclist->dmi_addr);
>> +}
>[...]
>
>We usually implement iteration as macros so that any context doesn't
>have to be squeezed through a single untyped (void *) variable.  A macro
>for this would look something like:
>
>#define netdev_for_each_mc_addr(dev, addr)						\
>	for (addr = (dev)->mc_list ? (dev)->mc_list->dmi_addr : NULL;			\
>	     addr;									\
>	     addr = (container_of(addr, struct dev_addr_list, dmi_addr)->next ?		\
>		     container_of(addr, struct dev_addr_list, dmi_addr)->next->dmi_addr : \
>		     NULL))

I admit this would look better. Going to change this and then repost.

Thanks Ben

>
>Once you change the list type this can presumably be made less ugly.
>
>Ben.
>
>-- 
>Ben Hutchings, Senior Software Engineer, Solarflare Communications
>Not speaking for my employer; that's the marketing department's job.
>They asked us to note that Solarflare product names are trademarked.
>

^ permalink raw reply

* Against 2.6.31.4 [PATCH 5/5] Revert 373c0a7e, 8aa7e847: Fix congestion_wait() sync/async vs read/write confusion
From: Mel Gorman @ 2009-10-22 14:25 UTC (permalink / raw)
  To: Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski,
	Tobias Oetiker
  Cc: Rafael J. Wysocki, David Miller, Reinette Chatre, Kalle Valo,
	David Rientjes, KOSAKI Motohiro, Mohamed Abbas, Jens Axboe,
	John W. Linville, Pekka Enberg, Bartlomiej Zolnierkiewicz,
	Greg Kroah-Hartman, Stephan von Krawczynski, Kernel Testers List,
	netdev, linux-kernel, linux-mm@kvack.org
In-Reply-To: <1256221356-26049-6-git-send-email-mel@csn.ul.ie>

This is a clean revert against 2.6.31.4

==== CUT HERE ====
Revert 373c0a7e, 8aa7e847: Fix congestion_wait() sync/async vs read/write confusion

Testing by Frans Pop indicates that in the 2.6.30..2.6.31 window at least
that the commits 373c0a7e 8aa7e847 dramatically increased the number of
GFP_ATOMIC failures that were occuring within a wireless driver. It was
never isolated which of the changes was the exact problem and it's possible
it has been fixed since.

However the fixes, if they exist in mainline, have not been back-ported to
-stable so for the -stable series, it might be best just to revert.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 arch/x86/lib/usercopy_32.c  |    2 +-
 drivers/block/pktcdvd.c     |   10 ++++------
 drivers/md/dm-crypt.c       |    2 +-
 fs/fat/file.c               |    2 +-
 fs/fuse/dev.c               |    8 ++++----
 fs/nfs/write.c              |    8 +++-----
 fs/reiserfs/journal.c       |    2 +-
 fs/xfs/linux-2.6/kmem.c     |    4 ++--
 fs/xfs/linux-2.6/xfs_buf.c  |    2 +-
 include/linux/backing-dev.h |   11 +++--------
 include/linux/blkdev.h      |   13 +++++++++----
 mm/backing-dev.c            |    7 ++++---
 mm/memcontrol.c             |    2 +-
 mm/page-writeback.c         |    8 ++++----
 mm/page_alloc.c             |    4 ++--
 mm/vmscan.c                 |    8 ++++----
 16 files changed, 45 insertions(+), 48 deletions(-)

diff --git a/arch/x86/lib/usercopy_32.c b/arch/x86/lib/usercopy_32.c
index 1f118d4..7c8ca91 100644
--- a/arch/x86/lib/usercopy_32.c
+++ b/arch/x86/lib/usercopy_32.c
@@ -751,7 +751,7 @@ survive:
 
 			if (retval == -ENOMEM && is_global_init(current)) {
 				up_read(&current->mm->mmap_sem);
-				congestion_wait(BLK_RW_ASYNC, HZ/50);
+				congestion_wait(WRITE, HZ/50);
 				goto survive;
 			}
 
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 99a506f..83650e0 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -1372,10 +1372,8 @@ try_next_bio:
 	wakeup = (pd->write_congestion_on > 0
 	 		&& pd->bio_queue_size <= pd->write_congestion_off);
 	spin_unlock(&pd->lock);
-	if (wakeup) {
-		clear_bdi_congested(&pd->disk->queue->backing_dev_info,
-					BLK_RW_ASYNC);
-	}
+	if (wakeup)
+		clear_bdi_congested(&pd->disk->queue->backing_dev_info, WRITE);
 
 	pkt->sleep_time = max(PACKET_WAIT_TIME, 1);
 	pkt_set_state(pkt, PACKET_WAITING_STATE);
@@ -2594,10 +2592,10 @@ static int pkt_make_request(struct request_queue *q, struct bio *bio)
 	spin_lock(&pd->lock);
 	if (pd->write_congestion_on > 0
 	    && pd->bio_queue_size >= pd->write_congestion_on) {
-		set_bdi_congested(&q->backing_dev_info, BLK_RW_ASYNC);
+		set_bdi_congested(&q->backing_dev_info, WRITE);
 		do {
 			spin_unlock(&pd->lock);
-			congestion_wait(BLK_RW_ASYNC, HZ);
+			congestion_wait(WRITE, HZ);
 			spin_lock(&pd->lock);
 		} while(pd->bio_queue_size > pd->write_congestion_off);
 	}
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index ed10381..c72a8dd 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -776,7 +776,7 @@ static void kcryptd_crypt_write_convert(struct dm_crypt_io *io)
 		 * But don't wait if split was due to the io size restriction
 		 */
 		if (unlikely(out_of_pages))
-			congestion_wait(BLK_RW_ASYNC, HZ/100);
+			congestion_wait(WRITE, HZ/100);
 
 		/*
 		 * With async crypto it is unsafe to share the crypto context
diff --git a/fs/fat/file.c b/fs/fat/file.c
index f042b96..b28ea64 100644
--- a/fs/fat/file.c
+++ b/fs/fat/file.c
@@ -134,7 +134,7 @@ static int fat_file_release(struct inode *inode, struct file *filp)
 	if ((filp->f_mode & FMODE_WRITE) &&
 	     MSDOS_SB(inode->i_sb)->options.flush) {
 		fat_flush_inodes(inode->i_sb, inode, NULL);
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
+		congestion_wait(WRITE, HZ/10);
 	}
 	return 0;
 }
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 6484eb7..f58ecbc 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -286,8 +286,8 @@ __releases(&fc->lock)
 		}
 		if (fc->num_background == FUSE_CONGESTION_THRESHOLD &&
 		    fc->connected && fc->bdi_initialized) {
-			clear_bdi_congested(&fc->bdi, BLK_RW_SYNC);
-			clear_bdi_congested(&fc->bdi, BLK_RW_ASYNC);
+			clear_bdi_congested(&fc->bdi, READ);
+			clear_bdi_congested(&fc->bdi, WRITE);
 		}
 		fc->num_background--;
 		fc->active_background--;
@@ -414,8 +414,8 @@ static void fuse_request_send_nowait_locked(struct fuse_conn *fc,
 		fc->blocked = 1;
 	if (fc->num_background == FUSE_CONGESTION_THRESHOLD &&
 	    fc->bdi_initialized) {
-		set_bdi_congested(&fc->bdi, BLK_RW_SYNC);
-		set_bdi_congested(&fc->bdi, BLK_RW_ASYNC);
+		set_bdi_congested(&fc->bdi, READ);
+		set_bdi_congested(&fc->bdi, WRITE);
 	}
 	list_add_tail(&req->list, &fc->bg_queue);
 	flush_bg_queue(fc);
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index a34fae2..5693fcd 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -200,10 +200,8 @@ static int nfs_set_page_writeback(struct page *page)
 		struct nfs_server *nfss = NFS_SERVER(inode);
 
 		if (atomic_long_inc_return(&nfss->writeback) >
-				NFS_CONGESTION_ON_THRESH) {
-			set_bdi_congested(&nfss->backing_dev_info,
-						BLK_RW_ASYNC);
-		}
+				NFS_CONGESTION_ON_THRESH)
+			set_bdi_congested(&nfss->backing_dev_info, WRITE);
 	}
 	return ret;
 }
@@ -215,7 +213,7 @@ static void nfs_end_page_writeback(struct page *page)
 
 	end_page_writeback(page);
 	if (atomic_long_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
-		clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
+		clear_bdi_congested(&nfss->backing_dev_info, WRITE);
 }
 
 /*
diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
index 9062220..77f5bb7 100644
--- a/fs/reiserfs/journal.c
+++ b/fs/reiserfs/journal.c
@@ -997,7 +997,7 @@ static int reiserfs_async_progress_wait(struct super_block *s)
 	DEFINE_WAIT(wait);
 	struct reiserfs_journal *j = SB_JOURNAL(s);
 	if (atomic_read(&j->j_async_throttle))
-		congestion_wait(BLK_RW_ASYNC, HZ / 10);
+		congestion_wait(WRITE, HZ / 10);
 	return 0;
 }
 
diff --git a/fs/xfs/linux-2.6/kmem.c b/fs/xfs/linux-2.6/kmem.c
index 2d3f90a..1cd3b55 100644
--- a/fs/xfs/linux-2.6/kmem.c
+++ b/fs/xfs/linux-2.6/kmem.c
@@ -53,7 +53,7 @@ kmem_alloc(size_t size, unsigned int __nocast flags)
 			printk(KERN_ERR "XFS: possible memory allocation "
 					"deadlock in %s (mode:0x%x)\n",
 					__func__, lflags);
-		congestion_wait(BLK_RW_ASYNC, HZ/50);
+		congestion_wait(WRITE, HZ/50);
 	} while (1);
 }
 
@@ -130,7 +130,7 @@ kmem_zone_alloc(kmem_zone_t *zone, unsigned int __nocast flags)
 			printk(KERN_ERR "XFS: possible memory allocation "
 					"deadlock in %s (mode:0x%x)\n",
 					__func__, lflags);
-		congestion_wait(BLK_RW_ASYNC, HZ/50);
+		congestion_wait(WRITE, HZ/50);
 	} while (1);
 }
 
diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 965df12..178c20c 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -412,7 +412,7 @@ _xfs_buf_lookup_pages(
 
 			XFS_STATS_INC(xb_page_retries);
 			xfsbufd_wakeup(0, gfp_mask);
-			congestion_wait(BLK_RW_ASYNC, HZ/50);
+			congestion_wait(WRITE, HZ/50);
 			goto retry;
 		}
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 1d52425..0ec2c59 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -229,14 +229,9 @@ static inline int bdi_rw_congested(struct backing_dev_info *bdi)
 				  (1 << BDI_async_congested));
 }
 
-enum {
-	BLK_RW_ASYNC	= 0,
-	BLK_RW_SYNC	= 1,
-};
-
-void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
-void set_bdi_congested(struct backing_dev_info *bdi, int sync);
-long congestion_wait(int sync, long timeout);
+void clear_bdi_congested(struct backing_dev_info *bdi, int rw);
+void set_bdi_congested(struct backing_dev_info *bdi, int rw);
+long congestion_wait(int rw, long timeout);
 
 
 static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 69103e0..998c8e0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -70,6 +70,11 @@ enum rq_cmd_type_bits {
 	REQ_TYPE_ATA_PC,
 };
 
+enum {
+	BLK_RW_ASYNC	= 0,
+	BLK_RW_SYNC	= 1,
+};
+
 /*
  * For request of type REQ_TYPE_LINUX_BLOCK, rq->cmd[0] is the opcode being
  * sent down (similar to how REQ_TYPE_BLOCK_PC means that ->cmd[] holds a
@@ -775,18 +780,18 @@ extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
  * congested queues, and wake up anyone who was waiting for requests to be
  * put back.
  */
-static inline void blk_clear_queue_congested(struct request_queue *q, int sync)
+static inline void blk_clear_queue_congested(struct request_queue *q, int rw)
 {
-	clear_bdi_congested(&q->backing_dev_info, sync);
+	clear_bdi_congested(&q->backing_dev_info, rw);
 }
 
 /*
  * A queue has just entered congestion.  Flag that in the queue's VM-visible
  * state flags and increment the global gounter of congested queues.
  */
-static inline void blk_set_queue_congested(struct request_queue *q, int sync)
+static inline void blk_set_queue_congested(struct request_queue *q, int rw)
 {
-	set_bdi_congested(&q->backing_dev_info, sync);
+	set_bdi_congested(&q->backing_dev_info, rw);
 }
 
 extern void blk_start_queue(struct request_queue *q);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index c86edd2..493b468 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -283,6 +283,7 @@ static wait_queue_head_t congestion_wqh[2] = {
 		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
 	};
 
+
 void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
 {
 	enum bdi_state bit;
@@ -307,18 +308,18 @@ EXPORT_SYMBOL(set_bdi_congested);
 
 /**
  * congestion_wait - wait for a backing_dev to become uncongested
- * @sync: SYNC or ASYNC IO
+ * @rw: READ or WRITE
  * @timeout: timeout in jiffies
  *
  * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
  * write congestion.  If no backing_devs are congested then just wait for the
  * next write to be completed.
  */
-long congestion_wait(int sync, long timeout)
+long congestion_wait(int rw, long timeout)
 {
 	long ret;
 	DEFINE_WAIT(wait);
-	wait_queue_head_t *wqh = &congestion_wqh[sync];
+	wait_queue_head_t *wqh = &congestion_wqh[rw];
 
 	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
 	ret = io_schedule_timeout(timeout);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fd4529d..834509f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1990,7 +1990,7 @@ try_to_free:
 		if (!progress) {
 			nr_retries--;
 			/* maybe some writeback is necessary */
-			congestion_wait(BLK_RW_ASYNC, HZ/10);
+			congestion_wait(WRITE, HZ/10);
 		}
 
 	}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 81627eb..7687879 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -575,7 +575,7 @@ static void balance_dirty_pages(struct address_space *mapping)
 		if (pages_written >= write_chunk)
 			break;		/* We've done our duty */
 
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
+		congestion_wait(WRITE, HZ/10);
 	}
 
 	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
@@ -669,7 +669,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
                 if (global_page_state(NR_UNSTABLE_NFS) +
 			global_page_state(NR_WRITEBACK) <= dirty_thresh)
                         	break;
-                congestion_wait(BLK_RW_ASYNC, HZ/10);
+                congestion_wait(WRITE, HZ/10);
 
 		/*
 		 * The caller might hold locks which can prevent IO completion
@@ -715,7 +715,7 @@ static void background_writeout(unsigned long _min_pages)
 		if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
 			/* Wrote less than expected */
 			if (wbc.encountered_congestion || wbc.more_io)
-				congestion_wait(BLK_RW_ASYNC, HZ/10);
+				congestion_wait(WRITE, HZ/10);
 			else
 				break;
 		}
@@ -787,7 +787,7 @@ static void wb_kupdate(unsigned long arg)
 		writeback_inodes(&wbc);
 		if (wbc.nr_to_write > 0) {
 			if (wbc.encountered_congestion || wbc.more_io)
-				congestion_wait(BLK_RW_ASYNC, HZ/10);
+				congestion_wait(WRITE, HZ/10);
 			else
 				break;	/* All the old data is written */
 		}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f7e52af..68f75da 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1684,7 +1684,7 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 			preferred_zone, migratetype);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
-			congestion_wait(BLK_RW_ASYNC, HZ/50);
+			congestion_wait(WRITE, HZ/50);
 	} while (!page && (gfp_mask & __GFP_NOFAIL));
 
 	/*
@@ -1855,7 +1855,7 @@ rebalance:
 	pages_reclaimed += did_some_progress;
 	if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
 		/* Wait for some write requests to complete then retry */
-		congestion_wait(BLK_RW_ASYNC, HZ/50);
+		congestion_wait(WRITE, HZ/50);
 		goto rebalance;
 	}
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a7eafdb..0e66a6b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1109,7 +1109,7 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
 		 */
 		if (nr_freed < nr_taken && !current_is_kswapd() &&
 		    lumpy_reclaim) {
-			congestion_wait(BLK_RW_ASYNC, HZ/10);
+			congestion_wait(WRITE, HZ/10);
 
 			/*
 			 * The attempt at page out may have made some
@@ -1726,7 +1726,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 
 		/* Take a nap, wait for some writeback to complete */
 		if (sc->nr_scanned && priority < DEF_PRIORITY - 2)
-			congestion_wait(BLK_RW_ASYNC, HZ/10);
+			congestion_wait(WRITE, HZ/10);
 	}
 	/* top priority shrink_zones still had more to do? don't OOM, then */
 	if (!sc->all_unreclaimable && scanning_global_lru(sc))
@@ -1974,7 +1974,7 @@ loop_again:
 		 * another pass across the zones.
 		 */
 		if (total_scanned && priority < DEF_PRIORITY - 2)
-			congestion_wait(BLK_RW_ASYNC, HZ/10);
+			congestion_wait(WRITE, HZ/10);
 
 		/*
 		 * We do this so kswapd doesn't build up large priorities for
@@ -2247,7 +2247,7 @@ unsigned long shrink_all_memory(unsigned long nr_pages)
 				goto out;
 
 			if (sc.nr_scanned && prio < DEF_PRIORITY - 2)
-				congestion_wait(BLK_RW_ASYNC, HZ / 10);
+				congestion_wait(WRITE, HZ / 10);
 		}
 	}
 
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 1/5 Against 2.6.31.4] page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed
From: Mel Gorman @ 2009-10-22 14:24 UTC (permalink / raw)
  To: Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski,
	Tobias Oetiker
  Cc: Rafael J. Wysocki, David Miller, Reinette Chatre, Kalle Valo,
	David Rientjes, KOSAKI Motohiro, Mohamed Abbas, Jens Axboe,
	John W. Linville, Pekka Enberg, Bartlomiej Zolnierkiewicz,
	Greg Kroah-Hartman, Stephan von Krawczynski, Kernel Testers List,
	netdev, linux-kernel, linux-mm@kvack.org
In-Reply-To: <1256221356-26049-2-git-send-email-mel@csn.ul.ie>

This is a version that applies against 2.6.31.4

==== CUT HERE ====
page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed

If a direct reclaim makes no forward progress, it considers whether it
should go OOM or not. Whether OOM is triggered or not, it may retry the
application afterwards. In times past, this would always wake kswapd as well
but currently, kswapd is not woken up after direct reclaim fails. For order-0
allocations, this makes little difference but if there is a heavy mix of
higher-order allocations that direct reclaim is failing for, it might mean
that kswapd is not rewoken for higher orders as much as it did previously.

This patch wakes up kswapd when an allocation is being retried after a direct
reclaim failure. It would be expected that kswapd is already awake, but
this has the effect of telling kswapd to reclaim at the higher order as well.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0b3c6cb..239677a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1763,6 +1763,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
 		goto nopage;
 
+restart:
 	wake_all_kswapd(order, zonelist, high_zoneidx);
 
 	/*
@@ -1772,7 +1773,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 */
 	alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
-restart:
 	/* This is the last chance, in general, before the goto nopage. */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 5/5] ONLY-APPLY-IF-STILL-FAILING Revert 373c0a7e, 8aa7e847: Fix congestion_wait() sync/async vs read/write confusion
From: Mel Gorman @ 2009-10-22 14:22 UTC (permalink / raw)
  To: Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski,
	Tobias Oetiker
  Cc: Rafael J. Wysocki, David Miller, Reinette Chatre, Kalle Valo,
	David Rientjes, KOSAKI Motohiro, Mohamed Abbas, Jens Axboe,
	John W. Linville, Pekka Enberg, Bartlomiej Zolnierkiewicz,
	Greg Kroah-Hartman, Stephan von Krawczynski, Kernel Testers List,
	netdev, linux-kernel, linux-mm@kvack.org", Mel Gorman
In-Reply-To: <1256221356-26049-1-git-send-email-mel@csn.ul.ie>

Testing by Frans Pop indicates that in the 2.6.30..2.6.31 window at
least that the commits 373c0a7e 8aa7e847 dramatically increased the
number of GFP_ATOMIC failures that were occuring within a wireless
driver. It was never isolated which of the changes was the exact problem
and it's possible it has been fixed since. If problems are still
occuring with GFP_ATOMIC in 2.6.31-rc5, then this patch should be
applied to determine if the congestion_wait() callers are still broken.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 arch/x86/lib/usercopy_32.c  |    2 +-
 drivers/block/pktcdvd.c     |   10 ++++------
 drivers/md/dm-crypt.c       |    2 +-
 fs/fat/file.c               |    2 +-
 fs/fuse/dev.c               |    8 ++++----
 fs/nfs/write.c              |    8 +++-----
 fs/reiserfs/journal.c       |    2 +-
 fs/xfs/linux-2.6/kmem.c     |    4 ++--
 fs/xfs/linux-2.6/xfs_buf.c  |    2 +-
 include/linux/backing-dev.h |   11 +++--------
 include/linux/blkdev.h      |   13 +++++++++----
 mm/backing-dev.c            |    7 ++++---
 mm/memcontrol.c             |    2 +-
 mm/page-writeback.c         |    2 +-
 mm/page_alloc.c             |    4 ++--
 mm/vmscan.c                 |    8 ++++----
 16 files changed, 42 insertions(+), 45 deletions(-)

diff --git a/arch/x86/lib/usercopy_32.c b/arch/x86/lib/usercopy_32.c
index 1f118d4..7c8ca91 100644
--- a/arch/x86/lib/usercopy_32.c
+++ b/arch/x86/lib/usercopy_32.c
@@ -751,7 +751,7 @@ survive:
 
 			if (retval == -ENOMEM && is_global_init(current)) {
 				up_read(&current->mm->mmap_sem);
-				congestion_wait(BLK_RW_ASYNC, HZ/50);
+				congestion_wait(WRITE, HZ/50);
 				goto survive;
 			}
 
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 2ddf03a..d69bf9c 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -1372,10 +1372,8 @@ try_next_bio:
 	wakeup = (pd->write_congestion_on > 0
 	 		&& pd->bio_queue_size <= pd->write_congestion_off);
 	spin_unlock(&pd->lock);
-	if (wakeup) {
-		clear_bdi_congested(&pd->disk->queue->backing_dev_info,
-					BLK_RW_ASYNC);
-	}
+	if (wakeup)
+		clear_bdi_congested(&pd->disk->queue->backing_dev_info, WRITE);
 
 	pkt->sleep_time = max(PACKET_WAIT_TIME, 1);
 	pkt_set_state(pkt, PACKET_WAITING_STATE);
@@ -2594,10 +2592,10 @@ static int pkt_make_request(struct request_queue *q, struct bio *bio)
 	spin_lock(&pd->lock);
 	if (pd->write_congestion_on > 0
 	    && pd->bio_queue_size >= pd->write_congestion_on) {
-		set_bdi_congested(&q->backing_dev_info, BLK_RW_ASYNC);
+		set_bdi_congested(&q->backing_dev_info, WRITE);
 		do {
 			spin_unlock(&pd->lock);
-			congestion_wait(BLK_RW_ASYNC, HZ);
+			congestion_wait(WRITE, HZ);
 			spin_lock(&pd->lock);
 		} while(pd->bio_queue_size > pd->write_congestion_off);
 	}
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index ed10381..c72a8dd 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -776,7 +776,7 @@ static void kcryptd_crypt_write_convert(struct dm_crypt_io *io)
 		 * But don't wait if split was due to the io size restriction
 		 */
 		if (unlikely(out_of_pages))
-			congestion_wait(BLK_RW_ASYNC, HZ/100);
+			congestion_wait(WRITE, HZ/100);
 
 		/*
 		 * With async crypto it is unsafe to share the crypto context
diff --git a/fs/fat/file.c b/fs/fat/file.c
index e8c159d..ef60a65 100644
--- a/fs/fat/file.c
+++ b/fs/fat/file.c
@@ -134,7 +134,7 @@ static int fat_file_release(struct inode *inode, struct file *filp)
 	if ((filp->f_mode & FMODE_WRITE) &&
 	     MSDOS_SB(inode->i_sb)->options.flush) {
 		fat_flush_inodes(inode->i_sb, inode, NULL);
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
+		congestion_wait(WRITE, HZ/10);
 	}
 	return 0;
 }
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 51d9e33..b152761 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -286,8 +286,8 @@ __releases(&fc->lock)
 		}
 		if (fc->num_background == fc->congestion_threshold &&
 		    fc->connected && fc->bdi_initialized) {
-			clear_bdi_congested(&fc->bdi, BLK_RW_SYNC);
-			clear_bdi_congested(&fc->bdi, BLK_RW_ASYNC);
+			clear_bdi_congested(&fc->bdi, READ);
+			clear_bdi_congested(&fc->bdi, WRITE);
 		}
 		fc->num_background--;
 		fc->active_background--;
@@ -414,8 +414,8 @@ static void fuse_request_send_nowait_locked(struct fuse_conn *fc,
 		fc->blocked = 1;
 	if (fc->num_background == fc->congestion_threshold &&
 	    fc->bdi_initialized) {
-		set_bdi_congested(&fc->bdi, BLK_RW_SYNC);
-		set_bdi_congested(&fc->bdi, BLK_RW_ASYNC);
+		set_bdi_congested(&fc->bdi, READ);
+		set_bdi_congested(&fc->bdi, WRITE);
 	}
 	list_add_tail(&req->list, &fc->bg_queue);
 	flush_bg_queue(fc);
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 53eb26c..bb9cc66 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -202,10 +202,8 @@ static int nfs_set_page_writeback(struct page *page)
 		struct nfs_server *nfss = NFS_SERVER(inode);
 
 		if (atomic_long_inc_return(&nfss->writeback) >
-				NFS_CONGESTION_ON_THRESH) {
-			set_bdi_congested(&nfss->backing_dev_info,
-						BLK_RW_ASYNC);
-		}
+				NFS_CONGESTION_ON_THRESH)
+			set_bdi_congested(&nfss->backing_dev_info, WRITE);
 	}
 	return ret;
 }
@@ -217,7 +215,7 @@ static void nfs_end_page_writeback(struct page *page)
 
 	end_page_writeback(page);
 	if (atomic_long_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
-		clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
+		clear_bdi_congested(&nfss->backing_dev_info, WRITE);
 }
 
 static struct nfs_page *nfs_find_and_lock_request(struct page *page)
diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
index 9062220..77f5bb7 100644
--- a/fs/reiserfs/journal.c
+++ b/fs/reiserfs/journal.c
@@ -997,7 +997,7 @@ static int reiserfs_async_progress_wait(struct super_block *s)
 	DEFINE_WAIT(wait);
 	struct reiserfs_journal *j = SB_JOURNAL(s);
 	if (atomic_read(&j->j_async_throttle))
-		congestion_wait(BLK_RW_ASYNC, HZ / 10);
+		congestion_wait(WRITE, HZ / 10);
 	return 0;
 }
 
diff --git a/fs/xfs/linux-2.6/kmem.c b/fs/xfs/linux-2.6/kmem.c
index 2d3f90a..1cd3b55 100644
--- a/fs/xfs/linux-2.6/kmem.c
+++ b/fs/xfs/linux-2.6/kmem.c
@@ -53,7 +53,7 @@ kmem_alloc(size_t size, unsigned int __nocast flags)
 			printk(KERN_ERR "XFS: possible memory allocation "
 					"deadlock in %s (mode:0x%x)\n",
 					__func__, lflags);
-		congestion_wait(BLK_RW_ASYNC, HZ/50);
+		congestion_wait(WRITE, HZ/50);
 	} while (1);
 }
 
@@ -130,7 +130,7 @@ kmem_zone_alloc(kmem_zone_t *zone, unsigned int __nocast flags)
 			printk(KERN_ERR "XFS: possible memory allocation "
 					"deadlock in %s (mode:0x%x)\n",
 					__func__, lflags);
-		congestion_wait(BLK_RW_ASYNC, HZ/50);
+		congestion_wait(WRITE, HZ/50);
 	} while (1);
 }
 
diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 965df12..178c20c 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -412,7 +412,7 @@ _xfs_buf_lookup_pages(
 
 			XFS_STATS_INC(xb_page_retries);
 			xfsbufd_wakeup(0, gfp_mask);
-			congestion_wait(BLK_RW_ASYNC, HZ/50);
+			congestion_wait(WRITE, HZ/50);
 			goto retry;
 		}
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index b449e73..58f5d0c 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -273,14 +273,9 @@ static inline int bdi_rw_congested(struct backing_dev_info *bdi)
 				  (1 << BDI_async_congested));
 }
 
-enum {
-	BLK_RW_ASYNC	= 0,
-	BLK_RW_SYNC	= 1,
-};
-
-void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
-void set_bdi_congested(struct backing_dev_info *bdi, int sync);
-long congestion_wait(int sync, long timeout);
+void clear_bdi_congested(struct backing_dev_info *bdi, int rw);
+void set_bdi_congested(struct backing_dev_info *bdi, int rw);
+long congestion_wait(int rw, long timeout);
 
 
 static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 221cecd..51a6320 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -70,6 +70,11 @@ enum rq_cmd_type_bits {
 	REQ_TYPE_ATA_PC,
 };
 
+enum {
+	BLK_RW_ASYNC	= 0,
+	BLK_RW_SYNC	= 1,
+};
+
 /*
  * For request of type REQ_TYPE_LINUX_BLOCK, rq->cmd[0] is the opcode being
  * sent down (similar to how REQ_TYPE_BLOCK_PC means that ->cmd[] holds a
@@ -784,18 +789,18 @@ extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
  * congested queues, and wake up anyone who was waiting for requests to be
  * put back.
  */
-static inline void blk_clear_queue_congested(struct request_queue *q, int sync)
+static inline void blk_clear_queue_congested(struct request_queue *q, int rw)
 {
-	clear_bdi_congested(&q->backing_dev_info, sync);
+	clear_bdi_congested(&q->backing_dev_info, rw);
 }
 
 /*
  * A queue has just entered congestion.  Flag that in the queue's VM-visible
  * state flags and increment the global gounter of congested queues.
  */
-static inline void blk_set_queue_congested(struct request_queue *q, int sync)
+static inline void blk_set_queue_congested(struct request_queue *q, int rw)
 {
-	set_bdi_congested(&q->backing_dev_info, sync);
+	set_bdi_congested(&q->backing_dev_info, rw);
 }
 
 extern void blk_start_queue(struct request_queue *q);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 5a37e20..d68d6e4 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -696,6 +696,7 @@ static wait_queue_head_t congestion_wqh[2] = {
 		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
 	};
 
+
 void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
 {
 	enum bdi_state bit;
@@ -720,18 +721,18 @@ EXPORT_SYMBOL(set_bdi_congested);
 
 /**
  * congestion_wait - wait for a backing_dev to become uncongested
- * @sync: SYNC or ASYNC IO
+ * @rw: READ or WRITE
  * @timeout: timeout in jiffies
  *
  * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
  * write congestion.  If no backing_devs are congested then just wait for the
  * next write to be completed.
  */
-long congestion_wait(int sync, long timeout)
+long congestion_wait(int rw, long timeout)
 {
 	long ret;
 	DEFINE_WAIT(wait);
-	wait_queue_head_t *wqh = &congestion_wqh[sync];
+	wait_queue_head_t *wqh = &congestion_wqh[rw];
 
 	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
 	ret = io_schedule_timeout(timeout);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f99f599..f92ee06 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2430,7 +2430,7 @@ try_to_free:
 		if (!progress) {
 			nr_retries--;
 			/* maybe some writeback is necessary */
-			congestion_wait(BLK_RW_ASYNC, HZ/10);
+			congestion_wait(WRITE, HZ/10);
 		}
 
 	}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2c5d792..b300954 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -671,7 +671,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
                 if (global_page_state(NR_UNSTABLE_NFS) +
 			global_page_state(NR_WRITEBACK) <= dirty_thresh)
                         	break;
-                congestion_wait(BLK_RW_ASYNC, HZ/10);
+                congestion_wait(WRITE, HZ/10);
 
 		/*
 		 * The caller might hold locks which can prevent IO completion
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 851df40..2cd0fbb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1738,7 +1738,7 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 			preferred_zone, migratetype);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
-			congestion_wait(BLK_RW_ASYNC, HZ/50);
+			congestion_wait(WRITE, HZ/50);
 	} while (!page && (gfp_mask & __GFP_NOFAIL));
 
 	/*
@@ -1909,7 +1909,7 @@ rebalance:
 	pages_reclaimed += did_some_progress;
 	if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
 		/* Wait for some write requests to complete then retry */
-		congestion_wait(BLK_RW_ASYNC, HZ/50);
+		congestion_wait(WRITE, HZ/50);
 		goto rebalance;
 	}
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cd68109..3805e59 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1174,7 +1174,7 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
 		 */
 		if (nr_freed < nr_taken && !current_is_kswapd() &&
 		    lumpy_reclaim) {
-			congestion_wait(BLK_RW_ASYNC, HZ/10);
+			congestion_wait(WRITE, HZ/10);
 
 			/*
 			 * The attempt at page out may have made some
@@ -1783,7 +1783,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 
 		/* Take a nap, wait for some writeback to complete */
 		if (sc->nr_scanned && priority < DEF_PRIORITY - 2)
-			congestion_wait(BLK_RW_ASYNC, HZ/10);
+			congestion_wait(WRITE, HZ/10);
 	}
 	/* top priority shrink_zones still had more to do? don't OOM, then */
 	if (!sc->all_unreclaimable && scanning_global_lru(sc))
@@ -2074,7 +2074,7 @@ loop_again:
 		 * another pass across the zones.
 		 */
 		if (total_scanned && priority < DEF_PRIORITY - 2)
-			congestion_wait(BLK_RW_ASYNC, HZ/10);
+			congestion_wait(WRITE, HZ/10);
 
 		/*
 		 * We do this so kswapd doesn't build up large priorities for
@@ -2378,7 +2378,7 @@ unsigned long shrink_all_memory(unsigned long nr_pages)
 				goto out;
 
 			if (sc.nr_scanned && prio < DEF_PRIORITY - 2)
-				congestion_wait(BLK_RW_ASYNC, HZ / 10);
+				congestion_wait(WRITE, HZ / 10);
 		}
 	}
 
-- 
1.6.3.3


^ permalink raw reply related

* [PATCH 4/5] page allocator: Pre-emptively wake kswapd when high-order watermarks are hit
From: Mel Gorman @ 2009-10-22 14:22 UTC (permalink / raw)
  To: Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski,
	Tobias Oetiker
  Cc: Rafael J. Wysocki, David Miller, Reinette Chatre, Kalle Valo,
	David Rientjes, KOSAKI Motohiro, Mohamed Abbas, Jens Axboe,
	John W. Linville, Pekka Enberg, Bartlomiej Zolnierkiewicz,
	Greg Kroah-Hartman, Stephan von Krawczynski, Kernel Testers List,
	netdev, linux-kernel, linux-mm@kvack.org", Mel Gorman
In-Reply-To: <1256221356-26049-1-git-send-email-mel@csn.ul.ie>

When a high-order allocation fails, kswapd is kicked so that it reclaims
at a higher-order to avoid direct reclaimers stall and to help GFP_ATOMIC
allocations. Something has changed in recent kernels that affect the timing
where high-order GFP_ATOMIC allocations are now failing with more frequency,
particularly under pressure.

This patch pre-emptively checks if watermarks have been hit after a
high-order allocation completes successfully. If the watermarks have been
reached, kswapd is woken in the hope it fixes the watermarks before the
next GFP_ATOMIC allocation fails.

Warning, this patch is somewhat of a band-aid. If this makes a difference,
it still implies that something has changed that is either causing more
GFP_ATOMIC allocations to occur (such as the case with iwlagn wireless
driver) or make them more likely to fail.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   33 ++++++++++++++++++++++-----------
 1 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7f2aa3e..851df40 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1596,6 +1596,17 @@ try_next_zone:
 	return page;
 }
 
+static inline
+void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
+						enum zone_type high_zoneidx)
+{
+	struct zoneref *z;
+	struct zone *zone;
+
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
+		wakeup_kswapd(zone, order);
+}
+
 static inline int
 should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 				unsigned long pages_reclaimed)
@@ -1730,18 +1741,18 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 			congestion_wait(BLK_RW_ASYNC, HZ/50);
 	} while (!page && (gfp_mask & __GFP_NOFAIL));
 
-	return page;
-}
-
-static inline
-void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
-						enum zone_type high_zoneidx)
-{
-	struct zoneref *z;
-	struct zone *zone;
+	/*
+	 * If after a high-order allocation we are now below watermarks,
+	 * pre-emptively kick kswapd rather than having the next allocation
+	 * fail and have to wake up kswapd, potentially failing GFP_ATOMIC
+	 * allocations or entering direct reclaim
+	 */
+	if (unlikely(order) && page && !zone_watermark_ok(preferred_zone, order,
+				preferred_zone->watermark[ALLOC_WMARK_LOW],
+				zone_idx(preferred_zone), ALLOC_WMARK_LOW))
+		wake_all_kswapd(order, zonelist, high_zoneidx);
 
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
-		wakeup_kswapd(zone, order);
+	return page;
 }
 
 static inline int
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 3/5] vmscan: Force kswapd to take notice faster when high-order watermarks are being hit
From: Mel Gorman @ 2009-10-22 14:22 UTC (permalink / raw)
  To: Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski,
	Tobias Oetiker
  Cc: Rafael J. Wysocki, David Miller, Reinette Chatre, Kalle Valo,
	David Rientjes, KOSAKI Motohiro, Mohamed Abbas, Jens Axboe,
	John W. Linville, Pekka Enberg, Bartlomiej Zolnierkiewicz,
	Greg Kroah-Hartman, Stephan von Krawczynski, Kernel Testers List,
	netdev, linux-kernel, linux-mm@kvack.org", Mel Gorman
In-Reply-To: <1256221356-26049-1-git-send-email-mel@csn.ul.ie>

When a high-order allocation fails, kswapd is kicked so that it reclaims
at a higher-order to avoid direct reclaimers stall and to help GFP_ATOMIC
allocations. Something has changed in recent kernels that affect the timing
where high-order GFP_ATOMIC allocations are now failing with more frequency,
particularly under pressure. This patch forces kswapd to notice sooner that
high-order allocations are occuring.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 64e4388..cd68109 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2016,6 +2016,15 @@ loop_again:
 					priority != DEF_PRIORITY)
 				continue;
 
+			/*
+			 * Exit quickly to restart if it has been indicated
+			 * that higher orders are required
+			 */
+			if (pgdat->kswapd_max_order > order) {
+				all_zones_ok = 1;
+				goto out;
+			}
+
 			if (!zone_watermark_ok(zone, order,
 					high_wmark_pages(zone), end_zone, 0))
 				all_zones_ok = 0;
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER
From: Mel Gorman @ 2009-10-22 14:22 UTC (permalink / raw)
  To: Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski,
	Tobias Oetiker
  Cc: Rafael J. Wysocki, David Miller, Reinette Chatre, Kalle Valo,
	David Rientjes, KOSAKI Motohiro, Mohamed Abbas, Jens Axboe,
	John W. Linville, Pekka Enberg, Bartlomiej Zolnierkiewicz,
	Greg Kroah-Hartman, Stephan von Krawczynski, Kernel Testers List,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org", Mel Gorman
In-Reply-To: <1256221356-26049-1-git-send-email-mel-wPRd99KPJ+uzQB+pC5nmwQ@public.gmane.org>

Commit 341ce06f69abfafa31b9468410a13dbd60e2b237 altered watermark logic
slightly by allowing rt_tasks that are handling an interrupt to set
ALLOC_HARDER. This patch brings the watermark logic more in line with
2.6.30.

[rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org: Spotted the problem]
Signed-off-by: Mel Gorman <mel-wPRd99KPJ+uzQB+pC5nmwQ@public.gmane.org>
Reviewed-by: Pekka Enberg <penberg-bbCR+/B0CizivPeTLB3BmA@public.gmane.org>
---
 mm/page_alloc.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dfa4362..7f2aa3e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1769,7 +1769,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 		 */
 		alloc_flags &= ~ALLOC_CPUSET;
-	} else if (unlikely(rt_task(p)))
+	} else if (unlikely(rt_task(p)) && !in_interrupt())
 		alloc_flags |= ALLOC_HARDER;
 
 	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-- 
1.6.3.3

^ permalink raw reply related

* [PATCH 1/5] page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed
From: Mel Gorman @ 2009-10-22 14:22 UTC (permalink / raw)
  To: Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski,
	Tobias Oetiker
  Cc: Rafael J. Wysocki, David Miller, Reinette Chatre, Kalle Valo,
	David Rientjes, KOSAKI Motohiro, Mohamed Abbas, Jens Axboe,
	John W. Linville, Pekka Enberg, Bartlomiej Zolnierkiewicz,
	Greg Kroah-Hartman, Stephan von Krawczynski, Kernel Testers List,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org", Mel Gorman
In-Reply-To: <1256221356-26049-1-git-send-email-mel-wPRd99KPJ+uzQB+pC5nmwQ@public.gmane.org>

If a direct reclaim makes no forward progress, it considers whether it
should go OOM or not. Whether OOM is triggered or not, it may retry the
application afterwards. In times past, this would always wake kswapd as well
but currently, kswapd is not woken up after direct reclaim fails. For order-0
allocations, this makes little difference but if there is a heavy mix of
higher-order allocations that direct reclaim is failing for, it might mean
that kswapd is not rewoken for higher orders as much as it did previously.

This patch wakes up kswapd when an allocation is being retried after a direct
reclaim failure. It would be expected that kswapd is already awake, but
this has the effect of telling kswapd to reclaim at the higher order as well.

Signed-off-by: Mel Gorman <mel-wPRd99KPJ+uzQB+pC5nmwQ@public.gmane.org>
---
 mm/page_alloc.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bf72055..dfa4362 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1817,9 +1817,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
 		goto nopage;
 
+restart:
 	wake_all_kswapd(order, zonelist, high_zoneidx);
 
-restart:
 	/*
 	 * OK, we're below the kswapd watermark and have kicked background
 	 * reclaim. Now things get more complex, so set up alloc_flags according
-- 
1.6.3.3

^ permalink raw reply related

* [PATCH 0/5] Candidate fix for increased number of GFP_ATOMIC failures V2
From: Mel Gorman @ 2009-10-22 14:22 UTC (permalink / raw)
  To: Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski,
	Tobias Oetiker
  Cc: Rafael J. Wysocki, David Miller, Reinette Chatre, Kalle Valo,
	David Rientjes, KOSAKI Motohiro, Mohamed Abbas, Jens Axboe,
	John W. Linville, Pekka Enberg, Bartlomiej Zolnierkiewicz,
	Greg Kroah-Hartman, Stephan von Krawczynski, Kernel Testers List,
	netdev, linux-kernel, linux-mm@kvack.org", Mel Gorman

Sorry for the large cc list. Variations of this bug have cropped up in a
number of different places and so there are a fair few people that should
be vaguely aware of what's going on.

Since 2.6.31-rc1, there have been an increasing number of GFP_ATOMIC
failures. A significant number of these have been high-order GFP_ATOMIC
failures and while they are generally brushed away, there has been a large
increase in them recently and there are a number of possible areas the
problem could be in - core vm, page writeback and a specific driver. The
bugs affected by this that I am aware of are;

[Bug #14141] order 2 page allocation failures in iwlagn
	Commit 4752c93c30441f98f7ed723001b1a5e3e5619829 introduced GFP_ATOMIC
	allocations within the wireless driver. This has caused large numbers
	of failure reports to occur as reported by Frans Pop. Fixing this
	requires changes to the driver if it wants to use GFP_ATOMIC which
	is in the hands of Mohamed Abbas and Reinette Chatre. However,
	it is very likely that it has being compounded by core mm changes
	that this series is aimed at.

[Bug #14141] order 2 page allocation failures (generic)
	This problem is being tracked under bug #14141 but chances are it's
	unrelated to the wireless change. Tobi Oetiker has reported that a
	virtualised machine using a bridged interface is reporting a small
	number of order-5 GFP_ATOMIC failures. He has reported that the
	errors can be suppressed with kswapd patches in this series. However,
	I would like to confirm they are necessary.

[Bug #14265] ifconfig: page allocation failure. order:5, mode:0x8020 w/ e100
	Karol Lewandows reported that e100 fails to allocate order-5
	GFP_ATOMIC when loading firmware during resume. This has started
	happening relatively recent.

[No BZ ID] Kernel crash on 2.6.31.x (kcryptd: page allocation failure..)
	This apparently is easily reproducible, particular in comparison to
	the other reports. The point of greatest interest is that this is
	order-0 GFP_ATOMIC failures. Sven, I'm hoping that you in particular
	will be able to follow the tests below as you are the most likely
	person to have an easily reproducible situation.

[No BZ ID] page allocation failure message kernel 2.6.31.4 (tty-related)
	reported at: http://lkml.org/lkml/2009/10/20/139. Looks the same
	as the order-2 failures.

There are 5 patches in this series. For people affected by this bug,
I'm afraid there is a lot of legwork involved to help pin down which of
these patches are relevant. These patches are all against 2.6.32-rc5 and
have been tested on X86 and X86-64 by running the sysbench benchmark to
completion. I'll post against 2.6.31.4 where necessary.

Test 1: Verify your problem occurs on 2.6.32-rc5 if you can

Test 2: Apply the following two patches and test again

  1/5 page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed
  2/5 page allocator: Do not allow interrupts to use ALLOC_HARDER


	These patches correct problems introduced by me during the 2.6.31-rc1
	merge window. The patches were not meant to introduce any functional
	changes but two were missed.

	If your problem goes away with just these two patches applied,
	please tell me.

Test 3: If you are getting allocation failures, try with the following patch

  3/5 vmscan: Force kswapd to take notice faster when high-order watermarks are being hit

	This is a functional change that causes kswapd to notice sooner
	when high-order watermarks have been hit. There have been a number
	of changes in page reclaim since 2.6.30 that might have delayed
	when kswapd kicks in for higher orders

	If your problem goes away with these three patches applied, please
	tell me

Test 4: If you are still getting failures, apply the following
  4/5 page allocator: Pre-emptively wake kswapd when high-order watermarks are hit

	This patch is very heavy handed and pre-emptively kicks kswapd when
	watermarks are hit. It should only be necessary if there has been
	significant changes in the timing and density of page allocations
	from an unknown source. Tobias, this patch is largely aimed at you.
	You reported that with patches 3+4 applied that your problems went
	away. I need to know if patch 3 on its own is enough or if both
	are required

	If your problem goes away with these four patches applied, please
	tell me

Test 5: If things are still screwed, apply the following
  5/5 Revert 373c0a7e, 8aa7e847: Fix congestion_wait() sync/async vs read/write confusion

	Frans Pop reports that the bulk of his problems go away when this
	patch is reverted on 2.6.31. There has been some confusion on why
	exactly this patch was wrong but apparently the conversion was not
	complete and further work was required. It's unknown if all the
	necessary work exists in 2.6.31-rc5 or not. If there are still
	allocation failures and applying this patch fixes the problem,
	there are still snags that need to be ironed out.

Test 6: If only testing 2.6.31.4, test with patches 1, 2 and 5 as posted for that kernel
	Even if patches 3, 4 or both are necessary against mainline, I'm
	hoping they are unnecessary against -stable.

Thanks to all that reported problems and are testing this. The major bulk of
the work was done by Frans Pop so a big thanks to him in particular. I/we owe
him beers.

 arch/x86/lib/usercopy_32.c  |    2 +-
 drivers/block/pktcdvd.c     |   10 ++++------
 drivers/md/dm-crypt.c       |    2 +-
 fs/fat/file.c               |    2 +-
 fs/fuse/dev.c               |    8 ++++----
 fs/nfs/write.c              |    8 +++-----
 fs/reiserfs/journal.c       |    2 +-
 fs/xfs/linux-2.6/kmem.c     |    4 ++--
 fs/xfs/linux-2.6/xfs_buf.c  |    2 +-
 include/linux/backing-dev.h |   11 +++--------
 include/linux/blkdev.h      |   13 +++++++++----
 mm/backing-dev.c            |    7 ++++---
 mm/memcontrol.c             |    2 +-
 mm/page-writeback.c         |    2 +-
 mm/page_alloc.c             |   41 ++++++++++++++++++++++++++---------------
 mm/vmscan.c                 |   17 +++++++++++++----
 16 files changed, 75 insertions(+), 58 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH net-next-2.6 1/4] net: introduce mc list helpers
From: Ben Hutchings @ 2009-10-22 14:18 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, eric.dumazet, jeffrey.t.kirsher, jesse.brandeburg,
	bruce.w.allan, peter.p.waskiewicz.jr, john.ronciak, e1000-devel,
	mchehab, linux-media
In-Reply-To: <20091022135220.GD2868@psychotron.lab.eng.brq.redhat.com>

On Thu, 2009-10-22 at 15:52 +0200, Jiri Pirko wrote:
> This helpers should be used by network drivers to access to netdev
> multicast lists.
[...]
> +static inline void netdev_mc_walk(struct net_device *dev,
> +				  void (*func)(void *, unsigned char *),
> +				  void *data)
> +{
> +	struct dev_addr_list *mclist;
> +	int i;
> +
> +	for (i = 0, mclist = dev->mc_list; mclist && i < dev->mc_count;
> +	     i++, mclist = mclist->next)
> +		func(data, mclist->dmi_addr);
> +}
[...]

We usually implement iteration as macros so that any context doesn't
have to be squeezed through a single untyped (void *) variable.  A macro
for this would look something like:

#define netdev_for_each_mc_addr(dev, addr)						\
	for (addr = (dev)->mc_list ? (dev)->mc_list->dmi_addr : NULL;			\
	     addr;									\
	     addr = (container_of(addr, struct dev_addr_list, dmi_addr)->next ?		\
		     container_of(addr, struct dev_addr_list, dmi_addr)->next->dmi_addr : \
		     NULL))

Once you change the list type this can presumably be made less ugly.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* [PATCH iproute2] ip: Support IFLA_TXQLEN in ip link command
From: Eric Dumazet @ 2009-10-22 14:15 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Linux Netdev List, Benjamin LaHaise

We currently use an expensive ioctl() to get device txqueuelen, while
rtnetlink gave it to us for free. This patch speeds up ip link operation
when many devices are registered.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---

diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index 267ecb3..f06a3f7 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -252,9 +252,12 @@ int print_linkinfo(const struct sockaddr_nl *who,
 	if (tb[IFLA_OPERSTATE])
 		print_operstate(fp, *(__u8 *)RTA_DATA(tb[IFLA_OPERSTATE]));
 		
-	if (filter.showqueue)
-		print_queuelen(fp, (char*)RTA_DATA(tb[IFLA_IFNAME]));
-
+	if (filter.showqueue) {
+		if (tb[IFLA_TXQLEN])
+			fprintf(fp, "qlen %d ", *(int *)RTA_DATA(tb[IFLA_TXQLEN]));
+		else
+			print_queuelen(fp, (char *)RTA_DATA(tb[IFLA_IFNAME]));
+	}
 	if (!filter.family || filter.family == AF_PACKET) {
 		SPRINT_BUF(b1);
 		fprintf(fp, "%s", _SL_);

^ permalink raw reply related

* [PATCH net-next-2.6 4/4] dvb: dvb_net: use mc helpers to access multicast list
From: Jiri Pirko @ 2009-10-22 13:57 UTC (permalink / raw)
  To: netdev
  Cc: davem, eric.dumazet, jeffrey.t.kirsher, jesse.brandeburg,
	bruce.w.allan, peter.p.waskiewicz.jr, john.ronciak, e1000-devel,
	mchehab, linux-media
In-Reply-To: <20091022135120.GC2868@psychotron.lab.eng.brq.redhat.com>

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
---
 drivers/media/dvb/dvb-core/dvb_net.c |   22 +++++++---------------
 1 files changed, 7 insertions(+), 15 deletions(-)

diff --git a/drivers/media/dvb/dvb-core/dvb_net.c b/drivers/media/dvb/dvb-core/dvb_net.c
index 8c9ae0a..eb50fb0 100644
--- a/drivers/media/dvb/dvb-core/dvb_net.c
+++ b/drivers/media/dvb/dvb-core/dvb_net.c
@@ -1110,17 +1110,16 @@ static int dvb_net_feed_stop(struct net_device *dev)
 }
 
 
-static int dvb_set_mc_filter (struct net_device *dev, struct dev_mc_list *mc)
+static void dvb_set_mc_filter(void *data, unsigned char *addr)
 {
-	struct dvb_net_priv *priv = netdev_priv(dev);
+	struct dvb_net_priv *priv = data;
 
 	if (priv->multi_num == DVB_NET_MULTICAST_MAX)
-		return -ENOMEM;
+		return;
 
-	memcpy(priv->multi_macs[priv->multi_num], mc->dmi_addr, 6);
+	memcpy(priv->multi_macs[priv->multi_num], addr, ETH_ALEN);
 
 	priv->multi_num++;
-	return 0;
 }
 
 
@@ -1140,21 +1139,14 @@ static void wq_set_multicast_list (struct work_struct *work)
 	} else if ((dev->flags & IFF_ALLMULTI)) {
 		dprintk("%s: allmulti mode\n", dev->name);
 		priv->rx_mode = RX_MODE_ALL_MULTI;
-	} else if (dev->mc_count) {
-		int mci;
-		struct dev_mc_list *mc;
-
+	} else if (netdev_mc_count(dev)) {
 		dprintk("%s: set_mc_list, %d entries\n",
-			dev->name, dev->mc_count);
+			dev->name, netdev_mc_count(dev));
 
 		priv->rx_mode = RX_MODE_MULTI;
 		priv->multi_num = 0;
 
-		for (mci = 0, mc=dev->mc_list;
-		     mci < dev->mc_count;
-		     mc = mc->next, mci++) {
-			dvb_set_mc_filter(dev, mc);
-		}
+		netdev_mc_walk(dev, dvb_set_mc_filter, priv);
 	}
 
 	netif_addr_unlock_bh(dev);
-- 
1.6.2.5


^ permalink raw reply related

* Re: [PATCH net-next-2.6 0/4] net: change the way mc_list is accessed
From: Jiri Pirko @ 2009-10-22 13:56 UTC (permalink / raw)
  To: netdev
  Cc: davem, eric.dumazet, jeffrey.t.kirsher, jesse.brandeburg,
	bruce.w.allan, peter.p.waskiewicz.jr, john.ronciak, e1000-devel,
	mchehab, linux-media
In-Reply-To: <20091022135446.GG2868@psychotron.lab.eng.brq.redhat.com>

wrong subject... reposting...

Thu, Oct 22, 2009 at 03:54:47PM CEST, jpirko@redhat.com wrote:
>Signed-off-by: Jiri Pirko <jpirko@redhat.com>
>---
> drivers/media/dvb/dvb-core/dvb_net.c |   22 +++++++---------------
> 1 files changed, 7 insertions(+), 15 deletions(-)
>
>diff --git a/drivers/media/dvb/dvb-core/dvb_net.c b/drivers/media/dvb/dvb-core/dvb_net.c
>index 8c9ae0a..eb50fb0 100644
>--- a/drivers/media/dvb/dvb-core/dvb_net.c
>+++ b/drivers/media/dvb/dvb-core/dvb_net.c
>@@ -1110,17 +1110,16 @@ static int dvb_net_feed_stop(struct net_device *dev)
> }
> 
> 
>-static int dvb_set_mc_filter (struct net_device *dev, struct dev_mc_list *mc)
>+static void dvb_set_mc_filter(void *data, unsigned char *addr)
> {
>-	struct dvb_net_priv *priv = netdev_priv(dev);
>+	struct dvb_net_priv *priv = data;
> 
> 	if (priv->multi_num == DVB_NET_MULTICAST_MAX)
>-		return -ENOMEM;
>+		return;
> 
>-	memcpy(priv->multi_macs[priv->multi_num], mc->dmi_addr, 6);
>+	memcpy(priv->multi_macs[priv->multi_num], addr, ETH_ALEN);
> 
> 	priv->multi_num++;
>-	return 0;
> }
> 
> 
>@@ -1140,21 +1139,14 @@ static void wq_set_multicast_list (struct work_struct *work)
> 	} else if ((dev->flags & IFF_ALLMULTI)) {
> 		dprintk("%s: allmulti mode\n", dev->name);
> 		priv->rx_mode = RX_MODE_ALL_MULTI;
>-	} else if (dev->mc_count) {
>-		int mci;
>-		struct dev_mc_list *mc;
>-
>+	} else if (netdev_mc_count(dev)) {
> 		dprintk("%s: set_mc_list, %d entries\n",
>-			dev->name, dev->mc_count);
>+			dev->name, netdev_mc_count(dev));
> 
> 		priv->rx_mode = RX_MODE_MULTI;
> 		priv->multi_num = 0;
> 
>-		for (mci = 0, mc=dev->mc_list;
>-		     mci < dev->mc_count;
>-		     mc = mc->next, mci++) {
>-			dvb_set_mc_filter(dev, mc);
>-		}
>+		netdev_mc_walk(dev, dvb_set_mc_filter, priv);
> 	}
> 
> 	netif_addr_unlock_bh(dev);
>-- 
>1.6.2.5
>

^ permalink raw reply

* Re: [RFC] net,socket: introduce build_sockaddr_check helper to catch overflow at build time
From: Cyrill Gorcunov @ 2009-10-22 13:55 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20091022.044914.36401063.davem@davemloft.net>

[David Miller - Thu, Oct 22, 2009 at 04:49:14AM -0700]
| From: Cyrill Gorcunov <gorcunov@gmail.com>
| Date: Wed, 21 Oct 2009 21:07:32 +0400
| 
| > net,socket: introduce build_sockaddr_check helper to catch overflow at build time
| > 
| > proto_ops->getname implies copying protocol specific data
| > into storage unit (particulary to __kernel_sockaddr_storage).
| > So when one implements new protocol he either may keep this
| > in mind (or may not).
| > 
| > Lets introduce build_sockaddr_check helper which check if
| > storage unit is not overfowed. Note that the check is build
| > time and introduce no slowdown at execution time.
| > 
| > Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
| 
| Nice idea, and I wonder if we can automate it even further.
| Perhaps some tag that gets put on the socket address type
| definition or similar?
| 

Thanks for review David! Not sure if I understand you right.
Initially I was trying to bring as minimum changes as possible.
Also I was shuffle in mind the following possibilities:

1) Since at least one .getname handler use memcpy, we could
   introduce some helper which check size (at build time) and
   then do memcpy (not optimal perhaps).


2) All handlers set *len to some size explicitly so we may
   introduce set_sockaddr_size() helper like

#define set_sockaddr_size(ptr, size)		\
	do {					\
		build_sockaddr_check(size);	\
		*ptr = size;			\
	} while (0)

Or you meant something completely different?

	-- Cyrill

^ permalink raw reply

* Re: [PATCH net-next-2.6 0/4] net: change the way mc_list is accessed
From: Jiri Pirko @ 2009-10-22 13:54 UTC (permalink / raw)
  To: netdev
  Cc: eric.dumazet, e1000-devel, bruce.w.allan, jesse.brandeburg,
	mchehab, john.ronciak, jeffrey.t.kirsher, davem, linux-media
In-Reply-To: <20091022135120.GC2868@psychotron.lab.eng.brq.redhat.com>

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
---
 drivers/media/dvb/dvb-core/dvb_net.c |   22 +++++++---------------
 1 files changed, 7 insertions(+), 15 deletions(-)

diff --git a/drivers/media/dvb/dvb-core/dvb_net.c b/drivers/media/dvb/dvb-core/dvb_net.c
index 8c9ae0a..eb50fb0 100644
--- a/drivers/media/dvb/dvb-core/dvb_net.c
+++ b/drivers/media/dvb/dvb-core/dvb_net.c
@@ -1110,17 +1110,16 @@ static int dvb_net_feed_stop(struct net_device *dev)
 }
 
 
-static int dvb_set_mc_filter (struct net_device *dev, struct dev_mc_list *mc)
+static void dvb_set_mc_filter(void *data, unsigned char *addr)
 {
-	struct dvb_net_priv *priv = netdev_priv(dev);
+	struct dvb_net_priv *priv = data;
 
 	if (priv->multi_num == DVB_NET_MULTICAST_MAX)
-		return -ENOMEM;
+		return;
 
-	memcpy(priv->multi_macs[priv->multi_num], mc->dmi_addr, 6);
+	memcpy(priv->multi_macs[priv->multi_num], addr, ETH_ALEN);
 
 	priv->multi_num++;
-	return 0;
 }
 
 
@@ -1140,21 +1139,14 @@ static void wq_set_multicast_list (struct work_struct *work)
 	} else if ((dev->flags & IFF_ALLMULTI)) {
 		dprintk("%s: allmulti mode\n", dev->name);
 		priv->rx_mode = RX_MODE_ALL_MULTI;
-	} else if (dev->mc_count) {
-		int mci;
-		struct dev_mc_list *mc;
-
+	} else if (netdev_mc_count(dev)) {
 		dprintk("%s: set_mc_list, %d entries\n",
-			dev->name, dev->mc_count);
+			dev->name, netdev_mc_count(dev));
 
 		priv->rx_mode = RX_MODE_MULTI;
 		priv->multi_num = 0;
 
-		for (mci = 0, mc=dev->mc_list;
-		     mci < dev->mc_count;
-		     mc = mc->next, mci++) {
-			dvb_set_mc_filter(dev, mc);
-		}
+		netdev_mc_walk(dev, dvb_set_mc_filter, priv);
 	}
 
 	netif_addr_unlock_bh(dev);
-- 
1.6.2.5


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference

^ permalink raw reply related

* [PATCH net-next-2.6 3/4] e1000e: use mc helpers to access multicast list
From: Jiri Pirko @ 2009-10-22 13:54 UTC (permalink / raw)
  To: netdev
  Cc: eric.dumazet, e1000-devel, bruce.w.allan, jesse.brandeburg,
	mchehab, john.ronciak, jeffrey.t.kirsher, davem, linux-media
In-Reply-To: <20091022135120.GC2868@psychotron.lab.eng.brq.redhat.com>

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
---
 drivers/net/e1000e/netdev.c |   34 +++++++++++++++++++---------------
 1 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
index 3769248..97cd106 100644
--- a/drivers/net/e1000e/netdev.c
+++ b/drivers/net/e1000e/netdev.c
@@ -2529,6 +2529,17 @@ static void e1000_update_mc_addr_list(struct e1000_hw *hw, u8 *mc_addr_list,
 }
 
 /**
+ * e1000_mc_walker - helper function
+ **/
+static void e1000_mc_walker(void *data, unsigned char *addr)
+{
+	u8 **mta_list_i = data;
+
+	memcpy(*mta_list_i, addr, ETH_ALEN);
+	*mta_list_i += ETH_ALEN;
+}
+
+/**
  * e1000_set_multi - Multicast and Promiscuous mode set
  * @netdev: network interface device structure
  *
@@ -2542,10 +2553,9 @@ static void e1000_set_multi(struct net_device *netdev)
 	struct e1000_adapter *adapter = netdev_priv(netdev);
 	struct e1000_hw *hw = &adapter->hw;
 	struct e1000_mac_info *mac = &hw->mac;
-	struct dev_mc_list *mc_ptr;
-	u8  *mta_list;
+	u8  *mta_list, *mta_list_i;
 	u32 rctl;
-	int i;
+	int mc_count;
 
 	/* Check for Promiscuous and All Multicast modes */
 
@@ -2567,23 +2577,17 @@ static void e1000_set_multi(struct net_device *netdev)
 
 	ew32(RCTL, rctl);
 
-	if (netdev->mc_count) {
-		mta_list = kmalloc(netdev->mc_count * 6, GFP_ATOMIC);
+	mc_count = netdev_mc_count(netdev);
+	if (mc_count) {
+		mta_list = kmalloc(mc_count * ETH_ALEN, GFP_ATOMIC);
 		if (!mta_list)
 			return;
 
 		/* prepare a packed array of only addresses. */
-		mc_ptr = netdev->mc_list;
-
-		for (i = 0; i < netdev->mc_count; i++) {
-			if (!mc_ptr)
-				break;
-			memcpy(mta_list + (i*ETH_ALEN), mc_ptr->dmi_addr,
-			       ETH_ALEN);
-			mc_ptr = mc_ptr->next;
-		}
+		mta_list_i = mta_list;
+		netdev_mc_walk(netdev, e1000_mc_walker, &mta_list_i);
 
-		e1000_update_mc_addr_list(hw, mta_list, i, 1,
+		e1000_update_mc_addr_list(hw, mta_list, mc_count, 1,
 					  mac->rar_entry_count);
 		kfree(mta_list);
 	} else {
-- 
1.6.2.5


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference

^ permalink raw reply related

* [PATCH net-next-2.6 2/4] 8139too: use mc helpers to access multicast list
From: Jiri Pirko @ 2009-10-22 13:53 UTC (permalink / raw)
  To: netdev
  Cc: eric.dumazet, e1000-devel, bruce.w.allan, jesse.brandeburg,
	mchehab, john.ronciak, jeffrey.t.kirsher, davem, linux-media
In-Reply-To: <20091022135120.GC2868@psychotron.lab.eng.brq.redhat.com>

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
---
 drivers/net/8139too.c |   24 ++++++++++++++----------
 1 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/drivers/net/8139too.c b/drivers/net/8139too.c
index 7e333f7..f0c3670 100644
--- a/drivers/net/8139too.c
+++ b/drivers/net/8139too.c
@@ -2501,6 +2501,15 @@ static struct net_device_stats *rtl8139_get_stats (struct net_device *dev)
 	return &dev->stats;
 }
 
+static void mc_walker(void *data, unsigned char *addr)
+{
+	u32 *mc_filter = data;
+	int bit_nr;
+
+	bit_nr = ether_crc(ETH_ALEN, addr) >> 26;
+	mc_filter[bit_nr >> 5] |= 1 << (bit_nr & 31);
+}
+
 /* Set or clear the multicast filter for this adaptor.
    This routine is not state sensitive and need not be SMP locked. */
 
@@ -2509,7 +2518,7 @@ static void __set_rx_mode (struct net_device *dev)
 	struct rtl8139_private *tp = netdev_priv(dev);
 	void __iomem *ioaddr = tp->mmio_addr;
 	u32 mc_filter[2];	/* Multicast hash filter */
-	int i, rx_mode;
+	int rx_mode;
 	u32 tmp;
 
 	pr_debug("%s:   rtl8139_set_rx_mode(%4.4x) done -- Rx config %8.8lx.\n",
@@ -2521,22 +2530,17 @@ static void __set_rx_mode (struct net_device *dev)
 		    AcceptBroadcast | AcceptMulticast | AcceptMyPhys |
 		    AcceptAllPhys;
 		mc_filter[1] = mc_filter[0] = 0xffffffff;
-	} else if ((dev->mc_count > multicast_filter_limit)
+	} else if ((netdev_mc_count(dev) > multicast_filter_limit)
 		   || (dev->flags & IFF_ALLMULTI)) {
 		/* Too many to filter perfectly -- accept all multicasts. */
 		rx_mode = AcceptBroadcast | AcceptMulticast | AcceptMyPhys;
 		mc_filter[1] = mc_filter[0] = 0xffffffff;
 	} else {
-		struct dev_mc_list *mclist;
 		rx_mode = AcceptBroadcast | AcceptMyPhys;
-		mc_filter[1] = mc_filter[0] = 0;
-		for (i = 0, mclist = dev->mc_list; mclist && i < dev->mc_count;
-		     i++, mclist = mclist->next) {
-			int bit_nr = ether_crc(ETH_ALEN, mclist->dmi_addr) >> 26;
-
-			mc_filter[bit_nr >> 5] |= 1 << (bit_nr & 31);
+		if (!netdev_mc_empty(dev))
 			rx_mode |= AcceptMulticast;
-		}
+		mc_filter[1] = mc_filter[0] = 0;
+		netdev_mc_walk(dev, mc_walker, mc_filter);
 	}
 
 	/* We can safely update without stopping the chip. */
-- 
1.6.2.5


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference

^ permalink raw reply related

* [PATCH net-next-2.6 1/4] net: introduce mc list helpers
From: Jiri Pirko @ 2009-10-22 13:52 UTC (permalink / raw)
  To: netdev
  Cc: davem, eric.dumazet, jeffrey.t.kirsher, jesse.brandeburg,
	bruce.w.allan, peter.p.waskiewicz.jr, john.ronciak, e1000-devel,
	mchehab, linux-media
In-Reply-To: <20091022135120.GC2868@psychotron.lab.eng.brq.redhat.com>

This helpers should be used by network drivers to access to netdev
multicast lists.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
---
 include/linux/netdevice.h |   22 ++++++++++++++++++++++
 1 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 8380009..7edc4a6 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -921,6 +921,28 @@ struct net_device
 
 #define	NETDEV_ALIGN		32
 
+static inline int netdev_mc_count(struct net_device *dev)
+{
+	return dev->mc_count;
+}
+
+static inline bool netdev_mc_empty(struct net_device *dev)
+{
+	return netdev_mc_count(dev) == 0;
+}
+
+static inline void netdev_mc_walk(struct net_device *dev,
+				  void (*func)(void *, unsigned char *),
+				  void *data)
+{
+	struct dev_addr_list *mclist;
+	int i;
+
+	for (i = 0, mclist = dev->mc_list; mclist && i < dev->mc_count;
+	     i++, mclist = mclist->next)
+		func(data, mclist->dmi_addr);
+}
+
 static inline
 struct netdev_queue *netdev_get_tx_queue(const struct net_device *dev,
 					 unsigned int index)
-- 
1.6.2.5


^ permalink raw reply related

* [PATCH net-next-2.6 0/4] net: change the way mc_list is accessed
From: Jiri Pirko @ 2009-10-22 13:51 UTC (permalink / raw)
  To: netdev
  Cc: eric.dumazet, e1000-devel, bruce.w.allan, jesse.brandeburg,
	mchehab, john.ronciak, jeffrey.t.kirsher, davem, linux-media

In a struct net_device, multicast addresses are stored using a self-made linked
list. To convert this to list_head list there would be needed to do the change
in all (literally all) network device drivers at once.

To solve this situation and also to make device drivers' code prettier I'm
introducing several multicast list helpers which can (and in the future they
should) be used to access mc list. Once all drivers will use these helpers,
we can easily convert to list_head.

The part of this patchset are also 3 examples of a usage of the helpers.

Kindly asking for review.

Thanks,

Jirka

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference

^ permalink raw reply

* Re: xfrm transport mode policy and forward packets
From: Timo Teräs @ 2009-10-22 13:31 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev, Alexey Kuznetsov
In-Reply-To: <20091022132126.GB28893@gondor.apana.org.au>

Herbert Xu wrote:
> On Thu, Oct 22, 2009 at 03:07:28PM +0300, Timo Teräs wrote:
>> I'm using on my dmvpn environment security policies like:
>>
>> src 0.0.0.0/0 dst 0.0.0.0/0 proto gre 	dir in priority 2147483648 ptype 
>> main 	tmpl src 0.0.0.0 dst 0.0.0.0
>> 		proto esp reqid 0 mode transport
>>
>> src 0.0.0.0/0 dst 0.0.0.0/0 proto gre 	dir out priority 2147483648 ptype 
>> main 	tmpl src 0.0.0.0 dst 0.0.0.0
>> 		proto esp reqid 0 mode transport
>>
>> To make sure the locally generated/received GRE traffic is IPsec protected.
>> Now when some other non-local gre traffic is being forwarded by this router,
>> that seems to match these SPs too. Basically no one behind this router box
>> can use GRE (or PPTP).
> 
> This is expected since forwarded GRE packets match the selector
> given.

Yes. I forgot to explicitly mention, that I thought just removing the
'fwd' policy would fix this. It's slightly confusing that that input path
is split to two separate policy db's, while output is not.

>> My ideas so far have been:
>> a) rename 'fwd' to 'infwd' and split 'out' to 'out' and 'outfwd' ?
>>   (sounds kinda intrusive)
>> b) iptables target that would be able to disable xfrm
>>
>> Any other ideas?
>> What would be the proper fix for this problem?
> 
> We could add the fwmark as a key.

Ah, sounds even better.

> Alexey and others may have better ideas on this.

Thanks!
 Timo

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox