Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] Don't allow sharing of tx skbs on xen-netfront
From: Neil Horman @ 2011-11-17 19:25 UTC (permalink / raw)
  To: Ian Campbell
  Cc: netdev, David S. Miller, Jeremy Fitzhardinge,
	Konrad Rzeszutek Wilk, xen-devel
In-Reply-To: <1321543238.3664.278.camel@zakaz.uk.xensource.com>

On Thu, Nov 17, 2011 at 03:20:38PM +0000, Ian Campbell wrote:
> On Mon, 2011-11-14 at 14:22 -0500, Neil Horman wrote:
> > It was pointed out to me recently that the xen-netfront driver can't safely
> > support shared skbs on transmit, since, while it doesn't maintain skb state
> > directly, it does pass a pointer to the skb to the hypervisor via a list, and
> > the hypervisor may expect the contents of the skb to remain stable.  Clearing
> > the IFF_TX_SKB_SHARING flag after the call to alloc_etherdev to make it safe.
> 
> What are the actual constraints here? The skb is used as a handle to the
> skb->data and shinfo (frags) and to complete at the end. It's actually
> those which are passed to the hypervisor (effectively the same as
> passing those addresses to the h/w for DMA).
> 
> Which parts of the skb are expected/allowed to not remain stable?
> 
> (Appologies if the above seems naive, I seem to have missed the
> introduction of shared tx skbs and IFF_TX_SKB_SHARING)
> 
Its ok, this is the most accurate description from the previous threads on the
subject:
http://lists.openwall.net/netdev/2011/08/22/63

The basic problem boils down the notion that some drivers, when they receive an
skb in their xmit paths, presume to have sole ownership of the skb, and as a
result may do things like add the skb to a list, or otherwise store stateful
data in the skb.  If the skb is shared, thats unsafe to do, as the stack still
holds a reference to the skb, and make make changes without serializing them
against the driver.  So we have to flag those drivers which preform these kinds
of actions.  xen-netfront doesn't strictly speaking modify any state directly ni
an skb, but it does place a pointer to the skb in a data structure here:

np->tx_skbs[id].skb = skb;

Which then gets handed off to the hypervisior.  Since the hypervisor now has
access to that skb pointer, and we can't be sure (from the guest perspective),
what it does with that information, it would be better to be safe by disallowing
shared skbs in this path.

Neil

> Ian.
> 
> > 
> > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > CC: "David S. Miller" <davem@davemloft.net>
> > CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
> > CC: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > CC: xen-devel@lists.xensource.com
> > ---
> >  drivers/net/xen-netfront.c |    6 ++++++
> >  1 files changed, 6 insertions(+), 0 deletions(-)
> > 
> > diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
> > index 226faab..fb1077b 100644
> > --- a/drivers/net/xen-netfront.c
> > +++ b/drivers/net/xen-netfront.c
> > @@ -1252,6 +1252,12 @@ static struct net_device * __devinit xennet_create_dev(struct xenbus_device *dev
> >  		return ERR_PTR(-ENOMEM);
> >  	}
> >  
> > +	/*
> > +	 * Since frames remain on a queue after a return from xennet_start_xmit,
> > +	 * we can't support tx shared skbs
> > +	 */
> > +	netdev->priv_flags &= ~IFF_TX_SKB_SHARING;
> > +
> >  	np                   = netdev_priv(netdev);
> >  	np->xbdev            = dev;
> >  
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: pull request: wireless-next 2011-11-17
From: David Miller @ 2011-11-17 19:34 UTC (permalink / raw)
  To: linville-2XuSBdqkA4R54TAoqtyWWQ
  Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20111117182939.GD32622-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org>

From: "John W. Linville" <linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org>
Date: Thu, 17 Nov 2011 13:29:39 -0500

> Here is a huge pull request for wireless bits intended for 3.3.  I
> reckon that the extended release cycle for 3.1 led to a backlog of new
> work that appeared during the merge window for 3.2?
> 
> At any rate...there is a big Bluetooth pull with a lot of code
> refactoring; a lot of brcm80211 bits including cleanups, refactoring,
> and some new hardware support; the normal set of iwlwifi updates plus
> some refactoring; an unusual large flurry of mac80211 and cfg80211
> updates from Johannes (including the new wireless TX status socket
> option); some updates to the new NFC subsystem; and the usual collection
> of other driver and mac80211 updates.  This also includes a pull of the
> wireless.git tree to resolve some build issues.

Pulled, thanks John.
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* pull request: wireless 2011-11-17
From: John W. Linville @ 2011-11-17 19:36 UTC (permalink / raw)
  To: davem; +Cc: linux-wireless, netdev, linux-kernel

Dave,

Here is a little batch of fixes intended for 3.2.  This includes a new
USB ID for rt2x00, a fix for a sleep-while-atomic bug in rt2x00, and a
fix for a potential memory leak in libertas.

Please let me know if there are problems!

Thanks,

John

---

The following changes since commit 102463b18d922dd55c29fbfa222e0355ecf3e42f:

  stmmac: fix pm functions avoiding sleep on spinlock (2011-11-17 03:13:42 -0500)

are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless.git for-davem

Gertjan van Wingerde (2):
      rt2x00: Add USB device ID of Buffalo WLI-UC-GNHP.
      rt2x00: Fix sleep-while-atomic bug in powersaving code.

Jesper Juhl (1):
      Net, libertas: Resolve memory leak in if_spi_host_to_card()

John W. Linville (1):
      Merge branch 'master' of git://git.kernel.org/.../linville/wireless into for-davem

 drivers/net/wireless/libertas/if_spi.c  |    1 +
 drivers/net/wireless/rt2x00/rt2800usb.c |    1 +
 drivers/net/wireless/rt2x00/rt2x00.h    |    1 +
 drivers/net/wireless/rt2x00/rt2x00dev.c |   22 ++++++++++++++++++++--
 4 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/drivers/net/wireless/libertas/if_spi.c b/drivers/net/wireless/libertas/if_spi.c
index 11b69b3..728baa4 100644
--- a/drivers/net/wireless/libertas/if_spi.c
+++ b/drivers/net/wireless/libertas/if_spi.c
@@ -995,6 +995,7 @@ static int if_spi_host_to_card(struct lbs_private *priv,
 		spin_unlock_irqrestore(&card->buffer_lock, flags);
 		break;
 	default:
+		kfree(packet);
 		netdev_err(priv->dev, "can't transfer buffer of type %d\n",
 			   type);
 		err = -EINVAL;
diff --git a/drivers/net/wireless/rt2x00/rt2800usb.c b/drivers/net/wireless/rt2x00/rt2800usb.c
index f156579..3778763 100644
--- a/drivers/net/wireless/rt2x00/rt2800usb.c
+++ b/drivers/net/wireless/rt2x00/rt2800usb.c
@@ -919,6 +919,7 @@ static struct usb_device_id rt2800usb_device_table[] = {
 	{ USB_DEVICE(0x050d, 0x935b) },
 	/* Buffalo */
 	{ USB_DEVICE(0x0411, 0x00e8) },
+	{ USB_DEVICE(0x0411, 0x0158) },
 	{ USB_DEVICE(0x0411, 0x016f) },
 	{ USB_DEVICE(0x0411, 0x01a2) },
 	/* Corega */
diff --git a/drivers/net/wireless/rt2x00/rt2x00.h b/drivers/net/wireless/rt2x00/rt2x00.h
index 2ec5c00..99ff12d 100644
--- a/drivers/net/wireless/rt2x00/rt2x00.h
+++ b/drivers/net/wireless/rt2x00/rt2x00.h
@@ -943,6 +943,7 @@ struct rt2x00_dev {
 	 * Powersaving work
 	 */
 	struct delayed_work autowakeup_work;
+	struct work_struct sleep_work;
 
 	/*
 	 * Data queue arrays for RX, TX, Beacon and ATIM.
diff --git a/drivers/net/wireless/rt2x00/rt2x00dev.c b/drivers/net/wireless/rt2x00/rt2x00dev.c
index e1fb2a8..edd317f 100644
--- a/drivers/net/wireless/rt2x00/rt2x00dev.c
+++ b/drivers/net/wireless/rt2x00/rt2x00dev.c
@@ -465,6 +465,23 @@ static u8 *rt2x00lib_find_ie(u8 *data, unsigned int len, u8 ie)
 	return NULL;
 }
 
+static void rt2x00lib_sleep(struct work_struct *work)
+{
+	struct rt2x00_dev *rt2x00dev =
+	    container_of(work, struct rt2x00_dev, sleep_work);
+
+	if (!test_bit(DEVICE_STATE_PRESENT, &rt2x00dev->flags))
+		return;
+
+	/*
+	 * Check again is powersaving is enabled, to prevent races from delayed
+	 * work execution.
+	 */
+	if (!test_bit(CONFIG_POWERSAVING, &rt2x00dev->flags))
+		rt2x00lib_config(rt2x00dev, &rt2x00dev->hw->conf,
+				 IEEE80211_CONF_CHANGE_PS);
+}
+
 static void rt2x00lib_rxdone_check_ps(struct rt2x00_dev *rt2x00dev,
 				      struct sk_buff *skb,
 				      struct rxdone_entry_desc *rxdesc)
@@ -512,8 +529,7 @@ static void rt2x00lib_rxdone_check_ps(struct rt2x00_dev *rt2x00dev,
 	cam |= (tim_ie->bitmap_ctrl & 0x01);
 
 	if (!cam && !test_bit(CONFIG_POWERSAVING, &rt2x00dev->flags))
-		rt2x00lib_config(rt2x00dev, &rt2x00dev->hw->conf,
-				 IEEE80211_CONF_CHANGE_PS);
+		queue_work(rt2x00dev->workqueue, &rt2x00dev->sleep_work);
 }
 
 static int rt2x00lib_rxdone_read_signal(struct rt2x00_dev *rt2x00dev,
@@ -1141,6 +1157,7 @@ int rt2x00lib_probe_dev(struct rt2x00_dev *rt2x00dev)
 
 	INIT_WORK(&rt2x00dev->intf_work, rt2x00lib_intf_scheduled);
 	INIT_DELAYED_WORK(&rt2x00dev->autowakeup_work, rt2x00lib_autowakeup);
+	INIT_WORK(&rt2x00dev->sleep_work, rt2x00lib_sleep);
 
 	/*
 	 * Let the driver probe the device to detect the capabilities.
@@ -1197,6 +1214,7 @@ void rt2x00lib_remove_dev(struct rt2x00_dev *rt2x00dev)
 	 */
 	cancel_work_sync(&rt2x00dev->intf_work);
 	cancel_delayed_work_sync(&rt2x00dev->autowakeup_work);
+	cancel_work_sync(&rt2x00dev->sleep_work);
 	if (rt2x00_is_usb(rt2x00dev)) {
 		del_timer_sync(&rt2x00dev->txstatus_timer);
 		cancel_work_sync(&rt2x00dev->rxdone_work);
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply related

* [PATCH FIX net-next v1] net-forcedeth: fix possible stats inaccuracies on 32b hosts
From: David Decotigny @ 2011-11-17 19:38 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: David S. Miller, Eric Dumazet, Ian Campbell, Rick Jones,
	David Decotigny
In-Reply-To: <cover.1321558587.git.david.decotigny@google.com>

The software stats are updated from BH, this change ensures that 32b
UP hosts use appropriate protection.

Tested:
  - HW/SW stats consistent with pktgen (in particular tx=rx)
  - HW/SW stats consistent when tx/rx offloads disabled
  - no problem with+without lockdep (SMP 16-way)



Signed-off-by: David Decotigny <david.decotigny@google.com>
---
 drivers/net/ethernet/nvidia/forcedeth.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c b/drivers/net/ethernet/nvidia/forcedeth.c
index 0d8d5c0..4990534 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -1756,19 +1756,19 @@ nv_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *storage)
 
 	/* software stats */
 	do {
-		syncp_start = u64_stats_fetch_begin(&np->swstats_rx_syncp);
+		syncp_start = u64_stats_fetch_begin_bh(&np->swstats_rx_syncp);
 		storage->rx_packets       = np->stat_rx_packets;
 		storage->rx_bytes         = np->stat_rx_bytes;
 		storage->rx_dropped       = np->stat_rx_dropped;
 		storage->rx_missed_errors = np->stat_rx_missed_errors;
-	} while (u64_stats_fetch_retry(&np->swstats_rx_syncp, syncp_start));
+	} while (u64_stats_fetch_retry_bh(&np->swstats_rx_syncp, syncp_start));
 
 	do {
-		syncp_start = u64_stats_fetch_begin(&np->swstats_tx_syncp);
+		syncp_start = u64_stats_fetch_begin_bh(&np->swstats_tx_syncp);
 		storage->tx_packets = np->stat_tx_packets;
 		storage->tx_bytes   = np->stat_tx_bytes;
 		storage->tx_dropped = np->stat_tx_dropped;
-	} while (u64_stats_fetch_retry(&np->swstats_tx_syncp, syncp_start));
+	} while (u64_stats_fetch_retry_bh(&np->swstats_tx_syncp, syncp_start));
 
 	/* If the nic supports hw counters then retrieve latest values */
 	if (np->driver_data & DEV_HAS_STATISTICS_V123) {
-- 
1.7.3.1

^ permalink raw reply related

* Re: [PATCH FIX net-next v1] net-forcedeth: fix possible stats inaccuracies on 32b hosts
From: Eric Dumazet @ 2011-11-17 20:08 UTC (permalink / raw)
  To: David Decotigny
  Cc: netdev, linux-kernel, David S. Miller, Ian Campbell, Rick Jones
In-Reply-To: <34add86253ddd91d5d4d4c7fee9c90f63a8d8edf.1321558587.git.david.decotigny@google.com>

Le jeudi 17 novembre 2011 à 11:38 -0800, David Decotigny a écrit :
> The software stats are updated from BH, this change ensures that 32b
> UP hosts use appropriate protection.
> 
> Tested:
>   - HW/SW stats consistent with pktgen (in particular tx=rx)
>   - HW/SW stats consistent when tx/rx offloads disabled
>   - no problem with+without lockdep (SMP 16-way)
> 
> 
> 
> Signed-off-by: David Decotigny <david.decotigny@google.com>

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

^ permalink raw reply

* Re: [PATCH] Don't allow sharing of tx skbs on xen-netfront
From: Ian Campbell @ 2011-11-17 20:17 UTC (permalink / raw)
  To: Neil Horman
  Cc: netdev@vger.kernel.org, David S. Miller, Jeremy Fitzhardinge,
	Konrad Rzeszutek Wilk, xen-devel@lists.xensource.com
In-Reply-To: <20111117192535.GB26847@hmsreliant.think-freely.org>

On Thu, 2011-11-17 at 19:25 +0000, Neil Horman wrote:
> On Thu, Nov 17, 2011 at 03:20:38PM +0000, Ian Campbell wrote:
> > On Mon, 2011-11-14 at 14:22 -0500, Neil Horman wrote:
> > > It was pointed out to me recently that the xen-netfront driver can't safely
> > > support shared skbs on transmit, since, while it doesn't maintain skb state
> > > directly, it does pass a pointer to the skb to the hypervisor via a list, and
> > > the hypervisor may expect the contents of the skb to remain stable.  Clearing
> > > the IFF_TX_SKB_SHARING flag after the call to alloc_etherdev to make it safe.
> > 
> > What are the actual constraints here? The skb is used as a handle to the
> > skb->data and shinfo (frags) and to complete at the end. It's actually
> > those which are passed to the hypervisor (effectively the same as
> > passing those addresses to the h/w for DMA).
> > 
> > Which parts of the skb are expected/allowed to not remain stable?
> > 
> > (Appologies if the above seems naive, I seem to have missed the
> > introduction of shared tx skbs and IFF_TX_SKB_SHARING)
> > 
> Its ok, this is the most accurate description from the previous threads on the
> subject:
> http://lists.openwall.net/netdev/2011/08/22/63
> 
> The basic problem boils down the notion that some drivers, when they receive an
> skb in their xmit paths, presume to have sole ownership of the skb, and as a
> result may do things like add the skb to a list, or otherwise store stateful
> data in the skb.  If the skb is shared, thats unsafe to do, as the stack still
> holds a reference to the skb, and make make changes without serializing them
> against the driver.  So we have to flag those drivers which preform these kinds
> of actions.  xen-netfront doesn't strictly speaking modify any state directly ni
> an skb, but it does place a pointer to the skb in a data structure here:
> 
> np->tx_skbs[id].skb = skb;
> 
> Which then gets handed off to the hypervisior.  Since the hypervisor now has
> access to that skb pointer, and we can't be sure (from the guest perspective),
> what it does with that information, it would be better to be safe by disallowing
> shared skbs in this path.

The skb pointer itself doesn't get given to the backend/hypervisor. The
page which skb->data refers to is granted to the backend domain, as are
the pages in the frags.

I think we only need to be sure that the frontend doesn't rely on
anything in the skb itself, right? Does skb->data or shinfo count from
that perspective?

Ian.

> 
> Neil
> 
> > Ian.
> > 
> > > 
> > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > CC: "David S. Miller" <davem@davemloft.net>
> > > CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
> > > CC: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > > CC: xen-devel@lists.xensource.com
> > > ---
> > >  drivers/net/xen-netfront.c |    6 ++++++
> > >  1 files changed, 6 insertions(+), 0 deletions(-)
> > > 
> > > diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
> > > index 226faab..fb1077b 100644
> > > --- a/drivers/net/xen-netfront.c
> > > +++ b/drivers/net/xen-netfront.c
> > > @@ -1252,6 +1252,12 @@ static struct net_device * __devinit xennet_create_dev(struct xenbus_device *dev
> > >  		return ERR_PTR(-ENOMEM);
> > >  	}
> > >  
> > > +	/*
> > > +	 * Since frames remain on a queue after a return from xennet_start_xmit,
> > > +	 * we can't support tx shared skbs
> > > +	 */
> > > +	netdev->priv_flags &= ~IFF_TX_SKB_SHARING;
> > > +
> > >  	np                   = netdev_priv(netdev);
> > >  	np->xbdev            = dev;
> > >  
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 

^ permalink raw reply

* Re: [PATCH 0/4] skb paged fragment destructors
From: Michał Mirosław @ 2011-11-17 20:22 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Eric Dumazet, David Miller, Jesse Brandeburg,
	netdev@vger.kernel.org
In-Reply-To: <1321541121.3664.270.camel@zakaz.uk.xensource.com>

2011/11/17 Ian Campbell <Ian.Campbell@citrix.com>:
> On Wed, 2011-11-09 at 17:49 +0000, Eric Dumazet wrote:
>> struct skb_frag_page_desc {
>>       struct page *p;
>>       atomic_t ref;
>>       int (*destroy)(void *data);
>> /*    void *data; */ /* no need, see container_of() */
>
> It turns out that container_of is not so useful here as the users
> typically has a list of pages not a single page and hence has a list of
> destructors too. What you actually need is the container of the pointer
> to that list, IYSWIM, which you can't get at given only a pointer to an
> element of the list. So you end up doing
>
>        struct subsys_page_desc {
>                struct subsys_container *container;
>                struct sbk_frag_page_desc;
>        }
>
> *container here is basically the same as the void * so you might as well
> include it in the base datastructure.

If you reverse the order fields in subsys_page_desc then
container_of() compiles to no-op, and you don't have to impose this
extra field for all users, even if they don't need it.

If you see it useful, then there could be base skb_frag_page_desc and
then, skb_frag_page_desc_with_data (or othre) extending it. Core code
only ever needs the base structure.

BTW, destroy() prototype should be:

void destroy(struct skb_frag_page_desc *desc);

Because: 1. you know the parameter's type, 2. whatever would be
returned is not going to be useful.

Best Regards,
Michał Mirosław

^ permalink raw reply

* Re: [patch net-next 1/2] team: add fix_features
From: Michał Mirosław @ 2011-11-17 20:28 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, eric.dumazet, bhutchings, shemminger, andy, fbl,
	jzupka, ivecera
In-Reply-To: <1321539365-1125-1-git-send-email-jpirko@redhat.com>

2011/11/17 Jiri Pirko <jpirko@redhat.com>:
> do fix features in similar way as bonding code does
[...]

Looks the same like in bonding. ;)

Best Regards,
Michał Mirosław

^ permalink raw reply

* [PATCH] Fix wrong parameter given to sizeof call
From: Thomas Jarosch @ 2011-11-17 20:33 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

"remote_list" is of type

struct dma_chunk remote_list[VETH_MAX_FRAMES_PER_MSG];

Probably a copy'n'paste error.

Signed-off-by: Thomas Jarosch <thomas.jarosch@intra2net.com>
---
 drivers/net/ethernet/ibm/iseries_veth.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/ibm/iseries_veth.c b/drivers/net/ethernet/ibm/iseries_veth.c
index 4326681..acc31af 100644
--- a/drivers/net/ethernet/ibm/iseries_veth.c
+++ b/drivers/net/ethernet/ibm/iseries_veth.c
@@ -1421,7 +1421,7 @@ static void veth_receive(struct veth_lpar_connection *cnx,
 
 		/* FIXME: do we need this? */
 		memset(local_list, 0, sizeof(local_list));
-		memset(remote_list, 0, sizeof(VETH_MAX_FRAMES_PER_MSG));
+		memset(remote_list, 0, sizeof(remote_list));
 
 		/* a 0 address marks the end of the valid entries */
 		if (senddata->addr[startchunk] == 0)
-- 
1.7.6.4

^ permalink raw reply related

* Re: [PATCH] Don't allow sharing of tx skbs on xen-netfront
From: Neil Horman @ 2011-11-17 20:45 UTC (permalink / raw)
  To: Ian Campbell
  Cc: netdev@vger.kernel.org, David S. Miller, Jeremy Fitzhardinge,
	Konrad Rzeszutek Wilk, xen-devel@lists.xensource.com
In-Reply-To: <1321561021.8866.12.camel@dagon.hellion.org.uk>

On Thu, Nov 17, 2011 at 08:17:01PM +0000, Ian Campbell wrote:
> On Thu, 2011-11-17 at 19:25 +0000, Neil Horman wrote:
> > On Thu, Nov 17, 2011 at 03:20:38PM +0000, Ian Campbell wrote:
> > > On Mon, 2011-11-14 at 14:22 -0500, Neil Horman wrote:
> > > > It was pointed out to me recently that the xen-netfront driver can't safely
> > > > support shared skbs on transmit, since, while it doesn't maintain skb state
> > > > directly, it does pass a pointer to the skb to the hypervisor via a list, and
> > > > the hypervisor may expect the contents of the skb to remain stable.  Clearing
> > > > the IFF_TX_SKB_SHARING flag after the call to alloc_etherdev to make it safe.
> > > 
> > > What are the actual constraints here? The skb is used as a handle to the
> > > skb->data and shinfo (frags) and to complete at the end. It's actually
> > > those which are passed to the hypervisor (effectively the same as
> > > passing those addresses to the h/w for DMA).
> > > 
> > > Which parts of the skb are expected/allowed to not remain stable?
> > > 
> > > (Appologies if the above seems naive, I seem to have missed the
> > > introduction of shared tx skbs and IFF_TX_SKB_SHARING)
> > > 
> > Its ok, this is the most accurate description from the previous threads on the
> > subject:
> > http://lists.openwall.net/netdev/2011/08/22/63
> > 
> > The basic problem boils down the notion that some drivers, when they receive an
> > skb in their xmit paths, presume to have sole ownership of the skb, and as a
> > result may do things like add the skb to a list, or otherwise store stateful
> > data in the skb.  If the skb is shared, thats unsafe to do, as the stack still
> > holds a reference to the skb, and make make changes without serializing them
> > against the driver.  So we have to flag those drivers which preform these kinds
> > of actions.  xen-netfront doesn't strictly speaking modify any state directly ni
> > an skb, but it does place a pointer to the skb in a data structure here:
> > 
> > np->tx_skbs[id].skb = skb;
> > 
> > Which then gets handed off to the hypervisior.  Since the hypervisor now has
> > access to that skb pointer, and we can't be sure (from the guest perspective),
> > what it does with that information, it would be better to be safe by disallowing
> > shared skbs in this path.
> 
> The skb pointer itself doesn't get given to the backend/hypervisor. The
> page which skb->data refers to is granted to the backend domain, as are
> the pages in the frags.
> 
> I think we only need to be sure that the frontend doesn't rely on
> anything in the skb itself, right? Does skb->data or shinfo count from
> that perspective?
shinfo is definately a problem, as other devices may make modifications to it.
skb->data is probably safer, but is also potentially suspect (for instance if
another device appends an additional header to the data for instance)
Neil

^ permalink raw reply

* Re: [PATCH 4/5] tcp: allow undo from reordered DSACKs
From: Neal Cardwell @ 2011-11-17 20:49 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: David Miller, Netdev, Nandita Dukkipati, Yuchung Cheng,
	Tom Herbert
In-Reply-To: <alpine.DEB.2.00.1111170715060.23028@melkinpaasi.cs.helsinki.fi>

On Thu, Nov 17, 2011 at 12:18 AM, Ilpo Järvinen
<ilpo.jarvinen@helsinki.fi> wrote:
> On Wed, 16 Nov 2011, Neal Cardwell wrote:
>
>> Previously, SACK-enabled connections hung around in TCP_CA_Disorder
>> state while snd_una==high_seq, just waiting to accumulate DSACKs and
>> hopefully undo a cwnd reduction. This could and did lead to the
>> following unfortunate scenario: if some incoming ACKs advance snd_una
>> beyond high_seq then we were setting undo_marker to 0 and moving to
>> TCP_CA_Open, so if (due to reordering in the ACK return path) we
>> shortly thereafter received a DSACK then we were no longer able to
>> undo the cwnd reduction.
>>
>> The change: Simplify the congestion avoidance state machine by
>> removing the behavior where SACK-enabled connections hung around in
>> the TCP_CA_Disorder state just waiting for DSACKs. Instead, when
>> snd_una advances to high_seq or beyond we typically move to
>> TCP_CA_Open immediately and allow an undo in either TCP_CA_Open or
>> TCP_CA_Disorder if we later receive enough DSACKs.
>>
>> Other patches in this series will provide other changes that are
>> necessary to fully fix this problem.
>>
>> Signed-off-by: Neal Cardwell <ncardwell@google.com>
>> ---
>>  net/ipv4/tcp_input.c |   15 ++-------------
>>  1 files changed, 2 insertions(+), 13 deletions(-)
>>
>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>> index 751d390..a4efdd7 100644
>> --- a/net/ipv4/tcp_input.c
>> +++ b/net/ipv4/tcp_input.c
>> @@ -2858,7 +2858,7 @@ static void tcp_try_keep_open(struct sock *sk)
>>       struct tcp_sock *tp = tcp_sk(sk);
>>       int state = TCP_CA_Open;
>>
>> -     if (tcp_left_out(tp) || tcp_any_retrans_done(sk) || tp->undo_marker)
>> +     if (tcp_left_out(tp) || tcp_any_retrans_done(sk))
>>               state = TCP_CA_Disorder;
>>
>>       if (inet_csk(sk)->icsk_ca_state != state) {
>> @@ -3066,17 +3066,6 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
>>                       }
>>                       break;
>>
>> -             case TCP_CA_Disorder:
>> -                     tcp_try_undo_dsack(sk);
>> -                     if (!tp->undo_marker ||
>> -                         /* For SACK case do not Open to allow to undo
>> -                          * catching for all duplicate ACKs. */
>> -                         tcp_is_reno(tp) || tp->snd_una != tp->high_seq) {
>> -                             tp->undo_marker = 0;
>> -                             tcp_set_ca_state(sk, TCP_CA_Open);
>> -                     }
>> -                     break;
>> -
>>               case TCP_CA_Recovery:
>>                       if (tcp_is_reno(tp))
>>                               tcp_reset_reno_sack(tp);
>> @@ -3117,7 +3106,7 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
>>                               tcp_add_reno_sack(sk);
>>               }
>>
>> -             if (icsk->icsk_ca_state == TCP_CA_Disorder)
>> +             if (icsk->icsk_ca_state <= TCP_CA_Disorder)
>>                       tcp_try_undo_dsack(sk);
>>
>>               if (!tcp_time_to_recover(sk)) {
>
> How about extending Disorder state until second cumulative ACK that is
> acking >= high_seq?

That would seem to add complexity but only provide a partial solution.

This proposed patch has the virtue of providing a general solution
while simplifying the code a little.

What are your concerns with this patch?

Thanks, Ilpo, for taking a look at this patch sequence,
neal

^ permalink raw reply

* Re: root_lock vs. device's TX lock
From: Andy Fleming @ 2011-11-17 20:55 UTC (permalink / raw)
  To: Dave Taht; +Cc: Eric Dumazet, Tom Herbert, Linux Netdev List
In-Reply-To: <CAA93jw6VfGKAk_gBZOjyCCcR0tCSc8PMjOoPv6Qi8m+P1idp4Q@mail.gmail.com>

On Thu, Nov 17, 2011 at 12:19 PM, Dave Taht <dave.taht@gmail.com> wrote:
> On Thu, Nov 17, 2011 at 6:26 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> Le jeudi 17 novembre 2011 à 17:34 +0100, Eric Dumazet a écrit :
>>> Le jeudi 17 novembre 2011 à 08:10 -0800, Tom Herbert a écrit :
>>> > From sch_direct_xmit:
>>> >
>>> >         /* And release qdisc */
>>> >         spin_unlock(root_lock);
>>> >
>>> >         HARD_TX_LOCK(dev, txq, smp_processor_id());
>>> >         if (!netif_tx_queue_frozen_or_stopped(txq))
>>> >                 ret = dev_hard_start_xmit(skb, dev, txq);
>>> >
>>> >         HARD_TX_UNLOCK(dev, txq);
>>> >
>>> >         spin_lock(root_lock);
>>> >
>>> > This is a lot of lock manipulation to basically switch from one lock
>>> > to another and possibly thrashing just to send a packet.  I am
>>> > thinking that if the there is a 1-1 correspondence between qdisc and
>>> > device queue then we could actually use the device's lock as the root
>>> > lock for the qdisc.  So in that case, we would need to touch any locks
>>> > from sch_direct_xmit (just hold root lock which is already device lock
>>> > for the duration).
>>> >
>>> > Is there any reason why this couldn't work?
>>>
>>> But we have to dirty part of Qdisc anyway ?
>>> (state, bstats, q, ...)
>>>
>>
>> Also we want to permit other cpus to enqueue packets to Qdisc while our
>> cpu is busy in network driver ndo_start_xmit()
>>
>> For complex Qdisc / tc setups (potentially touching a lot of cache
>> lines), we could eventually add a small ring buffer so that the cpu
>> doing the ndo_start_xmit() also queues the packets into Qdisc.
>>
>> This ringbuffer could use a lockless algo. (we currently use the
>> secondary 'busylock' to serialize other cpus, but each cpu calls qdisc
>> enqueue itself.)
>
> I was thinking ringbuffering might also help in adding a 'grouper'
> abstraction to the dequeuing side.

Actually, I'm interested in circumventing *both* locks. Our SoC has
some quite-versatile queueing infrastructure, such that (for many
queueing setups) we can do all of the queueing in hardware, using
per-cpu access portals. By hacking around the qdisc lock, and using a
tx queue per core, we were able to achieve a significant speedup.

Right now, it's a big hack, but I'd like to find a way that we could
provide support for this. My initial thought was to craft a qdisc
implementation for our hardware, but the root_lock means doing so
would not yield any performance gain.

Andy

^ permalink raw reply

* Re: [PATCH 1/6] sky2: fix hang on shutdown (and other irq issues)
From: Sven Joachim @ 2011-11-17 20:57 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: davem, netdev
In-Reply-To: <20111117111803.11d152b4@nehalam.linuxnetplumber.net>

On 2011-11-17 20:18 +0100, Stephen Hemminger wrote:

> Does this fix it?

Yes, thank you.  However, I had a more serious problem unrelated to that
patch; more on that in another mail.

> --- a/drivers/net/ethernet/marvell/sky2.c	2011-11-17 11:10:35.938076778 -0800
> +++ b/drivers/net/ethernet/marvell/sky2.c	2011-11-17 11:15:49.527882932 -0800
> @@ -1723,6 +1723,8 @@ static int sky2_setup_irq(struct sky2_hw
>  	if (err)
>  		dev_err(&pdev->dev, "cannot assign irq %d\n", pdev->irq);
>  	else {
> +		hw->flags |= SKY2_HW_IRQ_SETUP;
> +
>  		napi_enable(&hw->napi);
>  		sky2_write32(hw, B0_IMSK, Y2_IS_BASE);
>  		sky2_read32(hw, B0_IMSK);
> @@ -2120,6 +2122,7 @@ static int sky2_close(struct net_device
>  
>  		napi_disable(&hw->napi);
>  		free_irq(hw->pdev->irq, hw);
> +		hw->flags &= ~SKY2_HW_IRQ_SETUP;
>  	} else {
>  		u32 imask;
>  
> @@ -3423,12 +3426,13 @@ static void sky2_all_down(struct sky2_hw
>  {
>  	int i;
>  
> -	sky2_read32(hw, B0_IMSK);
> -	sky2_write32(hw, B0_IMSK, 0);
> +	if (hw->flags & SKY2_HW_IRQ_SETUP) {
> +		sky2_read32(hw, B0_IMSK);
> +		sky2_write32(hw, B0_IMSK, 0);
>  
> -	if (hw->ports > 1 || netif_running(hw->dev[0]))
>  		synchronize_irq(hw->pdev->irq);
> -	napi_disable(&hw->napi);
> +		napi_disable(&hw->napi);
> +	}
>  
>  	for (i = 0; i < hw->ports; i++) {
>  		struct net_device *dev = hw->dev[i];
> @@ -3445,7 +3449,7 @@ static void sky2_all_down(struct sky2_hw
>  
>  static void sky2_all_up(struct sky2_hw *hw)
>  {
> -	u32 imask = 0;
> +	u32 imask = Y2_IS_BASE;
>  	int i;
>  
>  	for (i = 0; i < hw->ports; i++) {
> @@ -3461,8 +3465,7 @@ static void sky2_all_up(struct sky2_hw *
>  		netif_wake_queue(dev);
>  	}
>  
> -	if (imask || hw->ports > 1) {
> -		imask |= Y2_IS_BASE;
> +	if (hw->flags & SKY2_HW_IRQ_SETUP) {
>  		sky2_write32(hw, B0_IMSK, imask);
>  		sky2_read32(hw, B0_IMSK);
>  		sky2_read32(hw, B0_Y2_SP_LISR);
> --- a/drivers/net/ethernet/marvell/sky2.h	2011-11-16 15:15:40.964280527 -0800
> +++ b/drivers/net/ethernet/marvell/sky2.h	2011-11-17 11:13:00.154858718 -0800
> @@ -2287,6 +2287,7 @@ struct sky2_hw {
>  #define SKY2_HW_RSS_BROKEN	0x00000100
>  #define SKY2_HW_VLAN_BROKEN     0x00000200
>  #define SKY2_HW_RSS_CHKSUM	0x00000400	/* RSS requires chksum */
> +#define SKY2_HW_IRQ_SETUP	0x00000800
>  
>  	u8	     	     chip_id;
>  	u8		     chip_rev;

^ permalink raw reply

* Re: [PATCH] bonding: Don't allow mode change via sysfs with slaves present
From: David Miller @ 2011-11-17 21:04 UTC (permalink / raw)
  To: vfalico; +Cc: netdev, andy, fubar
In-Reply-To: <1321375482-8637-1-git-send-email-vfalico@redhat.com>

From: Veaceslav Falico <vfalico@redhat.com>
Date: Tue, 15 Nov 2011 17:44:42 +0100

> When changing mode via bonding's sysfs, the slaves are not initialized
> correctly. Forbid to change modes with slaves present to ensure that every
> slave is initialized correctly via bond_enslave().
> 
> Signed-off-by: Veaceslav Falico <vfalico@redhat.com>

Lots of discussion... what is the final verdict on this patch bonding folks?

^ permalink raw reply

* Re: [PATCH net-next] bnx2: switch to build_skb() infrastructure
From: David Miller @ 2011-11-17 21:04 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, mchan, eilong
In-Reply-To: <1321378205.2856.24.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 15 Nov 2011 18:30:05 +0100

> This is very similar to bnx2x conversion, but bnx2 only requires 16bytes
> alignement at start of the received frame to store its l2_fhdr, so goal
> was not to reduce skb truesize (in fact it should not change after this
> patch)
> 
> Using build_skb() reduces cache line misses in the driver, since we
> use cache hot skb instead of cold ones. Number of in-flight sk_buff
> structures is lower, they are more likely recycled in SLUB caches
> while still hot.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Michael Chan <mchan@broadcom.com>
> CC: Eilon Greenstein <eilong@broadcom.com>

Broadcom folks, please review.

^ permalink raw reply

* Re: [PATCH 4/6] sky2: reduce default Tx ring size
From: Sven Joachim @ 2011-11-17 21:07 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: davem, netdev
In-Reply-To: <20111116234344.813201788@vyatta.com>

On 2011-11-17 00:42 +0100, Stephen Hemminger wrote:

> The default Tx ring size for the sky2 driver is quite large and could
> cause excess buffer bloat for many users. The minimum ring size 
> possible and still allow handling  the worst case packet on 64bit platforms
> is 38 which gets rounded up to a power of 2. But most packets only require
> a couple of ring elements.
>
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
>
> --- a/drivers/net/ethernet/marvell/sky2.c	2011-11-16 15:19:39.518659262 -0800
> +++ b/drivers/net/ethernet/marvell/sky2.c	2011-11-16 15:19:40.990692544 -0800
> @@ -68,7 +68,7 @@
>  #define MAX_SKB_TX_LE	(2 + (sizeof(dma_addr_t)/sizeof(u32))*(MAX_SKB_FRAGS+1))
>  #define TX_MIN_PENDING		(MAX_SKB_TX_LE+1)
>  #define TX_MAX_PENDING		1024
> -#define TX_DEF_PENDING		127
> +#define TX_DEF_PENDING		63
>  
>  #define TX_WATCHDOG		(5 * HZ)
>  #define NAPI_WEIGHT		64

It's hard to believe, but this innocuous-looking patch caused my system
to crash and burn as soon as I actually used the network.
Unfortunately, I have neither a digital camera nor a serial console at
hand, but from my recollections and having read sky2.c I conclude that

	BUG_ON(done >= sky2->tx_ring_size);

in sky2_tx_complete() seems to have been triggered.  Does this make any
sense?

Cheers,
       Sven

^ permalink raw reply

* Re: [PATCH] r8169: add module param for control of ASPM disable
From: Matthew Garrett @ 2011-11-17 21:26 UTC (permalink / raw)
  To: Todd Broch
  Cc: Francois Romieu, Realtek linux nic maintainers, netdev,
	Hayes Wang
In-Reply-To: <CA+iF6RqKScip+6w-tzo3sF+hEsYDUefED29KEBrx5TDaZaTEbA@mail.gmail.com>

On Thu, Nov 17, 2011 at 01:20:46PM -0800, Todd Broch wrote:

> Looking a bit more at the driver a lot of configuration / feature choice is
> tied to mac_version.  Would this be a better solution (partial patch ...
> will submit properly later)

I'd be mildly surprised if the mac's the relevant constraint here, 
although it's possible. This patch is almost certainly safe, but if 
you're able to check with the hardware vendor you're working with that 
would be even better.

-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply

* Re: [PATCH] bonding: Don't allow mode change via sysfs with slaves present
From: Nicolas de Pesloüan @ 2011-11-17 21:28 UTC (permalink / raw)
  To: Andy Gospodarek; +Cc: Veaceslav Falico, netdev, Jay Vosburgh, Guus Sliepen
In-Reply-To: <20111116220200.GF25132@gospo.rdu.redhat.com>

Le 16/11/2011 23:02, Andy Gospodarek a écrit :

<snip>

> I was looking at ifenslave 1.1.0-20.  If you look at Debian bug #641250
> you will see a very similar report to what prompted Veaceslav to come up
> with this patch and post it here.

I completely missed this version. I recently reinstalled my system and forgot to add unstable to 
source.list.

> ifenslave-2.6 (1.1.0-20) unstable; urgency=low
>
>    * Use dashes consistently for bonding options in README.Debian.
>      Closes: #639244
>    * Enslave slaves only after fully setting up the master. Closes: #641250
>    * Add build-arch and build-indep targets to debian/rules.
>
>   -- Guus Sliepen<guus@debian.org>   Mon, 14 Nov 2011 11:36:21 +0100

Having a look at the change made to fix the bug described in #641250, I anticipate it will cause 
some regressions because some of the actions taken in setup_master must be done after enslavement 
and are now done before:

- primary must be set after mode (because only supported in some modes) and after enslavement.
- primary_reselect should be set after mode (because only supported in some modes), after 
enslavement and after primary.
- queue_id must be set after enslavement.
- active_slave must be set after mode and after enslavement.

I will prepare a -21 version.

<snip>

> Since this problem seems like a pretty major problem and now Debian,
> Fedora, RHEL, and Ubuntu all seem to have proper initialization scripts
> to handle it, I stand behind my original ACK.

You are right.

Acked-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>

	Nicolas.

^ permalink raw reply

* Re: [PATCH v2 net-next] net: use jump_label to shortcut RPS if not setup
From: Eric Dumazet @ 2011-11-17 21:34 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Tom Herbert
In-Reply-To: <1321535606.2751.35.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

Le jeudi 17 novembre 2011 à 14:13 +0100, Eric Dumazet a écrit :
> Most machines dont use RPS/RFS, and pay a fair amount of instructions in
> netif_receive_skb() / netif_rx() / get_rps_cpu() just to discover
> RPS/RFS is not setup.
> 
> Add a jump_label named rps_needed.
> 
> If no device rps_map or global rps_sock_flow_table is setup,
> netif_receive_skb() / netif_rx() do a single instruction instead of many
> ones, including conditional jumps.
> 
> jmp +0    (if CONFIG_JUMP_LABEL=y)
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Tom Herbert <therbert@google.com>

I wonder if we could use this jump_label infrastructure to shortcut
expensive netfilter nf_hook_slow() / nf_iterate on empty
tables/chains...

^ permalink raw reply

* Re: [Devel] Re: [PATCH v5 00/10] per-cgroup tcp memory pressure
From: David Miller @ 2011-11-17 21:35 UTC (permalink / raw)
  To: jbottomley
  Cc: eric.dumazet, glommer, linux-kernel, netdev, paul, lizf, linux-mm,
	devel, kirill, gthelen, kamezawa.hiroyu
In-Reply-To: <1321381632.3021.57.camel@dabdike.int.hansenpartnership.com>

From: James Bottomley <jbottomley@parallels.com>
Date: Tue, 15 Nov 2011 18:27:12 +0000

> Ping on this, please.  We're blocked on this patch set until we can get
> an ack that the approach is acceptable to network people.

__sk_mem_schedule is now more expensive, because instead of short-circuiting
the majority of the function's logic when "allocated <= prot->sysctl_mem[0]"
and immediately returning 1, the whole rest of the function is run.

The static branch protecting all of the cgroup code seems to be
enabled if any memory based cgroup'ing is enabled.  What if people use
the memory cgroup facility but not for sockets?  I am to understand
that, of the very few people who are going to use this stuff in any
capacity, this would be a common usage.

TCP specific stuff in mm/memcontrol.c, at best that's not nice at all.

Otherwise looks mostly good.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH 0/2] net: Add network priority cgroup (v2)
From: Neil Horman @ 2011-11-17 21:47 UTC (permalink / raw)
  To: netdev; +Cc: Neil Horman, John Fastabend, Robert Love, David S. Miller
In-Reply-To: <1321476666-8225-1-git-send-email-nhorman@tuxdriver.com>

Data Center Bridging environments are currently somewhat limited in their
ability to provide a general mechanism for controlling traffic priority.
Specifically they are unable to administratively control the priority at which
various types of network traffic are sent.

Currently, the only ways to set the priority of a network buffer are:

1) Through the use of the SO_PRIORITY socket option
2) By using low level hooks, like a tc action

(1) is difficult from an administrative perspective because it requires that the
application to be coded to not just assume the default priority is sufficient,
and must expose an administrative interface to allow priority adjustment.  Such
a solution is not scalable in a DCB environment

(2) is also difficult, as it requires constant administrative oversight of
applications so as to build appropriate rules to match traffic belonging to
various classes, so that priority can be appropriately set. It is further
limiting when DCB enabled hardware is in use, due to the fact that tc rules are
only run after a root qdisc has been selected (DCB enabled hardware may reserve
hw queues for various traffic classes and needs the priority to be set prior to
selecting the root qdisc)

I've discussed various solutions with John Fastabend, and we saw a cgroup as
being a good general solution to this problem.  The network priority cgroup
allows for a per-interface priority map to be built per cgroup.  Any traffic
originating from an application in a cgroup, that does not explicitly set its
priority with SO_PRIORITY will have its priority assigned to the value
designated for that group on that interface.  This allows a user space daemon,
when conducting LLDP negotiation with a DCB enabled peer to create a cgroup
based on the APP_TLV value received and administratively assign applications to
that priority using the existing cgroup utility infrastructure.

Tested by John and myself, with good results

(v2)
Based on reviews from John F., Amerigo Wang and Neerav Parikh, I've cleaned up
the rcu locking, fixed a memory leak in an error path, and corrected some typos.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
CC: Robert Love <robert.w.love@intel.com>
CC: "David S. Miller" <davem@davemloft.net>

^ permalink raw reply

* [PATCH 1/2] net: add network priority cgroup infrastructure (v2)
From: Neil Horman @ 2011-11-17 21:47 UTC (permalink / raw)
  To: netdev; +Cc: Neil Horman, John Fastabend, Robert Love, David S. Miller
In-Reply-To: <1321566472-28969-1-git-send-email-nhorman@tuxdriver.com>

This patch adds in the infrastructure code to create the network priority
cgroup.  The cgroup, in addition to the standard processes file creates two
control files:

1) prioidx - This is a read-only file that exports the index of this cgroup.
This is a value that is both arbitrary and unique to a cgroup in this subsystem,
and is used to index the per-device priority map

2) priomap - This is a writeable file.  On read it reports a table of 2-tuples
<name:priority> where name is the name of a network interface and priority is
indicates the priority assigned to frames egresessing on the named interface and
originating from a pid in this cgroup

This cgroup allows for skb priority to be set prior to a root qdisc getting
selected. This is benenficial for DCB enabled systems, in that it allows for any
application to use dcb configured priorities so without application modification

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
CC: Robert Love <robert.w.love@intel.com>
CC: "David S. Miller" <davem@davemloft.net>
---
 include/linux/cgroup_subsys.h |    8 +
 include/linux/netdevice.h     |    4 +
 include/net/netprio_cgroup.h  |   66 ++++++++
 include/net/sock.h            |    3 +
 net/Kconfig                   |    7 +
 net/core/Makefile             |    1 +
 net/core/dev.c                |   13 ++
 net/core/netprio_cgroup.c     |  347 +++++++++++++++++++++++++++++++++++++++++
 net/core/sock.c               |   22 +++-
 net/socket.c                  |    2 +
 10 files changed, 472 insertions(+), 1 deletions(-)
 create mode 100644 include/net/netprio_cgroup.h
 create mode 100644 net/core/netprio_cgroup.c

diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index ac663c1..0bd390c 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -59,8 +59,16 @@ SUBSYS(net_cls)
 SUBSYS(blkio)
 #endif
 
+/* */
+
 #ifdef CONFIG_CGROUP_PERF
 SUBSYS(perf)
 #endif
 
 /* */
+
+#ifdef CONFIG_NETPRIO_CGROUP
+SUBSYS(net_prio)
+#endif
+
+/* */
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 0db1f5f..750ea8e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -50,6 +50,7 @@
 #ifdef CONFIG_DCB
 #include <net/dcbnl.h>
 #endif
+#include <net/netprio_cgroup.h>
 
 struct vlan_group;
 struct netpoll_info;
@@ -1312,6 +1313,9 @@ struct net_device {
 	/* max exchange id for FCoE LRO by ddp */
 	unsigned int		fcoe_ddp_xid;
 #endif
+#if IS_ENABLED(CONFIG_NETPRIO_CGROUP)
+	struct netprio_map __rcu *priomap;
+#endif
 	/* phy device may attach itself for hardware timestamping */
 	struct phy_device *phydev;
 
diff --git a/include/net/netprio_cgroup.h b/include/net/netprio_cgroup.h
new file mode 100644
index 0000000..6b65936
--- /dev/null
+++ b/include/net/netprio_cgroup.h
@@ -0,0 +1,66 @@
+/*
+ * netprio_cgroup.h			Control Group Priority set 
+ *
+ *
+ * Authors:	Neil Horman <nhorman@tuxdriver.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+
+#ifndef _NETPRIO_CGROUP_H
+#define _NETPRIO_CGROUP_H
+#include <linux/module.h>
+#include <linux/cgroup.h>
+#include <linux/hardirq.h>
+#include <linux/rcupdate.h>
+
+struct cgroup_netprio_state
+{
+	struct cgroup_subsys_state css;
+	u32 prioidx;
+};
+
+struct netprio_map {
+	struct rcu_head rcu;
+	u32 priomap_len;
+	u32 priomap[];
+};
+
+#ifdef CONFIG_CGROUPS
+
+#ifndef CONFIG_NETPRIO_CGROUP
+extern int net_prio_subsys_id;
+#endif
+
+extern void sock_update_netprioidx(struct sock *sk);
+extern void skb_update_prio(struct sk_buff *skb);
+
+static inline struct cgroup_netprio_state
+		*task_netprio_state(struct task_struct *p)
+{
+#if IS_ENABLED(CONFIG_NETPRIO_CGROUP)
+	return container_of(task_subsys_state(p, net_prio_subsys_id),
+			    struct cgroup_netprio_state, css);
+#else
+	return NULL;
+#endif
+}
+
+#else
+
+#define sock_update_netprioidx(sk)
+#define skb_update_prio(skb)
+
+static inline struct cgroup_netprio_state
+		*task_netprio_state(struct task_struct *p)
+{
+	return NULL;
+}
+
+#endif
+
+#endif  /* _NET_CLS_CGROUP_H */
diff --git a/include/net/sock.h b/include/net/sock.h
index 5ac682f..87b24aa 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -321,6 +321,9 @@ struct sock {
 	unsigned short		sk_ack_backlog;
 	unsigned short		sk_max_ack_backlog;
 	__u32			sk_priority;
+#ifdef CONFIG_CGROUPS
+	__u32			sk_cgrp_prioidx;
+#endif
 	struct pid		*sk_peer_pid;
 	const struct cred	*sk_peer_cred;
 	long			sk_rcvtimeo;
diff --git a/net/Kconfig b/net/Kconfig
index a073148..63d2c5d 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -232,6 +232,13 @@ config XPS
 	depends on SMP && SYSFS && USE_GENERIC_SMP_HELPERS
 	default y
 
+config NETPRIO_CGROUP
+	tristate "Network priority cgroup"
+	depends on CGROUPS
+	---help---
+	  Cgroup subsystem for use in assigning processes to network priorities on
+	  a per-interface basis
+
 config HAVE_BPF_JIT
 	bool
 
diff --git a/net/core/Makefile b/net/core/Makefile
index 0d357b1..3606d40 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -19,3 +19,4 @@ obj-$(CONFIG_FIB_RULES) += fib_rules.o
 obj-$(CONFIG_TRACEPOINTS) += net-traces.o
 obj-$(CONFIG_NET_DROP_MONITOR) += drop_monitor.o
 obj-$(CONFIG_NETWORK_PHY_TIMESTAMPING) += timestamping.o
+obj-$(CONFIG_NETPRIO_CGROUP) += netprio_cgroup.o
diff --git a/net/core/dev.c b/net/core/dev.c
index b7ba81a..a1dca83 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2456,6 +2456,17 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 	return rc;
 }
 
+#ifdef CONFIG_CGROUPS
+void skb_update_prio(struct sk_buff *skb)
+{
+	struct netprio_map *map = rcu_dereference(skb->dev->priomap);
+
+	if ((!skb->priority) && (skb->sk) && map)
+		skb->priority = map->priomap[skb->sk->sk_cgrp_prioidx];
+}
+EXPORT_SYMBOL_GPL(skb_update_prio);
+#endif
+
 static DEFINE_PER_CPU(int, xmit_recursion);
 #define RECURSION_LIMIT 10
 
@@ -2496,6 +2507,8 @@ int dev_queue_xmit(struct sk_buff *skb)
 	 */
 	rcu_read_lock_bh();
 
+	skb_update_prio(skb);
+
 	txq = dev_pick_tx(dev, skb);
 	q = rcu_dereference_bh(txq->qdisc);
 
diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
new file mode 100644
index 0000000..e5a26fd
--- /dev/null
+++ b/net/core/netprio_cgroup.c
@@ -0,0 +1,347 @@
+/*
+ * net/core/netprio_cgroup.c	Priority Control Group
+ *
+ *		This program is free software; you can redistribute it and/or
+ *		modify it under the terms of the GNU General Public License
+ *		as published by the Free Software Foundation; either version
+ *		2 of the License, or (at your option) any later version.
+ *
+ * Authors:	Neil Horman <nhorman@tuxdriver.com>
+ */
+
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/skbuff.h>
+#include <linux/cgroup.h>
+#include <linux/rcupdate.h>
+#include <linux/atomic.h>
+#include <net/rtnetlink.h>
+#include <net/pkt_cls.h>
+#include <net/sock.h>
+#include <net/netprio_cgroup.h>
+
+static struct cgroup_subsys_state *cgrp_create(struct cgroup_subsys *ss,
+					       struct cgroup *cgrp);
+static void cgrp_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp);
+static int cgrp_populate(struct cgroup_subsys *ss, struct cgroup *cgrp);
+
+struct cgroup_subsys net_prio_subsys = {
+	.name		= "net_prio",
+	.create		= cgrp_create,
+	.destroy	= cgrp_destroy,
+	.populate	= cgrp_populate,
+#ifdef CONFIG_NETPRIO_CGROUP
+	.subsys_id	= net_prio_subsys_id,
+#endif
+	.module		= THIS_MODULE
+};
+
+#define PRIOIDX_SZ 128
+
+static unsigned long prioidx_map[PRIOIDX_SZ];
+static DEFINE_SPINLOCK(prioidx_map_lock);
+static atomic_t max_prioidx = ATOMIC_INIT(0);
+
+static inline struct cgroup_netprio_state *cgrp_netprio_state(struct cgroup *cgrp)
+{
+	return container_of(cgroup_subsys_state(cgrp, net_prio_subsys_id),
+			    struct cgroup_netprio_state, css);
+}
+
+static int get_prioidx(u32 *prio)
+{
+	unsigned long flags;
+	u32 prioidx;
+
+	spin_lock_irqsave(&prioidx_map_lock, flags);
+	prioidx = find_first_zero_bit(prioidx_map, sizeof(unsigned long) * PRIOIDX_SZ);
+	set_bit(prioidx, prioidx_map);
+	spin_unlock_irqrestore(&prioidx_map_lock, flags);
+	if (prioidx == sizeof(unsigned long) * PRIOIDX_SZ)
+		return -ENOSPC;
+
+	atomic_set(&max_prioidx, prioidx);
+	*prio = prioidx;
+	return 0;
+}
+
+static void put_prioidx(u32 idx)
+{
+	unsigned long flags;
+	spin_lock_irqsave(&prioidx_map_lock, flags);
+	clear_bit(idx, prioidx_map);
+	spin_unlock_irqrestore(&prioidx_map_lock, flags);
+}
+
+static void extend_netdev_table(struct net_device *dev, u32 new_len)
+{
+	size_t new_size = sizeof(struct netprio_map) +
+			   ((sizeof(u32) * new_len));
+	struct netprio_map *new_priomap = kzalloc(new_size, GFP_KERNEL);
+	struct netprio_map *old_priomap;
+	int i;
+
+	old_priomap  = rcu_dereference(dev->priomap);
+
+	if (!new_priomap) {
+		printk(KERN_WARNING "Unable to alloc new priomap!\n");
+		return;
+	}
+
+	for (i = 0;
+	     old_priomap && (i < old_priomap->priomap_len);
+	     i++)
+		new_priomap->priomap[i] = old_priomap->priomap[i];
+
+	new_priomap->priomap_len = new_len;
+
+	rcu_assign_pointer(dev->priomap, new_priomap);
+	if (old_priomap)
+		kfree_rcu(old_priomap, rcu);
+}
+
+static void update_netdev_tables(void)
+{
+	struct net_device *dev;
+	u32 max_len = atomic_read(&max_prioidx);
+	struct netprio_map *map;
+
+	rtnl_lock();
+
+
+	for_each_netdev(&init_net, dev) {
+		map = rcu_dereference(dev->priomap);
+		if ((!map) ||
+		    (map->priomap_len < max_len))
+			extend_netdev_table(dev, max_len);
+	}
+
+	rtnl_unlock();
+}
+
+static struct cgroup_subsys_state *cgrp_create(struct cgroup_subsys *ss,
+						 struct cgroup *cgrp)
+{
+	struct cgroup_netprio_state *cs;
+	int ret;
+
+	cs = kzalloc(sizeof(*cs), GFP_KERNEL);
+	if (!cs)
+		return ERR_PTR(-ENOMEM);
+
+	if (cgrp->parent && cgrp_netprio_state(cgrp->parent)->prioidx) {
+		kfree(cs);
+		return ERR_PTR(-EINVAL);
+	}
+
+	ret = get_prioidx(&cs->prioidx);
+	if (ret != 0) {
+		printk(KERN_WARNING "No space in priority index array\n");
+		kfree(cs);
+		return ERR_PTR(ret);
+	}
+
+	return &cs->css;
+}
+
+static void cgrp_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct cgroup_netprio_state *cs;
+	struct net_device *dev;
+	struct netprio_map *map;
+
+	cs = cgrp_netprio_state(cgrp);
+	rtnl_lock();
+	for_each_netdev(&init_net, dev) {
+		map = rcu_dereference(dev->priomap);
+		if (map)
+			map->priomap[cs->prioidx] = 0;
+	}
+	rtnl_unlock();
+	put_prioidx(cs->prioidx);
+	kfree(cs);
+}
+
+static u64 read_prioidx(struct cgroup *cgrp, struct cftype *cft)
+{
+	return (u64)cgrp_netprio_state(cgrp)->prioidx;
+}
+
+static int read_priomap(struct cgroup *cont, struct cftype *cft,
+			struct cgroup_map_cb *cb)
+{
+	struct net_device *dev;
+	u32 prioidx = cgrp_netprio_state(cont)->prioidx;
+	u32 priority;
+	struct netprio_map *map;
+
+	rcu_read_lock();
+
+	for_each_netdev_rcu(&init_net, dev) {
+		map = rcu_dereference(dev->priomap);
+		priority = map ? map->priomap[prioidx] : 0;
+		cb->fill(cb, dev->name, priority);
+	}
+	rcu_read_unlock();
+	return 0;
+}
+
+static int write_priomap(struct cgroup *cgrp, struct cftype *cft,
+			 const char *buffer)
+{
+	char *devname = kstrdup(buffer, GFP_KERNEL);
+	int ret = -EINVAL;
+	u32 prioidx = cgrp_netprio_state(cgrp)->prioidx;
+	unsigned long priority;
+	char *priostr;
+	struct net_device *dev;
+	struct netprio_map *map;
+
+	if (!devname)
+		return -ENOMEM;
+
+	/*
+	 * Minimally sized valid priomap string
+	 */
+	if (strlen(devname) < 3)
+		goto out_free_devname;
+
+	priostr = strstr(devname, " ");
+	if (!priostr)
+		goto out_free_devname;
+
+	/*
+	 *Separate the devname from the associated priority
+	 *and advance the priostr poitner to the priority value
+	 */
+	*priostr = '\0';
+	priostr++;
+
+	/*
+	 * If the priostr points to NULL, we're at the end of the passed
+	 * in string, and its not a valid write
+	 */
+	if (*priostr == '\0')
+		goto out_free_devname;
+
+	ret = kstrtoul(priostr, 10, &priority);
+	if (ret < 0)
+		goto out_free_devname;
+
+	ret = -ENODEV;
+
+	dev = dev_get_by_name(&init_net, devname);
+	if (!dev)
+		goto out_free_devname;
+
+	update_netdev_tables();
+	ret = 0;
+	rcu_read_lock();
+	map = rcu_dereference(dev->priomap);
+	if (map)
+		map->priomap[prioidx] = priority;
+	rcu_read_unlock();
+	dev_put(dev);
+
+out_free_devname:
+	kfree(devname);
+	return ret;
+}
+
+static struct cftype ss_files[] = {
+	{
+		.name = "prioidx",
+		.read_u64 = read_prioidx,
+	},
+	{
+		.name = "ifpriomap",
+		.read_map = read_priomap,
+		.write_string = write_priomap,
+	},
+};
+
+static int cgrp_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	return cgroup_add_files(cgrp, ss, ss_files, ARRAY_SIZE(ss_files));
+}
+
+static int netprio_device_event(struct notifier_block *unused,
+				unsigned long event, void *ptr)
+{
+	struct net_device *dev = ptr;
+	struct netprio_map *old;
+	u32 max_len = atomic_read(&max_prioidx);
+
+	old = rcu_dereference_protected(dev->priomap, 1);
+	/*
+	 * Note this is called with rtnl_lock held so we have update side
+	 * protection on our rcu assignments
+	 */
+
+	switch (event) {
+
+	case NETDEV_REGISTER:
+		if (max_len)
+			extend_netdev_table(dev, max_len);
+		break;
+	case NETDEV_UNREGISTER:
+		rcu_assign_pointer(dev->priomap, NULL);
+		if (old)
+			kfree_rcu(old, rcu);
+		break;
+	}
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block netprio_device_notifier = {
+	.notifier_call = netprio_device_event
+};
+
+static int __init init_cgroup_netprio(void)
+{
+	int ret;
+
+	ret = cgroup_load_subsys(&net_prio_subsys);
+	if (ret)
+		goto out;
+#ifndef CONFIG_NETPRIO_CGROUP
+	smp_wmb();
+	net_prio_subsys_id = net_prio_subsys.subsys_id;
+#endif
+
+	register_netdevice_notifier(&netprio_device_notifier);
+
+out:
+	return ret;
+}
+
+static void __exit exit_cgroup_netprio(void)
+{
+	struct netprio_map *old;
+	struct net_device *dev;
+
+	unregister_netdevice_notifier(&netprio_device_notifier);
+
+	cgroup_unload_subsys(&net_prio_subsys);
+
+#ifndef CONFIG_NETPRIO_CGROUP
+	net_prio_subsys_id = -1;
+	synchronize_rcu();
+#endif
+
+	rtnl_lock();
+	for_each_netdev(&init_net, dev) {
+		old = rcu_dereference(dev->priomap);
+		rcu_assign_pointer(dev->priomap, NULL);
+		if (old)
+			kfree_rcu(old, rcu);
+	}
+	rtnl_unlock();
+}
+
+module_init(init_cgroup_netprio);
+module_exit(exit_cgroup_netprio);
+MODULE_LICENSE("GPL v2");
diff --git a/net/core/sock.c b/net/core/sock.c
index 5a08762..77a4888 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -125,6 +125,7 @@
 #include <net/xfrm.h>
 #include <linux/ipsec.h>
 #include <net/cls_cgroup.h>
+#include <net/netprio_cgroup.h>
 
 #include <linux/filter.h>
 
@@ -221,10 +222,16 @@ __u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;
 int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
 EXPORT_SYMBOL(sysctl_optmem_max);
 
-#if defined(CONFIG_CGROUPS) && !defined(CONFIG_NET_CLS_CGROUP)
+#if defined(CONFIG_CGROUPS)
+#if !defined(CONFIG_NET_CLS_CGROUP)
 int net_cls_subsys_id = -1;
 EXPORT_SYMBOL_GPL(net_cls_subsys_id);
 #endif
+#if !defined(CONFIG_NETPRIO_CGROUP)
+int net_prio_subsys_id = -1;
+EXPORT_SYMBOL_GPL(net_prio_subsys_id);
+#endif
+#endif
 
 static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
 {
@@ -1111,6 +1118,18 @@ void sock_update_classid(struct sock *sk)
 		sk->sk_classid = classid;
 }
 EXPORT_SYMBOL(sock_update_classid);
+
+void sock_update_netprioidx(struct sock *sk)
+{
+	struct cgroup_netprio_state *state;
+	if (in_interrupt())
+		return;
+	rcu_read_lock();
+	state = task_netprio_state(current);
+	sk->sk_cgrp_prioidx = state ? state->prioidx : 0;
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL_GPL(sock_update_netprioidx);
 #endif
 
 /**
@@ -1138,6 +1157,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
 		atomic_set(&sk->sk_wmem_alloc, 1);
 
 		sock_update_classid(sk);
+		sock_update_netprioidx(sk);
 	}
 
 	return sk;
diff --git a/net/socket.c b/net/socket.c
index 2877647..108716f 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -549,6 +549,8 @@ static inline int __sock_sendmsg_nosec(struct kiocb *iocb, struct socket *sock,
 
 	sock_update_classid(sock->sk);
 
+	sock_update_netprioidx(sock->sk);
+
 	si->sock = sock;
 	si->scm = NULL;
 	si->msg = msg;
-- 
1.7.6.4

^ permalink raw reply related

* [PATCH 2/2] net: add documentation for net_prio cgroups (v2)
From: Neil Horman @ 2011-11-17 21:47 UTC (permalink / raw)
  To: netdev; +Cc: Neil Horman, John Fastabend, Robert Love, David S. Miller
In-Reply-To: <1321566472-28969-1-git-send-email-nhorman@tuxdriver.com>

Add the requisite documentation to explain to new users how net_prio cgroups work

Signed-off-by:Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
CC: Robert Love <robert.w.love@intel.com>
CC: "David S. Miller" <davem@davemloft.net>
---
 Documentation/cgroups/net_prio.txt |   53 ++++++++++++++++++++++++++++++++++++
 1 files changed, 53 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/cgroups/net_prio.txt

diff --git a/Documentation/cgroups/net_prio.txt b/Documentation/cgroups/net_prio.txt
new file mode 100644
index 0000000..01b3226
--- /dev/null
+++ b/Documentation/cgroups/net_prio.txt
@@ -0,0 +1,53 @@
+Network priority cgroup
+-------------------------
+
+The Network priority cgroup provides an interface to allow an administrator to
+dynamically set the priority of network traffic generated by various
+applications
+
+Nominally, an application would set the priority of its traffic via the
+SO_PRIORITY socket option.  This however, is not always possible because:
+
+1) The application may not have been coded to set this value
+2) The priority of application traffic is often a site-specific administrative
+   decision rather than an application defined one.
+
+This cgroup allows an administrator to assign a process to a group which defines
+the priority of egress traffic on a given interface. Network priority groups can
+be created by first mounting the cgroup filesystem.
+
+# mount -t cgroup -onet_prio none /sys/fs/cgroup/net_prio
+
+With the above step, the initial group acting as the parent accounting group
+becomes visible at '/sys/fs/cgroup/net_prio'.  This group includes all tasks in
+the system. '/sys/fs/cgroup/net_prio/tasks' lists the tasks in this cgroup.
+
+Each net_prio cgroup contains two files that are subsystem specific
+
+net_prio.prioidx
+This file is read-only, and is simply informative.  It contains a unique integer
+value that the kernel uses as an internal representation of this cgroup.
+
+net_prio.ifpriomap
+This file contains a map of the priorities assigned to traffic originating from
+processes in this group and egressing the system on various interfaces. It
+contains a list of tuples in the form <ifname priority>.  Contents of this file
+can be modified by echoing a string into the file using the same tuple format.
+for example:
+
+echo "eth0 5" > /sys/fs/cgroups/net_prio/iscsi/net_prio.ifpriomap
+
+This command would force any traffic originating from processes belonging to the
+iscsi net_prio cgroup and egressing on interface eth0 to have the priority of
+said traffic set to the value 5. The parent accounting group also has a
+writeable 'net_prio.ifpriomap' file that can be used to set a system default
+priority.
+
+Priorities are set immediately prior to queueing a frame to the device
+queueing discipline (qdisc) so priorities will be assigned prior to the hardware
+queue selection being made.
+
+One usage for the net_prio cgroup is with mqprio qdisc allowing application
+traffic to be steered to hardware/driver based traffic classes. These mappings
+can then be managed by administrators or other networking protocols such as
+DCBX.
-- 
1.7.6.4

^ permalink raw reply related

* Re: [PATCH] route: add more relaxed option for secure_redirects
From: David Miller @ 2011-11-17 21:53 UTC (permalink / raw)
  To: fbl; +Cc: netdev
In-Reply-To: <20111116234042.6ad8d723@asterix.rh>

From: Flavio Leitner <fbl@redhat.com>
Date: Wed, 16 Nov 2011 23:40:42 -0200

> To make sure we are in the same page, this simple setup reproduces
> the issue.
> 
> IP: 10.0.0.1
> gw: 10.0.0.100
> +--------+          +-----+ primary: 10.0.0.2
> | client |----+-----| GW1 |   alias: 10.0.0.100
> +--------+    |     +-----+      gw: 10.0.0.254
>            +--+--+
>            | GW2 |---> internet
>            +-----+
>          10.0.0.254
> 
> 1. Client sends TCP SYN to an internet host using
>    GW1 alias address as default gw address
> 
> 2. Then GW1 sends the ICMP redirect back to client
>    using the primary address as source address.
> 
> 3. GW1 forwards the original packet to GW2
> 
> 4. client ignores the ICMP redirect because
>    client.gw != gw1.primary.

GW1 must respond using a source address matching 'alias', ie.
10.0.0.100 and I would accept a mechinsm to make sure that happens,
if not by default then via a sysctl or similar control.

^ permalink raw reply

* Re: [PATCH] ipv4: avoid to double release dst in tcp_v4_connect
From: David Miller @ 2011-11-17 21:56 UTC (permalink / raw)
  To: roy.qing.li; +Cc: netdev
In-Reply-To: <1321518828-4288-1-git-send-email-roy.qing.li@gmail.com>

From: roy.qing.li@gmail.com
Date: Thu, 17 Nov 2011 16:33:48 +0800

> From: RongQing.Li <roy.qing.li@gmail.com> 
> 
> When tcp_connect failed in tcp_v4_connect, the dst will be
> released in error handler of tcp_v4_connect, but dst has been
> set to sk->sk_dst_cache which will be released again when
> destroy this sk.
> 
> Signed-off-by: RongQing.Li <roy.qing.li@gmail.com>

We set 'rt' to NULL at these statements specifically to handle this
situation, ip_rt_put(NULL) will do nothing.

I don't think this change is necessary.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox