Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] tcp: handle tcp_net_metrics_init() order-5 memory allocation failures
From: David Miller @ 2012-11-16 18:37 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, jln
In-Reply-To: <1353079913.10798.31.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 16 Nov 2012 07:31:53 -0800

> Well, we dont really know what the size needs to be, and your proposal
> reduces the size by a 4 factor, even for the initial namespace.
> 
> Julien report was about Chrome browser own netns, on a suspend/resume
> cycle (or something like that)
> 
> If size can influence behavior, we could try a vmalloc() if kmalloc()
> fails...

Agreed.

> [PATCH v3] tcp: handle tcp_net_metrics_init() order-5 memory allocation failures
> 
> order-5 allocations can fail with current kernels, we should
> try vmalloc() as well.
> 
> Reported-by: Julien Tinnes <jln@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

This looks great, applied, thanks.

^ permalink raw reply

* pull request: wireless 2012-11-16
From: John W. Linville @ 2012-11-16 18:30 UTC (permalink / raw)
  To: davem; +Cc: linux-wireless, netdev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 17964 bytes --]

commit 26c6e80892d8c160dffaba85889bd4e65b1dacf6

Dave,

This batch of fixes is intended for the 3.7 stream...

This includes a pull of the Bluetooth tree.  Gustavo says:

"A few important fixes to go into 3.7. There is a new hw support by Marcos
Chaparro. Johan added a memory leak fix and hci device index list fix.
Also Marcel fixed a race condition in the device set up that was prevent the
bt monitor to work properly. Last, Paulo Sérgio added a fix to the error
status when pairing for LE fails. This was prevent userspace to work to handle
the failure properly."

Regarding the mac80211 pull, Johannes says:

"I have a locking fix for some SKB queues, a variable initialization to
avoid crashes in a certain failure case, another free_txskb fix from
Felix and another fix from him to avoid calling a stopped driver, a fix
for a (very unlikely) memory leak and a fix to not send null data
packets when resuming while not associated."

Regarding the iwlwifi pull, Johannes says:

"Two more fixes for iwlwifi ... one to use ieee80211_free_txskb(), and
one to check DMA mapping errors, please pull."

On top of that, Johannes also included a wireless regulatory fix
to allow 40 MHz on channels 12 and 13 in world roaming mode.  Also,
Hauke Mehrtens fixes a #ifdef typo in brcmfmac.

Please let me know if there are problems!

Thanks,

John

---

The following changes since commit 6fc4adca6ce3e1d57a42707019dddcb883578a91:

  tilegx: request_irq with a non-null device name (2012-11-16 01:40:41 -0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless.git for-davem

for you to fetch changes up to 26c6e80892d8c160dffaba85889bd4e65b1dacf6:

  Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless into for-davem (2012-11-16 12:59:13 -0500)

----------------------------------------------------------------

Arik Nemtsov (1):
      mac80211: sync acccess to tx_filtered/ps_tx_buf queues

David Spinadel (1):
      mac80211: init sched_scan_ies

Felix Fietkau (2):
      mac80211: do not call ieee80211_configure_filter if no interfaces are up
      mac80211: call skb_dequeue/ieee80211_free_txskb instead of __skb_queue_purge

Hauke Mehrtens (1):
      brcmfmac: fix typo in CONFIG_BRCMISCAN

Johan Hedberg (2):
      Bluetooth: Fix having bogus entries in mgmt_read_index_list reply
      Bluetooth: Fix memory leak when removing a UUID

Johannes Berg (5):
      iwlwifi: handle DMA mapping failures
      iwlwifi: use ieee80211_free_txskb
      mac80211: fix memory leak in device registration error path
      mac80211: don't send null data packet when not associated
      wireless: allow 40 MHz on world roaming channels 12/13

John W. Linville (4):
      Merge branch 'for-john' of git://git.kernel.org/.../jberg/mac80211
      Merge branch 'for-john' of git://git.kernel.org/.../iwlwifi/iwlwifi-fixes
      Merge branch 'master' of git://git.kernel.org/.../bluetooth/bluetooth
      Merge branch 'master' of git://git.kernel.org/.../linville/wireless into for-davem

Marcel Holtmann (1):
      Bluetooth: Notify about device registration before power on

Marcos Chaparro (1):
      Bluetooth: ath3k: Add support for VAIO VPCEH [0489:e027]

Paulo Sérgio (1):
      Bluetooth: Fix error status when pairing fails

 drivers/bluetooth/ath3k.c                          |  1 +
 drivers/bluetooth/btusb.c                          |  1 +
 .../net/wireless/brcm80211/brcmfmac/wl_cfg80211.c  |  2 +-
 drivers/net/wireless/iwlwifi/dvm/mac80211.c        |  2 +-
 drivers/net/wireless/iwlwifi/dvm/main.c            |  2 +-
 drivers/net/wireless/iwlwifi/pcie/rx.c             | 23 ++++++++++++++++++++--
 net/bluetooth/hci_core.c                           |  4 ++--
 net/bluetooth/mgmt.c                               | 12 ++++++-----
 net/bluetooth/smp.c                                |  2 +-
 net/mac80211/cfg.c                                 |  3 +++
 net/mac80211/ieee80211_i.h                         |  2 ++
 net/mac80211/main.c                                |  6 ++++--
 net/mac80211/scan.c                                |  2 +-
 net/mac80211/sta_info.c                            | 11 ++++++++---
 net/mac80211/status.c                              |  9 +++++++++
 net/mac80211/tx.c                                  |  9 ++++++---
 net/mac80211/util.c                                |  2 ++
 net/wireless/reg.c                                 |  5 ++---
 18 files changed, 73 insertions(+), 25 deletions(-)

diff --git a/drivers/bluetooth/ath3k.c b/drivers/bluetooth/ath3k.c
index fc2de55..b00000e 100644
--- a/drivers/bluetooth/ath3k.c
+++ b/drivers/bluetooth/ath3k.c
@@ -67,6 +67,7 @@ static struct usb_device_id ath3k_table[] = {
 	{ USB_DEVICE(0x13d3, 0x3304) },
 	{ USB_DEVICE(0x0930, 0x0215) },
 	{ USB_DEVICE(0x0489, 0xE03D) },
+	{ USB_DEVICE(0x0489, 0xE027) },
 
 	/* Atheros AR9285 Malbec with sflash firmware */
 	{ USB_DEVICE(0x03F0, 0x311D) },
diff --git a/drivers/bluetooth/btusb.c b/drivers/bluetooth/btusb.c
index debda27..ee82f2f 100644
--- a/drivers/bluetooth/btusb.c
+++ b/drivers/bluetooth/btusb.c
@@ -124,6 +124,7 @@ static struct usb_device_id blacklist_table[] = {
 	{ USB_DEVICE(0x13d3, 0x3304), .driver_info = BTUSB_IGNORE },
 	{ USB_DEVICE(0x0930, 0x0215), .driver_info = BTUSB_IGNORE },
 	{ USB_DEVICE(0x0489, 0xe03d), .driver_info = BTUSB_IGNORE },
+	{ USB_DEVICE(0x0489, 0xe027), .driver_info = BTUSB_IGNORE },
 
 	/* Atheros AR9285 Malbec with sflash firmware */
 	{ USB_DEVICE(0x03f0, 0x311d), .driver_info = BTUSB_IGNORE },
diff --git a/drivers/net/wireless/brcm80211/brcmfmac/wl_cfg80211.c b/drivers/net/wireless/brcm80211/brcmfmac/wl_cfg80211.c
index a6f1e81..481345c 100644
--- a/drivers/net/wireless/brcm80211/brcmfmac/wl_cfg80211.c
+++ b/drivers/net/wireless/brcm80211/brcmfmac/wl_cfg80211.c
@@ -4401,7 +4401,7 @@ static s32 brcmf_mode_to_nl80211_iftype(s32 mode)
 
 static void brcmf_wiphy_pno_params(struct wiphy *wiphy)
 {
-#ifndef CONFIG_BRCMFISCAN
+#ifndef CONFIG_BRCMISCAN
 	/* scheduled scan settings */
 	wiphy->max_sched_scan_ssids = BRCMF_PNO_MAX_PFN_COUNT;
 	wiphy->max_match_sets = BRCMF_PNO_MAX_PFN_COUNT;
diff --git a/drivers/net/wireless/iwlwifi/dvm/mac80211.c b/drivers/net/wireless/iwlwifi/dvm/mac80211.c
index ff8162d..fa4d1b8 100644
--- a/drivers/net/wireless/iwlwifi/dvm/mac80211.c
+++ b/drivers/net/wireless/iwlwifi/dvm/mac80211.c
@@ -521,7 +521,7 @@ static void iwlagn_mac_tx(struct ieee80211_hw *hw,
 		     ieee80211_get_tx_rate(hw, IEEE80211_SKB_CB(skb))->bitrate);
 
 	if (iwlagn_tx_skb(priv, control->sta, skb))
-		dev_kfree_skb_any(skb);
+		ieee80211_free_txskb(hw, skb);
 }
 
 static void iwlagn_mac_update_tkip_key(struct ieee80211_hw *hw,
diff --git a/drivers/net/wireless/iwlwifi/dvm/main.c b/drivers/net/wireless/iwlwifi/dvm/main.c
index 7ff3f14..408132c 100644
--- a/drivers/net/wireless/iwlwifi/dvm/main.c
+++ b/drivers/net/wireless/iwlwifi/dvm/main.c
@@ -2114,7 +2114,7 @@ static void iwl_free_skb(struct iwl_op_mode *op_mode, struct sk_buff *skb)
 
 	info = IEEE80211_SKB_CB(skb);
 	iwl_trans_free_tx_cmd(priv->trans, info->driver_data[1]);
-	dev_kfree_skb_any(skb);
+	ieee80211_free_txskb(priv->hw, skb);
 }
 
 static void iwl_set_hw_rfkill_state(struct iwl_op_mode *op_mode, bool state)
diff --git a/drivers/net/wireless/iwlwifi/pcie/rx.c b/drivers/net/wireless/iwlwifi/pcie/rx.c
index 17c8e5d..bb69f8f 100644
--- a/drivers/net/wireless/iwlwifi/pcie/rx.c
+++ b/drivers/net/wireless/iwlwifi/pcie/rx.c
@@ -321,6 +321,14 @@ static void iwl_rx_allocate(struct iwl_trans *trans, gfp_t priority)
 			dma_map_page(trans->dev, page, 0,
 				     PAGE_SIZE << trans_pcie->rx_page_order,
 				     DMA_FROM_DEVICE);
+		if (dma_mapping_error(trans->dev, rxb->page_dma)) {
+			rxb->page = NULL;
+			spin_lock_irqsave(&rxq->lock, flags);
+			list_add(&rxb->list, &rxq->rx_used);
+			spin_unlock_irqrestore(&rxq->lock, flags);
+			__free_pages(page, trans_pcie->rx_page_order);
+			return;
+		}
 		/* dma address must be no more than 36 bits */
 		BUG_ON(rxb->page_dma & ~DMA_BIT_MASK(36));
 		/* and also 256 byte aligned! */
@@ -488,8 +496,19 @@ static void iwl_rx_handle_rxbuf(struct iwl_trans *trans,
 			dma_map_page(trans->dev, rxb->page, 0,
 				     PAGE_SIZE << trans_pcie->rx_page_order,
 				     DMA_FROM_DEVICE);
-		list_add_tail(&rxb->list, &rxq->rx_free);
-		rxq->free_count++;
+		if (dma_mapping_error(trans->dev, rxb->page_dma)) {
+			/*
+			 * free the page(s) as well to not break
+			 * the invariant that the items on the used
+			 * list have no page(s)
+			 */
+			__free_pages(rxb->page, trans_pcie->rx_page_order);
+			rxb->page = NULL;
+			list_add_tail(&rxb->list, &rxq->rx_used);
+		} else {
+			list_add_tail(&rxb->list, &rxq->rx_free);
+			rxq->free_count++;
+		}
 	} else
 		list_add_tail(&rxb->list, &rxq->rx_used);
 	spin_unlock_irqrestore(&rxq->lock, flags);
diff --git a/net/bluetooth/hci_core.c b/net/bluetooth/hci_core.c
index 8a0ce70..a0a2f97 100644
--- a/net/bluetooth/hci_core.c
+++ b/net/bluetooth/hci_core.c
@@ -1754,11 +1754,11 @@ int hci_register_dev(struct hci_dev *hdev)
 	if (hdev->dev_type != HCI_AMP)
 		set_bit(HCI_AUTO_OFF, &hdev->dev_flags);
 
-	schedule_work(&hdev->power_on);
-
 	hci_notify(hdev, HCI_DEV_REG);
 	hci_dev_hold(hdev);
 
+	schedule_work(&hdev->power_on);
+
 	return id;
 
 err_wqueue:
diff --git a/net/bluetooth/mgmt.c b/net/bluetooth/mgmt.c
index aa2ea0a..91de423 100644
--- a/net/bluetooth/mgmt.c
+++ b/net/bluetooth/mgmt.c
@@ -326,7 +326,7 @@ static int read_index_list(struct sock *sk, struct hci_dev *hdev, void *data,
 	struct hci_dev *d;
 	size_t rp_len;
 	u16 count;
-	int i, err;
+	int err;
 
 	BT_DBG("sock %p", sk);
 
@@ -347,9 +347,7 @@ static int read_index_list(struct sock *sk, struct hci_dev *hdev, void *data,
 		return -ENOMEM;
 	}
 
-	rp->num_controllers = cpu_to_le16(count);
-
-	i = 0;
+	count = 0;
 	list_for_each_entry(d, &hci_dev_list, list) {
 		if (test_bit(HCI_SETUP, &d->dev_flags))
 			continue;
@@ -357,10 +355,13 @@ static int read_index_list(struct sock *sk, struct hci_dev *hdev, void *data,
 		if (!mgmt_valid_hdev(d))
 			continue;
 
-		rp->index[i++] = cpu_to_le16(d->id);
+		rp->index[count++] = cpu_to_le16(d->id);
 		BT_DBG("Added hci%u", d->id);
 	}
 
+	rp->num_controllers = cpu_to_le16(count);
+	rp_len = sizeof(*rp) + (2 * count);
+
 	read_unlock(&hci_dev_list_lock);
 
 	err = cmd_complete(sk, MGMT_INDEX_NONE, MGMT_OP_READ_INDEX_LIST, 0, rp,
@@ -1366,6 +1367,7 @@ static int remove_uuid(struct sock *sk, struct hci_dev *hdev, void *data,
 			continue;
 
 		list_del(&match->list);
+		kfree(match);
 		found++;
 	}
 
diff --git a/net/bluetooth/smp.c b/net/bluetooth/smp.c
index 2ac8d50..a592337 100644
--- a/net/bluetooth/smp.c
+++ b/net/bluetooth/smp.c
@@ -267,7 +267,7 @@ static void smp_failure(struct l2cap_conn *conn, u8 reason, u8 send)
 
 	clear_bit(HCI_CONN_ENCRYPT_PEND, &conn->hcon->flags);
 	mgmt_auth_failed(conn->hcon->hdev, conn->dst, hcon->type,
-			 hcon->dst_type, reason);
+			 hcon->dst_type, HCI_ERROR_AUTH_FAILURE);
 
 	cancel_delayed_work_sync(&conn->security_timer);
 
diff --git a/net/mac80211/cfg.c b/net/mac80211/cfg.c
index 05f3a31..7371f67 100644
--- a/net/mac80211/cfg.c
+++ b/net/mac80211/cfg.c
@@ -2594,6 +2594,9 @@ static void ieee80211_mgmt_frame_register(struct wiphy *wiphy,
 		else
 			local->probe_req_reg--;
 
+		if (!local->open_count)
+			break;
+
 		ieee80211_queue_work(&local->hw, &local->reconfig_filter);
 		break;
 	default:
diff --git a/net/mac80211/ieee80211_i.h b/net/mac80211/ieee80211_i.h
index 8c80455..156e583 100644
--- a/net/mac80211/ieee80211_i.h
+++ b/net/mac80211/ieee80211_i.h
@@ -1314,6 +1314,8 @@ netdev_tx_t ieee80211_monitor_start_xmit(struct sk_buff *skb,
 					 struct net_device *dev);
 netdev_tx_t ieee80211_subif_start_xmit(struct sk_buff *skb,
 				       struct net_device *dev);
+void ieee80211_purge_tx_queue(struct ieee80211_hw *hw,
+			      struct sk_buff_head *skbs);
 
 /* HT */
 void ieee80211_apply_htcap_overrides(struct ieee80211_sub_if_data *sdata,
diff --git a/net/mac80211/main.c b/net/mac80211/main.c
index c80c449..f57f597 100644
--- a/net/mac80211/main.c
+++ b/net/mac80211/main.c
@@ -871,8 +871,10 @@ int ieee80211_register_hw(struct ieee80211_hw *hw)
 				local->hw.wiphy->cipher_suites,
 				sizeof(u32) * local->hw.wiphy->n_cipher_suites,
 				GFP_KERNEL);
-			if (!suites)
-				return -ENOMEM;
+			if (!suites) {
+				result = -ENOMEM;
+				goto fail_wiphy_register;
+			}
 			for (r = 0; r < local->hw.wiphy->n_cipher_suites; r++) {
 				u32 suite = local->hw.wiphy->cipher_suites[r];
 				if (suite == WLAN_CIPHER_SUITE_WEP40 ||
diff --git a/net/mac80211/scan.c b/net/mac80211/scan.c
index c4cdbde..43e60b5 100644
--- a/net/mac80211/scan.c
+++ b/net/mac80211/scan.c
@@ -917,7 +917,7 @@ int ieee80211_request_sched_scan_start(struct ieee80211_sub_if_data *sdata,
 				       struct cfg80211_sched_scan_request *req)
 {
 	struct ieee80211_local *local = sdata->local;
-	struct ieee80211_sched_scan_ies sched_scan_ies;
+	struct ieee80211_sched_scan_ies sched_scan_ies = {};
 	int ret, i;
 
 	mutex_lock(&local->mtx);
diff --git a/net/mac80211/sta_info.c b/net/mac80211/sta_info.c
index 0a4e4c0..d2eb64e 100644
--- a/net/mac80211/sta_info.c
+++ b/net/mac80211/sta_info.c
@@ -117,8 +117,8 @@ static void free_sta_work(struct work_struct *wk)
 
 	for (ac = 0; ac < IEEE80211_NUM_ACS; ac++) {
 		local->total_ps_buffered -= skb_queue_len(&sta->ps_tx_buf[ac]);
-		__skb_queue_purge(&sta->ps_tx_buf[ac]);
-		__skb_queue_purge(&sta->tx_filtered[ac]);
+		ieee80211_purge_tx_queue(&local->hw, &sta->ps_tx_buf[ac]);
+		ieee80211_purge_tx_queue(&local->hw, &sta->tx_filtered[ac]);
 	}
 
 #ifdef CONFIG_MAC80211_MESH
@@ -141,7 +141,7 @@ static void free_sta_work(struct work_struct *wk)
 		tid_tx = rcu_dereference_raw(sta->ampdu_mlme.tid_tx[i]);
 		if (!tid_tx)
 			continue;
-		__skb_queue_purge(&tid_tx->pending);
+		ieee80211_purge_tx_queue(&local->hw, &tid_tx->pending);
 		kfree(tid_tx);
 	}
 
@@ -961,6 +961,7 @@ void ieee80211_sta_ps_deliver_wakeup(struct sta_info *sta)
 	struct ieee80211_local *local = sdata->local;
 	struct sk_buff_head pending;
 	int filtered = 0, buffered = 0, ac;
+	unsigned long flags;
 
 	clear_sta_flag(sta, WLAN_STA_SP);
 
@@ -976,12 +977,16 @@ void ieee80211_sta_ps_deliver_wakeup(struct sta_info *sta)
 	for (ac = 0; ac < IEEE80211_NUM_ACS; ac++) {
 		int count = skb_queue_len(&pending), tmp;
 
+		spin_lock_irqsave(&sta->tx_filtered[ac].lock, flags);
 		skb_queue_splice_tail_init(&sta->tx_filtered[ac], &pending);
+		spin_unlock_irqrestore(&sta->tx_filtered[ac].lock, flags);
 		tmp = skb_queue_len(&pending);
 		filtered += tmp - count;
 		count = tmp;
 
+		spin_lock_irqsave(&sta->ps_tx_buf[ac].lock, flags);
 		skb_queue_splice_tail_init(&sta->ps_tx_buf[ac], &pending);
+		spin_unlock_irqrestore(&sta->ps_tx_buf[ac].lock, flags);
 		tmp = skb_queue_len(&pending);
 		buffered += tmp - count;
 	}
diff --git a/net/mac80211/status.c b/net/mac80211/status.c
index 3af0cc4..101eb88 100644
--- a/net/mac80211/status.c
+++ b/net/mac80211/status.c
@@ -668,3 +668,12 @@ void ieee80211_free_txskb(struct ieee80211_hw *hw, struct sk_buff *skb)
 	dev_kfree_skb_any(skb);
 }
 EXPORT_SYMBOL(ieee80211_free_txskb);
+
+void ieee80211_purge_tx_queue(struct ieee80211_hw *hw,
+			      struct sk_buff_head *skbs)
+{
+	struct sk_buff *skb;
+
+	while ((skb = __skb_dequeue(skbs)))
+		ieee80211_free_txskb(hw, skb);
+}
diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
index c9bf83f..b858ebe 100644
--- a/net/mac80211/tx.c
+++ b/net/mac80211/tx.c
@@ -1358,7 +1358,7 @@ static int invoke_tx_handlers(struct ieee80211_tx_data *tx)
 		if (tx->skb)
 			ieee80211_free_txskb(&tx->local->hw, tx->skb);
 		else
-			__skb_queue_purge(&tx->skbs);
+			ieee80211_purge_tx_queue(&tx->local->hw, &tx->skbs);
 		return -1;
 	} else if (unlikely(res == TX_QUEUED)) {
 		I802_DEBUG_INC(tx->local->tx_handlers_queued);
@@ -2120,10 +2120,13 @@ netdev_tx_t ieee80211_subif_start_xmit(struct sk_buff *skb,
  */
 void ieee80211_clear_tx_pending(struct ieee80211_local *local)
 {
+	struct sk_buff *skb;
 	int i;
 
-	for (i = 0; i < local->hw.queues; i++)
-		skb_queue_purge(&local->pending[i]);
+	for (i = 0; i < local->hw.queues; i++) {
+		while ((skb = skb_dequeue(&local->pending[i])) != NULL)
+			ieee80211_free_txskb(&local->hw, skb);
+	}
 }
 
 /*
diff --git a/net/mac80211/util.c b/net/mac80211/util.c
index 2393918..0151ae3 100644
--- a/net/mac80211/util.c
+++ b/net/mac80211/util.c
@@ -1491,6 +1491,8 @@ int ieee80211_reconfig(struct ieee80211_local *local)
 		list_for_each_entry(sdata, &local->interfaces, list) {
 			if (sdata->vif.type != NL80211_IFTYPE_STATION)
 				continue;
+			if (!sdata->u.mgd.associated)
+				continue;
 
 			ieee80211_send_nullfunc(local, sdata, 0);
 		}
diff --git a/net/wireless/reg.c b/net/wireless/reg.c
index bcc7d7e..b75756b 100644
--- a/net/wireless/reg.c
+++ b/net/wireless/reg.c
@@ -141,9 +141,8 @@ static const struct ieee80211_regdomain world_regdom = {
 	.reg_rules = {
 		/* IEEE 802.11b/g, channels 1..11 */
 		REG_RULE(2412-10, 2462+10, 40, 6, 20, 0),
-		/* IEEE 802.11b/g, channels 12..13. No HT40
-		 * channel fits here. */
-		REG_RULE(2467-10, 2472+10, 20, 6, 20,
+		/* IEEE 802.11b/g, channels 12..13. */
+		REG_RULE(2467-10, 2472+10, 40, 6, 20,
 			NL80211_RRF_PASSIVE_SCAN |
 			NL80211_RRF_NO_IBSS),
 		/* IEEE 802.11 channel 14 - Only JP enables
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply related

* Re: [PATCH 14/14] wlcore: Remove redundant check on unsigned variable
From: Luciano Coelho @ 2012-11-16 18:24 UTC (permalink / raw)
  To: Tushar Behera; +Cc: linux-kernel, patches, linux-wireless, netdev
In-Reply-To: <1353048646-10935-15-git-send-email-tushar.behera@linaro.org>

On Fri, 2012-11-16 at 12:20 +0530, Tushar Behera wrote:
> No need to check whether unsigned variable is less than 0.
> 
> CC: Luciano Coelho <coelho@ti.com>
> CC: linux-wireless@vger.kernel.org
> CC: netdev@vger.kernel.org
> Signed-off-by: Tushar Behera <tushar.behera@linaro.org>
> ---

Applied in the wl12xx.git tree.  Thanks!

--
Luca.

^ permalink raw reply

* Re: [PATCH 2/5] drivers/net/wireless/ti/wlcore/main.c: eliminate possible double power off
From: Luciano Coelho @ 2012-11-16 18:18 UTC (permalink / raw)
  To: Julia Lawall
  Cc: kernel-janitors, John W. Linville, linux-wireless, netdev,
	linux-kernel
In-Reply-To: <1350816727-1381-3-git-send-email-Julia.Lawall@lip6.fr>

On Sun, 2012-10-21 at 12:52 +0200, Julia Lawall wrote:
> From: Julia Lawall <Julia.Lawall@lip6.fr>
> 
> The function wl12xx_set_power_on is only called twice, once in
> wl12xx_chip_wakeup and once in wl12xx_get_hw_info.  On the failure of the
> call in wl12xx_chip_wakeup, the containing function just returns, but on
> the failure of the call in wl12xx_get_hw_info, the containing function
> calls wl1271_power_off.  This does not seem necessary, because if
> wl12xx_set_power_on has set the power on and then fails, it has already
> turned the power off.

[...]

Applied and pushed, thanks!

--
Luca.

^ permalink raw reply

* Re: [PATCH net-next] net: use right lock in __dev_remove_offload
From: Vlad Yasevich @ 2012-11-16 18:15 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1353089303.10798.38.camel@edumazet-glaptop>

On 11/16/2012 01:08 PM, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> offload_base is protected by offload_lock, not ptype_lock
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Vlad Yasevich <vyasevic@redhat.com>
> ---
>   net/core/dev.c |    4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index cf105e8..2705a2a 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -513,7 +513,7 @@ void __dev_remove_offload(struct packet_offload *po)
>   	struct list_head *head = &offload_base;
>   	struct packet_offload *po1;
>
> -	spin_lock(&ptype_lock);
> +	spin_lock(&offload_lock);
>
>   	list_for_each_entry(po1, head, list) {
>   		if (po == po1) {
> @@ -524,7 +524,7 @@ void __dev_remove_offload(struct packet_offload *po)
>
>   	pr_warn("dev_remove_offload: %p not found\n", po);
>   out:
> -	spin_unlock(&ptype_lock);
> +	spin_unlock(&offload_lock);
>   }
>   EXPORT_SYMBOL(__dev_remove_offload);
>

Acked-by: Vlad Yasevich <vyasevic@redhat.com>

I distinctly remember changing that, but just look at my patches and 
it's there.... Sorry...

-vlad

^ permalink raw reply

* [PATCH net-next] net: use right lock in __dev_remove_offload
From: Eric Dumazet @ 2012-11-16 18:08 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Vlad Yasevich

From: Eric Dumazet <edumazet@google.com>

offload_base is protected by offload_lock, not ptype_lock

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vlad Yasevich <vyasevic@redhat.com>
---
 net/core/dev.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index cf105e8..2705a2a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -513,7 +513,7 @@ void __dev_remove_offload(struct packet_offload *po)
 	struct list_head *head = &offload_base;
 	struct packet_offload *po1;
 
-	spin_lock(&ptype_lock);
+	spin_lock(&offload_lock);
 
 	list_for_each_entry(po1, head, list) {
 		if (po == po1) {
@@ -524,7 +524,7 @@ void __dev_remove_offload(struct packet_offload *po)
 
 	pr_warn("dev_remove_offload: %p not found\n", po);
 out:
-	spin_unlock(&ptype_lock);
+	spin_unlock(&offload_lock);
 }
 EXPORT_SYMBOL(__dev_remove_offload);
 

^ permalink raw reply related

* Re: [PATCH 0/4] netfilter updates for nf-next (try 2)
From: David Miller @ 2012-11-16 18:00 UTC (permalink / raw)
  To: pablo; +Cc: netfilter-devel, netdev
In-Reply-To: <1353069653-3231-1-git-send-email-pablo@netfilter.org>

From: pablo@netfilter.org
Date: Fri, 16 Nov 2012 13:40:49 +0100

> This is the second try to include the following four patches that contain
> updates for your net-next tree, they are:
> 
> * Little cleanup for IPVS the use of a strange notation to assign the
>   conntrack object, from Alan Cox.
> 
> * Another little cleanup for nf_nat to save a couple of lines by using
>   PTR_RET, from Wu Fengguan
> 
> * getsockopt support to obtain the original IPv6 address after NAT,
>   similar to the one that IPv4 provides, from Florian Westphal.
> 
> * Follow-up patch pointed out by YOSHIFUJI Hideaki to only provide
>   the scope_id in case that link is local, again from Florian Westphal.
> 
> You can pull these changes from:
> 
> git://1984.lsi.us.es/nf-next master

Pulled, thanks.

There was a conflict I had to resolve because the net-next tree
had a series of IS_ENABLED conversions that overlapped with some
of the changes in your tree.

^ permalink raw reply

* Re: [PATCH] openvswitch: Make IPv6 packet parsing dependent on IPv6 config
From: Vlad Yasevich @ 2012-11-16 17:43 UTC (permalink / raw)
  To: Jesse Gross
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w, davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <CAEP_g=9ge1dq8ahinM075hjdfdmXaou4a9fqPyLPJQinths6pQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On 11/16/2012 12:26 PM, Jesse Gross wrote:
> On Fri, Nov 16, 2012 at 7:40 AM, Vlad Yasevich <vyasevic-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
>> Openvswitch attempts to use IPv6 packet parsing functions without
>> any dependency on IPv6 (unlike every other place in kernel).  Pull
>> the IPv6 code in openvswitch togeter and put a conditional that's
>> dependent on CONFIG_IPV6.
>>
>> Resolves:
>> net/built-in.o: In function `ovs_flow_extract':
>> (.text+0xbf5d5): undefined reference to `ipv6_skip_exthdr'
>>
>> Signed-off-by: Vlad Yasevich <vyasevic-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>
>
> Doesn't this move in the opposite direction of your patches to make IPv6
> GSO/GRO always available?  The packets being processed here
> are generally created by the guest but with Open vSwitch running on the
> host.  Also, ipv6_skip_exthdr() is in exthdrs_core.c, so it actually is
> always available.  I suspect that the real problem is that the dependency
> on the ipv6 directory changed to CONFIG_INET and Open vSwitch should now
> depend on this.
>

Yes and no... :)  IPv6 uses a bunch of IPv4 code all over.  IPv4 is 
enabled with CONFIG_INET and IPv6 with CONFIG_NET.  So creates a strange 
imbalance.  By shifting IPv6 to CONFIG_INET (which is where it
lives and what enables its selection during config process), we now have 
a dependency with openvswitch.

All other users of ipv6_skip_exthdr have it either under the IS_ENABLED 
conditional or through some other means that don't build it when INET is
completely turned off.  This patch does the same for openvswitch.

I see 2 alternatives to this:
  1) Make openvswitch depend on CONFIG_INET.
  2) Pull a ton of code out of CONFIG_INET (v4 and v6) and into 
CONFIG_NET.  This could start with IPv6 header parsing and maybe even
include GSO/TSO (but not sure how much sense that would be).

What's your take?

-vlad

^ permalink raw reply

* Re: [PATCH] openvswitch: Make IPv6 packet parsing dependent on IPv6 config
From: Jesse Gross @ 2012-11-16 17:36 UTC (permalink / raw)
  To: Vlad Yasevich
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w, davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <1353080434-14165-1-git-send-email-vyasevic-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Fri, Nov 16, 2012 at 7:40 AM, Vlad Yasevich <vyasevic-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> Openvswitch attempts to use IPv6 packet parsing functions without
> any dependency on IPv6 (unlike every other place in kernel).  Pull
> the IPv6 code in openvswitch togeter and put a conditional that's
> dependent on CONFIG_IPV6.
>
> Resolves:
> net/built-in.o: In function `ovs_flow_extract':
> (.text+0xbf5d5): undefined reference to `ipv6_skip_exthdr'
>
> Signed-off-by: Vlad Yasevich <vyasevic-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

(Sorry for duplicates received, the original message had a typo in the
mailing list address.)

Doesn't this move in the opposite direction of your patches to make
IPv6 GSO/GRO always available?  The packets being processed here are
generally created by the guest but with Open vSwitch running on the
host.  Also, ipv6_skip_exthdr() is in exthdrs_core.c, so it actually
is always available.  I suspect that the real problem is that the
dependency on the ipv6 directory changed to CONFIG_INET and Open
vSwitch should now depend on this.

^ permalink raw reply

* Re: [PATCH] openvswitch: Make IPv6 packet parsing dependent on IPv6 config
From: Jesse Gross @ 2012-11-16 17:33 UTC (permalink / raw)
  To: Vlad Yasevich
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w, davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <1353080434-14165-1-git-send-email-vyasevic-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>


[-- Attachment #1.1: Type: text/plain, Size: 990 bytes --]

On Fri, Nov 16, 2012 at 7:40 AM, Vlad Yasevich <vyasevic-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> Openvswitch attempts to use IPv6 packet parsing functions without
> any dependency on IPv6 (unlike every other place in kernel).  Pull
> the IPv6 code in openvswitch togeter and put a conditional that's
> dependent on CONFIG_IPV6.
>
> Resolves:
> net/built-in.o: In function `ovs_flow_extract':
> (.text+0xbf5d5): undefined reference to `ipv6_skip_exthdr'
>
> Signed-off-by: Vlad Yasevich <vyasevic-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>


Doesn't this move in the opposite direction of your patches to make IPv6
GSO/GRO always available?  The packets being processed here
are generally created by the guest but with Open vSwitch running on the
host.  Also, ipv6_skip_exthdr() is in exthdrs_core.c, so it actually is
always available.  I suspect that the real problem is that the dependency
on the ipv6 directory changed to CONFIG_INET and Open vSwitch should now
depend on this.

[-- Attachment #1.2: Type: text/html, Size: 1531 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply

* [RFC PATCH] tcp: introduce raw access to experimental options
From: elelueck @ 2012-11-16 16:54 UTC (permalink / raw)
  To: netdev; +Cc: frankbla, raspl, ubacher, samudrala, Einar Lueck, davem

From: Einar Lueck <elelueck@linux.vnet.ibm.com>

This patch adds means for raw acces to TCP expirimental options
253 and 254. The intention of this is to enable user space
applications to implement communication behaviour that depends
on experimental options. For that, new (set|get)sockopts are
introduced:

TCP_EXPOPTS (get & set): TCP experimental options to be added to
			 packets
TCP_RECV_EXPOPTS (get):  experimental options received with last
			 packet
TCP_RECV_SYN_EXPOPTS (get): experimental options received with
			 SYN packet

TCP experimental options 253 and 254 configured via TCP_EXPOPTS on
any TCP socket are appended to every packet that is sent as long
as there is enough room left. If there is not enough room left they
are silently dropped.

Listening sockets reply to SYN packets with SYN ACK packets containing
TCP experimental options 253 and 254 as configured via TCP_EXPOPTS, too.
If a TCP connection gets established the configured experimental options
are the defaults for the new socket, too. Thus, a getsockopt on the
resulting accept socket for TCP_EXPOPTS returns the same stuff configured
on the listening socket.

As mentioned above, even after the 3whs is complete, experimental options
are sent with every packet. To enable user space applications to distinguish
between what has been advertized via SYN and what has been received with the
last packet the aforementioned TCP_RECV_SYN_EXPOPTS and TCP_RECV_EXPOPTS are
introduced.

Today, experimental option 253 (COOKIE) and 254 (FASTOPEN) are already
exploited. For co-existence the following approach has been taken:

General remarks:
* Interface to COOKIE and FASTOPEN stays the same
Sender side:
1. COOKIE and FASTPATH code adds own options first (if applicable)
2. Finally, if enough room is left, TCP_EXPOPTS experimental options are
   appended
Receiver side:
1. ALL 253 and 254 experimental options are made available via
   TCP_RECV(_SYN)_EXPOPTS
2. COOKIE and FASTOPEN code check if there is any option relevant for them

References:
http://tools.ietf.org/html/draft-ietf-tcpm-experimental-options-02

Signed-off-by: Einar Lueck <elelueck@linux.vnet.ibm.com>
---
 include/linux/tcp.h      |  25 ++++++++++
 include/net/tcp.h        |   3 ++
 net/ipv4/tcp.c           | 110 +++++++++++++++++++++++++++++++++++++++++++
 net/ipv4/tcp_input.c     | 119 +++++++++++++++++++++++++++++++----------------
 net/ipv4/tcp_ipv4.c      |  14 ++++++
 net/ipv4/tcp_minisocks.c |  17 +++++++
 net/ipv4/tcp_output.c    |  37 ++++++++++++---
 7 files changed, 279 insertions(+), 46 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index eb125a4..b2a6451 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -110,6 +110,10 @@ enum {
 #define TCP_REPAIR_QUEUE	20
 #define TCP_QUEUE_SEQ		21
 #define TCP_REPAIR_OPTIONS	22
+#define TCP_EXPOPTS		23	/* TCP exp. options (configured) */
+#define TCP_RECV_EXPOPTS	24	/* TCP exp. options (received) */
+#define TCP_RECV_SYN_EXPOPTS	25	/* TCP exp. options
+					   (received with syn)) */
 
 struct tcp_repair_opt {
 	__u32	opt_code;
@@ -269,6 +273,8 @@ struct tcp_sack_block {
 #define TCP_FACK_ENABLED  (1 << 1)   /*1 = FACK is enabled locally*/
 #define TCP_DSACK_SEEN    (1 << 2)   /*1 = DSACK was received from peer*/
 
+#define TCP_EXPOP_MAXLEN	40
+
 struct tcp_options_received {
 /*	PAWS/RTTM data	*/
 	long	ts_recent_stamp;/* Time we stored ts_recent (for aging) */
@@ -288,6 +294,9 @@ struct tcp_options_received {
 	u8	num_sacks;	/* Number of SACK blocks		*/
 	u16	user_mss;	/* mss requested by user in ioctl	*/
 	u16	mss_clamp;	/* Maximal mss, negotiated at connection setup */
+	u8	exp_opts_len;	/* length of buffer containing all exp
+				   options in format: kind length data */
+	u8	exp_opts[TCP_EXPOP_MAXLEN];	/* experimental options */
 };
 
 static inline void tcp_clear_options(struct tcp_options_received *rx_opt)
@@ -295,6 +304,7 @@ static inline void tcp_clear_options(struct tcp_options_received *rx_opt)
 	rx_opt->tstamp_ok = rx_opt->sack_ok = 0;
 	rx_opt->wscale_ok = rx_opt->snd_wscale = 0;
 	rx_opt->cookie_plus = 0;
+	rx_opt->exp_opts_len = 0;
 }
 
 /* This is the max number of SACKS that we'll generate and process. It's safe
@@ -315,6 +325,10 @@ struct tcp_request_sock {
 	u32				rcv_isn;
 	u32				snt_isn;
 	u32				snt_synack; /* synack sent time */
+
+	u8 syn_expopts[TCP_EXPOP_MAXLEN];	/* experimental options
+						   received with SYNACK */
+	u8 syn_expopts_len;
 };
 
 static inline struct tcp_request_sock *tcp_rsk(const struct request_sock *req)
@@ -406,6 +420,17 @@ struct tcp_sock {
 	u32	snd_up;		/* Urgent pointer		*/
 
 	u8	keepalive_probes; /* num of allowed keep alive probes	*/
+
+	/* for raw acces to experimental options */
+	struct {
+		u8 *conf;	/* lazy allocation of TCP_EXPOP_MAXLEN bytes
+				   for raw access to experimental options */
+		u8 conf_len;	/* bytes actually used for experimental opts */
+		u8 *syn;	/* experimental options received with SYN,
+				   allocated only if received */
+		u8 syn_len;	/* bytes of experimental options actually
+				   received with SYN */
+	} exp_opts;
 /*
  *      Options received (usually on last packet, some only on SYN packets).
  */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 1f000ff..b63d5c9 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -170,6 +170,8 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
 #define TCPOPT_TIMESTAMP	8	/* Better RTT estimations/PAWS */
 #define TCPOPT_MD5SIG		19	/* MD5 Signature (RFC2385) */
 #define TCPOPT_COOKIE		253	/* Cookie extension (experimental) */
+#define TCPOPT_EXP253		253	/* TCP experimental option 253 */
+#define TCPOPT_EXP254		254	/* TCP experimental option 254 */
 #define TCPOPT_EXP		254	/* Experimental */
 /* Magic number to be after the option value for sharing TCP
  * experimental options. See draft-ietf-tcpm-experimental-options-00.txt
@@ -180,6 +182,7 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
  *     TCP option lengths
  */
 
+#define TCPOLEN_MAX_ANYEXP     40
 #define TCPOLEN_MSS            4
 #define TCPOLEN_WINDOW         3
 #define TCPOLEN_SACK_PERM      2
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 5f64193..e7e4947 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -423,6 +423,12 @@ void tcp_init_sock(struct sock *sk)
 	sk->sk_sndbuf = sysctl_tcp_wmem[1];
 	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
 
+	/* memory for raw access to experimental options is allocated lazy */
+	tp->exp_opts.conf = NULL;
+	tp->exp_opts.conf_len = 0;
+	tp->exp_opts.syn = NULL;
+	tp->exp_opts.syn_len = 0;
+
 	local_bh_disable();
 	sock_update_memcg(sk);
 	sk_sockets_allocated_inc(sk);
@@ -2376,6 +2382,53 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 
 	/* These are data/string values, all the others are ints */
 	switch (optname) {
+	case TCP_EXPOPTS: {
+		u8 conf[TCP_EXPOP_MAXLEN];
+
+		if (optlen > TCP_EXPOP_MAXLEN || (optlen < 4 && optlen > 0) ||
+		    (optlen % 4 > 0))
+			return -EINVAL;
+		if (optlen > 0 && !optval)
+			return -EINVAL;
+
+		/* filter for raw access to supported options */
+		if (optlen) {
+			u8 i;
+
+			if (copy_from_user(conf, optval, optlen))
+				return -EFAULT;
+
+			i = 0;
+			while (i < optlen) {
+				if (conf[i] != TCPOPT_EXP253 &&
+				    conf[i] != TCPOPT_EXP254)
+					return -EINVAL;
+
+				if (i + 1 < optlen) {
+					i += conf[i+1];
+					if (i > optlen)
+						return -EINVAL;
+				} else {
+					return -EINVAL;
+				}
+			}
+		}
+
+		lock_sock(sk);
+		if (!optlen) {
+			tp->exp_opts.conf_len = 0;
+			release_sock(sk);
+			return 0;
+		}
+		if (!tp->exp_opts.conf) {
+			tp->exp_opts.conf = kzalloc(TCP_EXPOP_MAXLEN,
+						    sk->sk_allocation);
+		}
+		memcpy(tp->exp_opts.conf, conf, optlen);
+		tp->exp_opts.conf_len = optlen;
+		release_sock(sk);
+		return err;
+	}
 	case TCP_CONGESTION: {
 		char name[TCP_CA_NAME_MAX];
 
@@ -2947,6 +3000,63 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
 	case TCP_USER_TIMEOUT:
 		val = jiffies_to_msecs(icsk->icsk_user_timeout);
 		break;
+	case TCP_EXPOPTS: {
+		u8 exp_opts_len;
+
+		if (get_user(len, optlen))
+			return -EFAULT;
+		if (len < 0)
+			return -EINVAL;
+
+		exp_opts_len = tp->exp_opts.conf_len;
+
+		if (exp_opts_len > len)
+			return -EINVAL;
+		if (put_user(exp_opts_len, optlen))
+			return -EFAULT;
+		if (exp_opts_len && copy_to_user(optval, tp->exp_opts.conf,
+						 exp_opts_len))
+			return -EFAULT;
+		return 0;
+	}
+	case TCP_RECV_EXPOPTS:
+		if (get_user(len, optlen))
+			return -EFAULT;
+		if (len < 0)
+			return -EINVAL;
+
+		if (len < tp->rx_opt.exp_opts_len)
+			return -EINVAL;
+
+		if (put_user(tp->rx_opt.exp_opts_len, optlen))
+			return -EFAULT;
+		if (copy_to_user(optval, tp->rx_opt.exp_opts,
+				 tp->rx_opt.exp_opts_len))
+			return -EFAULT;
+		return 0;
+	case TCP_RECV_SYN_EXPOPTS: {
+		u8 exp_opts_len;
+
+		if (get_user(len, optlen))
+			return -EFAULT;
+		if (len < 0)
+			return -EINVAL;
+
+		if (!tp->exp_opts.syn)
+			exp_opts_len = 0;
+		else
+			exp_opts_len = tp->exp_opts.syn_len;
+
+		if (exp_opts_len > len)
+			return -EINVAL;
+		if (put_user(exp_opts_len, optlen))
+			return -EFAULT;
+		if (exp_opts_len && copy_to_user(optval, tp->exp_opts.syn,
+						 exp_opts_len)) {
+			return -EFAULT;
+		}
+		return 0;
+	}
 	default:
 		return -ENOPROTOOPT;
 	}
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d377f48..130d4f4 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3726,11 +3726,32 @@ old_ack:
 	return 0;
 }
 
+static inline void tcp_parse_fastopen_cookie(int opcode,
+		int opsize,
+		const unsigned char *ptr,
+		struct tcp_fastopen_cookie *foc,
+		const struct tcphdr *th) {
+	/* Fast Open option shares code 254 using a 16 bits magic number. It's
+	 * valid only in SYN or SYN-ACK with an even size.
+	 */
+	if (opsize < TCPOLEN_EXP_FASTOPEN_BASE ||
+	    get_unaligned_be16(ptr) != TCPOPT_FASTOPEN_MAGIC || foc == NULL ||
+	    !th->syn || (opsize & 1))
+		return;
+	foc->len = opsize - TCPOLEN_EXP_FASTOPEN_BASE;
+	if (foc->len >= TCP_FASTOPEN_COOKIE_MIN &&
+	    foc->len <= TCP_FASTOPEN_COOKIE_MAX)
+		memcpy(foc->val, ptr + 2, foc->len);
+	else if (foc->len != 0)
+		foc->len = -1;
+}
+
 /* Look for tcp options. Normally only called on SYN and SYNACK packets.
  * But, this can also be called on packets in the established flow when
  * the fast version below fails.
  */
-void tcp_parse_options(const struct sk_buff *skb, struct tcp_options_received *opt_rx,
+void tcp_parse_options(const struct sk_buff *skb,
+		       struct tcp_options_received *opt_rx,
 		       const u8 **hvpp, int estab,
 		       struct tcp_fastopen_cookie *foc)
 {
@@ -3740,6 +3761,7 @@ void tcp_parse_options(const struct sk_buff *skb, struct tcp_options_received *o
 
 	ptr = (const unsigned char *)(th + 1);
 	opt_rx->saw_tstamp = 0;
+	opt_rx->exp_opts_len = 0;
 
 	while (length > 0) {
 		int opcode = *ptr++;
@@ -3815,48 +3837,56 @@ void tcp_parse_options(const struct sk_buff *skb, struct tcp_options_received *o
 				 */
 				break;
 #endif
-			case TCPOPT_COOKIE:
-				/* This option is variable length.
+			case TCPOPT_EXP253:
+			case TCPOPT_EXP254:
+				/* First parse options into raw access area for
+				 * experimental options. Then handle
+				 * potential exploitations
 				 */
-				switch (opsize) {
-				case TCPOLEN_COOKIE_BASE:
-					/* not yet implemented */
-					break;
-				case TCPOLEN_COOKIE_PAIR:
-					/* not yet implemented */
-					break;
-				case TCPOLEN_COOKIE_MIN+0:
-				case TCPOLEN_COOKIE_MIN+2:
-				case TCPOLEN_COOKIE_MIN+4:
-				case TCPOLEN_COOKIE_MIN+6:
-				case TCPOLEN_COOKIE_MAX:
-					/* 16-bit multiple */
-					opt_rx->cookie_plus = opsize;
-					*hvpp = ptr;
-					break;
-				default:
-					/* ignore option */
-					break;
+				if (opsize <= TCPOLEN_MAX_ANYEXP &&
+				    opsize >= 2 &&
+				    (opt_rx->exp_opts_len + opsize <=
+				     TCPOLEN_MAX_ANYEXP)) {
+					opt_rx->exp_opts[
+						opt_rx->exp_opts_len] = opcode;
+					opt_rx->exp_opts[
+						opt_rx->exp_opts_len + 1] =
+						opsize;
+					memcpy(opt_rx->exp_opts +
+						opt_rx->exp_opts_len + 2, ptr,
+						opsize - 2);
+					opt_rx->exp_opts_len += opsize;
 				}
-				break;
 
-			case TCPOPT_EXP:
-				/* Fast Open option shares code 254 using a
-				 * 16 bits magic number. It's valid only in
-				 * SYN or SYN-ACK with an even size.
-				 */
-				if (opsize < TCPOLEN_EXP_FASTOPEN_BASE ||
-				    get_unaligned_be16(ptr) != TCPOPT_FASTOPEN_MAGIC ||
-				    foc == NULL || !th->syn || (opsize & 1))
-					break;
-				foc->len = opsize - TCPOLEN_EXP_FASTOPEN_BASE;
-				if (foc->len >= TCP_FASTOPEN_COOKIE_MIN &&
-				    foc->len <= TCP_FASTOPEN_COOKIE_MAX)
-					memcpy(foc->val, ptr + 2, foc->len);
-				else if (foc->len != 0)
-					foc->len = -1;
+				/* handle potential exploitations */
+				if (opcode == TCPOPT_COOKIE) {
+					/* This option is variable length. */
+					switch (opsize) {
+					case TCPOLEN_COOKIE_BASE:
+						/* not yet implemented */
+						break;
+					case TCPOLEN_COOKIE_PAIR:
+						/* not yet implemented */
+						break;
+					case TCPOLEN_COOKIE_MIN+0:
+					case TCPOLEN_COOKIE_MIN+2:
+					case TCPOLEN_COOKIE_MIN+4:
+					case TCPOLEN_COOKIE_MIN+6:
+					case TCPOLEN_COOKIE_MAX:
+						/* 16-bit multiple */
+						opt_rx->cookie_plus = opsize;
+						*hvpp = ptr;
+						break;
+					default:
+						/* ignore option */
+						break;
+					}
+				} else {
+					tcp_parse_fastopen_cookie(opcode,
+								  opsize, ptr,
+								  foc, th);
+				}
 				break;
-
 			}
 			ptr += opsize-2;
 			length -= opsize;
@@ -3888,6 +3918,9 @@ static bool tcp_fast_parse_options(const struct sk_buff *skb,
 				   const struct tcphdr *th,
 				   struct tcp_sock *tp, const u8 **hvpp)
 {
+	/* required if exp options are not used anymore by the counter part */
+	tp->rx_opt.exp_opts_len = 0;
+
 	/* In the spirit of fast parsing, compare doff directly to constant
 	 * values.  Because equality is used, short doff can be ignored here.
 	 */
@@ -5806,6 +5839,14 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 			}
 		}
 
+		if (unlikely(tp->rx_opt.exp_opts_len > 0)) {
+			tp->exp_opts.syn = kzalloc(tp->rx_opt.exp_opts_len,
+						   sk->sk_allocation);
+			tp->exp_opts.syn_len = tp->rx_opt.exp_opts_len;
+			memcpy(tp->exp_opts.syn, &tp->rx_opt.exp_opts,
+			       tp->rx_opt.exp_opts_len);
+		}
+
 		smp_mb();
 
 		tcp_finish_connect(sk, skb);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 00a748d..2f66bd5 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1321,6 +1321,16 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	tmp_opt.user_mss  = tp->rx_opt.user_mss;
 	tcp_parse_options(skb, &tmp_opt, &hash_location, 0, NULL);
 
+	/* for raw access to experimental options in SYN packet */
+	tcp_rsk(req)->syn_expopts_len = tmp_opt.exp_opts_len;
+	if (tcp_rsk(req)->syn_expopts_len) {
+		/* transport experimental options via request socket to big
+		 * socket
+		 */
+		memcpy(tcp_rsk(req)->syn_expopts, tmp_opt.exp_opts,
+		       tcp_rsk(req)->syn_expopts_len);
+	}
+
 	if (tmp_opt.cookie_plus > 0 &&
 	    tmp_opt.saw_tstamp &&
 	    !tp->rx_opt.cookie_out_never &&
@@ -1978,6 +1988,10 @@ void tcp_v4_destroy_sock(struct sock *sk)
 		tp->cookie_values = NULL;
 	}
 
+	/* buffers for raw access to experimental options */
+	kfree(tp->exp_opts.conf);
+	kfree(tp->exp_opts.syn);
+
 	/* If socket is aborted during connect operation */
 	tcp_free_fastopen_req(tp);
 
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 6ff7f10..dc25875 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -466,6 +466,23 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
 
 		newtp->urg_data = 0;
 
+		if (tcp_rsk(req)->syn_expopts_len) {
+			newtp->exp_opts.syn_len =
+					tcp_rsk(req)->syn_expopts_len;
+			newtp->exp_opts.syn = kzalloc(newtp->exp_opts.syn_len,
+						      GFP_ATOMIC);
+			memcpy(newtp->exp_opts.syn, tcp_rsk(req)->syn_expopts,
+			       newtp->exp_opts.syn_len);
+		}
+
+		if (oldtp->exp_opts.conf_len > 0) {
+			newtp->exp_opts.conf_len = oldtp->exp_opts.conf_len;
+			newtp->exp_opts.conf = kzalloc(TCP_EXPOP_MAXLEN,
+						       GFP_ATOMIC);
+			memcpy(newtp->exp_opts.conf, oldtp->exp_opts.conf,
+			       oldtp->exp_opts.conf_len);
+		}
+
 		if (sock_flag(newsk, SOCK_KEEPOPEN))
 			inet_csk_reset_keepalive_timer(newsk,
 						       keepalive_time_when(newtp));
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index d046326..8d7cf51 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -385,6 +385,7 @@ static inline bool tcp_urg_mode(const struct tcp_sock *tp)
 #define OPTION_MD5		(1 << 2)
 #define OPTION_WSCALE		(1 << 3)
 #define OPTION_COOKIE_EXTENSION	(1 << 4)
+#define OPTION_EXP		(1 << 5)
 #define OPTION_FAST_OPEN_COOKIE	(1 << 8)
 
 struct tcp_out_options {
@@ -581,6 +582,12 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
 		}
 		ptr += (foc->len + 3) >> 2;
 	}
+	if (unlikely(OPTION_EXP & options && tp->exp_opts.conf_len > 0)) {
+		__u8 *p = (__u8 *) ptr;
+		memcpy(ptr, tp->exp_opts.conf, tp->exp_opts.conf_len);
+		p += tp->exp_opts.conf_len;
+		ptr = (__be32 *) p;
+	}
 }
 
 /* Compute TCP options for SYN packets. This is not the final
@@ -693,6 +700,11 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 			remaining -= need;
 		}
 	}
+	if (unlikely(tp->exp_opts.conf_len > 0 &&
+		     tp->exp_opts.conf_len <= remaining)) {
+		opts->options |= OPTION_EXP;
+		remaining -= tp->exp_opts.conf_len;
+	}
 	return MAX_TCP_OPTION_SPACE - remaining;
 }
 
@@ -747,6 +759,11 @@ static unsigned int tcp_synack_options(struct sock *sk,
 		if (unlikely(!ireq->tstamp_ok))
 			remaining -= TCPOLEN_SACKPERM_ALIGNED;
 	}
+	if (unlikely(tcp_sk(sk)->exp_opts.conf_len > 0 &&
+		     tcp_sk(sk)->exp_opts.conf_len <= remaining)) {
+		opts->options |= OPTION_EXP;
+		remaining -= tcp_sk(sk)->exp_opts.conf_len;
+	}
 
 	/* Similar rationale to tcp_syn_options() applies here, too.
 	 * If the <SYN> options fit, the same options should fit now!
@@ -782,38 +799,44 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
 {
 	struct tcp_skb_cb *tcb = skb ? TCP_SKB_CB(skb) : NULL;
 	struct tcp_sock *tp = tcp_sk(sk);
-	unsigned int size = 0;
+	unsigned remaining = MAX_TCP_OPTION_SPACE;
 	unsigned int eff_sacks;
 
 #ifdef CONFIG_TCP_MD5SIG
 	*md5 = tp->af_specific->md5_lookup(sk, sk);
 	if (unlikely(*md5)) {
 		opts->options |= OPTION_MD5;
-		size += TCPOLEN_MD5SIG_ALIGNED;
+		remaining -= TCPOLEN_MD5SIG_ALIGNED;
 	}
 #else
 	*md5 = NULL;
 #endif
 
-	if (likely(tp->rx_opt.tstamp_ok)) {
+	if (likely(tp->rx_opt.tstamp_ok &&
+		   remaining >= TCPOLEN_TSTAMP_ALIGNED)) {
 		opts->options |= OPTION_TS;
 		opts->tsval = tcb ? tcb->when : 0;
 		opts->tsecr = tp->rx_opt.ts_recent;
-		size += TCPOLEN_TSTAMP_ALIGNED;
+		remaining -= TCPOLEN_TSTAMP_ALIGNED;
 	}
 
 	eff_sacks = tp->rx_opt.num_sacks + tp->rx_opt.dsack;
 	if (unlikely(eff_sacks)) {
-		const unsigned int remaining = MAX_TCP_OPTION_SPACE - size;
 		opts->num_sack_blocks =
 			min_t(unsigned int, eff_sacks,
 			      (remaining - TCPOLEN_SACK_BASE_ALIGNED) /
 			      TCPOLEN_SACK_PERBLOCK);
-		size += TCPOLEN_SACK_BASE_ALIGNED +
+		remaining -= TCPOLEN_SACK_BASE_ALIGNED +
 			opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
 	}
 
-	return size;
+	if (unlikely(tp->exp_opts.conf_len > 0 &&
+		     tp->exp_opts.conf_len <= remaining)) {
+		opts->options |= OPTION_EXP;
+		remaining -= tp->exp_opts.conf_len;
+	}
+
+	return MAX_TCP_OPTION_SPACE - remaining;
 }
 
 
-- 
1.7.12.4

^ permalink raw reply related

* Re: [PATCH net-next 1/4] ipip: allow to deactivate the creation of fb dev
From: Nicolas Dichtel @ 2012-11-16 16:46 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, davem
In-Reply-To: <20121116082926.1c6cccd2@nehalam.linuxnetplumber.net>

Le 16/11/2012 17:29, Stephen Hemminger a écrit :
> On Fri, 16 Nov 2012 17:14:13 +0100
> Nicolas Dichtel <nicolas.dichtel@6wind.com> wrote:
>
>> Now that tunnels can be configured via rtnetlink, this device is not mandatory.
>> The default is conservative.
>>
>> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
>
>
> Although I am in favor of reducing clutter, and we even have to put in special case
> code to ignore these stub devices in the Vyatta scripts. Module parameters are bit of a nuisance to deal with, but maybe
> the only way for this kind of thing and keep the required ABI.
>
> Not sure if I can fully endorse this. The device may still have uses.
> It is still useful for capturing "none of the above" packets
If you need to capture these packets, you can still create a tunnel with local 
any and remote any, even if the fb_device has not been created.

> and is used to auto-load module via module aliases.
Right, but if user uses netlink, the problem exists without these patches too.

By default, the fb device is created, so there is no change if you don't set 
explicitly setup_fb to 0.

^ permalink raw reply

* Re: [PATCH v2 net-next] sctp: Add support to per-association statistics via a new SCTP_GET_ASSOC_STATS call
From: Neil Horman @ 2012-11-16 16:39 UTC (permalink / raw)
  To: Michele Baldessari
  Cc: linux-sctp, Thomas Graf, Vlad Yasevich, netdev, David S. Miller
In-Reply-To: <1352991680-12289-1-git-send-email-michele@acksyn.org>

On Thu, Nov 15, 2012 at 04:01:20PM +0100, Michele Baldessari wrote:
> The current SCTP stack is lacking a mechanism to have per association
> statistics. This is an implementation modeled after OpenSolaris'
> SCTP_GET_ASSOC_STATS.
> 
> Userspace part will follow on lksctp if/when there is a general ACK on
> this.
> 
> V2)
>   - Implement partial retrieval of stat struct to cope for future expansion
>   - Kill the rtxpackets counter as it cannot be precise anyway
>   - Rename outseqtsns to outofseqtsns to make it clearer that these are out
>     of sequence unexpected TSNs
>   - Move asoc->ipackets++ under a lock to avoid potential miscounts
>   - Fold asoc->opackets++ into the already existing asoc check
>   - Kill unneeded (q->asoc) test when increasing rtxchunks
>   - Do not count octrlchunks if sending failed (SCTP_XMIT_OK != 0)
>   - Don't count SHUTDOWNs as SACKs
>   - Move SCTP_GET_ASSOC_STATS to the private space API
>   - Adjust the len check in sctp_getsockopt_assoc_stats() to allow for
>     future struct growth
>   - Move association statistics in their own struct
>   - Update idupchunks when we send a SACK with dup TSNs
>   - return min_rto in max_rto when RTO has not changed. Also return the
>     transport when max_rto last changed.
> 
> Signed-off: Michele Baldessari <michele@acksyn.org>
> Acked-by: Thomas Graf <tgraf@suug.ch>

Yes, I think this is good, I still don't like the idea of having to do these via
an ioctl, but I suppose it fits well enough.
Neil

Acked-by: Neil Horman <nhorman@tuxdriver.com>

^ permalink raw reply

* Re: [PATCH net-next 1/4] ipip: allow to deactivate the creation of fb dev
From: Stephen Hemminger @ 2012-11-16 16:29 UTC (permalink / raw)
  To: Nicolas Dichtel; +Cc: netdev, davem
In-Reply-To: <1353082456-21234-2-git-send-email-nicolas.dichtel@6wind.com>

On Fri, 16 Nov 2012 17:14:13 +0100
Nicolas Dichtel <nicolas.dichtel@6wind.com> wrote:

> Now that tunnels can be configured via rtnetlink, this device is not mandatory.
> The default is conservative.
> 
> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>


Although I am in favor of reducing clutter, and we even have to put in special case
code to ignore these stub devices in the Vyatta scripts. Module parameters are bit of a nuisance to deal with, but maybe
the only way for this kind of thing and keep the required ABI.

Not sure if I can fully endorse this. The device may still have uses.
It is still useful for capturing "none of the above" packets
and is used to auto-load module via module aliases.

^ permalink raw reply

* Re: [PATCH iproute2 3/3] ip/ip6tunnel: fix update of tclass and flowlabel
From: Stephen Hemminger @ 2012-11-16 16:16 UTC (permalink / raw)
  To: Nicolas Dichtel; +Cc: netdev
In-Reply-To: <1352906966-12932-3-git-send-email-nicolas.dichtel@6wind.com>

On Wed, 14 Nov 2012 16:29:26 +0100
Nicolas Dichtel <nicolas.dichtel@6wind.com> wrote:

> When tclass or flowlabel field were updated, we only performed an OR with the
> new value. For example, it was not possible to reset tclass:
>   ip -6 tunnel change ip6tnl2 tclass 0
> 
> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
> ---
>  ip/ip6tunnel.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/ip/ip6tunnel.c b/ip/ip6tunnel.c
> index 7aaac61..fcc9f33 100644
> --- a/ip/ip6tunnel.c
> +++ b/ip/ip6tunnel.c
> @@ -173,6 +173,7 @@ static int parse_args(int argc, char **argv, int cmd, struct ip6_tnl_parm *p)
>  			   matches(*argv, "dsfield") == 0) {
>  			__u8 uval;
>  			NEXT_ARG();
> +			p->flowinfo &= ~IP6_FLOWINFO_TCLASS;
>  			if (strcmp(*argv, "inherit") == 0)
>  				p->flags |= IP6_TNL_F_USE_ORIG_TCLASS;
>  			else {
> @@ -185,6 +186,7 @@ static int parse_args(int argc, char **argv, int cmd, struct ip6_tnl_parm *p)
>  			   strcmp(*argv, "fl") == 0) {
>  			__u32 uval;
>  			NEXT_ARG();
> +			p->flowinfo &= ~IP6_FLOWINFO_FLOWLABEL;
>  			if (strcmp(*argv, "inherit") == 0)
>  				p->flags |= IP6_TNL_F_USE_ORIG_FLOWLABEL;
>  			else {

All applied thanks.

^ permalink raw reply

* [PATCH net-next 0/4] Allow to deactivate fb tunnels device
From: Nicolas Dichtel @ 2012-11-16 16:14 UTC (permalink / raw)
  To: netdev; +Cc: davem

This serie proposes a module option to avoid the creation of the fb tunnels
device for sit/ipip/ip6tnl.
It also add netlink management for 6RD tunnels. The last info that still
requires ioctl is the management of the potential routers list (prl) for isatap
(note that this ioctl is not performed on the fb device).

As usual, the patch against iproute2 will be sent once the patches are included and
net-next merged. I can send it on demand.

 include/uapi/linux/if_tunnel.h |   4 +
 net/ipv4/ipip.c                |  42 ++++----
 net/ipv6/ip6_tunnel.c          |  43 +++++----
 net/ipv6/sit.c                 | 211 ++++++++++++++++++++++++++++++++---------
 4 files changed, 221 insertions(+), 79 deletions(-)

Comments are welcome.

Regards,
Nicolas

^ permalink raw reply

* [PATCH net-next 1/4] ipip: allow to deactivate the creation of fb dev
From: Nicolas Dichtel @ 2012-11-16 16:14 UTC (permalink / raw)
  To: netdev; +Cc: davem, Nicolas Dichtel
In-Reply-To: <1353082456-21234-1-git-send-email-nicolas.dichtel@6wind.com>

Now that tunnels can be configured via rtnetlink, this device is not mandatory.
The default is conservative.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 net/ipv4/ipip.c | 42 ++++++++++++++++++++++++++----------------
 1 file changed, 26 insertions(+), 16 deletions(-)

diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c
index c26c171..9e11633 100644
--- a/net/ipv4/ipip.c
+++ b/net/ipv4/ipip.c
@@ -124,6 +124,11 @@ static bool log_ecn_error = true;
 module_param(log_ecn_error, bool, 0644);
 MODULE_PARM_DESC(log_ecn_error, "Log packets received with corrupted ECN");
 
+static bool setup_fb = true;
+module_param(setup_fb, bool, 0644);
+MODULE_PARM_DESC(setup_fb,
+		 "Setup the fb device to configure tunnel via IOCTL");
+
 static int ipip_net_id __read_mostly;
 struct ipip_net {
 	struct ip_tunnel __rcu *tunnels_r_l[HASH_SIZE];
@@ -1022,25 +1027,29 @@ static int __net_init ipip_init_net(struct net *net)
 	ipn->tunnels[2] = ipn->tunnels_r;
 	ipn->tunnels[3] = ipn->tunnels_r_l;
 
-	ipn->fb_tunnel_dev = alloc_netdev(sizeof(struct ip_tunnel),
-					   "tunl0",
-					   ipip_tunnel_setup);
-	if (!ipn->fb_tunnel_dev) {
-		err = -ENOMEM;
-		goto err_alloc_dev;
-	}
-	dev_net_set(ipn->fb_tunnel_dev, net);
+	if (setup_fb) {
+		ipn->fb_tunnel_dev = alloc_netdev(sizeof(struct ip_tunnel),
+						  "tunl0",
+						  ipip_tunnel_setup);
+		if (!ipn->fb_tunnel_dev) {
+			err = -ENOMEM;
+			goto err_alloc_dev;
+		}
+		dev_net_set(ipn->fb_tunnel_dev, net);
 
-	err = ipip_fb_tunnel_init(ipn->fb_tunnel_dev);
-	if (err)
-		goto err_reg_dev;
+		err = ipip_fb_tunnel_init(ipn->fb_tunnel_dev);
+		if (err)
+			goto err_reg_dev;
 
-	if ((err = register_netdev(ipn->fb_tunnel_dev)))
-		goto err_reg_dev;
+		err = register_netdev(ipn->fb_tunnel_dev);
+		if (err)
+			goto err_reg_dev;
 
-	t = netdev_priv(ipn->fb_tunnel_dev);
+		t = netdev_priv(ipn->fb_tunnel_dev);
 
-	strcpy(t->parms.name, ipn->fb_tunnel_dev->name);
+		strcpy(t->parms.name, ipn->fb_tunnel_dev->name);
+	} else
+		ipn->fb_tunnel_dev = NULL;
 	return 0;
 
 err_reg_dev:
@@ -1057,7 +1066,8 @@ static void __net_exit ipip_exit_net(struct net *net)
 
 	rtnl_lock();
 	ipip_destroy_tunnels(ipn, &list);
-	unregister_netdevice_queue(ipn->fb_tunnel_dev, &list);
+	if (setup_fb)
+		unregister_netdevice_queue(ipn->fb_tunnel_dev, &list);
 	unregister_netdevice_many(&list);
 	rtnl_unlock();
 }
-- 
1.7.12

^ permalink raw reply related

* [PATCH net-next 4/4] ip6tnl: allow to deactivate the creation of fb dev
From: Nicolas Dichtel @ 2012-11-16 16:14 UTC (permalink / raw)
  To: netdev; +Cc: davem, Nicolas Dichtel
In-Reply-To: <1353082456-21234-1-git-send-email-nicolas.dichtel@6wind.com>

Now that tunnels can be configured via rtnetlink, this device is not mandatory.
The default is conservative.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 net/ipv6/ip6_tunnel.c | 43 +++++++++++++++++++++++++++----------------
 1 file changed, 27 insertions(+), 16 deletions(-)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index bf3a549..fe2028e 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -62,6 +62,11 @@ MODULE_DESCRIPTION("IPv6 tunneling device");
 MODULE_LICENSE("GPL");
 MODULE_ALIAS_NETDEV("ip6tnl0");
 
+static bool setup_fb = true;
+module_param(setup_fb, bool, 0644);
+MODULE_PARM_DESC(setup_fb,
+		 "Setup the fb device to configure tunnel via IOCTL");
+
 #ifdef IP6_TNL_DEBUG
 #define IP6_TNL_TRACE(x...) pr_debug("%s:" x "\n", __func__)
 #else
@@ -1711,8 +1716,10 @@ static void __net_exit ip6_tnl_destroy_tunnels(struct ip6_tnl_net *ip6n)
 	}
 
 	t = rtnl_dereference(ip6n->tnls_wc[0]);
-	unregister_netdevice_queue(t->dev, &list);
-	unregister_netdevice_many(&list);
+	if (t) {
+		unregister_netdevice_queue(t->dev, &list);
+		unregister_netdevice_many(&list);
+	}
 }
 
 static int __net_init ip6_tnl_init_net(struct net *net)
@@ -1724,25 +1731,29 @@ static int __net_init ip6_tnl_init_net(struct net *net)
 	ip6n->tnls[0] = ip6n->tnls_wc;
 	ip6n->tnls[1] = ip6n->tnls_r_l;
 
-	err = -ENOMEM;
-	ip6n->fb_tnl_dev = alloc_netdev(sizeof(struct ip6_tnl), "ip6tnl0",
-				      ip6_tnl_dev_setup);
+	if (setup_fb) {
+		err = -ENOMEM;
+		ip6n->fb_tnl_dev = alloc_netdev(sizeof(struct ip6_tnl),
+						"ip6tnl0", ip6_tnl_dev_setup);
 
-	if (!ip6n->fb_tnl_dev)
-		goto err_alloc_dev;
-	dev_net_set(ip6n->fb_tnl_dev, net);
+		if (!ip6n->fb_tnl_dev)
+			goto err_alloc_dev;
+		dev_net_set(ip6n->fb_tnl_dev, net);
 
-	err = ip6_fb_tnl_dev_init(ip6n->fb_tnl_dev);
-	if (err < 0)
-		goto err_register;
+		err = ip6_fb_tnl_dev_init(ip6n->fb_tnl_dev);
+		if (err < 0)
+			goto err_register;
 
-	err = register_netdev(ip6n->fb_tnl_dev);
-	if (err < 0)
-		goto err_register;
+		err = register_netdev(ip6n->fb_tnl_dev);
+		if (err < 0)
+			goto err_register;
+
+		t = netdev_priv(ip6n->fb_tnl_dev);
 
-	t = netdev_priv(ip6n->fb_tnl_dev);
+		strcpy(t->parms.name, ip6n->fb_tnl_dev->name);
+	} else
+		ip6n->fb_tnl_dev = NULL;
 
-	strcpy(t->parms.name, ip6n->fb_tnl_dev->name);
 	return 0;
 
 err_register:
-- 
1.7.12

^ permalink raw reply related

* [PATCH net-next 3/4] sit: allow to deactivate the creation of fb device
From: Nicolas Dichtel @ 2012-11-16 16:14 UTC (permalink / raw)
  To: netdev; +Cc: davem, Nicolas Dichtel
In-Reply-To: <1353082456-21234-1-git-send-email-nicolas.dichtel@6wind.com>

Now that tunnels can be configured via rtnetlink, this device is not mandatory.
The default is conservative.

The fb device was also used by 6RD as a template for default 6RD parameters.
User was able to update this template for each netns, hence when the fb device
is not created, this option does not exist. However, user can now set 6RD
parameters in the same netlink message that create the tunnel (before two
ioctl were needed) thus user can directly set the right value.

Last point is about ISATAP. The potential routers list (prl) management has not
been converted to netlink, but the ioctl is anyway performed on the related
device directly and not on the fb device.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 net/ipv6/sit.c | 62 +++++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 40 insertions(+), 22 deletions(-)

diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 504422d..d789a55 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -65,6 +65,15 @@
 #define HASH_SIZE  16
 #define HASH(addr) (((__force u32)addr^((__force u32)addr>>4))&0xF)
 
+static bool setup_fb = true;
+module_param(setup_fb, bool, 0644);
+MODULE_PARM_DESC(setup_fb,
+		 "Setup the fb device to configure tunnel via IOCTL");
+
+#ifdef CONFIG_IPV6_SIT_6RD
+static struct ip_tunnel_6rd_parm ip6rd_template;
+#endif
+
 static int ipip6_tunnel_init(struct net_device *dev);
 static void ipip6_tunnel_setup(struct net_device *dev);
 static void ipip6_dev_free(struct net_device *dev);
@@ -204,12 +213,9 @@ static void ipip6_tunnel_clone_6rd(struct net_device *dev, struct sit_net *sitn)
 #ifdef CONFIG_IPV6_SIT_6RD
 	struct ip_tunnel *t = netdev_priv(dev);
 
-	if (t->dev == sitn->fb_tunnel_dev) {
-		ipv6_addr_set(&t->ip6rd.prefix, htonl(0x20020000), 0, 0, 0);
-		t->ip6rd.relay_prefix = 0;
-		t->ip6rd.prefixlen = 16;
-		t->ip6rd.relay_prefixlen = 0;
-	} else {
+	if (t->dev == sitn->fb_tunnel_dev || setup_fb == false)
+		memcpy(&t->ip6rd, &ip6rd_template, sizeof(t->ip6rd));
+	else {
 		struct ip_tunnel *t0 = netdev_priv(sitn->fb_tunnel_dev);
 		memcpy(&t->ip6rd, &t0->ip6rd, sizeof(t->ip6rd));
 	}
@@ -1501,26 +1507,30 @@ static int __net_init sit_init_net(struct net *net)
 	sitn->tunnels[2] = sitn->tunnels_r;
 	sitn->tunnels[3] = sitn->tunnels_r_l;
 
-	sitn->fb_tunnel_dev = alloc_netdev(sizeof(struct ip_tunnel), "sit0",
-					   ipip6_tunnel_setup);
-	if (!sitn->fb_tunnel_dev) {
-		err = -ENOMEM;
-		goto err_alloc_dev;
-	}
-	dev_net_set(sitn->fb_tunnel_dev, net);
+	if (setup_fb) {
+		sitn->fb_tunnel_dev = alloc_netdev(sizeof(struct ip_tunnel),
+						   "sit0", ipip6_tunnel_setup);
+		if (!sitn->fb_tunnel_dev) {
+			err = -ENOMEM;
+			goto err_alloc_dev;
+		}
+		dev_net_set(sitn->fb_tunnel_dev, net);
 
-	err = ipip6_fb_tunnel_init(sitn->fb_tunnel_dev);
-	if (err)
-		goto err_dev_free;
+		err = ipip6_fb_tunnel_init(sitn->fb_tunnel_dev);
+		if (err)
+			goto err_dev_free;
 
-	ipip6_tunnel_clone_6rd(sitn->fb_tunnel_dev, sitn);
+		ipip6_tunnel_clone_6rd(sitn->fb_tunnel_dev, sitn);
 
-	if ((err = register_netdev(sitn->fb_tunnel_dev)))
-		goto err_reg_dev;
+		err = register_netdev(sitn->fb_tunnel_dev);
+		if (err)
+			goto err_reg_dev;
 
-	t = netdev_priv(sitn->fb_tunnel_dev);
+		t = netdev_priv(sitn->fb_tunnel_dev);
 
-	strcpy(t->parms.name, sitn->fb_tunnel_dev->name);
+		strcpy(t->parms.name, sitn->fb_tunnel_dev->name);
+	} else
+		sitn->fb_tunnel_dev = NULL;
 	return 0;
 
 err_reg_dev:
@@ -1538,7 +1548,8 @@ static void __net_exit sit_exit_net(struct net *net)
 
 	rtnl_lock();
 	sit_destroy_tunnels(sitn, &list);
-	unregister_netdevice_queue(sitn->fb_tunnel_dev, &list);
+	if (setup_fb)
+		unregister_netdevice_queue(sitn->fb_tunnel_dev, &list);
 	unregister_netdevice_many(&list);
 	rtnl_unlock();
 }
@@ -1565,6 +1576,13 @@ static int __init sit_init(void)
 
 	pr_info("IPv6 over IPv4 tunneling driver\n");
 
+#ifdef CONFIG_IPV6_SIT_6RD
+	ipv6_addr_set(&ip6rd_template.prefix, htonl(0x20020000), 0, 0, 0);
+	ip6rd_template.relay_prefix = 0;
+	ip6rd_template.prefixlen = 16;
+	ip6rd_template.relay_prefixlen = 0;
+#endif
+
 	err = register_pernet_device(&sit_net_ops);
 	if (err < 0)
 		return err;
-- 
1.7.12

^ permalink raw reply related

* [PATCH net-next 2/4] sit: allow to configure 6rd tunnels via netlink
From: Nicolas Dichtel @ 2012-11-16 16:14 UTC (permalink / raw)
  To: netdev; +Cc: davem, Nicolas Dichtel
In-Reply-To: <1353082456-21234-1-git-send-email-nicolas.dichtel@6wind.com>

This patch add the support of 6RD tunnels management via netlink.
Note that netdev_state_change() is now called when 6RD parameters are updated.

6RD parameters are updated only if there is at least one 6RD attribute.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 include/uapi/linux/if_tunnel.h |   4 ++
 net/ipv6/sit.c                 | 149 ++++++++++++++++++++++++++++++++++-------
 2 files changed, 128 insertions(+), 25 deletions(-)

diff --git a/include/uapi/linux/if_tunnel.h b/include/uapi/linux/if_tunnel.h
index 5ab0c8d..aee73d0 100644
--- a/include/uapi/linux/if_tunnel.h
+++ b/include/uapi/linux/if_tunnel.h
@@ -49,6 +49,10 @@ enum {
 	IFLA_IPTUN_FLAGS,
 	IFLA_IPTUN_PROTO,
 	IFLA_IPTUN_PMTUDISC,
+	IFLA_IPTUN_6RD_PREFIX,
+	IFLA_IPTUN_6RD_RELAY_PREFIX,
+	IFLA_IPTUN_6RD_PREFIXLEN,
+	IFLA_IPTUN_6RD_RELAY_PREFIXLEN,
 	__IFLA_IPTUN_MAX,
 };
 #define IFLA_IPTUN_MAX	(__IFLA_IPTUN_MAX - 1)
diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index ca6c2c8..504422d 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -936,6 +936,38 @@ static void ipip6_tunnel_update(struct ip_tunnel *t, struct ip_tunnel_parm *p)
 	netdev_state_change(t->dev);
 }
 
+#ifdef CONFIG_IPV6_SIT_6RD
+static int ipip6_tunnel_update_6rd(struct ip_tunnel *t,
+				   struct ip_tunnel_6rd *ip6rd)
+{
+	struct in6_addr prefix;
+	__be32 relay_prefix;
+
+	if (ip6rd->relay_prefixlen > 32 ||
+	    ip6rd->prefixlen + (32 - ip6rd->relay_prefixlen) > 64)
+		return -EINVAL;
+
+	ipv6_addr_prefix(&prefix, &ip6rd->prefix, ip6rd->prefixlen);
+	if (!ipv6_addr_equal(&prefix, &ip6rd->prefix))
+		return -EINVAL;
+	if (ip6rd->relay_prefixlen)
+		relay_prefix = ip6rd->relay_prefix &
+			       htonl(0xffffffffUL <<
+				     (32 - ip6rd->relay_prefixlen));
+	else
+		relay_prefix = 0;
+	if (relay_prefix != ip6rd->relay_prefix)
+		return -EINVAL;
+
+	t->ip6rd.prefix = prefix;
+	t->ip6rd.relay_prefix = relay_prefix;
+	t->ip6rd.prefixlen = ip6rd->prefixlen;
+	t->ip6rd.relay_prefixlen = ip6rd->relay_prefixlen;
+	netdev_state_change(t->dev);
+	return 0;
+}
+#endif
+
 static int
 ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 {
@@ -1105,31 +1137,9 @@ ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 		t = netdev_priv(dev);
 
 		if (cmd != SIOCDEL6RD) {
-			struct in6_addr prefix;
-			__be32 relay_prefix;
-
-			err = -EINVAL;
-			if (ip6rd.relay_prefixlen > 32 ||
-			    ip6rd.prefixlen + (32 - ip6rd.relay_prefixlen) > 64)
-				goto done;
-
-			ipv6_addr_prefix(&prefix, &ip6rd.prefix,
-					 ip6rd.prefixlen);
-			if (!ipv6_addr_equal(&prefix, &ip6rd.prefix))
+			err = ipip6_tunnel_update_6rd(t, &ip6rd);
+			if (err < 0)
 				goto done;
-			if (ip6rd.relay_prefixlen)
-				relay_prefix = ip6rd.relay_prefix &
-					       htonl(0xffffffffUL <<
-						     (32 - ip6rd.relay_prefixlen));
-			else
-				relay_prefix = 0;
-			if (relay_prefix != ip6rd.relay_prefix)
-				goto done;
-
-			t->ip6rd.prefix = prefix;
-			t->ip6rd.relay_prefix = relay_prefix;
-			t->ip6rd.prefixlen = ip6rd.prefixlen;
-			t->ip6rd.relay_prefixlen = ip6rd.relay_prefixlen;
 		} else
 			ipip6_tunnel_clone_6rd(dev, sitn);
 
@@ -1261,11 +1271,53 @@ static void ipip6_netlink_parms(struct nlattr *data[],
 		parms->i_flags = nla_get_be16(data[IFLA_IPTUN_FLAGS]);
 }
 
+#ifdef CONFIG_IPV6_SIT_6RD
+/* This function returns true when 6RD attributes are present in the nl msg */
+static bool ipip6_netlink_6rd_parms(struct nlattr *data[],
+				    struct ip_tunnel_6rd *ip6rd)
+{
+	bool ret = false;
+	memset(ip6rd, 0, sizeof(*ip6rd));
+
+	if (!data)
+		return ret;
+
+	if (data[IFLA_IPTUN_6RD_PREFIX]) {
+		ret = true;
+		nla_memcpy(&ip6rd->prefix, data[IFLA_IPTUN_6RD_PREFIX],
+			   sizeof(struct in6_addr));
+	}
+
+	if (data[IFLA_IPTUN_6RD_RELAY_PREFIX]) {
+		ret = true;
+		ip6rd->relay_prefix =
+			nla_get_be32(data[IFLA_IPTUN_6RD_RELAY_PREFIX]);
+	}
+
+	if (data[IFLA_IPTUN_6RD_PREFIXLEN]) {
+		ret = true;
+		ip6rd->prefixlen = nla_get_u16(data[IFLA_IPTUN_6RD_PREFIXLEN]);
+	}
+
+	if (data[IFLA_IPTUN_6RD_RELAY_PREFIXLEN]) {
+		ret = true;
+		ip6rd->relay_prefixlen =
+			nla_get_u16(data[IFLA_IPTUN_6RD_RELAY_PREFIXLEN]);
+	}
+
+	return ret;
+}
+#endif
+
 static int ipip6_newlink(struct net *src_net, struct net_device *dev,
 			 struct nlattr *tb[], struct nlattr *data[])
 {
 	struct net *net = dev_net(dev);
 	struct ip_tunnel *nt;
+#ifdef CONFIG_IPV6_SIT_6RD
+	struct ip_tunnel_6rd ip6rd;
+#endif
+	int err;
 
 	nt = netdev_priv(dev);
 	ipip6_netlink_parms(data, &nt->parms);
@@ -1273,7 +1325,16 @@ static int ipip6_newlink(struct net *src_net, struct net_device *dev,
 	if (ipip6_tunnel_locate(net, &nt->parms, 0))
 		return -EEXIST;
 
-	return ipip6_tunnel_create(dev);
+	err = ipip6_tunnel_create(dev);
+	if (err < 0)
+		return err;
+
+#ifdef CONFIG_IPV6_SIT_6RD
+	if (ipip6_netlink_6rd_parms(data, &ip6rd))
+		err = ipip6_tunnel_update_6rd(nt, &ip6rd);
+#endif
+
+	return err;
 }
 
 static int ipip6_changelink(struct net_device *dev, struct nlattr *tb[],
@@ -1283,6 +1344,9 @@ static int ipip6_changelink(struct net_device *dev, struct nlattr *tb[],
 	struct ip_tunnel_parm p;
 	struct net *net = dev_net(dev);
 	struct sit_net *sitn = net_generic(net, sit_net_id);
+#ifdef CONFIG_IPV6_SIT_6RD
+	struct ip_tunnel_6rd ip6rd;
+#endif
 
 	if (dev == sitn->fb_tunnel_dev)
 		return -EINVAL;
@@ -1302,6 +1366,12 @@ static int ipip6_changelink(struct net_device *dev, struct nlattr *tb[],
 		t = netdev_priv(dev);
 
 	ipip6_tunnel_update(t, &p);
+
+#ifdef CONFIG_IPV6_SIT_6RD
+	if (ipip6_netlink_6rd_parms(data, &ip6rd))
+		return ipip6_tunnel_update_6rd(t, &ip6rd);
+#endif
+
 	return 0;
 }
 
@@ -1322,6 +1392,16 @@ static size_t ipip6_get_size(const struct net_device *dev)
 		nla_total_size(1) +
 		/* IFLA_IPTUN_FLAGS */
 		nla_total_size(2) +
+#ifdef CONFIG_IPV6_SIT_6RD
+		/* IFLA_IPTUN_6RD_PREFIX */
+		nla_total_size(sizeof(struct in6_addr)) +
+		/* IFLA_IPTUN_6RD_RELAY_PREFIX */
+		nla_total_size(4) +
+		/* IFLA_IPTUN_6RD_PREFIXLEN */
+		nla_total_size(2) +
+		/* IFLA_IPTUN_6RD_RELAY_PREFIXLEN */
+		nla_total_size(2) +
+#endif
 		0;
 }
 
@@ -1339,6 +1419,19 @@ static int ipip6_fill_info(struct sk_buff *skb, const struct net_device *dev)
 		       !!(parm->iph.frag_off & htons(IP_DF))) ||
 	    nla_put_be16(skb, IFLA_IPTUN_FLAGS, parm->i_flags))
 		goto nla_put_failure;
+
+#ifdef CONFIG_IPV6_SIT_6RD
+	if (nla_put(skb, IFLA_IPTUN_6RD_PREFIX, sizeof(struct in6_addr),
+		    &tunnel->ip6rd.prefix) ||
+	    nla_put_be32(skb, IFLA_IPTUN_6RD_RELAY_PREFIX,
+			 tunnel->ip6rd.relay_prefix) ||
+	    nla_put_u16(skb, IFLA_IPTUN_6RD_PREFIXLEN,
+			tunnel->ip6rd.prefixlen) ||
+	    nla_put_u16(skb, IFLA_IPTUN_6RD_RELAY_PREFIXLEN,
+			tunnel->ip6rd.relay_prefixlen))
+		goto nla_put_failure;
+#endif
+
 	return 0;
 
 nla_put_failure:
@@ -1353,6 +1446,12 @@ static const struct nla_policy ipip6_policy[IFLA_IPTUN_MAX + 1] = {
 	[IFLA_IPTUN_TOS]		= { .type = NLA_U8 },
 	[IFLA_IPTUN_PMTUDISC]		= { .type = NLA_U8 },
 	[IFLA_IPTUN_FLAGS]		= { .type = NLA_U16 },
+#ifdef CONFIG_IPV6_SIT_6RD
+	[IFLA_IPTUN_6RD_PREFIX]		= { .len = sizeof(struct in6_addr) },
+	[IFLA_IPTUN_6RD_RELAY_PREFIX]	= { .type = NLA_U32 },
+	[IFLA_IPTUN_6RD_PREFIXLEN]	= { .type = NLA_U16 },
+	[IFLA_IPTUN_6RD_RELAY_PREFIXLEN] = { .type = NLA_U16 },
+#endif
 };
 
 static struct rtnl_link_ops sit_link_ops __read_mostly = {
-- 
1.7.12

^ permalink raw reply related

* [PATCH] openvswitch: Make IPv6 packet parsing dependent on IPv6 config
From: Vlad Yasevich @ 2012-11-16 15:43 UTC (permalink / raw)
  To: netdev
In-Reply-To: <50a5c2e5.lgIvZwesNLp78CVD%fengguang.wu@intel.com>

Openvswitch attempts to use IPv6 packet parsing functions without
any dependency on IPv6 (unlike every other place in kernel).  Pull
the IPv6 code in openvswitch togeter and put a conditional that's
dependent on CONFIG_IPV6.

Resolves:
net/built-in.o: In function `ovs_flow_extract':
(.text+0xbf5d5): undefined reference to `ipv6_skip_exthdr'

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
---
 net/openvswitch/flow.c |  168 ++++++++++++++++++++++++-----------------------
 1 files changed, 86 insertions(+), 82 deletions(-)

diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 98c7063..6dfaf60 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -124,6 +124,7 @@ u64 ovs_flow_used_time(unsigned long flow_jiffies)
 	(offsetof(struct sw_flow_key, field) +	\
 	 FIELD_SIZEOF(struct sw_flow_key, field))
 
+#if IS_ENABLED(CONFIG_IPV6)
 static int parse_ipv6hdr(struct sk_buff *skb, struct sw_flow_key *key,
 			 int *key_lenp)
 {
@@ -175,6 +176,89 @@ static bool icmp6hdr_ok(struct sk_buff *skb)
 				  sizeof(struct icmp6hdr));
 }
 
+static int parse_icmpv6(struct sk_buff *skb, struct sw_flow_key *key,
+			int *key_lenp, int nh_len)
+{
+	struct icmp6hdr *icmp = icmp6_hdr(skb);
+	int error = 0;
+	int key_len;
+
+	/* The ICMPv6 type and code fields use the 16-bit transport port
+	 * fields, so we need to store them in 16-bit network byte order.
+	 */
+	key->ipv6.tp.src = htons(icmp->icmp6_type);
+	key->ipv6.tp.dst = htons(icmp->icmp6_code);
+	key_len = SW_FLOW_KEY_OFFSET(ipv6.tp);
+
+	if (icmp->icmp6_code == 0 &&
+	    (icmp->icmp6_type == NDISC_NEIGHBOUR_SOLICITATION ||
+	     icmp->icmp6_type == NDISC_NEIGHBOUR_ADVERTISEMENT)) {
+		int icmp_len = skb->len - skb_transport_offset(skb);
+		struct nd_msg *nd;
+		int offset;
+
+		key_len = SW_FLOW_KEY_OFFSET(ipv6.nd);
+
+		/* In order to process neighbor discovery options, we need the
+		 * entire packet.
+		 */
+		if (unlikely(icmp_len < sizeof(*nd)))
+			goto out;
+		if (unlikely(skb_linearize(skb))) {
+			error = -ENOMEM;
+			goto out;
+		}
+
+		nd = (struct nd_msg *)skb_transport_header(skb);
+		key->ipv6.nd.target = nd->target;
+		key_len = SW_FLOW_KEY_OFFSET(ipv6.nd);
+
+		icmp_len -= sizeof(*nd);
+		offset = 0;
+		while (icmp_len >= 8) {
+			struct nd_opt_hdr *nd_opt =
+				 (struct nd_opt_hdr *)(nd->opt + offset);
+			int opt_len = nd_opt->nd_opt_len * 8;
+
+			if (unlikely(!opt_len || opt_len > icmp_len))
+				goto invalid;
+
+			/* Store the link layer address if the appropriate
+			 * option is provided.  It is considered an error if
+			 * the same link layer option is specified twice.
+			 */
+			if (nd_opt->nd_opt_type == ND_OPT_SOURCE_LL_ADDR
+			    && opt_len == 8) {
+				if (unlikely(!is_zero_ether_addr(key->ipv6.nd.sll)))
+					goto invalid;
+				memcpy(key->ipv6.nd.sll,
+				    &nd->opt[offset+sizeof(*nd_opt)], ETH_ALEN);
+			} else if (nd_opt->nd_opt_type == ND_OPT_TARGET_LL_ADDR
+				   && opt_len == 8) {
+				if (unlikely(!is_zero_ether_addr(key->ipv6.nd.tll)))
+					goto invalid;
+				memcpy(key->ipv6.nd.tll,
+				    &nd->opt[offset+sizeof(*nd_opt)], ETH_ALEN);
+			}
+
+			icmp_len -= opt_len;
+			offset += opt_len;
+		}
+	}
+
+	goto out;
+
+invalid:
+	memset(&key->ipv6.nd.target, 0, sizeof(key->ipv6.nd.target));
+	memset(key->ipv6.nd.sll, 0, sizeof(key->ipv6.nd.sll));
+	memset(key->ipv6.nd.tll, 0, sizeof(key->ipv6.nd.tll));
+
+out:
+	*key_lenp = key_len;
+	return error;
+}
+#endif
+
 #define TCP_FLAGS_OFFSET 13
 #define TCP_FLAG_MASK 0x3f
 
@@ -487,88 +571,6 @@ static __be16 parse_ethertype(struct sk_buff *skb)
 	return llc->ethertype;
 }
 
-static int parse_icmpv6(struct sk_buff *skb, struct sw_flow_key *key,
-			int *key_lenp, int nh_len)
-{
-	struct icmp6hdr *icmp = icmp6_hdr(skb);
-	int error = 0;
-	int key_len;
-
-	/* The ICMPv6 type and code fields use the 16-bit transport port
-	 * fields, so we need to store them in 16-bit network byte order.
-	 */
-	key->ipv6.tp.src = htons(icmp->icmp6_type);
-	key->ipv6.tp.dst = htons(icmp->icmp6_code);
-	key_len = SW_FLOW_KEY_OFFSET(ipv6.tp);
-
-	if (icmp->icmp6_code == 0 &&
-	    (icmp->icmp6_type == NDISC_NEIGHBOUR_SOLICITATION ||
-	     icmp->icmp6_type == NDISC_NEIGHBOUR_ADVERTISEMENT)) {
-		int icmp_len = skb->len - skb_transport_offset(skb);
-		struct nd_msg *nd;
-		int offset;
-
-		key_len = SW_FLOW_KEY_OFFSET(ipv6.nd);
-
-		/* In order to process neighbor discovery options, we need the
-		 * entire packet.
-		 */
-		if (unlikely(icmp_len < sizeof(*nd)))
-			goto out;
-		if (unlikely(skb_linearize(skb))) {
-			error = -ENOMEM;
-			goto out;
-		}
-
-		nd = (struct nd_msg *)skb_transport_header(skb);
-		key->ipv6.nd.target = nd->target;
-		key_len = SW_FLOW_KEY_OFFSET(ipv6.nd);
-
-		icmp_len -= sizeof(*nd);
-		offset = 0;
-		while (icmp_len >= 8) {
-			struct nd_opt_hdr *nd_opt =
-				 (struct nd_opt_hdr *)(nd->opt + offset);
-			int opt_len = nd_opt->nd_opt_len * 8;
-
-			if (unlikely(!opt_len || opt_len > icmp_len))
-				goto invalid;
-
-			/* Store the link layer address if the appropriate
-			 * option is provided.  It is considered an error if
-			 * the same link layer option is specified twice.
-			 */
-			if (nd_opt->nd_opt_type == ND_OPT_SOURCE_LL_ADDR
-			    && opt_len == 8) {
-				if (unlikely(!is_zero_ether_addr(key->ipv6.nd.sll)))
-					goto invalid;
-				memcpy(key->ipv6.nd.sll,
-				    &nd->opt[offset+sizeof(*nd_opt)], ETH_ALEN);
-			} else if (nd_opt->nd_opt_type == ND_OPT_TARGET_LL_ADDR
-				   && opt_len == 8) {
-				if (unlikely(!is_zero_ether_addr(key->ipv6.nd.tll)))
-					goto invalid;
-				memcpy(key->ipv6.nd.tll,
-				    &nd->opt[offset+sizeof(*nd_opt)], ETH_ALEN);
-			}
-
-			icmp_len -= opt_len;
-			offset += opt_len;
-		}
-	}
-
-	goto out;
-
-invalid:
-	memset(&key->ipv6.nd.target, 0, sizeof(key->ipv6.nd.target));
-	memset(key->ipv6.nd.sll, 0, sizeof(key->ipv6.nd.sll));
-	memset(key->ipv6.nd.tll, 0, sizeof(key->ipv6.nd.tll));
-
-out:
-	*key_lenp = key_len;
-	return error;
-}
-
 /**
  * ovs_flow_extract - extracts a flow key from an Ethernet frame.
  * @skb: sk_buff that contains the frame, with skb->data pointing to the
@@ -712,6 +714,7 @@ int ovs_flow_extract(struct sk_buff *skb, u16 in_port, struct sw_flow_key *key,
 				key_len = SW_FLOW_KEY_OFFSET(ipv4.arp);
 			}
 		}
+#if IS_ENABLED(CONFIG_IPV6)
 	} else if (key->eth.type == htons(ETH_P_IPV6)) {
 		int nh_len;             /* IPv6 Header + Extensions */
 
@@ -752,6 +755,7 @@ int ovs_flow_extract(struct sk_buff *skb, u16 in_port, struct sw_flow_key *key,
 					goto out;
 			}
 		}
+#endif
 	}
 
 out:
-- 
1.7.7.6

^ permalink raw reply related

* Re: [Xen-devel] [PATCH 0/4] Implement persistent grant in xen-netfront/netback
From: Konrad Rzeszutek Wilk @ 2012-11-16 15:34 UTC (permalink / raw)
  To: Wei Liu
  Cc: ANNIE LI, Pasi Kärkkäinen, netdev,
	xen-devel@lists.xensource.com, Ian Campbell
In-Reply-To: <CAOsiSVU+9fkGQhVhnrx=xxUD8hej55XXJGpKGWAxe1USPEhiEQ@mail.gmail.com>

On Thu, Nov 15, 2012 at 05:35:13PM +0800, Wei Liu wrote:
> On Thu, Nov 15, 2012 at 4:38 PM, ANNIE LI <annie.li@oracle.com> wrote:
> 
> >
> >
> > On 2012-11-15 15:40, Pasi Kärkkäinen wrote:
> >
> >> Hello,
> >>
> >> On Thu, Nov 15, 2012 at 03:03:07PM +0800, Annie Li wrote:
> >>
> >>> This patch implements persistent grants for xen-netfront/netback. This
> >>> mechanism maintains page pools in netback/netfront, these page pools is
> >>> used to
> >>> save grant pages which are mapped. This way improve performance which is
> >>> wasted
> >>> when doing grant operations.
> >>>
> >>> Current netback/netfront does map/unmap grant operations frequently when
> >>> transmitting/receiving packets, and grant operations costs much cpu
> >>> clock. In
> >>> this patch, netfront/netback maps grant pages when needed and then saves
> >>> them
> >>> into a page pool for future use. All these pages will be unmapped when
> >>> removing/releasing the net device.
> >>>
> >>>  Do you have performance numbers available already? with/without
> >> persistent grants?
> >>
> > I have some simple netperf/netserver test result with/without persistent
> > grants,
> >
> > Following is result of with persistent grant patch,
> >
> > Guests, Sum,      Avg,     Min,     Max
> >  1,  15106.4,  15106.4, 15106.36, 15106.36
> >  2,  13052.7,  6526.34,  6261.81,  6790.86
> >  3,  12675.1,  6337.53,  6220.24,  6454.83
> >  4,  13194,  6596.98,  6274.70,  6919.25
> >
> >
> > Following are result of without persistent patch
> >
> > Guests, Sum,     Avg,    Min,        Max
> >  1,  10864.1,  10864.1, 10864.10, 10864.10
> >  2,  10898.5,  5449.24,  4862.08,  6036.40
> >  3,  10734.5,  5367.26,  5261.43,  5473.08
> >  4,  10924,    5461.99,  5314.84,  5609.14
> >
> >
> >
> Interesting results. Have you tested how good it is on a 10G nic, i.e.
> guest sending packets
> through physical network to another host.

Not yet. This was done with two guests pounding each other. I am
setting two machines up for Annie so she can do that type of testing
and also with more guests.

^ permalink raw reply

* Re: [PATCH] tcp: handle tcp_net_metrics_init() order-5 memory allocation failures
From: Eric Dumazet @ 2012-11-16 15:31 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, jln
In-Reply-To: <20121116.013940.813652515905883288.davem@davemloft.net>

On Fri, 2012-11-16 at 01:39 -0500, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 15 Nov 2012 15:41:04 -0800
> 
> > From: Eric Dumazet <edumazet@google.com>
> > 
> > order-5 allocations can fail with current kernels, we should
> > try to reduce allocation sizes to allow network namespace
> > creation.
> > 
> > Reported-by: Julien Tinnes <jln@google.com>
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> 
> Indeed, this has to be done better.
> 
> But this kind of retry solution results in non-deterministic behavior.
> Yes the tcp metrics cache is best effort, but it's size can influence
> behavior in a substantial way depending upon the workload.
> 
> I would suggest that we instead use different limits, ones which the
> page allocator will satisfy for us always with GFP_KERNEL.
> 
> 1) include linux/mmzone.h
> 
> 2) Make the two limits based upon PAGE_ALLOC_COSTLY_ORDER.
> 
> That is, make the larger table size PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER
> and the smaller one PAGE_SIZE << (PAGE_ALLOC_COSTLY_ORDER - 1).

Well, we dont really know what the size needs to be, and your proposal
reduces the size by a 4 factor, even for the initial namespace.

Julien report was about Chrome browser own netns, on a suspend/resume
cycle (or something like that)

If size can influence behavior, we could try a vmalloc() if kmalloc()
fails...

Thanks

[PATCH v3] tcp: handle tcp_net_metrics_init() order-5 memory allocation failures

order-5 allocations can fail with current kernels, we should
try vmalloc() as well.

Reported-by: Julien Tinnes <jln@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_metrics.c |   12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 53bc584..f696d7c 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -1,7 +1,6 @@
 #include <linux/rcupdate.h>
 #include <linux/spinlock.h>
 #include <linux/jiffies.h>
-#include <linux/bootmem.h>
 #include <linux/module.h>
 #include <linux/cache.h>
 #include <linux/slab.h>
@@ -9,6 +8,7 @@
 #include <linux/tcp.h>
 #include <linux/hash.h>
 #include <linux/tcp_metrics.h>
+#include <linux/vmalloc.h>
 
 #include <net/inet_connection_sock.h>
 #include <net/net_namespace.h>
@@ -1034,7 +1034,10 @@ static int __net_init tcp_net_metrics_init(struct net *net)
 	net->ipv4.tcp_metrics_hash_log = order_base_2(slots);
 	size = sizeof(struct tcpm_hash_bucket) << net->ipv4.tcp_metrics_hash_log;
 
-	net->ipv4.tcp_metrics_hash = kzalloc(size, GFP_KERNEL);
+	net->ipv4.tcp_metrics_hash = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
+	if (!net->ipv4.tcp_metrics_hash)
+		net->ipv4.tcp_metrics_hash = vzalloc(size);
+
 	if (!net->ipv4.tcp_metrics_hash)
 		return -ENOMEM;
 
@@ -1055,7 +1058,10 @@ static void __net_exit tcp_net_metrics_exit(struct net *net)
 			tm = next;
 		}
 	}
-	kfree(net->ipv4.tcp_metrics_hash);
+	if (is_vmalloc_addr(net->ipv4.tcp_metrics_hash))
+		vfree(net->ipv4.tcp_metrics_hash);
+	else
+		kfree(net->ipv4.tcp_metrics_hash);
 }
 
 static __net_initdata struct pernet_operations tcp_net_metrics_ops = {

^ permalink raw reply related

* Re: [Xen-devel] [PATCH 0/4] Implement persistent grant in xen-netfront/netback
From: Konrad Rzeszutek Wilk @ 2012-11-16 15:23 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Roger Pau Monne, ANNIE LI, Pasi Kärkkäinen,
	netdev@vger.kernel.org, xen-devel@lists.xensource.com
In-Reply-To: <1353006667.26243.6.camel@dagon.hellion.org.uk>

On Thu, Nov 15, 2012 at 07:11:07PM +0000, Ian Campbell wrote:
> On Thu, 2012-11-15 at 18:29 +0000, Konrad Rzeszutek Wilk wrote:
> > On Thu, Nov 15, 2012 at 11:15:06AM +0000, Ian Campbell wrote:
> > > On Thu, 2012-11-15 at 10:56 +0000, Roger Pau Monne wrote:
> > > > On 15/11/12 09:38, ANNIE LI wrote:
> > > > > 
> > > > > 
> > > > > On 2012-11-15 15:40, Pasi Kärkkäinen wrote:
> > > > >> Hello,
> > > > >>
> > > > >> On Thu, Nov 15, 2012 at 03:03:07PM +0800, Annie Li wrote:
> > > > >>> This patch implements persistent grants for xen-netfront/netback. This
> > > > >>> mechanism maintains page pools in netback/netfront, these page pools is used to
> > > > >>> save grant pages which are mapped. This way improve performance which is wasted
> > > > >>> when doing grant operations.
> > > > >>>
> > > > >>> Current netback/netfront does map/unmap grant operations frequently when
> > > > >>> transmitting/receiving packets, and grant operations costs much cpu clock. In
> > > > >>> this patch, netfront/netback maps grant pages when needed and then saves them
> > > > >>> into a page pool for future use. All these pages will be unmapped when
> > > > >>> removing/releasing the net device.
> > > > >>>
> > > > >> Do you have performance numbers available already? with/without persistent grants?
> > > > > I have some simple netperf/netserver test result with/without persistent 
> > > > > grants,
> > > > > 
> > > > > Following is result of with persistent grant patch,
> > > > > 
> > > > > Guests, Sum,      Avg,     Min,     Max
> > > > >   1,  15106.4,  15106.4, 15106.36, 15106.36
> > > > >   2,  13052.7,  6526.34,  6261.81,  6790.86
> > > > >   3,  12675.1,  6337.53,  6220.24,  6454.83
> > > > >   4,  13194,  6596.98,  6274.70,  6919.25
> > > > > 
> > > > > 
> > > > > Following are result of without persistent patch
> > > > > 
> > > > > Guests, Sum,     Avg,    Min,        Max
> > > > >   1,  10864.1,  10864.1, 10864.10, 10864.10
> > > > >   2,  10898.5,  5449.24,  4862.08,  6036.40
> > > > >   3,  10734.5,  5367.26,  5261.43,  5473.08
> > > > >   4,  10924,    5461.99,  5314.84,  5609.14
> > > > 
> > > > In the block case, performance improvement is seen when using a large
> > > > number of guests, could you perform the same benchmark increasing the
> > > > number of guests to 15?
> > > 
> > > It would also be nice to see some analysis of the numbers which justify
> > > why this change is a good one without every reviewer having to evaluate
> > > the raw data themselves. In fact this should really be part of the
> > > commit message.
> > 
> > You mean like a nice graph, eh?
> 
> Together with an analysis of what it means and why it is a good thing,
> yes.

OK, lets put that on the TODO list for the next posting. In the meantime -
it sounds like you (the maintainer) are happy with the direction this is going.

The other things we want to do _after_ these patches is to look at the Wei
Liu patches and try to address the different reviewers comments.

The neat thing about them is that they have a concept of a page pool system.
And with persistent pages in both blkback and netback this gets more exciting.


> 
> Ian.
> 
> > 
> > I will run these patches on my 32GB box and see if I can give you
> > a nice PDF/jpg.
> > 
> > > 
> > > Ian.
> > > 
> 

^ permalink raw reply

* Re: [Xen-devel] [PATCH 0/4] Implement persistent grant in xen-netfront/netback
From: Konrad Rzeszutek Wilk @ 2012-11-16 15:21 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: ANNIE LI, Pasi Kärkkäinen, netdev@vger.kernel.org,
	xen-devel@lists.xensource.com, Ian Campbell
In-Reply-To: <50A4CA51.8080208@citrix.com>

On Thu, Nov 15, 2012 at 11:56:17AM +0100, Roger Pau Monné wrote:
> On 15/11/12 09:38, ANNIE LI wrote:
> > 
> > 
> > On 2012-11-15 15:40, Pasi Kärkkäinen wrote:
> >> Hello,
> >>
> >> On Thu, Nov 15, 2012 at 03:03:07PM +0800, Annie Li wrote:
> >>> This patch implements persistent grants for xen-netfront/netback. This
> >>> mechanism maintains page pools in netback/netfront, these page pools is used to
> >>> save grant pages which are mapped. This way improve performance which is wasted
> >>> when doing grant operations.
> >>>
> >>> Current netback/netfront does map/unmap grant operations frequently when
> >>> transmitting/receiving packets, and grant operations costs much cpu clock. In
> >>> this patch, netfront/netback maps grant pages when needed and then saves them
> >>> into a page pool for future use. All these pages will be unmapped when
> >>> removing/releasing the net device.
> >>>
> >> Do you have performance numbers available already? with/without persistent grants?
> > I have some simple netperf/netserver test result with/without persistent 
> > grants,
> > 
> > Following is result of with persistent grant patch,
> > 
> > Guests, Sum,      Avg,     Min,     Max
> >   1,  15106.4,  15106.4, 15106.36, 15106.36
> >   2,  13052.7,  6526.34,  6261.81,  6790.86
> >   3,  12675.1,  6337.53,  6220.24,  6454.83
> >   4,  13194,  6596.98,  6274.70,  6919.25
> > 
> > 
> > Following are result of without persistent patch
> > 
> > Guests, Sum,     Avg,    Min,        Max
> >   1,  10864.1,  10864.1, 10864.10, 10864.10
> >   2,  10898.5,  5449.24,  4862.08,  6036.40
> >   3,  10734.5,  5367.26,  5261.43,  5473.08
> >   4,  10924,    5461.99,  5314.84,  5609.14
> 
> In the block case, performance improvement is seen when using a large
> number of guests, could you perform the same benchmark increasing the
> number of guests to 15?

Keep in mind that one of the things that is limiting these numbers is
that netback is very CPU intensive. So I think it could get much much
faster - but netback pegs at 100%. With Wei Liu's patches the CPU usage
did drop by 40% (this is when I tested the old netback with Wei's
netback patches)- so we should see even a further speed increase.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox