Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 2/2] ss: implement -M option to get all memory information
From: Stephen Hemminger @ 2012-05-03 15:25 UTC (permalink / raw)
  To: Shan Wei; +Cc: xemul, NetDev
In-Reply-To: <4FA24458.6020105@gmail.com>

On Thu, 03 May 2012 16:39:52 +0800
Shan Wei <shanwei88@gmail.com> wrote:

> Stephen Hemminger said, at 2012/5/3 3:00:
> 
> > 
> > This looks good, is the skmeminfo a superset of the old meminfo?
> 
> 
> Yes, skmeminfo is a superset of old meminfo.
> Using this can get more socket memory information. 
> 
> > But your code is broken on 64 bit. skmeminfo in kernel is an array of __u32!
> 
> 
> OK. here is a new version.
> 
> ----
> [PATCH] ss: use new INET_DIAG_SKMEMINFO option to get more memory information for tcp socket
> 
> 
> Signed-off-by: Shan Wei <davidshan@tencent.com>
> ---
>  misc/ss.c |   16 ++++++++++++++--
>  1 files changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/misc/ss.c b/misc/ss.c
> index 5f70a26..bd60548 100644
> --- a/misc/ss.c
> +++ b/misc/ss.c
> @@ -1336,7 +1336,17 @@ static void tcp_show_info(const struct nlmsghdr *nlh, struct inet_diag_msg *r)
>  	parse_rtattr(tb, INET_DIAG_MAX, (struct rtattr*)(r+1),
>  		     nlh->nlmsg_len - NLMSG_LENGTH(sizeof(*r)));
>  
> -	if (tb[INET_DIAG_MEMINFO]) {
> +	if (tb[INET_DIAG_SKMEMINFO]) {
> +		const __u32 *skmeminfo =  RTA_DATA(tb[INET_DIAG_SKMEMINFO]);
> +		printf(" skmem:(r%u,rb%u,t%u,tb%u,f%u,w%u,o%u)",
> +			skmeminfo[SK_MEMINFO_RMEM_ALLOC],
> +			skmeminfo[SK_MEMINFO_RCVBUF],
> +			skmeminfo[SK_MEMINFO_WMEM_ALLOC],
> +			skmeminfo[SK_MEMINFO_SNDBUF],
> +			skmeminfo[SK_MEMINFO_FWD_ALLOC],
> +			skmeminfo[SK_MEMINFO_WMEM_QUEUED],
> +			skmeminfo[SK_MEMINFO_OPTMEM]);
> +	}else if (tb[INET_DIAG_MEMINFO]) {
>  		const struct inet_diag_meminfo *minfo
>  			= RTA_DATA(tb[INET_DIAG_MEMINFO]);
>  		printf(" mem:(r%u,w%u,f%u,t%u)",
> @@ -1505,8 +1515,10 @@ static int tcp_show_netlink(struct filter *f, FILE *dump_fp, int socktype)
>  	memset(&req.r, 0, sizeof(req.r));
>  	req.r.idiag_family = AF_INET;
>  	req.r.idiag_states = f->states;
> -	if (show_mem)
> +	if (show_mem) {
>  		req.r.idiag_ext |= (1<<(INET_DIAG_MEMINFO-1));
> +		req.r.idiag_ext |= (1<<(INET_DIAG_SKMEMINFO-1));
> +	}
>  
>  	if (show_tcpinfo) {
>  		req.r.idiag_ext |= (1<<(INET_DIAG_INFO-1));

This looks good, I will apply it

^ permalink raw reply

* Re: sky2 still badly broken
From: Stephen Hemminger @ 2012-05-03 15:23 UTC (permalink / raw)
  To: Niccolò Belli; +Cc: netdev
In-Reply-To: <4FA2527A.6020808@linuxsystems.it>

On Thu, 03 May 2012 11:40:10 +0200
Niccolò Belli <darkbasic@linuxsystems.it> wrote:

> Il 02/05/2012 20:56, Stephen Hemminger ha scritto:
> > It could be that your switch doesn't do autonegotiation or flow
> > control. You are getting receive fifo overflow errors.
> 
> I don't have this problem with other NICs. Also transfer rate is very 
> low (even 2 MB/s sometimes) while I get ~110MB/s with other NICs (and 
> the same switch of course).
> 
> Niccolò

The receiver on some versions of the chip can't keep up with full speed
of 1G bit/sec. The receive  FIFO has hardware issues, and since I don't
work for Marvell, working around the problem is guesswork. Without exact
information all that can be done is have a timeout and blunt force reset
logic. The vendor driver sk98lin has the same brute force logic, but may
just not print the message.

^ permalink raw reply

* Re: [PATCH v3 2/2] macvtap: restore vlan header on user read
From: Basil Gor @ 2012-05-03 15:22 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Eric W. Biederman, David S. Miller, netdev
In-Reply-To: <20120503143108.GA20969@redhat.com>

On Thu, May 03, 2012 at 05:31:10PM +0300, Michael S. Tsirkin wrote:
> On Thu, May 03, 2012 at 06:37:46AM -0700, Eric W. Biederman wrote:
> > "Michael S. Tsirkin" <mst@redhat.com> writes:
> > 
> > > On Wed, Apr 25, 2012 at 10:31:25PM -0700, Eric W. Biederman wrote:
> > >> Basil Gor <basil.gor@gmail.com> writes:
> > >> 
> > >> > Vlan tag is restored during buffer transmit to a network device (bridge
> > >> > port) in bridging code in case of tun/tap driver. In case of macvtap it
> > >> > has to be done explicitly. Otherwise vlan_tci is ignored and user always
> > >> > gets untagged packets.
> > >> 
> > >> We could quibble about efficiencies but this looks good except for
> > >> macvtap_recvmsg which isn't setting the auxdata for the vlan header.
> > >> 
> > >> Eric
> > >
> > > Right. I'm guessing we need to support old userspace
> > > so if there's auxdata, put vlan there but if not,
> > > put the vlan in the packet like this patch does.
> > 
> > This patch isn't horrible.
> > 
> > Still why copy the skb when you can just split the copy to userspace
> > into a couple of pieces?
> > 
> > We don't need to change the skb and changing the skb looks like
> > it is likely to confuse things and cause bugs because we are
> > not working with a consistent model of how vlan information
> > is encoded.
> > 
> > Still something needs to happen and this works in more cases even if it
> > isn't perfect.
> > 
> > Eric
> 
> Absolutely. And it's easier than I thought.
> So we can do something like the below (warning: compiled only).
> Basil - want to take a look?

Sure, I'll give it a try.
Thanks

Basil Gor

> My only concern if we put this logic in an out of way
> driver like macvtap will people remember to update it?
> Maybe better to update skb_copy_datagram_const_iovec which is in core?
> 
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> index 0427c65..5a1724c 100644
> --- a/drivers/net/macvtap.c
> +++ b/drivers/net/macvtap.c
> @@ -1,5 +1,6 @@
>  #include <linux/etherdevice.h>
>  #include <linux/if_macvlan.h>
> +#include <linux/if_vlan.h>
>  #include <linux/interrupt.h>
>  #include <linux/nsproxy.h>
>  #include <linux/compat.h>
> @@ -759,6 +760,7 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
>  	struct macvlan_dev *vlan;
>  	int ret;
>  	int vnet_hdr_len = 0;
> +	int vlan_offset = 0;
>  
>  	if (q->flags & IFF_VNET_HDR) {
>  		struct virtio_net_hdr vnet_hdr;
> @@ -776,8 +778,29 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
>  
>  	len = min_t(int, skb->len, len);
>  
> -	ret = skb_copy_datagram_const_iovec(skb, 0, iv, vnet_hdr_len, len);
> +	if (vlan_tx_tag_present(skb)) {
> +		struct {
> +			__be16 h_vlan_proto;
> +			__be16 h_vlan_TCI;
> +		} veth;
> + 		veth.h_vlan_proto = htons(ETH_P_8021Q);
> + 		veth.h_vlan_TCI = vlan_tx_tag_get(skb);
> +
> +		vlan_offset = offsetof(struct vlan_ethhdr, h_vlan_proto);
> +		ret = skb_copy_datagram_const_iovec(skb, 0, iv, vnet_hdr_len,
> +						    vlan_offset);
> +		if (ret)
> +			goto done;
> +		ret = memcpy_toiovecend(iv, (unsigned char *)&veth, vlan_offset,
> +					sizeof veth);
> +		if (ret)
> +			goto done;
> +		vlan_offset += sizeof veth;
> +	}
> +	ret = skb_copy_datagram_const_iovec(skb, vlan_offset, iv, vnet_hdr_len,
> +					    len);
>  
> +done:
>  	rcu_read_lock_bh();
>  	vlan = rcu_dereference_bh(q->vlan);
>  	if (vlan)

^ permalink raw reply

* pull request: wireless-next 2012-05-03
From: John W. Linville @ 2012-05-03 15:22 UTC (permalink / raw)
  To: davem-fT/PcQaiUtIeIZ0/mPfg9Q
  Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 17462 bytes --]

commit aeace1b7293095fd45240646343251b1da8713da

Dave,

This is a batch of updates intended for 3.5.  It also includes a pull
from the wireless tree which resolved some build dependencies.

Highlights of this pull request include some refactoring in the
bluetooth directories, some HT enhancements for mac80211, an expansion
of the ethtool support for cfg80211- and mac80211-based drivers,
and some more iwlwifi refactoring.

It looks like some of the bluetooth device ID patches got committed
on both the bluetooth and the bluetooth-next trees.  I'll ask them to
be more careful about that, but I didn't think it was worth asking
for rebases since that would be disruptive to the downstream trees
and since git handles the situation reasonably well already.

Please let me know if there are problems!

Thanks,

John

---

The following changes since commit af94bf6db1d58d26f1cdab145b6312ad363254a6:

  ixgbe: Fix use after free on module remove (2012-05-03 04:21:34 -0400)

are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next.git for-davem

AceLan Kao (5):
      Bluetooth: Add support for Atheros [04ca:3005]
      Bluetooth: Add support for Atheros [13d3:3362]
      Bluetooth: Add support for Atheros [13d3:3362]
      Bluetooth: Add support for AR3012 [0cf3:e004]
      Bluetooth: Add support for AR3012 [0cf3:e004]

Amitkumar Karwar (1):
      mwifiex: fix static checker warnings

Andre Guedes (10):
      Bluetooth: Check FINDING state in interleaved discovery
      Bluetooth: Add hci_cancel_le_scan() to hci_core
      Bluetooth: LE support for MGMT stop discovery
      Bluetooth: Replace EPERM by EALREADY in hci_cancel_inquiry
      Bluetooth: Refactor stop_discovery
      Bluetooth: Add Periodic Inquiry command complete handler
      Bluetooth: Add HCI_PERIODIC_INQ to dev_flags
      Bluetooth: Check HCI_PERIODIC_INQ in start_discovery
      Bluetooth: Ignore inquiry results from periodic inquiry
      Bluetooth: Remove MGMT_ADDR_INVALID macro

Andrei Emeltchenko (27):
      Bluetooth: Correct type for hdev lmp_subver
      Bluetooth: trivial: Correct endian conversion
      Bluetooth: Correct type for ediv to __le16
      Bluetooth: Fix extra conversion to __le32
      Bluetooth: Correct chan->psm endian conversions
      Bluetooth: Correct ediv in SMP
      Bluetooth: Correct length calc in L2CAP conf rsp
      Bluetooth: Correct CID endian notation
      Bluetooth: Convert error codes to le16
      Bluetooth: trivial: Fix endian conversion mode
      Bluetooth: mgmt: Add missing endian conversion
      Bluetooth: trivial: Correct types
      Bluetooth: Fix type in cpu_to_le conversion
      Bluetooth: Fix opcode access in hci_complete
      Bluetooth: trivial: Remove sparse warnings
      Bluetooth: Silence sparse warning
      Bluetooth: mgmt: Fix timeout type
      Bluetooth: Remove unneeded timer clear
      Bluetooth: Fix memory leaks due to chan refcnt
      Bluetooth: Make L2CAP chan_add functions static
      Bluetooth: Comments and style fixes
      Bluetooth: Remove unneeded zero initialization
      Bluetooth: Add Read Local AMP Info to init
      Bluetooth: Adds set_default function in L2CAP setup
      Bluetooth: trivial: Remove empty line
      Bluetooth: Fix debug printing unallocated name
      cfg80211: Remove compile warnings

Anisse Astier (2):
      rt2x00: debugfs support - allow a register to be empty
      rt2x00: Add debugfs access for rfcsr register

Ashok Nagarajan (4):
      mac80211: Advertise HT protection mode in IEs
      mac80211: Implement HT mixed protection mode
      mac80211: Allow nonHT/HT peering in mesh
      {nl,cfg,mac}80211: Allow user to see/configure HT protection mode

Ben Greear (4):
      cfg80211: Add framework to support ethtool stats.
      mac80211: Support getting sta_info stats via ethtool.
      mac80211: Framework to get wifi-driver stats via ethtool.
      mac80211: Add more ethtools stats: survey, rates, etc

Ben Hutchings (2):
      ipw2200: Fix order of device registration
      ipw2100: Fix order of device registration

Brian Gix (1):
      Bluetooth: mgmt: Fix corruption of device_connected pkt

Cho, Yu-Chen (1):
      Bluetooth: Add Atheros maryann PIDVID support

Dan Carpenter (1):
      wireless: at76c50x: allocating too much data

David Herrmann (5):
      Bluetooth: Remove redundant hdev->parent field
      Bluetooth: vhci: Ignore return code of nonseekable_open()
      Bluetooth: Move hci_alloc/free_dev close to hci_register/unregister_dev
      Bluetooth: Move device initialization to hci_alloc_dev()
      Bluetooth: Remove unneeded initialization in hci_alloc_dev()

Don Zickus (1):
      Bluetooth: btusb: typo in Broadcom SoftSailing id

Eldad Zack (1):
      brcmsmac: "INTERMEDIATE but not AMPDU" only when tracing

Eliad Peller (1):
      mac80211: call ieee80211_mgd_stop() on interface stop

Emmanuel Grumbach (3):
      iwlwifi: use IWL_* instead of dev_printk when possible
      iwlwifi: don't init trans->reg_lock from the op_mode
      cfg80211: fix BSS comparison

Felix Fietkau (1):
      mac80211: fix AP mode EAP tx for VLAN stations

Franky Lin (6):
      brcm80211: fmac: fix SDIO function 0 register r/w issue
      brcm80211: fmac: fix missing completion events issue
      brcmfmac: stop releasing sdio host in irq handler
      brcmfmac: check bus state for status
      brcmfmac: postpone interrupt register function
      brcmfmac: add out of band interrupt support

Gabor Juhos (2):
      ath9k: add an extra boolean parameter to ath9k_hw_apply_txpower
      ath9k: fix tx power settings for AR9287

Grazvydas Ignotas (2):
      wl1251: fix crash on remove due to premature kfree
      wl1251: fix crash on remove due to leftover work item

Gustavo Padovan (6):
      Bluetooth: Remove sk parameter from l2cap_chan_create()
      Bluetooth: Fix userspace compatibility issue with mgmt interface
      Merge git://git.kernel.org/.../bluetooth/bluetooth
      Bluetooth: Remove err parameter from alloc_skb()
      Bluetooth: remove unneeded declaration of sco_conn_del()
      Bluetooth: Fix coding style issues

Hemant Gupta (6):
      Bluetooth: Use correct flags for checking HCI_SSP_ENABLED bit
      Bluetooth: Send correct address type for LTK
      Bluetooth: Fix clearing discovery type when stopping discovery
      Bluetooth: mgmt: Fix missing connect failed event for LE
      Bluetooth: mgmt: Fix address type while loading Long Term Key
      Bluetooth: Don't distribute keys in case of Encryption Failure

Ido Yariv (1):
      Bluetooth: Search global l2cap channels by src/dst addresses

Jesper Juhl (1):
      Bluetooth: btmrvl_sdio: remove pointless conditional before release_firmware()

Johan Hedberg (2):
      Bluetooth: Don't increment twice in eir_has_data_type()
      Bluetooth: Check for minimum data length in eir_has_data_type()

Johan Hovold (2):
      Bluetooth: hci_ldisc: fix NULL-pointer dereference on tty_close
      Bluetooth: hci_core: fix NULL-pointer dereference at unregister

Johannes Berg (1):
      iwlwifi: fix hardware queue programming

John W. Linville (5):
      Merge branch 'for-upstream' of git://git.kernel.org/.../bluetooth/bluetooth
      Merge branch 'for-upstream' of git://git.kernel.org/.../bluetooth/bluetooth-next
      Merge branch 'wireless-next' of git://git.kernel.org/.../iwlwifi/iwlwifi
      Merge branch 'master' of git://git.kernel.org/.../linville/wireless
      Merge branch 'master' of git://git.kernel.org/.../linville/wireless-next into for-davem

Jonathan Bither (1):
      ath5k: add missing iounmap to AHB probe removal

João Paulo Rechi Vita (1):
      Bluetooth: btusb: Add USB device ID "0a5c 21e8"

Larry Finger (1):
      rtlwifi: Fix oops on unload

Luis R. Rodriguez (2):
      Bluetooth: properly use pr_fmt() on lib.c
      libertas: include sched.h on firmware.c

Lukasz Rymanowski (1):
      Bluetooth: Remove not needed status parameter

Manoj Iyer (2):
      Bluetooth: btusb: Add vendor specific ID (0489 e042) for BCM20702A0
      Bluetooth: btusb: Add vendor specific ID (0489 e042) for BCM20702A0

Marcel Holtmann (10):
      Bluetooth: Add TX power tag to EIR data
      Bluetooth: Handle EIR tags for Device ID
      Bluetooth: Add management command for setting Device ID
      Bluetooth: Fix broken usage of put_unaligned_le16
      Bluetooth: Fix broken usage of get_unaligned_le16
      Bluetooth: Update management interface revision
      Bluetooth: Split error handling for L2CAP listen sockets
      Bluetooth: Split error handling for SCO listen sockets
      Bluetooth: Don't check source address in SCO bind function
      Bluetooth: Restrict to one SCO listening socket

Mat Martineau (4):
      Bluetooth: Add definitions and struct members for new ERTM state machine
      Bluetooth: Add a structure to carry ERTM data in skb control blocks
      Bluetooth: Add the l2cap_seq_list structure for tracking frames
      Bluetooth: Functions for handling ERTM control fields

Meenakshi Venkataraman (1):
      iwlwifi: use correct released ucode version

Mikel Astiz (3):
      Bluetooth: Use unsigned int instead of signed int
      Bluetooth: Remove unnecessary check
      Bluetooth: btusb: Dynamic alternate setting

Rajkumar Manoharan (1):
      mac80211: fix rate control update on 2040 bss change

Santosh Nayak (1):
      Bluetooth: Fix Endian Bug.

Seth Forshee (1):
      b43: only reload config after successful initialization

Stanislav Yakovlev (2):
      ipw2200: Fix race condition in the command completion acknowledge
      net/wireless: ipw2200: Fix WARN_ON occurring in wiphy_register called by ipw_pci_probe

Stanislaw Gruszka (2):
      iwlwifi: do not nulify ctx->vif on reset
      iwlwifi: add option to disable 5GHz band

Steven Harms (2):
      Add Foxconn / Hon Hai IDs for btusb module
      Add Foxconn / Hon Hai IDs for btusb module

Syam Sidhardhan (3):
      Bluetooth: remove header declared but not defined
      Bluetooth: Remove strtoba header declared but not defined
      Bluetooth: mgmt: Remove unwanted goto statements

Szymon Janc (4):
      Bluetooth: mgmt: Fix some code style and indentation issues
      Bluetooth: mgmt: Don't allow to set invalid value to DeviceID source
      Bluetooth: Fix missing break in hci_cmd_complete_evt
      Bluetooth: Fix missing break in hci_cmd_complete_evt

Thomas Pedersen (2):
      mac80211: insert mesh peer after init
      mac80211: don't transmit 40MHz frames to 20MHz peer

Ulisses Furquim (1):
      Bluetooth: Fix registering hci with duplicate name

Vinicius Costa Gomes (1):
      Bluetooth: Add support for reusing the same hci_conn for LE links

Vishal Agarwal (4):
      Bluetooth: hci_persistent_key should return bool
      Bluetooth: Temporary keys should be retained during connection
      Bluetooth: hci_persistent_key should return bool
      Bluetooth: Temporary keys should be retained during connection

WarheadsSE (1):
      mwifiex: add support for SD8786 sdio

Wey-Yi Guy (11):
      iwlwifi: remove unused macros
      iwlwifi: add BT reduced tx power flag
      iwlwifi: add checking for the condition to reduce tx power
      iwlwifi: add reduced tx power threshold define
      iwlwifi: small define change
      iwlwifi: send reduce tx power info in command
      iwlwifi: change kill mask based on reduce power state
      iwlwifi: add loose coex lut
      iwlwifi: use 6000G2B for 6030 device series
      iwlwifi: modify #ifdef to avoid sparse complain
      iwlwifi: remove the iwl_shared reference

 drivers/bluetooth/ath3k.c                          |    4 +
 drivers/bluetooth/btmrvl_sdio.c                    |    9 +-
 drivers/bluetooth/btusb.c                          |   19 +-
 drivers/bluetooth/hci_ldisc.c                      |    2 +-
 drivers/bluetooth/hci_vhci.c                       |    3 +-
 drivers/net/wireless/at76c50x-usb.c                |    4 +-
 drivers/net/wireless/ath/ath5k/ahb.c               |    1 +
 drivers/net/wireless/ath/ath9k/ar5008_phy.c        |    2 +-
 drivers/net/wireless/ath/ath9k/ar9003_paprd.c      |    2 +-
 drivers/net/wireless/ath/ath9k/ar9003_phy.c        |    2 +-
 drivers/net/wireless/ath/ath9k/eeprom_9287.c       |    2 +
 drivers/net/wireless/ath/ath9k/hw.c                |    9 +-
 drivers/net/wireless/ath/ath9k/hw.h                |    3 +-
 drivers/net/wireless/b43/main.c                    |   10 +-
 drivers/net/wireless/brcm80211/Kconfig             |    9 +
 drivers/net/wireless/brcm80211/brcmfmac/bcmsdh.c   |   97 ++++-
 .../net/wireless/brcm80211/brcmfmac/bcmsdh_sdmmc.c |  113 +++++-
 drivers/net/wireless/brcm80211/brcmfmac/dhd_sdio.c |  102 ++++-
 .../net/wireless/brcm80211/brcmfmac/sdio_host.h    |   22 +-
 drivers/net/wireless/brcm80211/brcmsmac/main.c     |    3 +-
 drivers/net/wireless/ipw2x00/ipw2100.c             |   24 +-
 drivers/net/wireless/ipw2x00/ipw2200.c             |   57 ++--
 drivers/net/wireless/iwlwifi/iwl-1000.c            |    8 +-
 drivers/net/wireless/iwlwifi/iwl-2000.c            |   16 +-
 drivers/net/wireless/iwlwifi/iwl-5000.c            |   11 +-
 drivers/net/wireless/iwlwifi/iwl-6000.c            |   10 +-
 drivers/net/wireless/iwlwifi/iwl-agn-lib.c         |  153 +++----
 drivers/net/wireless/iwlwifi/iwl-agn.c             |   41 +-
 drivers/net/wireless/iwlwifi/iwl-agn.h             |    2 +-
 drivers/net/wireless/iwlwifi/iwl-commands.h        |   21 +-
 drivers/net/wireless/iwlwifi/iwl-dev.h             |    1 +
 drivers/net/wireless/iwlwifi/iwl-drv.c             |   12 +-
 drivers/net/wireless/iwlwifi/iwl-fh.h              |   22 +-
 drivers/net/wireless/iwlwifi/iwl-mac80211.c        |   10 +-
 drivers/net/wireless/iwlwifi/iwl-modparams.h       |    8 +-
 drivers/net/wireless/iwlwifi/iwl-prph.h            |   27 +-
 drivers/net/wireless/iwlwifi/iwl-trans-pcie.c      |    1 +
 drivers/net/wireless/libertas/firmware.c           |    1 +
 drivers/net/wireless/mwifiex/Kconfig               |    4 +-
 drivers/net/wireless/mwifiex/fw.h                  |    3 +-
 drivers/net/wireless/mwifiex/sdio.c                |    7 +
 drivers/net/wireless/mwifiex/sdio.h                |    1 +
 drivers/net/wireless/rt2x00/rt2800.h               |    2 +
 drivers/net/wireless/rt2x00/rt2800lib.c            |    7 +
 drivers/net/wireless/rt2x00/rt2x00debug.c          |   82 ++--
 drivers/net/wireless/rt2x00/rt2x00debug.h          |    1 +
 drivers/net/wireless/rtlwifi/pci.c                 |    1 +
 drivers/net/wireless/ti/wl1251/main.c              |    1 +
 drivers/net/wireless/ti/wl1251/sdio.c              |    2 +-
 include/linux/nl80211.h                            |    3 +
 include/net/bluetooth/bluetooth.h                  |   14 +-
 include/net/bluetooth/hci.h                        |    7 +
 include/net/bluetooth/hci_core.h                   |   21 +-
 include/net/bluetooth/l2cap.h                      |   78 +++-
 include/net/bluetooth/mgmt.h                       |    9 +
 include/net/bluetooth/smp.h                        |    2 +-
 include/net/cfg80211.h                             |   18 +
 include/net/mac80211.h                             |   17 +
 net/bluetooth/hci_conn.c                           |   32 +-
 net/bluetooth/hci_core.c                           |  206 +++++-----
 net/bluetooth/hci_event.c                          |   61 +++-
 net/bluetooth/hci_sysfs.c                          |    5 +-
 net/bluetooth/l2cap_core.c                         |  454 ++++++++++++++++----
 net/bluetooth/l2cap_sock.c                         |   33 +-
 net/bluetooth/lib.c                                |    2 +
 net/bluetooth/mgmt.c                               |  225 +++++++----
 net/bluetooth/sco.c                                |   72 ++--
 net/bluetooth/smp.c                                |    2 +-
 net/mac80211/cfg.c                                 |  182 ++++++++
 net/mac80211/driver-ops.h                          |   37 ++
 net/mac80211/driver-trace.h                        |   15 +
 net/mac80211/ibss.c                                |    2 +-
 net/mac80211/ieee80211_i.h                         |    5 +-
 net/mac80211/iface.c                               |    4 +-
 net/mac80211/mesh.c                                |   18 +-
 net/mac80211/mesh_plink.c                          |   96 ++++-
 net/mac80211/mlme.c                                |    4 +-
 net/mac80211/sta_info.h                            |    1 +
 net/mac80211/tx.c                                  |    3 +-
 net/mac80211/util.c                                |    9 +-
 net/wireless/ethtool.c                             |   29 ++
 net/wireless/mesh.c                                |    1 +
 net/wireless/nl80211.c                             |    7 +-
 net/wireless/scan.c                                |    6 +-
 net/wireless/util.c                                |    3 +-
 85 files changed, 1972 insertions(+), 665 deletions(-)
-- 
John W. Linville		Someday the world will need a hero, and you
linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org			might be all we have.  Be ready.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: [PATCH 2/2] tcp: cleanup tcp_try_coalesce
From: John W. Linville @ 2012-05-03 15:14 UTC (permalink / raw)
  To: David Miller
  Cc: eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
	alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w,
	netdev-u79uwXL29TY76Z2rM5mHXA, edumazet-hpIqsD4AKlfQT0dZR+AlfA,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	wey-yi.w.guy-ral2JQCrhuEAvxtiuMwx3w
In-Reply-To: <20120503.012502.44731688706812861.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>

On Thu, May 03, 2012 at 01:25:02AM -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Date: Thu, 03 May 2012 07:19:33 +0200
> 
> > My last patch against iwlwifi is still waiting to make its way into
> > official tree.
> > 
> > http://www.spinics.net/lists/netdev/msg192629.html
> 
> John, please rectify this situation.
> 
> The Intel Wireless folks said they would test it, but that was more
> than a month ago.
> 
> It's not acceptable to let bug fixes rot for that long, I don't care
> what their special internal testing procedure is.
> 
> If they give you further pushback, please just ignore them and apply
> Eric's fix directly.
> 
> Thank you.

I imagine that this somehow got lost in the shuffle during the
merge window.  That doesn't excuse it, of course.

It has waited long enough already, so I'll just go ahead and take it.

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org			might be all we have.  Be ready.
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 00/16] Swap-over-NBD without deadlocking V9
From: Mel Gorman @ 2012-05-03 15:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
	Peter Zijlstra, Mike Christie, Eric B Munson
In-Reply-To: <20120501152826.b970a098.akpm@linux-foundation.org>

On Tue, May 01, 2012 at 03:28:26PM -0700, Andrew Morton wrote:
> 
> This patchset is far less ghastly than I feared/remembered/dreamed ;)
> 

That might be the best comment the series ever received :)

> The mm parts, anyway.  Are the net guys on board with it all?

They are cc'd but have not given any feedback in a while. That could be
because they are happy with it or because if they felt the MM parts were
blocking the series then it was unnecessary to review the network parts.

Any of the networking people care to comment?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH 9/9] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack.
From: Ian Campbell @ 2012-05-03 14:56 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Ian Campbell,
	Neil Brown, J. Bruce Fields, linux-nfs
In-Reply-To: <1336056915.20716.96.camel@zakaz.uk.xensource.com>

This prevents an issue where an ACK is delayed, a retransmit is queued (either
at the RPC or TCP level) and the ACK arrives before the retransmission hits the
wire. If this happens to an NFS WRITE RPC then the write() system call
completes and the userspace process can continue, potentially modifying data
referenced by the retransmission before the retransmission occurs.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: linux-nfs@vger.kernel.org
Cc: netdev@vger.kernel.org
---
 include/linux/sunrpc/xdr.h  |    2 ++
 include/linux/sunrpc/xprt.h |    5 ++++-
 net/sunrpc/clnt.c           |   27 ++++++++++++++++++++++-----
 net/sunrpc/svcsock.c        |    3 ++-
 net/sunrpc/xprt.c           |   12 ++++++++++++
 net/sunrpc/xprtsock.c       |    3 ++-
 6 files changed, 44 insertions(+), 8 deletions(-)

diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h
index af70af3..ff1b121 100644
--- a/include/linux/sunrpc/xdr.h
+++ b/include/linux/sunrpc/xdr.h
@@ -16,6 +16,7 @@
 #include <asm/byteorder.h>
 #include <asm/unaligned.h>
 #include <linux/scatterlist.h>
+#include <linux/skbuff.h>
 
 /*
  * Buffer adjustment
@@ -57,6 +58,7 @@ struct xdr_buf {
 			tail[1];	/* Appended after page data */
 
 	struct page **	pages;		/* Array of contiguous pages */
+	struct skb_frag_destructor *destructor;
 	unsigned int	page_base,	/* Start of page data */
 			page_len,	/* Length of page data */
 			flags;		/* Flags for data disposition */
diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 77d278d..e8d3f18 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -92,7 +92,10 @@ struct rpc_rqst {
 						/* A cookie used to track the
 						   state of the transport
 						   connection */
-	
+	struct skb_frag_destructor destructor;	/* SKB paged fragment
+						 * destructor for
+						 * transmitted pages*/
+
 	/*
 	 * Partial send handling
 	 */
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 6797246..351bf3d 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -61,6 +61,7 @@ static void	call_reserve(struct rpc_task *task);
 static void	call_reserveresult(struct rpc_task *task);
 static void	call_allocate(struct rpc_task *task);
 static void	call_decode(struct rpc_task *task);
+static void	call_complete(struct rpc_task *task);
 static void	call_bind(struct rpc_task *task);
 static void	call_bind_status(struct rpc_task *task);
 static void	call_transmit(struct rpc_task *task);
@@ -1416,6 +1417,8 @@ rpc_xdr_encode(struct rpc_task *task)
 			 (char *)req->rq_buffer + req->rq_callsize,
 			 req->rq_rcvsize);
 
+	req->rq_snd_buf.destructor = &req->destructor;
+
 	p = rpc_encode_header(task);
 	if (p == NULL) {
 		printk(KERN_INFO "RPC: couldn't encode RPC header, exit EIO\n");
@@ -1581,6 +1584,7 @@ call_connect_status(struct rpc_task *task)
 static void
 call_transmit(struct rpc_task *task)
 {
+	struct rpc_rqst *req = task->tk_rqstp;
 	dprint_status(task);
 
 	task->tk_action = call_status;
@@ -1614,8 +1618,8 @@ call_transmit(struct rpc_task *task)
 	call_transmit_status(task);
 	if (rpc_reply_expected(task))
 		return;
-	task->tk_action = rpc_exit_task;
-	rpc_wake_up_queued_task(&task->tk_xprt->pending, task);
+	task->tk_action = call_complete;
+	skb_frag_destructor_unref(&req->destructor);
 }
 
 /*
@@ -1688,7 +1692,8 @@ call_bc_transmit(struct rpc_task *task)
 		return;
 	}
 
-	task->tk_action = rpc_exit_task;
+	task->tk_action = call_complete;
+	skb_frag_destructor_unref(&req->destructor);
 	if (task->tk_status < 0) {
 		printk(KERN_NOTICE "RPC: Could not send backchannel reply "
 			"error: %d\n", task->tk_status);
@@ -1728,7 +1733,6 @@ call_bc_transmit(struct rpc_task *task)
 			"error: %d\n", task->tk_status);
 		break;
 	}
-	rpc_wake_up_queued_task(&req->rq_xprt->pending, task);
 }
 #endif /* CONFIG_SUNRPC_BACKCHANNEL */
 
@@ -1906,12 +1910,14 @@ call_decode(struct rpc_task *task)
 		return;
 	}
 
-	task->tk_action = rpc_exit_task;
+	task->tk_action = call_complete;
 
 	if (decode) {
 		task->tk_status = rpcauth_unwrap_resp(task, decode, req, p,
 						      task->tk_msg.rpc_resp);
 	}
+	rpc_sleep_on(&req->rq_xprt->pending, task, NULL);
+	skb_frag_destructor_unref(&req->destructor);
 	dprintk("RPC: %5u call_decode result %d\n", task->tk_pid,
 			task->tk_status);
 	return;
@@ -1926,6 +1932,17 @@ out_retry:
 	}
 }
 
+/*
+ * 8.	Wait for pages to be released by the network stack.
+ */
+static void
+call_complete(struct rpc_task *task)
+{
+	dprintk("RPC: %5u call_complete result %d\n",
+		task->tk_pid, task->tk_status);
+	task->tk_action = rpc_exit_task;
+}
+
 static __be32 *
 rpc_encode_header(struct rpc_task *task)
 {
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index f6d8c73..1145929 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -198,7 +198,8 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 	while (pglen > 0) {
 		if (slen == size)
 			flags = 0;
-		result = kernel_sendpage(sock, *ppage, NULL, base, size, flags);
+		result = kernel_sendpage(sock, *ppage, xdr->destructor,
+					 base, size, flags);
 		if (result > 0)
 			len += result;
 		if (result != size)
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 6fe2dce..f8418a0 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1108,6 +1108,16 @@ static inline void xprt_init_xid(struct rpc_xprt *xprt)
 	xprt->xid = net_random();
 }
 
+static int xprt_complete_skb_pages(struct skb_frag_destructor *destroy)
+{
+	struct rpc_rqst	*req =
+		container_of(destroy, struct rpc_rqst, destructor);
+
+	dprintk("RPC: %5u completing skb pages\n", req->rq_task->tk_pid);
+	rpc_wake_up_queued_task(&req->rq_xprt->pending, req->rq_task);
+	return 0;
+}
+
 static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
 {
 	struct rpc_rqst	*req = task->tk_rqstp;
@@ -1120,6 +1130,8 @@ static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
 	req->rq_xid     = xprt_alloc_xid(xprt);
 	req->rq_release_snd_buf = NULL;
 	xprt_reset_majortimeo(req);
+	atomic_set(&req->destructor.ref, 1);
+	req->destructor.destroy = &xprt_complete_skb_pages;
 	dprintk("RPC: %5u reserved req %p xid %08x\n", task->tk_pid,
 			req, ntohl(req->rq_xid));
 }
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index f1995dc..44e07f3 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -408,7 +408,8 @@ static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned i
 		remainder -= len;
 		if (remainder != 0 || more)
 			flags |= MSG_MORE;
-		err = sock->ops->sendpage(sock, *ppage, NULL, base, len, flags);
+		err = sock->ops->sendpage(sock, *ppage, xdr->destructor,
+					  base, len, flags);
 		if (remainder == 0 || err != len)
 			break;
 		sent += err;
-- 
1.7.2.5

^ permalink raw reply related

* [PATCH 2/9] net: Use SKB_WITH_OVERHEAD in build_skb
From: Ian Campbell @ 2012-05-03 14:56 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Ian Campbell
In-Reply-To: <1336056915.20716.96.camel@zakaz.uk.xensource.com>

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/core/skbuff.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a056d7c..c60b603 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -263,7 +263,7 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size)
 	if (!skb)
 		return NULL;
 
-	size -= SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	size = SKB_WITH_OVERHEAD(size);
 
 	memset(skb, 0, offsetof(struct sk_buff, tail));
 	skb->truesize = SKB_TRUESIZE(size);
-- 
1.7.2.5

^ permalink raw reply related

* [PATCH 4/9] skb: add skb_shinfo_init and use for both alloc_skb, build_skb and skb_recycle
From: Ian Campbell @ 2012-05-03 14:56 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Ian Campbell
In-Reply-To: <1336056915.20716.96.camel@zakaz.uk.xensource.com>

There is only one semantic change here which is that skb_recycle now does:
	kmemcheck_annotate_variable(shinfo->destructor_arg)
I don't think it was erroneously missing before (since in the skb_recycle case
it will have happened previously) but I beleive it is harmless to do it again
and this saves having a different copy of the same code for the recycle case.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
---
 net/core/skbuff.c |   30 +++++++++++++-----------------
 1 files changed, 13 insertions(+), 17 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index c60b603..e96f68b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -145,6 +145,16 @@ static void skb_under_panic(struct sk_buff *skb, int sz, void *here)
 	BUG();
 }
 
+static void skb_shinfo_init(struct sk_buff *skb)
+{
+	struct skb_shared_info *shinfo = skb_shinfo(skb);
+
+	/* make sure we initialize shinfo sequentially */
+	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
+	atomic_set(&shinfo->dataref, 1);
+	kmemcheck_annotate_variable(shinfo->destructor_arg);
+}
+
 /* 	Allocate a new skbuff. We do this ourselves so we can fill in a few
  *	'private' fields and also do memory statistics to find all the
  *	[BEEP] leaks.
@@ -170,7 +180,6 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 			    int fclone, int node)
 {
 	struct kmem_cache *cache;
-	struct skb_shared_info *shinfo;
 	struct sk_buff *skb;
 	u8 *data;
 
@@ -210,11 +219,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 	skb->mac_header = ~0U;
 #endif
 
-	/* make sure we initialize shinfo sequentially */
-	shinfo = skb_shinfo(skb);
-	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
-	atomic_set(&shinfo->dataref, 1);
-	kmemcheck_annotate_variable(shinfo->destructor_arg);
+	skb_shinfo_init(skb);
 
 	if (fclone) {
 		struct sk_buff *child = skb + 1;
@@ -255,7 +260,6 @@ EXPORT_SYMBOL(__alloc_skb);
  */
 struct sk_buff *build_skb(void *data, unsigned int frag_size)
 {
-	struct skb_shared_info *shinfo;
 	struct sk_buff *skb;
 	unsigned int size = frag_size ? : ksize(data);
 
@@ -277,11 +281,7 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size)
 	skb->mac_header = ~0U;
 #endif
 
-	/* make sure we initialize shinfo sequentially */
-	shinfo = skb_shinfo(skb);
-	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
-	atomic_set(&shinfo->dataref, 1);
-	kmemcheck_annotate_variable(shinfo->destructor_arg);
+	skb_shinfo_init(skb);
 
 	return skb;
 }
@@ -546,13 +546,9 @@ EXPORT_SYMBOL(consume_skb);
  */
 void skb_recycle(struct sk_buff *skb)
 {
-	struct skb_shared_info *shinfo;
-
 	skb_release_head_state(skb);
 
-	shinfo = skb_shinfo(skb);
-	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
-	atomic_set(&shinfo->dataref, 1);
+	skb_shinfo_init(skb);
 
 	memset(skb, 0, offsetof(struct sk_buff, tail));
 	skb->data = skb->head + NET_SKB_PAD;
-- 
1.7.2.5

^ permalink raw reply related

* [PATCH 6/9] net: add support for per-paged-fragment destructors
From: Ian Campbell @ 2012-05-03 14:56 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Ian Campbell,
	Michał Mirosław
In-Reply-To: <1336056915.20716.96.camel@zakaz.uk.xensource.com>

Entities which care about the complete lifecycle of pages which they inject
into the network stack via an skb paged fragment can choose to set this
destructor in order to receive a callback when the stack is really finished
with a page (including all clones, retransmits, pull-ups etc etc).

This destructor will always be propagated alongside the struct page when
copying skb_frag_t->page. This is the reason I chose to embed the destructor in
a "struct { } page" within the skb_frag_t, rather than as a separate field,
since it allows existing code which propagates ->frags[N].page to Just
Work(tm).

When the destructor is present the page reference counting is done slightly
differently. No references are held by the network stack on the struct page (it
is up to the caller to manage this as necessary) instead the network stack will
track references via the count embedded in the destructor structure. When this
reference count reaches zero then the destructor will be called and the caller
can take the necesary steps to release the page (i.e. release the struct page
reference itself).

The intention is that callers can use this callback to delay completion to
_their_ callers until the network stack has completely released the page, in
order to prevent use-after-free or modification of data pages which are still
in use by the stack.

It is allowable (indeed expected) for a caller to share a single destructor
instance between multiple pages injected into the stack e.g. a group of pages
included in a single higher level operation might share a destructor which is
used to complete that higher level operation.

Previous changes have ensured that, even with the increase in frag size, the
hot fields (nr_frags through to at least frags[0]) fit with and are aligned to
a 64 byte cache line.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
Cc: netdev@vger.kernel.org
---
 include/linux/skbuff.h |   50 ++++++++++++++++++++++++++++++++++++++++++++++-
 net/core/skbuff.c      |   18 +++++++++++++++++
 net/ipv4/ip_output.c   |    2 +-
 net/ipv4/tcp.c         |    4 +-
 4 files changed, 69 insertions(+), 5 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 3698625..ccc7d93 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -168,9 +168,15 @@ struct sk_buff;
 
 typedef struct skb_frag_struct skb_frag_t;
 
+struct skb_frag_destructor {
+	atomic_t ref;
+	int (*destroy)(struct skb_frag_destructor *destructor);
+};
+
 struct skb_frag_struct {
 	struct {
 		struct page *p;
+		struct skb_frag_destructor *destructor;
 	} page;
 #if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
 	__u32 page_offset;
@@ -1232,6 +1238,31 @@ static inline int skb_pagelen(const struct sk_buff *skb)
 }
 
 /**
+ * skb_frag_set_destructor - set destructor for a paged fragment
+ * @skb: buffer containing fragment to be initialised
+ * @i: paged fragment index to initialise
+ * @destroy: the destructor to use for this fragment
+ *
+ * Sets @destroy as the destructor to be called when all references to
+ * the frag @i in @skb (tracked over skb_clone, retransmit, pull-ups,
+ * etc) are released.
+ *
+ * When a destructor is set then reference counting is performed on
+ * @destroy->ref. When the ref reaches zero then @destroy->destroy
+ * will be called. The caller is responsible for holding and managing
+ * any other references (such a the struct page reference count).
+ *
+ * This function must be called before any use of skb_frag_ref() or
+ * skb_frag_unref().
+ */
+static inline void skb_frag_set_destructor(struct sk_buff *skb, int i,
+					   struct skb_frag_destructor *destroy)
+{
+	skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+	frag->page.destructor = destroy;
+}
+
+/**
  * __skb_fill_page_desc - initialise a paged fragment in an skb
  * @skb: buffer containing fragment to be initialised
  * @i: paged fragment index to initialise
@@ -1250,6 +1281,7 @@ static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
 	skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
 
 	frag->page.p		  = page;
+	frag->page.destructor     = NULL;
 	frag->page_offset	  = off;
 	skb_frag_size_set(frag, size);
 }
@@ -1766,6 +1798,9 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag)
 	return frag->page.p;
 }
 
+extern void skb_frag_destructor_ref(struct skb_frag_destructor *destroy);
+extern void skb_frag_destructor_unref(struct skb_frag_destructor *destroy);
+
 /**
  * __skb_frag_ref - take an addition reference on a paged fragment.
  * @frag: the paged fragment
@@ -1774,6 +1809,10 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag)
  */
 static inline void __skb_frag_ref(skb_frag_t *frag)
 {
+	if (unlikely(frag->page.destructor)) {
+		skb_frag_destructor_ref(frag->page.destructor);
+		return;
+	}
 	get_page(skb_frag_page(frag));
 }
 
@@ -1797,6 +1836,10 @@ static inline void skb_frag_ref(struct sk_buff *skb, int f)
  */
 static inline void __skb_frag_unref(skb_frag_t *frag)
 {
+	if (unlikely(frag->page.destructor)) {
+		skb_frag_destructor_unref(frag->page.destructor);
+		return;
+	}
 	put_page(skb_frag_page(frag));
 }
 
@@ -1994,13 +2037,16 @@ static inline int skb_add_data(struct sk_buff *skb,
 }
 
 static inline bool skb_can_coalesce(struct sk_buff *skb, int i,
-				    const struct page *page, int off)
+				    const struct page *page,
+				    const struct skb_frag_destructor *destroy,
+				    int off)
 {
 	if (i) {
 		const struct skb_frag_struct *frag = &skb_shinfo(skb)->frags[i - 1];
 
 		return page == skb_frag_page(frag) &&
-		       off == frag->page_offset + skb_frag_size(frag);
+		       off == frag->page_offset + skb_frag_size(frag) &&
+		       frag->page.destructor == destroy;
 	}
 	return false;
 }
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index fab6de0..945b807 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -353,6 +353,23 @@ struct sk_buff *dev_alloc_skb(unsigned int length)
 }
 EXPORT_SYMBOL(dev_alloc_skb);
 
+void skb_frag_destructor_ref(struct skb_frag_destructor *destroy)
+{
+	BUG_ON(destroy == NULL);
+	atomic_inc(&destroy->ref);
+}
+EXPORT_SYMBOL(skb_frag_destructor_ref);
+
+void skb_frag_destructor_unref(struct skb_frag_destructor *destroy)
+{
+	if (destroy == NULL)
+		return;
+
+	if (atomic_dec_and_test(&destroy->ref))
+		destroy->destroy(destroy);
+}
+EXPORT_SYMBOL(skb_frag_destructor_unref);
+
 static void skb_drop_list(struct sk_buff **listp)
 {
 	struct sk_buff *list = *listp;
@@ -2334,6 +2351,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen)
 	 */
 	if (!to ||
 	    !skb_can_coalesce(tgt, to, skb_frag_page(fragfrom),
+			      fragfrom->page.destructor,
 			      fragfrom->page_offset)) {
 		merge = -1;
 	} else {
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 4910176..7652751 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1242,7 +1242,7 @@ ssize_t	ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
 		i = skb_shinfo(skb)->nr_frags;
 		if (len > size)
 			len = size;
-		if (skb_can_coalesce(skb, i, page, offset)) {
+		if (skb_can_coalesce(skb, i, page, NULL, offset)) {
 			skb_frag_size_add(&skb_shinfo(skb)->frags[i-1], len);
 		} else if (i < MAX_SKB_FRAGS) {
 			get_page(page);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 9670af3..2d590ca 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -870,7 +870,7 @@ new_segment:
 			copy = size;
 
 		i = skb_shinfo(skb)->nr_frags;
-		can_coalesce = skb_can_coalesce(skb, i, page, offset);
+		can_coalesce = skb_can_coalesce(skb, i, page, NULL, offset);
 		if (!can_coalesce && i >= MAX_SKB_FRAGS) {
 			tcp_mark_push(tp, skb);
 			goto new_segment;
@@ -1124,7 +1124,7 @@ new_segment:
 
 				off = sk->sk_sndmsg_off;
 
-				if (skb_can_coalesce(skb, i, page, off) &&
+				if (skb_can_coalesce(skb, i, page, NULL, off) &&
 				    off != PAGE_SIZE) {
 					/* We can extend the last page
 					 * fragment. */
-- 
1.7.2.5

^ permalink raw reply related

* [PATCH 3/9] chelsio: use SKB_WITH_OVERHEAD
From: Ian Campbell @ 2012-05-03 14:56 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Ian Campbell,
	Divy Le Ray
In-Reply-To: <1336056915.20716.96.camel@zakaz.uk.xensource.com>

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Divy Le Ray <divy@chelsio.com>
---
 drivers/net/ethernet/chelsio/cxgb/sge.c  |    3 +--
 drivers/net/ethernet/chelsio/cxgb3/sge.c |    6 +++---
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb/sge.c b/drivers/net/ethernet/chelsio/cxgb/sge.c
index 47a8435..52373db 100644
--- a/drivers/net/ethernet/chelsio/cxgb/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb/sge.c
@@ -599,8 +599,7 @@ static int alloc_rx_resources(struct sge *sge, struct sge_params *p)
 		sizeof(struct cpl_rx_data) +
 		sge->freelQ[!sge->jumbo_fl].dma_offset;
 
-		size = (16 * 1024) -
-		    SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	size = SKB_WITH_OVERHEAD(16 * 1024);
 
 	sge->freelQ[sge->jumbo_fl].rx_buffer_size = size;
 
diff --git a/drivers/net/ethernet/chelsio/cxgb3/sge.c b/drivers/net/ethernet/chelsio/cxgb3/sge.c
index cfb60e1..b804470 100644
--- a/drivers/net/ethernet/chelsio/cxgb3/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb3/sge.c
@@ -3043,7 +3043,7 @@ int t3_sge_alloc_qset(struct adapter *adapter, unsigned int id, int nports,
 	q->fl[1].buf_size = FL1_PG_CHUNK_SIZE;
 #else
 	q->fl[1].buf_size = is_offload(adapter) ?
-		(16 * 1024) - SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) :
+		SKB_WITH_OVERHEAD(16 * 1024) :
 		MAX_FRAME_SIZE + 2 + sizeof(struct cpl_rx_pkt);
 #endif
 
@@ -3282,8 +3282,8 @@ void t3_sge_prep(struct adapter *adap, struct sge_params *p)
 {
 	int i;
 
-	p->max_pkt_size = (16 * 1024) - sizeof(struct cpl_rx_data) -
-	    SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	p->max_pkt_size =
+		SKB_WITH_OVERHEAD((16*1024) - sizeof(struct cpl_rx_data));
 
 	for (i = 0; i < SGE_QSETS; ++i) {
 		struct qset_params *q = p->qset + i;
-- 
1.7.2.5

^ permalink raw reply related

* Re: [PATCH 05/11] mm: swap: Implement generic handler for swap_activate
From: Mel Gorman @ 2012-05-03 14:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, Linux-NFS, LKML, David Miller,
	Trond Myklebust, Neil Brown, Christoph Hellwig, Peter Zijlstra,
	Mike Christie, Eric B Munson
In-Reply-To: <20120501155747.368a1d36.akpm@linux-foundation.org>

On Tue, May 01, 2012 at 03:57:47PM -0700, Andrew Morton wrote:
> On Mon, 16 Apr 2012 13:17:49 +0100
> Mel Gorman <mgorman@suse.de> wrote:
> 
> > The version of swap_activate introduced is sufficient for swap-over-NFS
> > but would not provide enough information to implement a generic handler.
> > This patch shuffles things slightly to ensure the same information is
> > available for aops->swap_activate() as is available to the core.
> > 
> > No functionality change.
> > 
> > ...
> >
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -587,6 +587,8 @@ typedef struct {
> >  typedef int (*read_actor_t)(read_descriptor_t *, struct page *,
> >  		unsigned long, unsigned long);
> >  
> > +struct swap_info_struct;
> 
> Please put forward declarations at top-of-file.  To prevent accidental
> duplication later on.
> 

Done.

> >  struct address_space_operations {
> >  	int (*writepage)(struct page *page, struct writeback_control *wbc);
> >  	int (*readpage)(struct file *, struct page *);
> >
> > ...
> >
> > --- a/mm/page_io.c
> > +++ b/mm/page_io.c
> 
> Have you tested all this code with CONFIG_SWAP=n?
> 

Emm, it builds. That counts, right?

> Have you sought to minimise additional new code when CONFIG_SWAP=n?
> 

Not specifically, but generic_swapfile_activate() is defined in page_io.c
and that is built only if CONFIG_SWAP=y. Similarly swapon is in
swapfile.c which is only build when swap is enabled.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH 8/9] net: add paged frag destructor support to kernel_sendpage.
From: Ian Campbell @ 2012-05-03 14:56 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Ian Campbell
In-Reply-To: <1336056915.20716.96.camel@zakaz.uk.xensource.com>

This requires adding a new argument to various sendpage hooks up and down the
stack. At the moment this parameter is always NULL.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
---
 drivers/block/drbd/drbd_main.c           |    1 +
 drivers/scsi/iscsi_tcp.c                 |    4 ++--
 drivers/scsi/iscsi_tcp.h                 |    3 ++-
 drivers/target/iscsi/iscsi_target_util.c |    3 ++-
 fs/dlm/lowcomms.c                        |    4 ++--
 fs/ocfs2/cluster/tcp.c                   |    1 +
 include/linux/net.h                      |    6 +++++-
 include/net/inet_common.h                |    4 +++-
 include/net/ip.h                         |    4 +++-
 include/net/sock.h                       |    8 +++++---
 include/net/tcp.h                        |    4 +++-
 net/ceph/messenger.c                     |    2 +-
 net/core/sock.c                          |    6 +++++-
 net/ipv4/af_inet.c                       |    9 ++++++---
 net/ipv4/ip_output.c                     |    6 ++++--
 net/ipv4/tcp.c                           |   24 +++++++++++++++---------
 net/ipv4/udp.c                           |   11 ++++++-----
 net/ipv4/udp_impl.h                      |    5 +++--
 net/rds/tcp_send.c                       |    1 +
 net/socket.c                             |   11 +++++++----
 net/sunrpc/svcsock.c                     |    6 +++---
 net/sunrpc/xprtsock.c                    |    2 +-
 22 files changed, 81 insertions(+), 44 deletions(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 211fc44..e70ba0c 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2584,6 +2584,7 @@ static int _drbd_send_page(struct drbd_conf *mdev, struct page *page,
 	set_fs(KERNEL_DS);
 	do {
 		sent = mdev->data.socket->ops->sendpage(mdev->data.socket, page,
+							NULL,
 							offset, len,
 							msg_flags);
 		if (sent == -EAGAIN) {
diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
index 9220861..724d32538 100644
--- a/drivers/scsi/iscsi_tcp.c
+++ b/drivers/scsi/iscsi_tcp.c
@@ -284,8 +284,8 @@ static int iscsi_sw_tcp_xmit_segment(struct iscsi_tcp_conn *tcp_conn,
 		if (!segment->data) {
 			sg = segment->sg;
 			offset += segment->sg_offset + sg->offset;
-			r = tcp_sw_conn->sendpage(sk, sg_page(sg), offset,
-						  copy, flags);
+			r = tcp_sw_conn->sendpage(sk, sg_page(sg), NULL,
+						  offset, copy, flags);
 		} else {
 			struct msghdr msg = { .msg_flags = flags };
 			struct kvec iov = {
diff --git a/drivers/scsi/iscsi_tcp.h b/drivers/scsi/iscsi_tcp.h
index 666fe09..1e23265 100644
--- a/drivers/scsi/iscsi_tcp.h
+++ b/drivers/scsi/iscsi_tcp.h
@@ -52,7 +52,8 @@ struct iscsi_sw_tcp_conn {
 	uint32_t		sendpage_failures_cnt;
 	uint32_t		discontiguous_hdr_cnt;
 
-	ssize_t (*sendpage)(struct socket *, struct page *, int, size_t, int);
+	ssize_t (*sendpage)(struct socket *, struct page *,
+			    struct skb_frag_destructor *, int, size_t, int);
 };
 
 struct iscsi_sw_tcp_host {
diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
index 4eba86d..d876dae 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -1323,7 +1323,8 @@ send_hdr:
 		u32 sub_len = min_t(u32, data_len, space);
 send_pg:
 		tx_sent = conn->sock->ops->sendpage(conn->sock,
-					sg_page(sg), sg->offset + offset, sub_len, 0);
+					sg_page(sg), NULL,
+					sg->offset + offset, sub_len, 0);
 		if (tx_sent != sub_len) {
 			if (tx_sent == -EAGAIN) {
 				pr_err("tcp_sendpage() returned"
diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index 133ef6d..0673cea 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -1336,8 +1336,8 @@ static void send_to_sock(struct connection *con)
 
 		ret = 0;
 		if (len) {
-			ret = kernel_sendpage(con->sock, e->page, offset, len,
-					      msg_flags);
+			ret = kernel_sendpage(con->sock, e->page, NULL,
+					      offset, len, msg_flags);
 			if (ret == -EAGAIN || ret == 0) {
 				if (ret == -EAGAIN &&
 				    test_bit(SOCK_ASYNC_NOSPACE, &con->sock->flags) &&
diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
index 1bfe880..c82a711 100644
--- a/fs/ocfs2/cluster/tcp.c
+++ b/fs/ocfs2/cluster/tcp.c
@@ -983,6 +983,7 @@ static void o2net_sendpage(struct o2net_sock_container *sc,
 		mutex_lock(&sc->sc_send_lock);
 		ret = sc->sc_sock->ops->sendpage(sc->sc_sock,
 						 virt_to_page(kmalloced_virt),
+						 NULL,
 						 (long)kmalloced_virt & ~PAGE_MASK,
 						 size, MSG_DONTWAIT);
 		mutex_unlock(&sc->sc_send_lock);
diff --git a/include/linux/net.h b/include/linux/net.h
index be60c7f..d9b0d648 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -157,6 +157,7 @@ struct kiocb;
 struct sockaddr;
 struct msghdr;
 struct module;
+struct skb_frag_destructor;
 
 struct proto_ops {
 	int		family;
@@ -203,6 +204,7 @@ struct proto_ops {
 	int		(*mmap)	     (struct file *file, struct socket *sock,
 				      struct vm_area_struct * vma);
 	ssize_t		(*sendpage)  (struct socket *sock, struct page *page,
+				      struct skb_frag_destructor *destroy,
 				      int offset, size_t size, int flags);
 	ssize_t 	(*splice_read)(struct socket *sock,  loff_t *ppos,
 				       struct pipe_inode_info *pipe, size_t len, unsigned int flags);
@@ -274,7 +276,9 @@ extern int kernel_getsockopt(struct socket *sock, int level, int optname,
 			     char *optval, int *optlen);
 extern int kernel_setsockopt(struct socket *sock, int level, int optname,
 			     char *optval, unsigned int optlen);
-extern int kernel_sendpage(struct socket *sock, struct page *page, int offset,
+extern int kernel_sendpage(struct socket *sock, struct page *page,
+			   struct skb_frag_destructor *destroy,
+			   int offset,
 			   size_t size, int flags);
 extern int kernel_sock_ioctl(struct socket *sock, int cmd, unsigned long arg);
 extern int kernel_sock_shutdown(struct socket *sock,
diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 22fac98..91cd8d0 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -21,7 +21,9 @@ extern int inet_dgram_connect(struct socket *sock, struct sockaddr * uaddr,
 extern int inet_accept(struct socket *sock, struct socket *newsock, int flags);
 extern int inet_sendmsg(struct kiocb *iocb, struct socket *sock,
 			struct msghdr *msg, size_t size);
-extern ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
+extern ssize_t inet_sendpage(struct socket *sock, struct page *page,
+			     struct skb_frag_destructor *frag,
+			     int offset,
 			     size_t size, int flags);
 extern int inet_recvmsg(struct kiocb *iocb, struct socket *sock,
 			struct msghdr *msg, size_t size, int flags);
diff --git a/include/net/ip.h b/include/net/ip.h
index 94ddb69c..dbd7ecb 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -114,7 +114,9 @@ extern int		ip_append_data(struct sock *sk, struct flowi4 *fl4,
 				struct rtable **rt,
 				unsigned int flags);
 extern int		ip_generic_getfrag(void *from, char *to, int offset, int len, int odd, struct sk_buff *skb);
-extern ssize_t		ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
+extern ssize_t		ip_append_page(struct sock *sk, struct flowi4 *fl4,
+				struct page *page,
+				struct skb_frag_destructor *destroy,
 				int offset, size_t size, int flags);
 extern struct sk_buff  *__ip_make_skb(struct sock *sk,
 				      struct flowi4 *fl4,
diff --git a/include/net/sock.h b/include/net/sock.h
index 68a2834..c999f48 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -848,6 +848,7 @@ struct proto {
 					size_t len, int noblock, int flags, 
 					int *addr_len);
 	int			(*sendpage)(struct sock *sk, struct page *page,
+					struct skb_frag_destructor *destroy,
 					int offset, size_t size, int flags);
 	int			(*bind)(struct sock *sk, 
 					struct sockaddr *uaddr, int addr_len);
@@ -1466,9 +1467,10 @@ extern int			sock_no_mmap(struct file *file,
 					     struct socket *sock,
 					     struct vm_area_struct *vma);
 extern ssize_t			sock_no_sendpage(struct socket *sock,
-						struct page *page,
-						int offset, size_t size, 
-						int flags);
+					struct page *page,
+					struct skb_frag_destructor *destroy,
+					int offset, size_t size,
+					int flags);
 
 /*
  * Functions to fill in entries in struct proto_ops when a protocol
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 0fb84de..81dbfde8 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -331,7 +331,9 @@ extern void *tcp_v4_tw_get_peer(struct sock *sk);
 extern int tcp_v4_tw_remember_stamp(struct inet_timewait_sock *tw);
 extern int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		       size_t size);
-extern int tcp_sendpage(struct sock *sk, struct page *page, int offset,
+extern int tcp_sendpage(struct sock *sk, struct page *page,
+			struct skb_frag_destructor *destroy,
+			int offset,
 			size_t size, int flags);
 extern int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg);
 extern int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 36fa6bf..b355be1 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -320,7 +320,7 @@ static int ceph_tcp_sendpage(struct socket *sock, struct page *page,
 	int flags = MSG_DONTWAIT | MSG_NOSIGNAL | (more ? MSG_MORE : MSG_EOR);
 	int ret;
 
-	ret = kernel_sendpage(sock, page, offset, size, flags);
+	ret = kernel_sendpage(sock, page, NULL, offset, size, flags);
 	if (ret == -EAGAIN)
 		ret = 0;
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 1a88351..cffff5f 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1953,7 +1953,9 @@ int sock_no_mmap(struct file *file, struct socket *sock, struct vm_area_struct *
 }
 EXPORT_SYMBOL(sock_no_mmap);
 
-ssize_t sock_no_sendpage(struct socket *sock, struct page *page, int offset, size_t size, int flags)
+ssize_t sock_no_sendpage(struct socket *sock, struct page *page,
+			 struct skb_frag_destructor *destroy,
+			 int offset, size_t size, int flags)
 {
 	ssize_t res;
 	struct msghdr msg = {.msg_flags = flags};
@@ -1963,6 +1965,8 @@ ssize_t sock_no_sendpage(struct socket *sock, struct page *page, int offset, siz
 	iov.iov_len = size;
 	res = kernel_sendmsg(sock, &msg, &iov, 1, size);
 	kunmap(page);
+	/* kernel_sendmsg copies so we can destroy immediately */
+	skb_frag_destructor_unref(destroy);
 	return res;
 }
 EXPORT_SYMBOL(sock_no_sendpage);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index c8f7aee..b1caf89 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -747,7 +747,9 @@ int inet_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
 }
 EXPORT_SYMBOL(inet_sendmsg);
 
-ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
+ssize_t inet_sendpage(struct socket *sock, struct page *page,
+		      struct skb_frag_destructor *destroy,
+		      int offset,
 		      size_t size, int flags)
 {
 	struct sock *sk = sock->sk;
@@ -760,8 +762,9 @@ ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
 		return -EAGAIN;
 
 	if (sk->sk_prot->sendpage)
-		return sk->sk_prot->sendpage(sk, page, offset, size, flags);
-	return sock_no_sendpage(sock, page, offset, size, flags);
+		return sk->sk_prot->sendpage(sk, page, destroy,
+					     offset, size, flags);
+	return sock_no_sendpage(sock, page, destroy, offset, size, flags);
 }
 EXPORT_SYMBOL(inet_sendpage);
 
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 7652751..877ff62 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1129,6 +1129,7 @@ int ip_append_data(struct sock *sk, struct flowi4 *fl4,
 }
 
 ssize_t	ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
+		       struct skb_frag_destructor *destroy,
 		       int offset, size_t size, int flags)
 {
 	struct inet_sock *inet = inet_sk(sk);
@@ -1242,11 +1243,12 @@ ssize_t	ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
 		i = skb_shinfo(skb)->nr_frags;
 		if (len > size)
 			len = size;
-		if (skb_can_coalesce(skb, i, page, NULL, offset)) {
+		if (skb_can_coalesce(skb, i, page, destroy, offset)) {
 			skb_frag_size_add(&skb_shinfo(skb)->frags[i-1], len);
 		} else if (i < MAX_SKB_FRAGS) {
-			get_page(page);
 			skb_fill_page_desc(skb, i, page, offset, len);
+			skb_frag_set_destructor(skb, i, destroy);
+			skb_frag_ref(skb, i);
 		} else {
 			err = -EMSGSIZE;
 			goto error;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 2d590ca..bee7864 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -822,8 +822,11 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
 	return mss_now;
 }
 
-static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffset,
-			 size_t psize, int flags)
+static ssize_t do_tcp_sendpages(struct sock *sk,
+				struct page **pages,
+				struct skb_frag_destructor *destroy,
+				int poffset,
+				size_t psize, int flags)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	int mss_now, size_goal;
@@ -870,7 +873,7 @@ new_segment:
 			copy = size;
 
 		i = skb_shinfo(skb)->nr_frags;
-		can_coalesce = skb_can_coalesce(skb, i, page, NULL, offset);
+		can_coalesce = skb_can_coalesce(skb, i, page, destroy, offset);
 		if (!can_coalesce && i >= MAX_SKB_FRAGS) {
 			tcp_mark_push(tp, skb);
 			goto new_segment;
@@ -881,8 +884,9 @@ new_segment:
 		if (can_coalesce) {
 			skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
 		} else {
-			get_page(page);
 			skb_fill_page_desc(skb, i, page, offset, copy);
+			skb_frag_set_destructor(skb, i, destroy);
+			skb_frag_ref(skb, i);
 		}
 
 		skb->len += copy;
@@ -937,18 +941,20 @@ out_err:
 	return sk_stream_error(sk, flags, err);
 }
 
-int tcp_sendpage(struct sock *sk, struct page *page, int offset,
-		 size_t size, int flags)
+int tcp_sendpage(struct sock *sk, struct page *page,
+		 struct skb_frag_destructor *destroy,
+		 int offset, size_t size, int flags)
 {
 	ssize_t res;
 
 	if (!(sk->sk_route_caps & NETIF_F_SG) ||
 	    !(sk->sk_route_caps & NETIF_F_ALL_CSUM))
-		return sock_no_sendpage(sk->sk_socket, page, offset, size,
-					flags);
+		return sock_no_sendpage(sk->sk_socket, page, destroy,
+					offset, size, flags);
 
 	lock_sock(sk);
-	res = do_tcp_sendpages(sk, &page, offset, size, flags);
+	res = do_tcp_sendpages(sk, &page, destroy,
+			       offset, size, flags);
 	release_sock(sk);
 	return res;
 }
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 279fd08..c69aa65 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1032,8 +1032,9 @@ do_confirm:
 }
 EXPORT_SYMBOL(udp_sendmsg);
 
-int udp_sendpage(struct sock *sk, struct page *page, int offset,
-		 size_t size, int flags)
+int udp_sendpage(struct sock *sk, struct page *page,
+		 struct skb_frag_destructor *destroy,
+		 int offset, size_t size, int flags)
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct udp_sock *up = udp_sk(sk);
@@ -1061,11 +1062,11 @@ int udp_sendpage(struct sock *sk, struct page *page, int offset,
 	}
 
 	ret = ip_append_page(sk, &inet->cork.fl.u.ip4,
-			     page, offset, size, flags);
+			     page, destroy, offset, size, flags);
 	if (ret == -EOPNOTSUPP) {
 		release_sock(sk);
-		return sock_no_sendpage(sk->sk_socket, page, offset,
-					size, flags);
+		return sock_no_sendpage(sk->sk_socket, page, destroy,
+					offset, size, flags);
 	}
 	if (ret < 0) {
 		udp_flush_pending_frames(sk);
diff --git a/net/ipv4/udp_impl.h b/net/ipv4/udp_impl.h
index 5a681e2..aa8eca2 100644
--- a/net/ipv4/udp_impl.h
+++ b/net/ipv4/udp_impl.h
@@ -23,8 +23,9 @@ extern int	compat_udp_getsockopt(struct sock *sk, int level, int optname,
 #endif
 extern int	udp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 			    size_t len, int noblock, int flags, int *addr_len);
-extern int	udp_sendpage(struct sock *sk, struct page *page, int offset,
-			     size_t size, int flags);
+extern int	udp_sendpage(struct sock *sk, struct page *page,
+			     struct skb_frag_destructor *destroy,
+			     int offset, size_t size, int flags);
 extern int	udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
 extern void	udp_destroy_sock(struct sock *sk);
 
diff --git a/net/rds/tcp_send.c b/net/rds/tcp_send.c
index 1b4fd68..71503ad 100644
--- a/net/rds/tcp_send.c
+++ b/net/rds/tcp_send.c
@@ -119,6 +119,7 @@ int rds_tcp_xmit(struct rds_connection *conn, struct rds_message *rm,
 	while (sg < rm->data.op_nents) {
 		ret = tc->t_sock->ops->sendpage(tc->t_sock,
 						sg_page(&rm->data.op_sg[sg]),
+						NULL,
 						rm->data.op_sg[sg].offset + off,
 						rm->data.op_sg[sg].length - off,
 						MSG_DONTWAIT|MSG_NOSIGNAL);
diff --git a/net/socket.c b/net/socket.c
index d3aaa4f..f92c9c2 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -815,7 +815,7 @@ static ssize_t sock_sendpage(struct file *file, struct page *page,
 	/* more is a combination of MSG_MORE and MSG_SENDPAGE_NOTLAST */
 	flags |= more;
 
-	return kernel_sendpage(sock, page, offset, size, flags);
+	return kernel_sendpage(sock, page, NULL, offset, size, flags);
 }
 
 static ssize_t sock_splice_read(struct file *file, loff_t *ppos,
@@ -3349,15 +3349,18 @@ int kernel_setsockopt(struct socket *sock, int level, int optname,
 }
 EXPORT_SYMBOL(kernel_setsockopt);
 
-int kernel_sendpage(struct socket *sock, struct page *page, int offset,
+int kernel_sendpage(struct socket *sock, struct page *page,
+		    struct skb_frag_destructor *destroy,
+		    int offset,
 		    size_t size, int flags)
 {
 	sock_update_classid(sock->sk);
 
 	if (sock->ops->sendpage)
-		return sock->ops->sendpage(sock, page, offset, size, flags);
+		return sock->ops->sendpage(sock, page, destroy,
+					   offset, size, flags);
 
-	return sock_no_sendpage(sock, page, offset, size, flags);
+	return sock_no_sendpage(sock, page, destroy, offset, size, flags);
 }
 EXPORT_SYMBOL(kernel_sendpage);
 
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index f0132b2..f6d8c73 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -185,7 +185,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 	/* send head */
 	if (slen == xdr->head[0].iov_len)
 		flags = 0;
-	len = kernel_sendpage(sock, headpage, headoffset,
+	len = kernel_sendpage(sock, headpage, NULL, headoffset,
 				  xdr->head[0].iov_len, flags);
 	if (len != xdr->head[0].iov_len)
 		goto out;
@@ -198,7 +198,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 	while (pglen > 0) {
 		if (slen == size)
 			flags = 0;
-		result = kernel_sendpage(sock, *ppage, base, size, flags);
+		result = kernel_sendpage(sock, *ppage, NULL, base, size, flags);
 		if (result > 0)
 			len += result;
 		if (result != size)
@@ -212,7 +212,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
 
 	/* send tail */
 	if (xdr->tail[0].iov_len) {
-		result = kernel_sendpage(sock, tailpage, tailoffset,
+		result = kernel_sendpage(sock, tailpage, NULL, tailoffset,
 				   xdr->tail[0].iov_len, 0);
 		if (result > 0)
 			len += result;
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 890b03f..f1995dc 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -408,7 +408,7 @@ static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned i
 		remainder -= len;
 		if (remainder != 0 || more)
 			flags |= MSG_MORE;
-		err = sock->ops->sendpage(sock, *ppage, base, len, flags);
+		err = sock->ops->sendpage(sock, *ppage, NULL, base, len, flags);
 		if (remainder == 0 || err != len)
 			break;
 		sent += err;
-- 
1.7.2.5

^ permalink raw reply related

* [PATCH 5/9] net: pad skb data and shinfo as a whole rather than individually
From: Ian Campbell @ 2012-05-03 14:56 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Ian Campbell
In-Reply-To: <1336056915.20716.96.camel@zakaz.uk.xensource.com>

This reduces the minimum overhead required for this allocation such that the
shinfo can be grown in the following patch without overflowing 2048 bytes for a
1500 byte frame.

Reducing this overhead while also growing the shinfo means that sometimes the
tail end of the data can end up in the same cache line as the beginning of the
shinfo. Specifically in the case of the 64 byte cache lines on a 64 bit system
the first 8 bytes of shinfo can overlap the tail cacheline of the data. In many
cases the allocation slop means that there is no overlap.

In order to ensure that the hot struct members remain on the same 64 byte cache
line move the "destructor_arg" member to the front, this member is not used on
any hot path so it is a good choice to potentially be on a separate cache line
(and which addtionally may be shared with skb->data).

Also rather than relying on knowledge about the size and layout of the rest of
the shinfo to ensure that the right parts of the shinfo are aligned decree that
nr_frags will be cache aligned and therefore that the 64 bytes starting at
nr_frags should contain the hot struct members.

All this avoids hitting an extra cache line on hot operations such as
kfree_skb.

On 4k pages this motion and alignment strategy (along with the following frag
size increase) results in the shinfo abutting the very end of the allocation.
On larger pages (where SKB_MAX_FRAGS can be smaller) it means that we still
correctly align the hot data without needing to make assumptions about the data
layout outside of the hot 64-bytes of the shinfo.

Explicitly aligning nr_frags, rather than relying on analysis of the shinfo
layout was suggested by Alexander Duyck <alexander.h.duyck@intel.com>

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/skbuff.h |   50 +++++++++++++++++++++++++++++------------------
 net/core/skbuff.c      |    9 +++++++-
 2 files changed, 39 insertions(+), 20 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 19e348f..3698625 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -41,19 +41,24 @@
 
 #define SKB_DATA_ALIGN(X)	(((X) + (SMP_CACHE_BYTES - 1)) & \
 				 ~(SMP_CACHE_BYTES - 1))
-/* maximum data size which can fit into an allocation of X bytes */
-#define SKB_WITH_OVERHEAD(X)	\
-	((X) - SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+
 /*
- * minimum allocation size required for an skb containing X bytes of data
- *
- * We do our best to align skb_shared_info on a separate cache
- * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives
- * aligned memory blocks, unless SLUB/SLAB debug is enabled.  Both
- * skb->head and skb_shared_info are cache line aligned.
+ * We do our best to align the hot members of skb_shared_info on a
+ * separate cache line.  We explicitly align the nr_frags field and
+ * arrange that the order of the fields in skb_shared_info is such
+ * that the interesting fields are nr_frags onwards and are therefore
+ * cache line aligned.
  */
-#define SKB_ALLOCSIZE(X)	\
-	(SKB_DATA_ALIGN((X)) + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+#define SKB_SHINFO_SIZE							\
+	(SKB_DATA_ALIGN(sizeof(struct skb_shared_info)			\
+			- offsetof(struct skb_shared_info, nr_frags))	\
+	 + offsetof(struct skb_shared_info, nr_frags))
+
+/* maximum data size which can fit into an allocation of X bytes */
+#define SKB_WITH_OVERHEAD(X)	((X) - SKB_SHINFO_SIZE)
+
+/* minimum allocation size required for an skb containing X bytes of data */
+#define SKB_ALLOCSIZE(X)	(SKB_DATA_ALIGN((X) + SKB_SHINFO_SIZE))
 
 #define SKB_MAX_ORDER(X, ORDER) \
 	SKB_WITH_OVERHEAD((PAGE_SIZE << (ORDER)) - (X))
@@ -63,7 +68,7 @@
 /* return minimum truesize of one skb containing X bytes of data */
 #define SKB_TRUESIZE(X) ((X) +						\
 			 SKB_DATA_ALIGN(sizeof(struct sk_buff)) +	\
-			 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+			 SKB_SHINFO_SIZE)
 
 /* A. Checksumming of received packets by device.
  *
@@ -263,6 +268,19 @@ struct ubuf_info {
  * the end of the header data, ie. at skb->end.
  */
 struct skb_shared_info {
+	/* Intermediate layers must ensure that destructor_arg
+	 * remains valid until skb destructor */
+	void		*destructor_arg;
+
+	/* Warning: all fields from here until dataref are cleared in
+	 * skb_shinfo_init() (called from __alloc_skb, build_skb,
+	 * skb_recycle, etc).
+	 *
+	 * nr_frags will always be aligned to the start of a cache
+	 * line. It is intended that everything from nr_frags until at
+	 * least frags[0] (inclusive) should fit into the same 64-byte
+	 * cache line.
+	 */
 	unsigned char	nr_frags;
 	__u8		tx_flags;
 	unsigned short	gso_size;
@@ -273,15 +291,9 @@ struct skb_shared_info {
 	struct skb_shared_hwtstamps hwtstamps;
 	__be32          ip6_frag_id;
 
-	/*
-	 * Warning : all fields before dataref are cleared in __alloc_skb()
-	 */
+	/* fields from nr_frags until dataref are cleared in skb_shinfo_init */
 	atomic_t	dataref;
 
-	/* Intermediate layers must ensure that destructor_arg
-	 * remains valid until skb destructor */
-	void *		destructor_arg;
-
 	/* must be last field, see pskb_expand_head() */
 	skb_frag_t	frags[MAX_SKB_FRAGS];
 };
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index e96f68b..fab6de0 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -149,8 +149,15 @@ static void skb_shinfo_init(struct sk_buff *skb)
 {
 	struct skb_shared_info *shinfo = skb_shinfo(skb);
 
+	/* Ensure that nr_frags->frags[0] (at least) fits into a
+	 * single cache line. */
+	BUILD_BUG_ON((offsetof(struct skb_shared_info, frags[1])
+		      - offsetof(struct skb_shared_info, nr_frags)) > 64);
+
 	/* make sure we initialize shinfo sequentially */
-	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
+	memset(&shinfo->nr_frags, 0,
+	       offsetof(struct skb_shared_info, dataref)
+	       - offsetof(struct skb_shared_info, nr_frags));
 	atomic_set(&shinfo->dataref, 1);
 	kmemcheck_annotate_variable(shinfo->destructor_arg);
 }
-- 
1.7.2.5

^ permalink raw reply related

* [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors
From: Ian Campbell @ 2012-05-03 14:56 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Ian Campbell
In-Reply-To: <1336056915.20716.96.camel@zakaz.uk.xensource.com>

This should be used by drivers which need to hold on to an skb for an extended
(perhaps unbounded) period of time. e.g. the tun driver which relies on
userspace consuming the skb.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: mst@redhat.com
---
 drivers/net/tun.c      |    1 +
 include/linux/skbuff.h |   11 ++++++++
 net/core/skbuff.c      |   68 ++++++++++++++++++++++++++++++++++-------------
 3 files changed, 61 insertions(+), 19 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index bb8c72c..b53e04e 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -415,6 +415,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 	/* Orphan the skb - required as we might hang on to it
 	 * for indefinite time. */
 	skb_orphan(skb);
+	skb_orphan_frags(skb, GFP_KERNEL);
 
 	/* Enqueue packet */
 	skb_queue_tail(&tun->socket.sk->sk_receive_queue, skb);
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index ccc7d93..9145f83 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1711,6 +1711,17 @@ static inline void skb_orphan(struct sk_buff *skb)
 }
 
 /**
+ *	skb_orphan_frags - orphan the frags contained in a buffer
+ *	@skb: buffer to orphan frags from
+ *	@gfp_mask: allocation mask for replacement pages
+ *
+ *	For each frag in the SKB which has a destructor (i.e. has an
+ *	owner) create a copy of that frag and release the original
+ *	page by calling the destructor.
+ */
+extern int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask);
+
+/**
  *	__skb_queue_purge - empty a list
  *	@list: list to empty
  *
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 945b807..f009abb 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -697,31 +697,25 @@ struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src)
 }
 EXPORT_SYMBOL_GPL(skb_morph);
 
-/*	skb_copy_ubufs	-	copy userspace skb frags buffers to kernel
- *	@skb: the skb to modify
- *	@gfp_mask: allocation priority
- *
- *	This must be called on SKBTX_DEV_ZEROCOPY skb.
- *	It will copy all frags into kernel and drop the reference
- *	to userspace pages.
- *
- *	If this function is called from an interrupt gfp_mask() must be
- *	%GFP_ATOMIC.
- *
- *	Returns 0 on success or a negative error code on failure
- *	to allocate kernel memory to copy to.
+/*
+ * If uarg != NULL copy and replace all frags.
+ * If uarg == NULL then only copy and replace those which have a destructor
+ * pointer.
  */
-int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
+static int skb_copy_frags(struct sk_buff *skb, gfp_t gfp_mask,
+			  struct ubuf_info *uarg)
 {
 	int i;
 	int num_frags = skb_shinfo(skb)->nr_frags;
 	struct page *page, *head = NULL;
-	struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg;
 
 	for (i = 0; i < num_frags; i++) {
 		u8 *vaddr;
 		skb_frag_t *f = &skb_shinfo(skb)->frags[i];
 
+		if (!uarg && !f->page.destructor)
+			continue;
+
 		page = alloc_page(GFP_ATOMIC);
 		if (!page) {
 			while (head) {
@@ -739,11 +733,16 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 		head = page;
 	}
 
-	/* skb frags release userspace buffers */
-	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
+	/* skb frags release buffers */
+	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+		skb_frag_t *f = &skb_shinfo(skb)->frags[i];
+		if (!uarg && !f->page.destructor)
+			continue;
 		skb_frag_unref(skb, i);
+	}
 
-	uarg->callback(uarg);
+	if (uarg)
+		uarg->callback(uarg);
 
 	/* skb frags point to kernel buffers */
 	for (i = skb_shinfo(skb)->nr_frags; i > 0; i--) {
@@ -752,10 +751,41 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 		head = (struct page *)head->private;
 	}
 
-	skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
 	return 0;
 }
 
+/*	skb_copy_ubufs	-	copy userspace skb frags buffers to kernel
+ *	@skb: the skb to modify
+ *	@gfp_mask: allocation priority
+ *
+ *	This must be called on SKBTX_DEV_ZEROCOPY skb.
+ *	It will copy all frags into kernel and drop the reference
+ *	to userspace pages.
+ *
+ *	If this function is called from an interrupt gfp_mask() must be
+ *	%GFP_ATOMIC.
+ *
+ *	Returns 0 on success or a negative error code on failure
+ *	to allocate kernel memory to copy to.
+ */
+int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
+{
+	struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg;
+	int rc;
+
+	rc = skb_copy_frags(skb, gfp_mask, uarg);
+
+	if (rc == 0)
+		skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
+
+	return rc;
+}
+
+int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask)
+{
+	return skb_copy_frags(skb, gfp_mask, NULL);
+}
+EXPORT_SYMBOL(skb_orphan_frags);
 
 /**
  *	skb_clone	-	duplicate an sk_buff
-- 
1.7.2.5

^ permalink raw reply related

* [PATCH 1/9] net: add and use SKB_ALLOCSIZE
From: Ian Campbell @ 2012-05-03 14:56 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Ian Campbell
In-Reply-To: <1336056915.20716.96.camel@zakaz.uk.xensource.com>

This gives the allocation size required for an skb containing X bytes of data

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 drivers/net/ethernet/broadcom/bnx2.c        |    7 +++----
 drivers/net/ethernet/broadcom/bnx2x/bnx2x.h |    3 +--
 drivers/net/ethernet/broadcom/tg3.c         |    3 +--
 include/linux/skbuff.h                      |   12 ++++++++++++
 net/core/skbuff.c                           |    8 +-------
 5 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2.c b/drivers/net/ethernet/broadcom/bnx2.c
index ac7b744..62eb000 100644
--- a/drivers/net/ethernet/broadcom/bnx2.c
+++ b/drivers/net/ethernet/broadcom/bnx2.c
@@ -5321,8 +5321,7 @@ bnx2_set_rx_ring_size(struct bnx2 *bp, u32 size)
 	/* 8 for CRC and VLAN */
 	rx_size = bp->dev->mtu + ETH_HLEN + BNX2_RX_OFFSET + 8;
 
-	rx_space = SKB_DATA_ALIGN(rx_size + BNX2_RX_ALIGN) + NET_SKB_PAD +
-		SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	rx_space = SKB_ALLOCSIZE(rx_size + BNX2_RX_ALIGN) + NET_SKB_PAD;
 
 	bp->rx_copy_thresh = BNX2_RX_COPY_THRESH;
 	bp->rx_pg_ring_size = 0;
@@ -5345,8 +5344,8 @@ bnx2_set_rx_ring_size(struct bnx2 *bp, u32 size)
 
 	bp->rx_buf_use_size = rx_size;
 	/* hw alignment + build_skb() overhead*/
-	bp->rx_buf_size = SKB_DATA_ALIGN(bp->rx_buf_use_size + BNX2_RX_ALIGN) +
-		NET_SKB_PAD + SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	bp->rx_buf_size = SKB_ALLOCSIZE(bp->rx_buf_use_size + BNX2_RX_ALIGN) +
+		NET_SKB_PAD;
 	bp->rx_jumbo_thresh = rx_size - BNX2_RX_OFFSET;
 	bp->rx_ring_size = size;
 	bp->rx_max_ring = bnx2_find_max_ring(size, MAX_RX_RINGS);
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
index e30e2a2..3586879 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
@@ -1252,8 +1252,7 @@ struct bnx2x {
 #define BNX2X_FW_RX_ALIGN_START	(1UL << BNX2X_RX_ALIGN_SHIFT)
 
 #define BNX2X_FW_RX_ALIGN_END					\
-	max(1UL << BNX2X_RX_ALIGN_SHIFT, 			\
-	    SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+	max(1UL << BNX2X_RX_ALIGN_SHIFT, SKB_ALLOCSIZE(0))
 
 #define BNX2X_PXP_DRAM_ALIGN		(BNX2X_RX_ALIGN_SHIFT - 5)
 
diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index 482138e..6869f17 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -5714,8 +5714,7 @@ static int tg3_alloc_rx_data(struct tg3 *tp, struct tg3_rx_prodring_set *tpr,
 	 * Callers depend upon this behavior and assume that
 	 * we leave everything unchanged if we fail.
 	 */
-	skb_size = SKB_DATA_ALIGN(data_size + TG3_RX_OFFSET(tp)) +
-		   SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	skb_size = SKB_ALLOCSIZE(data_size + TG3_RX_OFFSET(tp));
 	if (skb_size <= TG3_FRAGSIZE) {
 		data = tg3_frag_alloc(tpr);
 		*frag_size = TG3_FRAGSIZE;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 988fc49..19e348f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -41,8 +41,20 @@
 
 #define SKB_DATA_ALIGN(X)	(((X) + (SMP_CACHE_BYTES - 1)) & \
 				 ~(SMP_CACHE_BYTES - 1))
+/* maximum data size which can fit into an allocation of X bytes */
 #define SKB_WITH_OVERHEAD(X)	\
 	((X) - SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+/*
+ * minimum allocation size required for an skb containing X bytes of data
+ *
+ * We do our best to align skb_shared_info on a separate cache
+ * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives
+ * aligned memory blocks, unless SLUB/SLAB debug is enabled.  Both
+ * skb->head and skb_shared_info are cache line aligned.
+ */
+#define SKB_ALLOCSIZE(X)	\
+	(SKB_DATA_ALIGN((X)) + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+
 #define SKB_MAX_ORDER(X, ORDER) \
 	SKB_WITH_OVERHEAD((PAGE_SIZE << (ORDER)) - (X))
 #define SKB_MAX_HEAD(X)		(SKB_MAX_ORDER((X), 0))
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 52ba2b5..a056d7c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -182,13 +182,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 		goto out;
 	prefetchw(skb);
 
-	/* We do our best to align skb_shared_info on a separate cache
-	 * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives
-	 * aligned memory blocks, unless SLUB/SLAB debug is enabled.
-	 * Both skb->head and skb_shared_info are cache line aligned.
-	 */
-	size = SKB_DATA_ALIGN(size);
-	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	size = SKB_ALLOCSIZE(size);
 	data = kmalloc_node_track_caller(size, gfp_mask, node);
 	if (!data)
 		goto nodata;
-- 
1.7.2.5

^ permalink raw reply related

* [PATCH v5 0/9] skb paged fragment destructors
From: Ian Campbell @ 2012-05-03 14:55 UTC (permalink / raw)
  To: netdev@vger.kernel.org
  Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, David VomLehn,
	Bart Van Assche, xen-devel, Ian Campbell, Alexander Duyck

The following series makes use of the skb fragment API (which is in 3.2
+) to add a per-paged-fragment destructor callback. This can be used by
creators of skbs who are interested in the lifecycle of the pages
included in that skb after they have handed it off to the network stack.

The mail at [0] contains some more background and rationale but
basically the completed series will allow entities which inject pages
into the networking stack to receive a notification when the stack has
really finished with those pages (i.e. including retransmissions,
clones, pull-ups etc) and not just when the original skb is finished
with, which is beneficial to many subsystems which wish to inject pages
into the network stack without giving up full ownership of those page's
lifecycle. It implements something broadly along the lines of what was
described in [1].

I have also included a patch to the RPC subsystem which uses this API to
fix the bug which I describe at [2].

I've also had some interest from David VemLehn and Bart Van Assche
regarding using this functionality in the context of vmsplice and iSCSI
targets respectively (I think).

Changes since last time:

      * The big change is that the patches now explicitly align the
        "nr_frags" member of the shinfo, as suggested by Alexander
        Duyck. This ensures that the placement is optimal irrespective
        of page size (in particular the variation of MAX_SKB_FRAGS). It
        is still the case that for 4k pages a maximum MTU frame +
        SKB_PAD + shinfo, still fit within 2048k.
              * As part of the preceeding I squashed the patches
                manipulating the shinfo layout and alignment into a
                single patch (which is far more coherent than the
                piecemeal approach used previously)
      * I crushed "net: only allow paged fragments with the same
        destructor to be coalesced." into the baseline patch (Ben
        Hutchings)
      * Added and used skb_shinfo_init to centralise several copies of
        that code.
      * Reduced CC list on "net: add paged frag destructor support to
        kernel_sendpage", it was rather long and seemed a bit overly
        spammy on the non-netdev recipients.

Changes since time before:

      * Added skb_orphan_frags API for the use of recipients of SKBs who
        may hold onto the SKB for a long time (this is analogous to
        skb_orphan). This was pointed out by Michael. The TUN driver is
        currently the only user.
              * I can't for the life of me get anything to actually hit
                this code path. I've been trying with an NFS server
                running in a Xen HVM domain with emulated (e.g. tap)
                networking and a client in domain 0, using the NFS fix
                in this series which generates SKBs with destructors
                set, so far -- nothing. I suspect that lack of TSO/GSO
                etc on the TAP interface is causing the frags to be
                copied to normal pages during skb_segment().
      * Various fixups related to the change of alignment/padding in
        shinfo, in particular to build_skb as pointed out by Eric.
      * Tweaked ordering of shinfo members to ensure that all hotpath
        variables up to and including the first frag fit within (and are
        aligned to) a single 64 byte cache line. (Eric again)

I ran a monothread UDP benchmark (similar to that described by Eric in
e52fcb2462ac) and don't see any difference in pps throughput, it was
~810,000 pps both before and after.

Cheers,
Ian.

[0] http://marc.info/?l=linux-netdev&m=131072801125521&w=2
[1] http://marc.info/?l=linux-netdev&m=130925719513084&w=2
[2] http://marc.info/?l=linux-nfs&m=122424132729720&w=2

^ permalink raw reply

* Re: [PATCH v3 1/2] vhost-net: fix handle_rx buffer size
From: Basil Gor @ 2012-05-03 14:43 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Eric W. Biederman, David S. Miller, netdev
In-Reply-To: <20120503131623.GA26705@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 1949 bytes --]

On Thu, May 03, 2012 at 04:16:24PM +0300, Michael S. Tsirkin wrote:
> On Wed, Apr 25, 2012 at 09:01:15PM +0400, Basil Gor wrote:
> > Take vlan header length into account, when vlan id is stored as
> > vlan_tci. Otherwise tagged packets comming from macvtap will be
> > truncated.
> > 
> > Signed-off-by: Basil Gor <basil.gor@gmail.com>
> 
> So I'm inclined to apply these two patches, we
> this doesn't fix packet socket backend
> but could be fixed by a follow-up patch.
> 

That's what I'm going to do.

While testing packet socket I noticed that tcpdump doesn't work
on macvtap0, since there is no dev_hard_start_xmit like in
tun/tap0 case I think (lines 120-144 in trace attached). And I
have no clear picture how to fix this gracefully.

Also I think there are issues with macvtap on top of bonding, that
I'm also going to verify and debug.

> > ---
> >  drivers/vhost/net.c |    7 ++++++-
> >  1 files changed, 6 insertions(+), 1 deletions(-)
> > 
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index 1f21d2a..5c17010 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -24,6 +24,7 @@
> >  #include <linux/if_arp.h>
> >  #include <linux/if_tun.h>
> >  #include <linux/if_macvlan.h>
> > +#include <linux/if_vlan.h>
> >  
> >  #include <net/sock.h>
> >  
> > @@ -283,8 +284,12 @@ static int peek_head_len(struct sock *sk)
> >  
> >  	spin_lock_irqsave(&sk->sk_receive_queue.lock, flags);
> >  	head = skb_peek(&sk->sk_receive_queue);
> > -	if (likely(head))
> > +	if (likely(head)) {
> >  		len = head->len;
> > +		if (vlan_tx_tag_present(head))
> > +			len += VLAN_HLEN;
> > +	}
> > +
> >  	spin_unlock_irqrestore(&sk->sk_receive_queue.lock, flags);
> >  	return len;
> >  }
> > -- 
> > 1.7.6.5
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: tap_trace.log --]
[-- Type: text/plain, Size: 21099 bytes --]

single arp packet receive

		br0
		^ \
		|  +->tap0 <-- tapread + tcpdump
->wlan0-+->macvlan0 
         \->macvtap0 <-- macvtapread + tcpdump

001       0 irq/28-b43(25823): -> netpoll_trap()
002    0xffffffff815209c0 : netpoll_trap+0x0/0x20 [kernel]
003    0xffffffff8161a189 : kretprobe_trampoline+0x0/0x57 [kernel]
004    0xffffffffa0260a8a [mac80211]
005    0xffffffffa0260ac0 [mac80211]
006    0xffffffffa0318b02 [b43]
007    0xffffffff810e4b60 : irq_thread_fn+0x0/0x50 [kernel]
008    0xffffffffa0313330 [b43]
009    0xffffffffa02f71d6 [b43]
010    0xffffffffa02f7486 [b43]
011    0xffffffff810e4b89 : irq_thread_fn+0x29/0x50 [kernel]
012    0xffffffff810e4ae0 : irq_thread+0x1a0/0x220 [kernel]
013    0xffffffff810e4940 : irq_thread+0x0/0x220 [kernel]
014    0xffffffff81079da3 : kthread+0x93/0xa0 [kernel]
015    0xffffffff81620f24 : kernel_thread_helper+0x4/0x10 [kernel]
016    0xffffffff81079d10 : kthread+0x0/0xa0 [kernel]
017    0xffffffff81620f20 : kernel_thread_helper+0x0/0x10 [kernel]
018      480 irq/28-b43(25823): <- netpoll_trap(): return=0x0
019        0 irq/28-b43(25823): -> netif_receive_skb(skb=0xffff8800a8183300)
020   	skb_dump:dev:wlan0 proto:8100 len:32 vlan_tci:{prio:0 cfi:0 vid:0}
021    0xffffffff8150b360 : netif_receive_skb+0x0/0x90 [kernel]
022    0xffffffff8161a189 : kretprobe_trampoline+0x0/0x57 [kernel]
023    0xffffffffa02584d6 [mac80211]
024    0xffffffff8161a189 : kretprobe_trampoline+0x0/0x57 [kernel]
025    0xffffffffa02598f6 [mac80211]
026    0xffffffffa025a56e [mac80211]
027    0xffffffffa0312fd4 [b43]
028    0xffffffffa0318d8a [b43]
029    0xffffffff810e4b60 : irq_thread_fn+0x0/0x50 [kernel]
030    0xffffffffa02f71fd [b43]
031    0xffffffffa02f7486 [b43]
032    0xffffffff810e4b89 : irq_thread_fn+0x29/0x50 [kernel]
033    0xffffffff810e4ae0 : irq_thread+0x1a0/0x220 [kernel]
034    0xffffffff810e4940 : irq_thread+0x0/0x220 [kernel]
035    0xffffffff81079da3 : kthread+0x93/0xa0 [kernel]
036    0xffffffff81620f24 : kernel_thread_helper+0x4/0x10 [kernel]
037    0xffffffff81079d10 : kthread+0x0/0xa0 [kernel]
038    0xffffffff81620f20 : kernel_thread_helper+0x0/0x10 [kernel]
039      551 irq/28-b43(25823):  -> __netif_receive_skb(skb=0xffff8800a8183300 ptype=? pt_prev=? rx_handler=? orig_dev=? null_or_dev=? deliver_exact=? ret=? type=?)
040   	skb_dump:dev:wlan0 proto:8100 len:32 vlan_tci:{prio:0 cfi:0 vid:0}
041      587 irq/28-b43(25823):   -> vlan_untag(skb=0xffff8800a8183300 vhdr=? vlan_tci=?)
042   	skb_dump:dev:wlan0 proto:8100 len:32 vlan_tci:{prio:0 cfi:0 vid:0}
043      616 irq/28-b43(25823):   <- vlan_untag(): return=0xffff8800a8183300
044      639 irq/28-b43(25823):   -> packet_rcv(skb=0xffff8800a8183300 dev=0xffff8800aa24b000 pt=0xffff8800276f5cc0 orig_dev=0xffff8800aa24b000 sk=? sll=? po=? skb_head=? skb_len=? snaplen=? res=?)
045   	skb_dump:dev:wlan0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
046   	dev:name:wlan0
047   	orig_dev:name:wlan0
048      690 irq/28-b43(25823):   <- packet_rcv(): return=0x0
049      705 irq/28-b43(25823):   -> vlan_do_receive(skbp=0xffff880045c1fa08 last_handler=0x0 skb=? vlan_id=? vlan_dev=0xffff8800aa24b000 rx_stats=?)
050   	skbp_dump:dev:wlan0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
051      738 irq/28-b43(25823):   <- vlan_do_receive(): return=0x0
052      755 irq/28-b43(25823):   -> macvlan_handle_frame(pskb=0xffff880045c1fa08 port=? skb=? eth=? vlan=? src=0xffff8800aa24b000 dev=? len=? ret=?)
053   	pskb_dump:dev:wlan0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
054      794 irq/28-b43(25823):    -> macvlan_broadcast(skb=0xffff8800a8183300 port=0xffff8800a9466000 src=0x0 mode=0xf eth=0xffff880045c1f9f0 vlan=? n=? nskb=? i=? err=0x0)
055   	skb_dump:dev:wlan0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
056      843 irq/28-b43(25823):     -> netif_rx(skb=0xffff8800a7012e00 ret=0xffffffffa7012e00)
057   	skb_dump:dev:macvlan0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
058      877 irq/28-b43(25823):      -> enqueue_to_backlog(skb=0xffff8800a7012e00 cpu=0x0 qtail=0xffff880045c1f930 sd=? flags=?)
059   	skb_dump:dev:macvlan0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
060      930 irq/28-b43(25823):      <- enqueue_to_backlog(): return=0x0
061      942 irq/28-b43(25823):     <- netif_rx(): return=0x0
062      962 irq/28-b43(25823):     -> macvtap_receive(skb=0xffff8800136a7200)
063   	skb_dump:dev:macvtap0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
064      992 irq/28-b43(25823):      -> macvtap_forward(dev=0xffff8800a9467000 skb=0xffff8800136a7200 q=0x0)
065   	skb_dump:dev:macvtap0 proto:0806 len:42 vlan_tci:{prio:0 cfi:4096 vid:50}
066   	dev:name:macvtap0
067     1030 irq/28-b43(25823):       -> __skb_get_rxhash(skb=0xffff8800136a7200 keys={...} hash=?)
068   	skb_dump:dev:macvtap0 proto:0806 len:42 vlan_tci:{prio:0 cfi:4096 vid:50}
069     1058 irq/28-b43(25823):       <- __skb_get_rxhash(): 
070     1080 irq/28-b43(25823):      <- macvtap_forward(): return=0x0
071     1092 irq/28-b43(25823):     <- macvtap_receive(): return=0x0
072     1105 irq/28-b43(25823):    <- macvlan_broadcast(): 
073     1117 irq/28-b43(25823):   <- macvlan_handle_frame(): return=0x3
074     1149 irq/28-b43(25823):  <- __netif_receive_skb(): return=0x0
075     1161 irq/28-b43(25823): <- netif_receive_skb(): return=0x0
076        0 irq/28-b43(25823): -> net_rx_action(h=0xffffffff81c04098 sd=? time_limit=? budget=? have=0x3)
077    0xffffffff8150bc00 : net_rx_action+0x0/0x270 [kernel]
078    0xffffffff8161a189 : kretprobe_trampoline+0x0/0x57 [kernel]
079    0xffffffff8162101c : call_softirq+0x1c/0x30 [kernel]
080    0xffffffff81016455 : do_softirq+0x65/0xa0 [kernel]
081    0xffffffff8105e654 : local_bh_enable+0x94/0xa0 [kernel]
082    0xffffffffa0312fd9 [b43]
083    0xffffffffa0318d8a [b43]
084    0xffffffff810e4b60 : irq_thread_fn+0x0/0x50 [kernel]
085    0xffffffffa02f71fd [b43]
086    0xffffffffa02f7486 [b43]
087    0xffffffff810e4b89 : irq_thread_fn+0x29/0x50 [kernel]
088    0xffffffff810e4ae0 : irq_thread+0x1a0/0x220 [kernel]
089    0xffffffff810e4940 : irq_thread+0x0/0x220 [kernel]
090    0xffffffff81079da3 : kthread+0x93/0xa0 [kernel]
091    0xffffffff81620f24 : kernel_thread_helper+0x4/0x10 [kernel]
092    0xffffffff81079d10 : kthread+0x0/0xa0 [kernel]
093    0xffffffff81620f20 : kernel_thread_helper+0x0/0x10 [kernel]
094      506 irq/28-b43(25823):  -> process_backlog(napi=0xffff8800afc14118 quota=0x40 work=? sd=?)
095      530 irq/28-b43(25823):   -> __netif_receive_skb(skb=0xffff8800a7012e00 ptype=? pt_prev=? rx_handler=? orig_dev=? null_or_dev=? deliver_exact=? ret=? type=?)
096   	skb_dump:dev:macvlan0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
097      561 irq/28-b43(25823):    -> vlan_do_receive(skbp=0xffff8800afc03e30 last_handler=0x0 skb=? vlan_id=? vlan_dev=0xffff880088528000 rx_stats=?)
098   	skbp_dump:dev:macvlan0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
099      591 irq/28-b43(25823):    <- vlan_do_receive(): return=0x0
100      626 irq/28-b43(25823):    -> br_flood_forward(br=0xffff8800a995e780 skb=0xffff8800a7012e00 skb2=0xffff8800a7012e00)
101   	skb_dump:dev:macvlan0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
102      667 irq/28-b43(25823):     -> br_flood(br=0xffff8800a995e780 skb=0xffff8800a7012e00 skb0=0xffff8800a7012e00 __packet_hook=0xffffffffa040f380 p=? prev=?)
103   	skb_dump:dev:macvlan0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
104      711 irq/28-b43(25823):      -> maybe_deliver(prev=0x0 p=0xffff8800a85bec00 skb=0xffff8800a7012e00 __packet_hook=0xffffffffa040f380 err=?)
105   	skb_dump:dev:macvlan0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
106      766 irq/28-b43(25823):      <- maybe_deliver(): return=0xffff8800a85bec00
107      784 irq/28-b43(25823):      -> maybe_deliver(prev=0xffff8800a85bec00 p=0xffff8800a7972800 skb=0xffff8800a7012e00 __packet_hook=0xffffffffa040f380 err=?)
108   	skb_dump:dev:macvlan0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
109      820 irq/28-b43(25823):      <- maybe_deliver(): return=0xffff8800a85bec00
110      839 irq/28-b43(25823):      -> deliver_clone(prev=0xffff8800a85bec00 skb=0xffff8800a7012e00 __packet_hook=0xffffffffa040f380 dev=?)
111   	skb_dump:dev:macvlan0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
112      879 irq/28-b43(25823):       -> __br_forward(to=0xffff8800a85bec00 skb=0xffff8800a8183300 indev=?)
113   	skb_dump:dev:macvlan0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
114      914 irq/28-b43(25823):        -> br_forward_finish(skb=0xffff8800a8183300)
115   	skb_dump:dev:tap0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
116      943 irq/28-b43(25823):         -> br_dev_queue_push_xmit(skb=0xffff8800a8183300)
117   	skb_dump:dev:tap0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
118      973 irq/28-b43(25823):          -> dev_queue_xmit(skb=0xffff8800a8183300 dev=? txq=0xffffffffa040f380 q=? rc=?)
119   	skb_dump:dev:tap0 proto:0806 len:42 vlan_tci:{prio:0 cfi:4096 vid:50}
120     1010 irq/28-b43(25823):           -> dev_hard_start_xmit(skb=0xffff8800a8183300 dev=0xffff8800a9866000 txq=0xffff88008864fa00 ops=? rc=? skb_len=?)
121   	skb_dump:dev:tap0 proto:0806 len:42 vlan_tci:{prio:0 cfi:4096 vid:50}
122   	dev:name:tap0
123     1056 irq/28-b43(25823):            -> tpacket_rcv(skb=0xffff8800a9849600 dev=0xffff8800a9866000 pt=0xffff8800662994c0 orig_dev=0xffff8800a9866000 sk=? po=? sll=? h={...} skb_head=? skb_len=? snaplen=? res=? status=? macoff=? netoff=? hdrlen=? copy_skb=? tv={...} ts={...} shhwtstamps=?)
124   	skb_dump:dev:tap0 proto:0806 len:42 vlan_tci:{prio:0 cfi:4096 vid:50}
125   	dev:name:tap0
126   	orig_dev:name:tap0
127     1109 irq/28-b43(25823):             -> packet_lookup_frame(po=0xffff880066299000 rb=0xffff8800662992a0 position=0x6 status=0x0 pg_vec_pos=? frame_offset=? h={...})
128     1137 irq/28-b43(25823):              -> __packet_get_status(po=0xffff880066299000 frame=0xffff880019b60000 h={...})
129     1159 irq/28-b43(25823):              <- __packet_get_status(): return=0x0
130     1171 irq/28-b43(25823):             <- packet_lookup_frame(): return=0xffff880019b60000
131     1192 irq/28-b43(25823):             -> __packet_set_status(po=0xffff880066299000 frame=0xffff880019b60000 status=0x11 h={...})
132     1217 irq/28-b43(25823):             <- __packet_set_status(): 
133     1240 irq/28-b43(25823):            <- tpacket_rcv(): return=0x0
134     1255 irq/28-b43(25823):            -> netif_skb_features(skb=0xffff8800a8183300 protocol=? features=?)
135   	skb_dump:dev:tap0 proto:0806 len:42 vlan_tci:{prio:0 cfi:4096 vid:50}
136     1286 irq/28-b43(25823):             -> harmonize_features(skb=0xffff8800a8183300 protocol=0x608 features=0x0)
137   	skb_dump:dev:tap0 proto:0806 len:42 vlan_tci:{prio:0 cfi:4096 vid:50}
138     1314 irq/28-b43(25823):             <- harmonize_features(): return=0x0
139     1326 irq/28-b43(25823):            <- netif_skb_features(): return=0x0
140     1348 irq/28-b43(25823):            -> tun_net_xmit(skb=0xffff8800a8183300 dev=0xffff8800a9866000 tun=?)
141   	skb_dump:dev:tap0 proto:8100 len:46 vlan_tci:{prio:0 cfi:0 vid:0}
142   	dev:name:tap0
143     1388 irq/28-b43(25823):            <- tun_net_xmit(): return=0x0
144     1401 irq/28-b43(25823):           <- dev_hard_start_xmit(): return=0x0
145     1414 irq/28-b43(25823):          <- dev_queue_xmit(): return=0x0
146     1426 irq/28-b43(25823):         <- br_dev_queue_push_xmit(): return=0x0
147     1438 irq/28-b43(25823):        <- br_forward_finish(): return=0x0
148     1450 irq/28-b43(25823):       <- __br_forward(): 
149     1461 irq/28-b43(25823):      <- deliver_clone(): return=0x0
150     1473 irq/28-b43(25823):     <- br_flood(): 
151     1484 irq/28-b43(25823):    <- br_flood_forward(): 
152     1499 irq/28-b43(25823):    -> netif_receive_skb(skb=0xffff8800a7012e00)
153   	skb_dump:dev:br0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
154     1524 irq/28-b43(25823):     -> __netif_receive_skb(skb=0xffff8800a7012e00 ptype=? pt_prev=? rx_handler=? orig_dev=? null_or_dev=? deliver_exact=? ret=? type=?)
155   	skb_dump:dev:br0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
156     1553 irq/28-b43(25823):      -> vlan_do_receive(skbp=0xffff8800afc03d20 last_handler=0x1 skb=? vlan_id=? vlan_dev=0xffff8800a995e000 rx_stats=?)
157   	skbp_dump:dev:br0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
158     1583 irq/28-b43(25823):      <- vlan_do_receive(): return=0x0
159     1596 irq/28-b43(25823):     <- __netif_receive_skb(): return=0x0
160     1608 irq/28-b43(25823):    <- netif_receive_skb(): return=0x0
161     1619 irq/28-b43(25823):   <- __netif_receive_skb(): return=0x1
162     1631 irq/28-b43(25823):  <- process_backlog(): return=0x1
163     1646 irq/28-b43(25823):  -> net_rps_action_and_irq_enable(sd=0xffff8800afc14040 remsd=?)
164     1666 irq/28-b43(25823):  <- net_rps_action_and_irq_enable(): 
165     1678 irq/28-b43(25823): <- net_rx_action(): 
166        0 macvtapread(13215): -> macvtap_poll(file=0xffff88006bb35c00 wait=0xffff8800a86a9ab8 q=? mask=?)
167    0xffffffffa03f6000 : macvtap_poll+0x0/0xa0 [macvtap]
168    0xffffffff8161a189 : kretprobe_trampoline+0x0/0x57 [kernel]
169    0xffffffff8119561c : core_sys_select+0x1ec/0x370 [kernel]
170    0xffffffff81195860 : sys_select+0xc0/0x100 [kernel]
171    0xffffffff8161fb69 : system_call_fastpath+0x16/0x1b [kernel]
172      206 macvtapread(13215): <- macvtap_poll(): return=0x145
173        0 macvtapread(13215): -> macvtap_aio_read(iocb=0xffff8800a86a9e00 iv=0xffff8800a86a9ed8 count=0x1 pos=0x0 file=? q=? len=? ret=?)
174    0xffffffffa03f65e0 : macvtap_aio_read+0x0/0x80 [macvtap]
175    0xffffffff8161a189 : kretprobe_trampoline+0x0/0x57 [kernel]
176    0xffffffff81182305 : vfs_read+0x165/0x180 [kernel]
177    0xffffffff8118236a : sys_read+0x4a/0x90 [kernel]
178    0xffffffff8161fb69 : system_call_fastpath+0x16/0x1b [kernel]
179      193 macvtapread(13215):  -> macvtap_do_read(q=0xffff88001a62f800 iocb=0xffff8800a86a9e00 iv=0xffff8800a86a9ed8 len=0x5de noblock=0x0 wait={...} skb=? ret=?)
180      233 macvtapread(13215):  <- macvtap_do_read(): return=0x34
181      246 macvtapread(13215): <- macvtap_aio_read(): return=0x34
182        0 macvtapread(13215): -> macvtap_poll(file=0xffff88006bb35c00 wait=0xffff8800a86a9ab8 q=? mask=?)
183    0xffffffffa03f6000 : macvtap_poll+0x0/0xa0 [macvtap]
184    0xffffffff8161a189 : kretprobe_trampoline+0x0/0x57 [kernel]
185    0xffffffff8119561c : core_sys_select+0x1ec/0x370 [kernel]
186    0xffffffff81195860 : sys_select+0xc0/0x100 [kernel]
187    0xffffffff8161fb69 : system_call_fastpath+0x16/0x1b [kernel]
188      190 macvtapread(13215): <- macvtap_poll(): return=0x104
189        0 tcpdump(tap0): -> packet_poll(file=0xffff8800885f0e00 sock=0xffff88001904a000 wait=0xffff880017601b88 sk=? po=? mask=?)
190    0xffffffff815e51f0 : packet_poll+0x0/0x130 [kernel]
191    0xffffffff8161a189 : kretprobe_trampoline+0x0/0x57 [kernel]
192    0xffffffff81195ceb : do_sys_poll+0x25b/0x4c0 [kernel]
193    0xffffffff8119602b : sys_poll+0x6b/0x100 [kernel]
194    0xffffffff8161fb69 : system_call_fastpath+0x16/0x1b [kernel]
195      223 tcpdump(tap0):  -> packet_lookup_frame(po=0xffff880066299000 rb=0xffff8800662992a0 position=0x6 status=0x0 pg_vec_pos=? frame_offset=? h={...})
196      249 tcpdump(tap0):   -> __packet_get_status(po=0xffff880066299000 frame=0xffff880019b60000 h={...})
197      269 tcpdump(tap0):   <- __packet_get_status(): return=0x11
198      282 tcpdump(tap0):  <- packet_lookup_frame(): return=0x0
199      294 tcpdump(tap0): <- packet_poll(): return=0x345
200        0 tcpdump(tap0): -> packet_poll(file=0xffff8800885f0e00 sock=0xffff88001904a000 wait=0xffff880017601b88 sk=? po=? mask=?)
201    0xffffffff815e51f0 : packet_poll+0x0/0x130 [kernel]
202    0xffffffff8161a189 : kretprobe_trampoline+0x0/0x57 [kernel]
203    0xffffffff81195ceb : do_sys_poll+0x25b/0x4c0 [kernel]
204    0xffffffff8119602b : sys_poll+0x6b/0x100 [kernel]
205    0xffffffff8161fb69 : system_call_fastpath+0x16/0x1b [kernel]
206      204 tcpdump(tap0):  -> packet_lookup_frame(po=0xffff880066299000 rb=0xffff8800662992a0 position=0x6 status=0x0 pg_vec_pos=? frame_offset=? h={...})
207      229 tcpdump(tap0):   -> __packet_get_status(po=0xffff880066299000 frame=0xffff880019b60000 h={...})
208      250 tcpdump(tap0):   <- __packet_get_status(): return=0x0
209      262 tcpdump(tap0):  <- packet_lookup_frame(): return=0xffff880019b60000
210      277 tcpdump(tap0): <- packet_poll(): return=0x304
211        0 tapread(13310): -> tun_chr_poll(file=0xffff880002451f00 wait=0xffff8800a7435ab8 tfile=? tun=? sk=? mask=?)
212    0xffffffffa042f4a0 : tun_chr_poll+0x0/0x110 [tun]
213    0xffffffff8161a189 : kretprobe_trampoline+0x0/0x57 [kernel]
214    0xffffffff8119561c : core_sys_select+0x1ec/0x370 [kernel]
215    0xffffffff81195860 : sys_select+0xc0/0x100 [kernel]
216    0xffffffff8161fb69 : system_call_fastpath+0x16/0x1b [kernel]
217      196 tapread(13310):  -> __tun_get(tfile=0xffff88000456afc0 tun=?)
218      215 tapread(13310):  <- __tun_get(): return=0xffff8800a9866780
219      234 tapread(13310):  -> tun_put(tun=0xffff8800a9866780 tfile=?)
220      252 tapread(13310):  <- tun_put(): 
221      262 tapread(13310): <- tun_chr_poll(): return=0x145
222        0 tapread(13310): -> tun_chr_aio_read(iocb=0xffff8800a7435e00 iv=0xffff8800a7435ed8 count=0x1 pos=0x0 file=? tfile=? tun=? len=? ret=?)
223    0xffffffffa042f6a0 : tun_chr_aio_read+0x0/0xe0 [tun]
224    0xffffffff8161a189 : kretprobe_trampoline+0x0/0x57 [kernel]
225    0xffffffff81182250 : vfs_read+0xb0/0x180 [kernel]
226    0xffffffff8118236a : sys_read+0x4a/0x90 [kernel]
227    0xffffffff8161fb69 : system_call_fastpath+0x16/0x1b [kernel]
228      213 tapread(13310):  -> __tun_get(tfile=0xffff88000456afc0 tun=?)
229      230 tapread(13310):  <- __tun_get(): return=0xffff8800a9866780
230      248 tapread(13310):  -> tun_do_read(tun=0xffff8800a9866780 iocb=0xffff8800a7435e00 iv=0xffff8800a7435ed8 len=0x5de noblock=0x0 wait={...} skb=? ret=?)
231      284 tapread(13310):   -> netpoll_trap()
232      295 tapread(13310):   <- netpoll_trap(): return=0x0
233      312 tapread(13310):  <- tun_do_read(): return=0x2e
234      325 tapread(13310):  -> tun_put(tun=0xffff8800a9866780 tfile=?)
235      340 tapread(13310):  <- tun_put(): 
236      350 tapread(13310): <- tun_chr_aio_read(): return=0x2e
237        0 tapread(13310): -> tun_chr_poll(file=0xffff880002451f00 wait=0xffff8800a7435ab8 tfile=? tun=? sk=? mask=?)
238    0xffffffffa042f4a0 : tun_chr_poll+0x0/0x110 [tun]
239    0xffffffff8161a189 : kretprobe_trampoline+0x0/0x57 [kernel]
240    0xffffffff8119561c : core_sys_select+0x1ec/0x370 [kernel]
241    0xffffffff81195860 : sys_select+0xc0/0x100 [kernel]
242    0xffffffff8161fb69 : system_call_fastpath+0x16/0x1b [kernel]
243      192 tapread(13310):  -> __tun_get(tfile=0xffff88000456afc0 tun=?)
244      209 tapread(13310):  <- __tun_get(): return=0xffff8800a9866780
245      225 tapread(13310):  -> tun_put(tun=0xffff8800a9866780 tfile=?)
246      241 tapread(13310):  <- tun_put(): 
247      251 tapread(13310): <- tun_chr_poll(): return=0x104
248        0 tcpdump(macvtap0): -> packet_poll(file=0xffff880027793a00 sock=0xffff8800a925c500 wait=0xffff8800a9a99b88 sk=? po=? mask=?)
249    0xffffffff815e51f0 : packet_poll+0x0/0x130 [kernel]
250    0xffffffff8161a189 : kretprobe_trampoline+0x0/0x57 [kernel]
251    0xffffffff81195ceb : do_sys_poll+0x25b/0x4c0 [kernel]
252    0xffffffff8119602b : sys_poll+0x6b/0x100 [kernel]
253    0xffffffff8161fb69 : system_call_fastpath+0x16/0x1b [kernel]
254      242 tcpdump(macvtap0):  -> packet_lookup_frame(po=0xffff88003dd2b000 rb=0xffff88003dd2b2a0 position=0x1e status=0x0 pg_vec_pos=? frame_offset=? h={...})
255      271 tcpdump(macvtap0):   -> __packet_get_status(po=0xffff88003dd2b000 frame=0xffff88003c180000 h={...})
256      294 tcpdump(macvtap0):   <- __packet_get_status(): return=0x0
257      308 tcpdump(macvtap0):  <- packet_lookup_frame(): return=0xffff88003c180000
258      324 tcpdump(macvtap0): <- packet_poll(): return=0x304
259        0 tcpdump(macvtap0): -> packet_poll(file=0xffff880027793a00 sock=0xffff8800a925c500 wait=0xffff8800a9a99b88 sk=? po=? mask=?)
260    0xffffffff815e51f0 : packet_poll+0x0/0x130 [kernel]
261    0xffffffff8161a189 : kretprobe_trampoline+0x0/0x57 [kernel]
262    0xffffffff81195ceb : do_sys_poll+0x25b/0x4c0 [kernel]
263    0xffffffff8119602b : sys_poll+0x6b/0x100 [kernel]
264    0xffffffff8161fb69 : system_call_fastpath+0x16/0x1b [kernel]
265      193 tcpdump(macvtap0):  -> packet_lookup_frame(po=0xffff88003dd2b000 rb=0xffff88003dd2b2a0 position=0x1e status=0x0 pg_vec_pos=? frame_offset=? h={...})
266      217 tcpdump(macvtap0):   -> __packet_get_status(po=0xffff88003dd2b000 frame=0xffff88003c180000 h={...})
267      237 tcpdump(macvtap0):   <- __packet_get_status(): return=0x0
268      248 tcpdump(macvtap0):  <- packet_lookup_frame(): return=0xffff88003c180000
269      264 tcpdump(macvtap0): <- packet_poll(): return=0x304

[-- Attachment #3: tun-trace.stp --]
[-- Type: text/plain, Size: 2747 bytes --]

%{
#include <linux/skbuff.h>
#include <linux/if_vlan.h>
%}

function skb_dump:string(skb_ptr:long) %{
	struct sk_buff *skb = (void*)THIS->skb_ptr;
	if (!skb) {
		snprintf(THIS->__retvalue, MAXSTRINGLEN, "skb=NULL");
		return;
	}

	snprintf(THIS->__retvalue, MAXSTRINGLEN, "dev:%s proto:%04x len:%u vlan_tci:{prio:%d cfi:%d vid:%d}",
		skb->dev ? skb->dev->name : "NULL",
		htons(skb->protocol),
		skb->len,
		skb->vlan_tci & VLAN_PRIO_MASK,
		skb->vlan_tci & VLAN_CFI_MASK,
		skb->vlan_tci & VLAN_VID_MASK);
%}

function dev_dump:string(dev_ptr:long) %{
	struct net_device *dev = (void*)THIS->dev_ptr;
	if (!dev) {
		snprintf(THIS->__retvalue, MAXSTRINGLEN, "dev=NULL");
		return;
	}

	snprintf(THIS->__retvalue, MAXSTRINGLEN, "name:%s", dev->name);
%}

function deref_unsafe:long(pskb_ptr:long) %{
	struct sk_buff **pskb = (void*)THIS->pskb_ptr;
	if (!pskb || !(*pskb))
		THIS->__retvalue = 0;
	else
		THIS->__retvalue = (long)*pskb;
%}

probe begin {
	printf ("started\n")
}

global nesting = 0

probe probepoints.call = module("tun").function("*@drivers/net/tun.c").call,
			module("macvlan").function("*@drivers/net/macvlan.c").call,
			module("macvtap").function("*@drivers/net/macvtap.c").call,
			module("bridge").function("*@net/bridge/br_device.c").call,
			module("bridge").function("*@net/bridge/br_forward.c").call,
			kernel.function("vlan_*").call,
			kernel.function("__vlan_*").call,
			kernel.function("*@net/packet/af_packet.c").call,
			kernel.function("*@net/core/netpoll.c").call,
			kernel.function("*@net/core/dev.c").call {
	nesting++
}

probe probepoints.return = module("tun").function("*@drivers/net/tun.c").return,
			module("macvlan").function("*@drivers/net/macvlan.c").return,
			module("macvtap").function("*@drivers/net/macvtap.c").return,
			module("bridge").function("*@net/bridge/br_device.c").return,
			module("bridge").function("*@net/bridge/br_forward.c").return,
			kernel.function("vlan_*").return,
			kernel.function("__vlan_*").return,
			kernel.function("*@net/packet/af_packet.c").return,
			kernel.function("*@net/core/netpoll.c").return,
			kernel.function("*@net/core/dev.c").return {
	nesting--
}

probe probepoints.call {
	printf ("%s -> %s(%s)\n", thread_indent(1), probefunc(), $$vars)
	if (@defined($skb))
		printf("\tskb_dump:%s\n", skb_dump($skb))
	if (@defined($pskb))
		printf("\tpskb_dump:%s\n", skb_dump(deref_unsafe($pskb)))
	if (@defined($skbp))
		printf("\tskbp_dump:%s\n", skb_dump(deref_unsafe($skbp)))
	if (@defined($dev))
		printf("\tdev:%s\n", dev_dump($dev))
	if (@defined($orig_dev))
		printf("\torig_dev:%s\n", dev_dump($dev))
	if (nesting == 1)
		print_stack(backtrace())
}

probe probepoints.return {
	printf ("%s <- %s(): %s\n", thread_indent(-1), probefunc(), $$return)
}

^ permalink raw reply

* Re: [PATCH v3 2/2] macvtap: restore vlan header on user read
From: Michael S. Tsirkin @ 2012-05-03 14:31 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Basil Gor, David S. Miller, netdev
In-Reply-To: <m1ehr1l711.fsf@fess.ebiederm.org>

On Thu, May 03, 2012 at 06:37:46AM -0700, Eric W. Biederman wrote:
> "Michael S. Tsirkin" <mst@redhat.com> writes:
> 
> > On Wed, Apr 25, 2012 at 10:31:25PM -0700, Eric W. Biederman wrote:
> >> Basil Gor <basil.gor@gmail.com> writes:
> >> 
> >> > Vlan tag is restored during buffer transmit to a network device (bridge
> >> > port) in bridging code in case of tun/tap driver. In case of macvtap it
> >> > has to be done explicitly. Otherwise vlan_tci is ignored and user always
> >> > gets untagged packets.
> >> 
> >> We could quibble about efficiencies but this looks good except for
> >> macvtap_recvmsg which isn't setting the auxdata for the vlan header.
> >> 
> >> Eric
> >
> > Right. I'm guessing we need to support old userspace
> > so if there's auxdata, put vlan there but if not,
> > put the vlan in the packet like this patch does.
> 
> This patch isn't horrible.
> 
> Still why copy the skb when you can just split the copy to userspace
> into a couple of pieces?
> 
> We don't need to change the skb and changing the skb looks like
> it is likely to confuse things and cause bugs because we are
> not working with a consistent model of how vlan information
> is encoded.
> 
> Still something needs to happen and this works in more cases even if it
> isn't perfect.
> 
> Eric

Absolutely. And it's easier than I thought.
So we can do something like the below (warning: compiled only).
Basil - want to take a look?
My only concern if we put this logic in an out of way
driver like macvtap will people remember to update it?
Maybe better to update skb_copy_datagram_const_iovec which is in core?


Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 0427c65..5a1724c 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -1,5 +1,6 @@
 #include <linux/etherdevice.h>
 #include <linux/if_macvlan.h>
+#include <linux/if_vlan.h>
 #include <linux/interrupt.h>
 #include <linux/nsproxy.h>
 #include <linux/compat.h>
@@ -759,6 +760,7 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
 	struct macvlan_dev *vlan;
 	int ret;
 	int vnet_hdr_len = 0;
+	int vlan_offset = 0;
 
 	if (q->flags & IFF_VNET_HDR) {
 		struct virtio_net_hdr vnet_hdr;
@@ -776,8 +778,29 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
 
 	len = min_t(int, skb->len, len);
 
-	ret = skb_copy_datagram_const_iovec(skb, 0, iv, vnet_hdr_len, len);
+	if (vlan_tx_tag_present(skb)) {
+		struct {
+			__be16 h_vlan_proto;
+			__be16 h_vlan_TCI;
+		} veth;
+ 		veth.h_vlan_proto = htons(ETH_P_8021Q);
+ 		veth.h_vlan_TCI = vlan_tx_tag_get(skb);
+
+		vlan_offset = offsetof(struct vlan_ethhdr, h_vlan_proto);
+		ret = skb_copy_datagram_const_iovec(skb, 0, iv, vnet_hdr_len,
+						    vlan_offset);
+		if (ret)
+			goto done;
+		ret = memcpy_toiovecend(iv, (unsigned char *)&veth, vlan_offset,
+					sizeof veth);
+		if (ret)
+			goto done;
+		vlan_offset += sizeof veth;
+	}
+	ret = skb_copy_datagram_const_iovec(skb, vlan_offset, iv, vnet_hdr_len,
+					    len);
 
+done:
 	rcu_read_lock_bh();
 	vlan = rcu_dereference_bh(q->vlan);
 	if (vlan)

^ permalink raw reply related

* Re: [PATCH] net: davinci_emac: Add pre_open, post_stop platform callbacks
From: Kevin Hilman @ 2012-05-03 14:22 UTC (permalink / raw)
  To: Bedia, Vaibhav
  Cc: Mark A. Greer, netdev@vger.kernel.org, linux-omap@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org
In-Reply-To: <B5906170F1614E41A8A28DE3B8D121433EA72820@DBDE01.ent.ti.com>

"Bedia, Vaibhav" <vaibhav.bedia@ti.com> writes:

> On Thu, May 03, 2012 at 05:17:18, Mark A. Greer wrote:
>> From: "Mark A. Greer" <mgreer@animalcreek.com>
>> 
>> The davinci EMAC driver has been incorporated into the am35x
>> family of SoC's which is OMAP-based.  The incorporation is
>> incomplete in that the EMAC cannot unblock the [ARM] core if
>> its blocked on a 'wfi' instruction.  This is an issue with
>> the cpu_idle code because it has the core execute a 'wfi'
>> instruction.
>> 
>> To work around this issue, add platform data callbacks which
>> are called at the beginning of the open routine and at the
>> end of the stop routine of the davinci_emac driver.  The
>> callbacks allow the platform code to issue disable_hlt() and
>> enable_hlt() calls appropriately.  Calling disable_hlt()
>> prevents cpu_idle from issuing the 'wfi' instruction.
>> 
>> It is not sufficient to simply call disable_hlt() when
>> there is an EMAC present because it could be present but
>> not actually used in which case, we do want the 'wfi' to
>> be executed.
>> 
>
> Are you trying to say that if ARM executes _just_ wfi and _absolutely
> nothing else_ is done in the OMAP PM code, EMAC stops working?
>
> However, if this is indeed the case, then probably a better solution would be
> to invoke disable_hlt() from the board file when EMAC support is compiled in.

No.  As Mark stated in the changelog, doing that will prevent any
low-power states states even if the EMAC is not in use.  IMO, it is best
to only prevent WFI when absolutely needed.

Kevin


^ permalink raw reply

* Re: [PATCH 04/11] mm: Add support for a filesystem to activate swap files and use direct_IO for writing swap pages
From: Mel Gorman @ 2012-05-03 14:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, Linux-NFS, LKML, David Miller,
	Trond Myklebust, Neil Brown, Christoph Hellwig, Peter Zijlstra,
	Mike Christie, Eric B Munson
In-Reply-To: <20120501155308.5679a09b.akpm@linux-foundation.org>

On Tue, May 01, 2012 at 03:53:08PM -0700, Andrew Morton wrote:
> > It is perfectly possible that direct_IO be used to read the swap
> > pages but it is an unnecessary complication. Similarly, it is possible
> > that ->writepage be used instead of direct_io to write the pages but
> > filesystem developers have stated that calling writepage from the VM
> > is undesirable for a variety of reasons and using direct_IO opens up
> > the possibility of writing back batches of swap pages in the future.
> 
> This all seems a bit odd.  And abusive.
> 
> Yes, it would be more pleasing if direct-io was used for reading as
> well.  How much more complication would it add?
> 

Quite a bit.

Superficially it's easy because swap_readpage() just sets up a kiocb,
fills in the necessary details and call ->direct_IO. The complexity is
around page locking and writing back pending writes in NFS.

read_swap_cache_async() calls swap_readpage with the page locked and
is expected to return with the page unlocked on successful completion of
the IO.

For swap-over-nfs, the readpage handler behaves exactly as
read_swap_cache_async() expects. For everything else, submit_bio() is used
with end_swap_bio_read() unlocking the page. Both of these handlers behave
the same with respect to locking. The direct_IO handler does not expect the
page to be locked and does not unlock it itself. Even if it works for NFS,
there might be other complications in the future around page locking in
direct_IO handlers.

The second complexity may be specific to NFS. The NFS readpage handler
flushes any pending writes with nfs_wb_page() before doing the read which it
can do because it holds the page lock. It was completely unclear how the same
could be achieved from swap_readpage() in a filesystem-independent manner.

As ->readpage() already knew how to do the right thing in all cases, I
used it.

> If I understand correctly, on the read path we're taking a fresh page
> which is destined for swapcache and then pretending that it is a
> pagecache page for the purpose of the I/O? 
>
> If there already existed a
> pagecache page for that file offset then we let it just sit there and
> bypass it?
> 

On the read path read_swap_cache_async() checks if a page is already in
swapcache and if not not, allocates a new page, adds it to the swapcache
and calls swap_readpage. Hence I do not think we are tripping the
problem you are thinking of.

> I'm surprised that this works at all - I guess nothing under
> ->readpage() goes poking around in the address_space.  For NFS, at
> least!
> 
> >
> > ...
> >
> > @@ -93,11 +94,38 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
> >  {
> >  	struct bio *bio;
> >  	int ret = 0, rw = WRITE;
> > +	struct swap_info_struct *sis = page_swap_info(page);
> >  
> >  	if (try_to_free_swap(page)) {
> >  		unlock_page(page);
> >  		goto out;
> >  	}
> > +
> > +	if (sis->flags & SWP_FILE) {
> > +		struct kiocb kiocb;
> > +		struct file *swap_file = sis->swap_file;
> > +		struct address_space *mapping = swap_file->f_mapping;
> > +		struct iovec iov = {
> > +			.iov_base = page_address(page),
> 
> Didn't we need to kmap the page?
> 

.... Yep, that would be important all right. I'll look at this closely
and do a round of testing on x86-32.

> > +			.iov_len  = PAGE_SIZE,
> > +		};
> > +
> > +		init_sync_kiocb(&kiocb, swap_file);
> > +		kiocb.ki_pos = page_file_offset(page);
> > +		kiocb.ki_left = PAGE_SIZE;
> > +		kiocb.ki_nbytes = PAGE_SIZE;
> > +
> > +		unlock_page(page);
> > +		ret = mapping->a_ops->direct_IO(KERNEL_WRITE,
> > +						&kiocb, &iov,
> > +						kiocb.ki_pos, 1);
> 
> I wonder if there's any point in setting PG_writeback around the IO.  I
> can't think of a reason.
> 

One does not spring to mind.

> > +		if (ret == PAGE_SIZE) {
> > +			count_vm_event(PSWPOUT);
> > +			ret = 0;
> > +		}
> > +		return ret;
> > +	}
> > +
> >  	bio = get_swap_bio(GFP_NOIO, page, end_swap_bio_write);
> >  	if (bio == NULL) {
> >  		set_page_dirty(page);
> > @@ -119,9 +147,21 @@ int swap_readpage(struct page *page)
> >  {
> >  	struct bio *bio;
> >  	int ret = 0;
> > +	struct swap_info_struct *sis = page_swap_info(page);
> >  
> >  	VM_BUG_ON(!PageLocked(page));
> >  	VM_BUG_ON(PageUptodate(page));
> > +
> > +	if (sis->flags & SWP_FILE) {
> > +		struct file *swap_file = sis->swap_file;
> > +		struct address_space *mapping = swap_file->f_mapping;
> > +
> > +		ret = mapping->a_ops->readpage(swap_file, page);
> > +		if (!ret)
> > +			count_vm_event(PSWPIN);
> > +		return ret;
> > +	}
> 
> Confused.  Where did we set up page->index with the file offset?
> 

We don't use page->index in this case.

__add_to_swap_cache() records the swap entry in page->private.
nfs_readpage() looks up the page index with page_index() which for
SwapCache pages calls __page_file_index(). It in turn gets the swap
entry and looks up the index with swp_offset().

> >  	bio = get_swap_bio(GFP_KERNEL, page, end_swap_bio_read);
> >  	if (bio == NULL) {
> >  		unlock_page(page);
> > @@ -133,3 +173,15 @@ int swap_readpage(struct page *page)
> >  out:
> >  	return ret;
> >  }
> > +
> > +int swap_set_page_dirty(struct page *page)
> > +{
> > +	struct swap_info_struct *sis = page_swap_info(page);
> > +
> > +	if (sis->flags & SWP_FILE) {
> > +		struct address_space *mapping = sis->swap_file->f_mapping;
> > +		return mapping->a_ops->set_page_dirty(page);
> > +	} else {
> > +		return __set_page_dirty_nobuffers(page);
> > +	}
> > +}
> 
> More confused.  This is a swapcache page, not a pagecache page?  Why
> are we running set_page_dirty() against it?
> 

I don't really get the question. swap-over-NFS is not doing anything
different here than what we do today. PageSwapCache pages still have to
be marked dirty so they get written to disk before being discarded.

> And what are we doing on the !SWP_FILE path? 

Maintaining existing behaviour. This is what the swap ops looks like
without the patchset

static const struct address_space_operations swap_aops = {
        .writepage      = swap_writepage,
        .set_page_dirty = __set_page_dirty_nobuffers,
        .migratepage    = migrate_page,
};

> Newly setting PG_dirty
> against block-device-backed swapcache pages?  Why?  Where does it get
> cleared again?
> 

clear_page_dirty_for_io() in vmscan.c#pageout() ? I might be missing
something in your question again :(

> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index 9d3dd37..c25b9cf 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -26,7 +26,7 @@
> >   */
> >  static const struct address_space_operations swap_aops = {
> >  	.writepage	= swap_writepage,
> > -	.set_page_dirty	= __set_page_dirty_nobuffers,
> > +	.set_page_dirty	= swap_set_page_dirty,
> >  	.migratepage	= migrate_page,
> >  };
> >
> > ...
> >

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v3 2/2] macvtap: restore vlan header on user read
From: Eric W. Biederman @ 2012-05-03 13:37 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Basil Gor, David S. Miller, netdev
In-Reply-To: <20120503130751.GB26366@redhat.com>

"Michael S. Tsirkin" <mst@redhat.com> writes:

> On Wed, Apr 25, 2012 at 10:31:25PM -0700, Eric W. Biederman wrote:
>> Basil Gor <basil.gor@gmail.com> writes:
>> 
>> > Vlan tag is restored during buffer transmit to a network device (bridge
>> > port) in bridging code in case of tun/tap driver. In case of macvtap it
>> > has to be done explicitly. Otherwise vlan_tci is ignored and user always
>> > gets untagged packets.
>> 
>> We could quibble about efficiencies but this looks good except for
>> macvtap_recvmsg which isn't setting the auxdata for the vlan header.
>> 
>> Eric
>
> Right. I'm guessing we need to support old userspace
> so if there's auxdata, put vlan there but if not,
> put the vlan in the packet like this patch does.

This patch isn't horrible.

Still why copy the skb when you can just split the copy to userspace
into a couple of pieces?

We don't need to change the skb and changing the skb looks like
it is likely to confuse things and cause bugs because we are
not working with a consistent model of how vlan information
is encoded.

Still something needs to happen and this works in more cases even if it
isn't perfect.

Eric

>> > Signed-off-by: Basil Gor <basil.gor@gmail.com>
>> > ---
>> >  drivers/net/macvtap.c |   11 ++++++++++-
>> >  1 files changed, 10 insertions(+), 1 deletions(-)
>> >
>> > diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
>> > index 0427c65..28d2678 100644
>> > --- a/drivers/net/macvtap.c
>> > +++ b/drivers/net/macvtap.c
>> > @@ -1,6 +1,7 @@
>> >  #include <linux/etherdevice.h>
>> >  #include <linux/if_macvlan.h>
>> >  #include <linux/interrupt.h>
>> > +#include <linux/if_vlan.h>
>> >  #include <linux/nsproxy.h>
>> >  #include <linux/compat.h>
>> >  #include <linux/if_tun.h>
>> > @@ -753,13 +754,21 @@ static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
>> >  
>> >  /* Put packet to the user space buffer */
>> >  static ssize_t macvtap_put_user(struct macvtap_queue *q,
>> > -				const struct sk_buff *skb,
>> > +				struct sk_buff *skb,
>> >  				const struct iovec *iv, int len)
>> >  {
>> >  	struct macvlan_dev *vlan;
>> >  	int ret;
>> >  	int vnet_hdr_len = 0;
>> >  
>> > +	if (vlan_tx_tag_present(skb)) {
>> > +		skb = __vlan_put_tag(skb, vlan_tx_tag_get(skb));
>> > +		if (unlikely(!skb))
>> > +			return -ENOMEM;
>> > +
>> > +		skb->vlan_tci = 0;
>> > +	}
>> > +
>> >  	if (q->flags & IFF_VNET_HDR) {
>> >  		struct virtio_net_hdr vnet_hdr;
>> >  		vnet_hdr_len = q->vnet_hdr_sz;
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v3 1/2] vhost-net: fix handle_rx buffer size
From: Michael S. Tsirkin @ 2012-05-03 13:16 UTC (permalink / raw)
  To: Basil Gor; +Cc: Eric W. Biederman, David S. Miller, netdev
In-Reply-To: <1335373275-336-1-git-send-email-basil.gor@gmail.com>

On Wed, Apr 25, 2012 at 09:01:15PM +0400, Basil Gor wrote:
> Take vlan header length into account, when vlan id is stored as
> vlan_tci. Otherwise tagged packets comming from macvtap will be
> truncated.
> 
> Signed-off-by: Basil Gor <basil.gor@gmail.com>

So I'm inclined to apply these two patches, we
this doesn't fix packet socket backend
but could be fixed by a follow-up patch.

> ---
>  drivers/vhost/net.c |    7 ++++++-
>  1 files changed, 6 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 1f21d2a..5c17010 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -24,6 +24,7 @@
>  #include <linux/if_arp.h>
>  #include <linux/if_tun.h>
>  #include <linux/if_macvlan.h>
> +#include <linux/if_vlan.h>
>  
>  #include <net/sock.h>
>  
> @@ -283,8 +284,12 @@ static int peek_head_len(struct sock *sk)
>  
>  	spin_lock_irqsave(&sk->sk_receive_queue.lock, flags);
>  	head = skb_peek(&sk->sk_receive_queue);
> -	if (likely(head))
> +	if (likely(head)) {
>  		len = head->len;
> +		if (vlan_tx_tag_present(head))
> +			len += VLAN_HLEN;
> +	}
> +
>  	spin_unlock_irqrestore(&sk->sk_receive_queue.lock, flags);
>  	return len;
>  }
> -- 
> 1.7.6.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v3 2/2] macvtap: restore vlan header on user read
From: Michael S. Tsirkin @ 2012-05-03 13:07 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Basil Gor, David S. Miller, netdev
In-Reply-To: <m162cngitu.fsf@fess.ebiederm.org>

On Wed, Apr 25, 2012 at 10:31:25PM -0700, Eric W. Biederman wrote:
> Basil Gor <basil.gor@gmail.com> writes:
> 
> > Vlan tag is restored during buffer transmit to a network device (bridge
> > port) in bridging code in case of tun/tap driver. In case of macvtap it
> > has to be done explicitly. Otherwise vlan_tci is ignored and user always
> > gets untagged packets.
> 
> We could quibble about efficiencies but this looks good except for
> macvtap_recvmsg which isn't setting the auxdata for the vlan header.
> 
> Eric

Right. I'm guessing we need to support old userspace
so if there's auxdata, put vlan there but if not,
put the vlan in the packet like this patch does.

> > Signed-off-by: Basil Gor <basil.gor@gmail.com>
> > ---
> >  drivers/net/macvtap.c |   11 ++++++++++-
> >  1 files changed, 10 insertions(+), 1 deletions(-)
> >
> > diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> > index 0427c65..28d2678 100644
> > --- a/drivers/net/macvtap.c
> > +++ b/drivers/net/macvtap.c
> > @@ -1,6 +1,7 @@
> >  #include <linux/etherdevice.h>
> >  #include <linux/if_macvlan.h>
> >  #include <linux/interrupt.h>
> > +#include <linux/if_vlan.h>
> >  #include <linux/nsproxy.h>
> >  #include <linux/compat.h>
> >  #include <linux/if_tun.h>
> > @@ -753,13 +754,21 @@ static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
> >  
> >  /* Put packet to the user space buffer */
> >  static ssize_t macvtap_put_user(struct macvtap_queue *q,
> > -				const struct sk_buff *skb,
> > +				struct sk_buff *skb,
> >  				const struct iovec *iv, int len)
> >  {
> >  	struct macvlan_dev *vlan;
> >  	int ret;
> >  	int vnet_hdr_len = 0;
> >  
> > +	if (vlan_tx_tag_present(skb)) {
> > +		skb = __vlan_put_tag(skb, vlan_tx_tag_get(skb));
> > +		if (unlikely(!skb))
> > +			return -ENOMEM;
> > +
> > +		skb->vlan_tci = 0;
> > +	}
> > +
> >  	if (q->flags & IFF_VNET_HDR) {
> >  		struct virtio_net_hdr vnet_hdr;
> >  		vnet_hdr_len = q->vnet_hdr_sz;
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] wl12xx: fix size of two memset's in wl1271_cmd_build_arp_rsp()
From: Luciano Coelho @ 2012-05-03 12:56 UTC (permalink / raw)
  To: Jesper Juhl; +Cc: linux-wireless, John W. Linville, netdev, linux-kernel
In-Reply-To: <alpine.LNX.2.00.1204222255460.27455@swampdragon.chaosbits.net>

On Sun, 2012-04-22 at 22:57 +0200, Jesper Juhl wrote:
> On Sun, 22 Apr 2012, Jesper Juhl wrote:
> 
> > We currently do this:
> > 
> > int wl1271_cmd_build_arp_rsp(struct wl1271 *wl, struct wl12xx_vif *wlvif)
> > ...
> > 	struct wl12xx_arp_rsp_template *tmpl;
> > 	struct ieee80211_hdr_3addr *hdr;
> > ...
> > 	tmpl = (struct wl12xx_arp_rsp_template *)skb_put(skb, sizeof(*tmpl));
> > 	memset(tmpl, 0, sizeof(tmpl));
> > ...
> > 	hdr = (struct ieee80211_hdr_3addr *)skb_push(skb, sizeof(*hdr));
> > 	memset(hdr, 0, sizeof(*hdr));
> > ...
> > 
> > I believe we want to set the entire structures to 0 with those
> > memset() calls, not just zero the initial part of them (size of the
> > pointer bytes).
> > 
> 
> Sorry, I accidentally copied that code from the fixed version. The above 
> should read:
> 
> 
> We currently do this:
> 
> int wl1271_cmd_build_arp_rsp(struct wl1271 *wl, struct wl12xx_vif *wlvif)
> ...
>       struct wl12xx_arp_rsp_template *tmpl;
>       struct ieee80211_hdr_3addr *hdr;
> ...
>       tmpl = (struct wl12xx_arp_rsp_template *)skb_put(skb, sizeof(*tmpl));
>       memset(tmpl, 0, sizeof(tmpl));
> ...
>       hdr = (struct ieee80211_hdr_3addr *)skb_push(skb, sizeof(*hdr));
>       memset(hdr, 0, sizeof(hdr));
> ...
> 
> I believe we want to set the entire structures to 0 with those
> memset() calls, not just zero the initial part of them (size of the
> pointer bytes).
> 
> 
> 
> 
> 
> > Signed-off-by: Jesper Juhl <jj@chaosbits.net>
> > ---

Applied with the fixed commit log and merged into the new directory
structure.  Thanks Jesper!

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox