* Re: [PATCH 2/2] ss: implement -M option to get all memory information
From: Stephen Hemminger @ 2012-05-03 15:25 UTC (permalink / raw)
To: Shan Wei; +Cc: xemul, NetDev
In-Reply-To: <4FA24458.6020105@gmail.com>
On Thu, 03 May 2012 16:39:52 +0800
Shan Wei <shanwei88@gmail.com> wrote:
> Stephen Hemminger said, at 2012/5/3 3:00:
>
> >
> > This looks good, is the skmeminfo a superset of the old meminfo?
>
>
> Yes, skmeminfo is a superset of old meminfo.
> Using this can get more socket memory information.
>
> > But your code is broken on 64 bit. skmeminfo in kernel is an array of __u32!
>
>
> OK. here is a new version.
>
> ----
> [PATCH] ss: use new INET_DIAG_SKMEMINFO option to get more memory information for tcp socket
>
>
> Signed-off-by: Shan Wei <davidshan@tencent.com>
> ---
> misc/ss.c | 16 ++++++++++++++--
> 1 files changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/misc/ss.c b/misc/ss.c
> index 5f70a26..bd60548 100644
> --- a/misc/ss.c
> +++ b/misc/ss.c
> @@ -1336,7 +1336,17 @@ static void tcp_show_info(const struct nlmsghdr *nlh, struct inet_diag_msg *r)
> parse_rtattr(tb, INET_DIAG_MAX, (struct rtattr*)(r+1),
> nlh->nlmsg_len - NLMSG_LENGTH(sizeof(*r)));
>
> - if (tb[INET_DIAG_MEMINFO]) {
> + if (tb[INET_DIAG_SKMEMINFO]) {
> + const __u32 *skmeminfo = RTA_DATA(tb[INET_DIAG_SKMEMINFO]);
> + printf(" skmem:(r%u,rb%u,t%u,tb%u,f%u,w%u,o%u)",
> + skmeminfo[SK_MEMINFO_RMEM_ALLOC],
> + skmeminfo[SK_MEMINFO_RCVBUF],
> + skmeminfo[SK_MEMINFO_WMEM_ALLOC],
> + skmeminfo[SK_MEMINFO_SNDBUF],
> + skmeminfo[SK_MEMINFO_FWD_ALLOC],
> + skmeminfo[SK_MEMINFO_WMEM_QUEUED],
> + skmeminfo[SK_MEMINFO_OPTMEM]);
> + }else if (tb[INET_DIAG_MEMINFO]) {
> const struct inet_diag_meminfo *minfo
> = RTA_DATA(tb[INET_DIAG_MEMINFO]);
> printf(" mem:(r%u,w%u,f%u,t%u)",
> @@ -1505,8 +1515,10 @@ static int tcp_show_netlink(struct filter *f, FILE *dump_fp, int socktype)
> memset(&req.r, 0, sizeof(req.r));
> req.r.idiag_family = AF_INET;
> req.r.idiag_states = f->states;
> - if (show_mem)
> + if (show_mem) {
> req.r.idiag_ext |= (1<<(INET_DIAG_MEMINFO-1));
> + req.r.idiag_ext |= (1<<(INET_DIAG_SKMEMINFO-1));
> + }
>
> if (show_tcpinfo) {
> req.r.idiag_ext |= (1<<(INET_DIAG_INFO-1));
This looks good, I will apply it
^ permalink raw reply
* Re: sky2 still badly broken
From: Stephen Hemminger @ 2012-05-03 15:23 UTC (permalink / raw)
To: Niccolò Belli; +Cc: netdev
In-Reply-To: <4FA2527A.6020808@linuxsystems.it>
On Thu, 03 May 2012 11:40:10 +0200
Niccolò Belli <darkbasic@linuxsystems.it> wrote:
> Il 02/05/2012 20:56, Stephen Hemminger ha scritto:
> > It could be that your switch doesn't do autonegotiation or flow
> > control. You are getting receive fifo overflow errors.
>
> I don't have this problem with other NICs. Also transfer rate is very
> low (even 2 MB/s sometimes) while I get ~110MB/s with other NICs (and
> the same switch of course).
>
> Niccolò
The receiver on some versions of the chip can't keep up with full speed
of 1G bit/sec. The receive FIFO has hardware issues, and since I don't
work for Marvell, working around the problem is guesswork. Without exact
information all that can be done is have a timeout and blunt force reset
logic. The vendor driver sk98lin has the same brute force logic, but may
just not print the message.
^ permalink raw reply
* Re: [PATCH v3 2/2] macvtap: restore vlan header on user read
From: Basil Gor @ 2012-05-03 15:22 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: Eric W. Biederman, David S. Miller, netdev
In-Reply-To: <20120503143108.GA20969@redhat.com>
On Thu, May 03, 2012 at 05:31:10PM +0300, Michael S. Tsirkin wrote:
> On Thu, May 03, 2012 at 06:37:46AM -0700, Eric W. Biederman wrote:
> > "Michael S. Tsirkin" <mst@redhat.com> writes:
> >
> > > On Wed, Apr 25, 2012 at 10:31:25PM -0700, Eric W. Biederman wrote:
> > >> Basil Gor <basil.gor@gmail.com> writes:
> > >>
> > >> > Vlan tag is restored during buffer transmit to a network device (bridge
> > >> > port) in bridging code in case of tun/tap driver. In case of macvtap it
> > >> > has to be done explicitly. Otherwise vlan_tci is ignored and user always
> > >> > gets untagged packets.
> > >>
> > >> We could quibble about efficiencies but this looks good except for
> > >> macvtap_recvmsg which isn't setting the auxdata for the vlan header.
> > >>
> > >> Eric
> > >
> > > Right. I'm guessing we need to support old userspace
> > > so if there's auxdata, put vlan there but if not,
> > > put the vlan in the packet like this patch does.
> >
> > This patch isn't horrible.
> >
> > Still why copy the skb when you can just split the copy to userspace
> > into a couple of pieces?
> >
> > We don't need to change the skb and changing the skb looks like
> > it is likely to confuse things and cause bugs because we are
> > not working with a consistent model of how vlan information
> > is encoded.
> >
> > Still something needs to happen and this works in more cases even if it
> > isn't perfect.
> >
> > Eric
>
> Absolutely. And it's easier than I thought.
> So we can do something like the below (warning: compiled only).
> Basil - want to take a look?
Sure, I'll give it a try.
Thanks
Basil Gor
> My only concern if we put this logic in an out of way
> driver like macvtap will people remember to update it?
> Maybe better to update skb_copy_datagram_const_iovec which is in core?
>
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>
> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> index 0427c65..5a1724c 100644
> --- a/drivers/net/macvtap.c
> +++ b/drivers/net/macvtap.c
> @@ -1,5 +1,6 @@
> #include <linux/etherdevice.h>
> #include <linux/if_macvlan.h>
> +#include <linux/if_vlan.h>
> #include <linux/interrupt.h>
> #include <linux/nsproxy.h>
> #include <linux/compat.h>
> @@ -759,6 +760,7 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
> struct macvlan_dev *vlan;
> int ret;
> int vnet_hdr_len = 0;
> + int vlan_offset = 0;
>
> if (q->flags & IFF_VNET_HDR) {
> struct virtio_net_hdr vnet_hdr;
> @@ -776,8 +778,29 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
>
> len = min_t(int, skb->len, len);
>
> - ret = skb_copy_datagram_const_iovec(skb, 0, iv, vnet_hdr_len, len);
> + if (vlan_tx_tag_present(skb)) {
> + struct {
> + __be16 h_vlan_proto;
> + __be16 h_vlan_TCI;
> + } veth;
> + veth.h_vlan_proto = htons(ETH_P_8021Q);
> + veth.h_vlan_TCI = vlan_tx_tag_get(skb);
> +
> + vlan_offset = offsetof(struct vlan_ethhdr, h_vlan_proto);
> + ret = skb_copy_datagram_const_iovec(skb, 0, iv, vnet_hdr_len,
> + vlan_offset);
> + if (ret)
> + goto done;
> + ret = memcpy_toiovecend(iv, (unsigned char *)&veth, vlan_offset,
> + sizeof veth);
> + if (ret)
> + goto done;
> + vlan_offset += sizeof veth;
> + }
> + ret = skb_copy_datagram_const_iovec(skb, vlan_offset, iv, vnet_hdr_len,
> + len);
>
> +done:
> rcu_read_lock_bh();
> vlan = rcu_dereference_bh(q->vlan);
> if (vlan)
^ permalink raw reply
* pull request: wireless-next 2012-05-03
From: John W. Linville @ 2012-05-03 15:22 UTC (permalink / raw)
To: davem-fT/PcQaiUtIeIZ0/mPfg9Q
Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA,
netdev-u79uwXL29TY76Z2rM5mHXA
[-- Attachment #1: Type: text/plain, Size: 17462 bytes --]
commit aeace1b7293095fd45240646343251b1da8713da
Dave,
This is a batch of updates intended for 3.5. It also includes a pull
from the wireless tree which resolved some build dependencies.
Highlights of this pull request include some refactoring in the
bluetooth directories, some HT enhancements for mac80211, an expansion
of the ethtool support for cfg80211- and mac80211-based drivers,
and some more iwlwifi refactoring.
It looks like some of the bluetooth device ID patches got committed
on both the bluetooth and the bluetooth-next trees. I'll ask them to
be more careful about that, but I didn't think it was worth asking
for rebases since that would be disruptive to the downstream trees
and since git handles the situation reasonably well already.
Please let me know if there are problems!
Thanks,
John
---
The following changes since commit af94bf6db1d58d26f1cdab145b6312ad363254a6:
ixgbe: Fix use after free on module remove (2012-05-03 04:21:34 -0400)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next.git for-davem
AceLan Kao (5):
Bluetooth: Add support for Atheros [04ca:3005]
Bluetooth: Add support for Atheros [13d3:3362]
Bluetooth: Add support for Atheros [13d3:3362]
Bluetooth: Add support for AR3012 [0cf3:e004]
Bluetooth: Add support for AR3012 [0cf3:e004]
Amitkumar Karwar (1):
mwifiex: fix static checker warnings
Andre Guedes (10):
Bluetooth: Check FINDING state in interleaved discovery
Bluetooth: Add hci_cancel_le_scan() to hci_core
Bluetooth: LE support for MGMT stop discovery
Bluetooth: Replace EPERM by EALREADY in hci_cancel_inquiry
Bluetooth: Refactor stop_discovery
Bluetooth: Add Periodic Inquiry command complete handler
Bluetooth: Add HCI_PERIODIC_INQ to dev_flags
Bluetooth: Check HCI_PERIODIC_INQ in start_discovery
Bluetooth: Ignore inquiry results from periodic inquiry
Bluetooth: Remove MGMT_ADDR_INVALID macro
Andrei Emeltchenko (27):
Bluetooth: Correct type for hdev lmp_subver
Bluetooth: trivial: Correct endian conversion
Bluetooth: Correct type for ediv to __le16
Bluetooth: Fix extra conversion to __le32
Bluetooth: Correct chan->psm endian conversions
Bluetooth: Correct ediv in SMP
Bluetooth: Correct length calc in L2CAP conf rsp
Bluetooth: Correct CID endian notation
Bluetooth: Convert error codes to le16
Bluetooth: trivial: Fix endian conversion mode
Bluetooth: mgmt: Add missing endian conversion
Bluetooth: trivial: Correct types
Bluetooth: Fix type in cpu_to_le conversion
Bluetooth: Fix opcode access in hci_complete
Bluetooth: trivial: Remove sparse warnings
Bluetooth: Silence sparse warning
Bluetooth: mgmt: Fix timeout type
Bluetooth: Remove unneeded timer clear
Bluetooth: Fix memory leaks due to chan refcnt
Bluetooth: Make L2CAP chan_add functions static
Bluetooth: Comments and style fixes
Bluetooth: Remove unneeded zero initialization
Bluetooth: Add Read Local AMP Info to init
Bluetooth: Adds set_default function in L2CAP setup
Bluetooth: trivial: Remove empty line
Bluetooth: Fix debug printing unallocated name
cfg80211: Remove compile warnings
Anisse Astier (2):
rt2x00: debugfs support - allow a register to be empty
rt2x00: Add debugfs access for rfcsr register
Ashok Nagarajan (4):
mac80211: Advertise HT protection mode in IEs
mac80211: Implement HT mixed protection mode
mac80211: Allow nonHT/HT peering in mesh
{nl,cfg,mac}80211: Allow user to see/configure HT protection mode
Ben Greear (4):
cfg80211: Add framework to support ethtool stats.
mac80211: Support getting sta_info stats via ethtool.
mac80211: Framework to get wifi-driver stats via ethtool.
mac80211: Add more ethtools stats: survey, rates, etc
Ben Hutchings (2):
ipw2200: Fix order of device registration
ipw2100: Fix order of device registration
Brian Gix (1):
Bluetooth: mgmt: Fix corruption of device_connected pkt
Cho, Yu-Chen (1):
Bluetooth: Add Atheros maryann PIDVID support
Dan Carpenter (1):
wireless: at76c50x: allocating too much data
David Herrmann (5):
Bluetooth: Remove redundant hdev->parent field
Bluetooth: vhci: Ignore return code of nonseekable_open()
Bluetooth: Move hci_alloc/free_dev close to hci_register/unregister_dev
Bluetooth: Move device initialization to hci_alloc_dev()
Bluetooth: Remove unneeded initialization in hci_alloc_dev()
Don Zickus (1):
Bluetooth: btusb: typo in Broadcom SoftSailing id
Eldad Zack (1):
brcmsmac: "INTERMEDIATE but not AMPDU" only when tracing
Eliad Peller (1):
mac80211: call ieee80211_mgd_stop() on interface stop
Emmanuel Grumbach (3):
iwlwifi: use IWL_* instead of dev_printk when possible
iwlwifi: don't init trans->reg_lock from the op_mode
cfg80211: fix BSS comparison
Felix Fietkau (1):
mac80211: fix AP mode EAP tx for VLAN stations
Franky Lin (6):
brcm80211: fmac: fix SDIO function 0 register r/w issue
brcm80211: fmac: fix missing completion events issue
brcmfmac: stop releasing sdio host in irq handler
brcmfmac: check bus state for status
brcmfmac: postpone interrupt register function
brcmfmac: add out of band interrupt support
Gabor Juhos (2):
ath9k: add an extra boolean parameter to ath9k_hw_apply_txpower
ath9k: fix tx power settings for AR9287
Grazvydas Ignotas (2):
wl1251: fix crash on remove due to premature kfree
wl1251: fix crash on remove due to leftover work item
Gustavo Padovan (6):
Bluetooth: Remove sk parameter from l2cap_chan_create()
Bluetooth: Fix userspace compatibility issue with mgmt interface
Merge git://git.kernel.org/.../bluetooth/bluetooth
Bluetooth: Remove err parameter from alloc_skb()
Bluetooth: remove unneeded declaration of sco_conn_del()
Bluetooth: Fix coding style issues
Hemant Gupta (6):
Bluetooth: Use correct flags for checking HCI_SSP_ENABLED bit
Bluetooth: Send correct address type for LTK
Bluetooth: Fix clearing discovery type when stopping discovery
Bluetooth: mgmt: Fix missing connect failed event for LE
Bluetooth: mgmt: Fix address type while loading Long Term Key
Bluetooth: Don't distribute keys in case of Encryption Failure
Ido Yariv (1):
Bluetooth: Search global l2cap channels by src/dst addresses
Jesper Juhl (1):
Bluetooth: btmrvl_sdio: remove pointless conditional before release_firmware()
Johan Hedberg (2):
Bluetooth: Don't increment twice in eir_has_data_type()
Bluetooth: Check for minimum data length in eir_has_data_type()
Johan Hovold (2):
Bluetooth: hci_ldisc: fix NULL-pointer dereference on tty_close
Bluetooth: hci_core: fix NULL-pointer dereference at unregister
Johannes Berg (1):
iwlwifi: fix hardware queue programming
John W. Linville (5):
Merge branch 'for-upstream' of git://git.kernel.org/.../bluetooth/bluetooth
Merge branch 'for-upstream' of git://git.kernel.org/.../bluetooth/bluetooth-next
Merge branch 'wireless-next' of git://git.kernel.org/.../iwlwifi/iwlwifi
Merge branch 'master' of git://git.kernel.org/.../linville/wireless
Merge branch 'master' of git://git.kernel.org/.../linville/wireless-next into for-davem
Jonathan Bither (1):
ath5k: add missing iounmap to AHB probe removal
João Paulo Rechi Vita (1):
Bluetooth: btusb: Add USB device ID "0a5c 21e8"
Larry Finger (1):
rtlwifi: Fix oops on unload
Luis R. Rodriguez (2):
Bluetooth: properly use pr_fmt() on lib.c
libertas: include sched.h on firmware.c
Lukasz Rymanowski (1):
Bluetooth: Remove not needed status parameter
Manoj Iyer (2):
Bluetooth: btusb: Add vendor specific ID (0489 e042) for BCM20702A0
Bluetooth: btusb: Add vendor specific ID (0489 e042) for BCM20702A0
Marcel Holtmann (10):
Bluetooth: Add TX power tag to EIR data
Bluetooth: Handle EIR tags for Device ID
Bluetooth: Add management command for setting Device ID
Bluetooth: Fix broken usage of put_unaligned_le16
Bluetooth: Fix broken usage of get_unaligned_le16
Bluetooth: Update management interface revision
Bluetooth: Split error handling for L2CAP listen sockets
Bluetooth: Split error handling for SCO listen sockets
Bluetooth: Don't check source address in SCO bind function
Bluetooth: Restrict to one SCO listening socket
Mat Martineau (4):
Bluetooth: Add definitions and struct members for new ERTM state machine
Bluetooth: Add a structure to carry ERTM data in skb control blocks
Bluetooth: Add the l2cap_seq_list structure for tracking frames
Bluetooth: Functions for handling ERTM control fields
Meenakshi Venkataraman (1):
iwlwifi: use correct released ucode version
Mikel Astiz (3):
Bluetooth: Use unsigned int instead of signed int
Bluetooth: Remove unnecessary check
Bluetooth: btusb: Dynamic alternate setting
Rajkumar Manoharan (1):
mac80211: fix rate control update on 2040 bss change
Santosh Nayak (1):
Bluetooth: Fix Endian Bug.
Seth Forshee (1):
b43: only reload config after successful initialization
Stanislav Yakovlev (2):
ipw2200: Fix race condition in the command completion acknowledge
net/wireless: ipw2200: Fix WARN_ON occurring in wiphy_register called by ipw_pci_probe
Stanislaw Gruszka (2):
iwlwifi: do not nulify ctx->vif on reset
iwlwifi: add option to disable 5GHz band
Steven Harms (2):
Add Foxconn / Hon Hai IDs for btusb module
Add Foxconn / Hon Hai IDs for btusb module
Syam Sidhardhan (3):
Bluetooth: remove header declared but not defined
Bluetooth: Remove strtoba header declared but not defined
Bluetooth: mgmt: Remove unwanted goto statements
Szymon Janc (4):
Bluetooth: mgmt: Fix some code style and indentation issues
Bluetooth: mgmt: Don't allow to set invalid value to DeviceID source
Bluetooth: Fix missing break in hci_cmd_complete_evt
Bluetooth: Fix missing break in hci_cmd_complete_evt
Thomas Pedersen (2):
mac80211: insert mesh peer after init
mac80211: don't transmit 40MHz frames to 20MHz peer
Ulisses Furquim (1):
Bluetooth: Fix registering hci with duplicate name
Vinicius Costa Gomes (1):
Bluetooth: Add support for reusing the same hci_conn for LE links
Vishal Agarwal (4):
Bluetooth: hci_persistent_key should return bool
Bluetooth: Temporary keys should be retained during connection
Bluetooth: hci_persistent_key should return bool
Bluetooth: Temporary keys should be retained during connection
WarheadsSE (1):
mwifiex: add support for SD8786 sdio
Wey-Yi Guy (11):
iwlwifi: remove unused macros
iwlwifi: add BT reduced tx power flag
iwlwifi: add checking for the condition to reduce tx power
iwlwifi: add reduced tx power threshold define
iwlwifi: small define change
iwlwifi: send reduce tx power info in command
iwlwifi: change kill mask based on reduce power state
iwlwifi: add loose coex lut
iwlwifi: use 6000G2B for 6030 device series
iwlwifi: modify #ifdef to avoid sparse complain
iwlwifi: remove the iwl_shared reference
drivers/bluetooth/ath3k.c | 4 +
drivers/bluetooth/btmrvl_sdio.c | 9 +-
drivers/bluetooth/btusb.c | 19 +-
drivers/bluetooth/hci_ldisc.c | 2 +-
drivers/bluetooth/hci_vhci.c | 3 +-
drivers/net/wireless/at76c50x-usb.c | 4 +-
drivers/net/wireless/ath/ath5k/ahb.c | 1 +
drivers/net/wireless/ath/ath9k/ar5008_phy.c | 2 +-
drivers/net/wireless/ath/ath9k/ar9003_paprd.c | 2 +-
drivers/net/wireless/ath/ath9k/ar9003_phy.c | 2 +-
drivers/net/wireless/ath/ath9k/eeprom_9287.c | 2 +
drivers/net/wireless/ath/ath9k/hw.c | 9 +-
drivers/net/wireless/ath/ath9k/hw.h | 3 +-
drivers/net/wireless/b43/main.c | 10 +-
drivers/net/wireless/brcm80211/Kconfig | 9 +
drivers/net/wireless/brcm80211/brcmfmac/bcmsdh.c | 97 ++++-
.../net/wireless/brcm80211/brcmfmac/bcmsdh_sdmmc.c | 113 +++++-
drivers/net/wireless/brcm80211/brcmfmac/dhd_sdio.c | 102 ++++-
.../net/wireless/brcm80211/brcmfmac/sdio_host.h | 22 +-
drivers/net/wireless/brcm80211/brcmsmac/main.c | 3 +-
drivers/net/wireless/ipw2x00/ipw2100.c | 24 +-
drivers/net/wireless/ipw2x00/ipw2200.c | 57 ++--
drivers/net/wireless/iwlwifi/iwl-1000.c | 8 +-
drivers/net/wireless/iwlwifi/iwl-2000.c | 16 +-
drivers/net/wireless/iwlwifi/iwl-5000.c | 11 +-
drivers/net/wireless/iwlwifi/iwl-6000.c | 10 +-
drivers/net/wireless/iwlwifi/iwl-agn-lib.c | 153 +++----
drivers/net/wireless/iwlwifi/iwl-agn.c | 41 +-
drivers/net/wireless/iwlwifi/iwl-agn.h | 2 +-
drivers/net/wireless/iwlwifi/iwl-commands.h | 21 +-
drivers/net/wireless/iwlwifi/iwl-dev.h | 1 +
drivers/net/wireless/iwlwifi/iwl-drv.c | 12 +-
drivers/net/wireless/iwlwifi/iwl-fh.h | 22 +-
drivers/net/wireless/iwlwifi/iwl-mac80211.c | 10 +-
drivers/net/wireless/iwlwifi/iwl-modparams.h | 8 +-
drivers/net/wireless/iwlwifi/iwl-prph.h | 27 +-
drivers/net/wireless/iwlwifi/iwl-trans-pcie.c | 1 +
drivers/net/wireless/libertas/firmware.c | 1 +
drivers/net/wireless/mwifiex/Kconfig | 4 +-
drivers/net/wireless/mwifiex/fw.h | 3 +-
drivers/net/wireless/mwifiex/sdio.c | 7 +
drivers/net/wireless/mwifiex/sdio.h | 1 +
drivers/net/wireless/rt2x00/rt2800.h | 2 +
drivers/net/wireless/rt2x00/rt2800lib.c | 7 +
drivers/net/wireless/rt2x00/rt2x00debug.c | 82 ++--
drivers/net/wireless/rt2x00/rt2x00debug.h | 1 +
drivers/net/wireless/rtlwifi/pci.c | 1 +
drivers/net/wireless/ti/wl1251/main.c | 1 +
drivers/net/wireless/ti/wl1251/sdio.c | 2 +-
include/linux/nl80211.h | 3 +
include/net/bluetooth/bluetooth.h | 14 +-
include/net/bluetooth/hci.h | 7 +
include/net/bluetooth/hci_core.h | 21 +-
include/net/bluetooth/l2cap.h | 78 +++-
include/net/bluetooth/mgmt.h | 9 +
include/net/bluetooth/smp.h | 2 +-
include/net/cfg80211.h | 18 +
include/net/mac80211.h | 17 +
net/bluetooth/hci_conn.c | 32 +-
net/bluetooth/hci_core.c | 206 +++++-----
net/bluetooth/hci_event.c | 61 +++-
net/bluetooth/hci_sysfs.c | 5 +-
net/bluetooth/l2cap_core.c | 454 ++++++++++++++++----
net/bluetooth/l2cap_sock.c | 33 +-
net/bluetooth/lib.c | 2 +
net/bluetooth/mgmt.c | 225 +++++++----
net/bluetooth/sco.c | 72 ++--
net/bluetooth/smp.c | 2 +-
net/mac80211/cfg.c | 182 ++++++++
net/mac80211/driver-ops.h | 37 ++
net/mac80211/driver-trace.h | 15 +
net/mac80211/ibss.c | 2 +-
net/mac80211/ieee80211_i.h | 5 +-
net/mac80211/iface.c | 4 +-
net/mac80211/mesh.c | 18 +-
net/mac80211/mesh_plink.c | 96 ++++-
net/mac80211/mlme.c | 4 +-
net/mac80211/sta_info.h | 1 +
net/mac80211/tx.c | 3 +-
net/mac80211/util.c | 9 +-
net/wireless/ethtool.c | 29 ++
net/wireless/mesh.c | 1 +
net/wireless/nl80211.c | 7 +-
net/wireless/scan.c | 6 +-
net/wireless/util.c | 3 +-
85 files changed, 1972 insertions(+), 665 deletions(-)
--
John W. Linville Someday the world will need a hero, and you
linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org might be all we have. Be ready.
[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply
* Re: [PATCH 2/2] tcp: cleanup tcp_try_coalesce
From: John W. Linville @ 2012-05-03 15:14 UTC (permalink / raw)
To: David Miller
Cc: eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w,
alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w,
netdev-u79uwXL29TY76Z2rM5mHXA, edumazet-hpIqsD4AKlfQT0dZR+AlfA,
jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
linux-wireless-u79uwXL29TY76Z2rM5mHXA,
wey-yi.w.guy-ral2JQCrhuEAvxtiuMwx3w
In-Reply-To: <20120503.012502.44731688706812861.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
On Thu, May 03, 2012 at 01:25:02AM -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Date: Thu, 03 May 2012 07:19:33 +0200
>
> > My last patch against iwlwifi is still waiting to make its way into
> > official tree.
> >
> > http://www.spinics.net/lists/netdev/msg192629.html
>
> John, please rectify this situation.
>
> The Intel Wireless folks said they would test it, but that was more
> than a month ago.
>
> It's not acceptable to let bug fixes rot for that long, I don't care
> what their special internal testing procedure is.
>
> If they give you further pushback, please just ignore them and apply
> Eric's fix directly.
>
> Thank you.
I imagine that this somehow got lost in the shuffle during the
merge window. That doesn't excuse it, of course.
It has waited long enough already, so I'll just go ahead and take it.
John
--
John W. Linville Someday the world will need a hero, and you
linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org might be all we have. Be ready.
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 00/16] Swap-over-NBD without deadlocking V9
From: Mel Gorman @ 2012-05-03 15:00 UTC (permalink / raw)
To: Andrew Morton
Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
Peter Zijlstra, Mike Christie, Eric B Munson
In-Reply-To: <20120501152826.b970a098.akpm@linux-foundation.org>
On Tue, May 01, 2012 at 03:28:26PM -0700, Andrew Morton wrote:
>
> This patchset is far less ghastly than I feared/remembered/dreamed ;)
>
That might be the best comment the series ever received :)
> The mm parts, anyway. Are the net guys on board with it all?
They are cc'd but have not given any feedback in a while. That could be
because they are happy with it or because if they felt the MM parts were
blocking the series then it was unnecessary to review the network parts.
Any of the networking people care to comment?
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* [PATCH 9/9] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack.
From: Ian Campbell @ 2012-05-03 14:56 UTC (permalink / raw)
To: netdev
Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Ian Campbell,
Neil Brown, J. Bruce Fields, linux-nfs
In-Reply-To: <1336056915.20716.96.camel@zakaz.uk.xensource.com>
This prevents an issue where an ACK is delayed, a retransmit is queued (either
at the RPC or TCP level) and the ACK arrives before the retransmission hits the
wire. If this happens to an NFS WRITE RPC then the write() system call
completes and the userspace process can continue, potentially modifying data
referenced by the retransmission before the retransmission occurs.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: linux-nfs@vger.kernel.org
Cc: netdev@vger.kernel.org
---
include/linux/sunrpc/xdr.h | 2 ++
include/linux/sunrpc/xprt.h | 5 ++++-
net/sunrpc/clnt.c | 27 ++++++++++++++++++++++-----
net/sunrpc/svcsock.c | 3 ++-
net/sunrpc/xprt.c | 12 ++++++++++++
net/sunrpc/xprtsock.c | 3 ++-
6 files changed, 44 insertions(+), 8 deletions(-)
diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h
index af70af3..ff1b121 100644
--- a/include/linux/sunrpc/xdr.h
+++ b/include/linux/sunrpc/xdr.h
@@ -16,6 +16,7 @@
#include <asm/byteorder.h>
#include <asm/unaligned.h>
#include <linux/scatterlist.h>
+#include <linux/skbuff.h>
/*
* Buffer adjustment
@@ -57,6 +58,7 @@ struct xdr_buf {
tail[1]; /* Appended after page data */
struct page ** pages; /* Array of contiguous pages */
+ struct skb_frag_destructor *destructor;
unsigned int page_base, /* Start of page data */
page_len, /* Length of page data */
flags; /* Flags for data disposition */
diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 77d278d..e8d3f18 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -92,7 +92,10 @@ struct rpc_rqst {
/* A cookie used to track the
state of the transport
connection */
-
+ struct skb_frag_destructor destructor; /* SKB paged fragment
+ * destructor for
+ * transmitted pages*/
+
/*
* Partial send handling
*/
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 6797246..351bf3d 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -61,6 +61,7 @@ static void call_reserve(struct rpc_task *task);
static void call_reserveresult(struct rpc_task *task);
static void call_allocate(struct rpc_task *task);
static void call_decode(struct rpc_task *task);
+static void call_complete(struct rpc_task *task);
static void call_bind(struct rpc_task *task);
static void call_bind_status(struct rpc_task *task);
static void call_transmit(struct rpc_task *task);
@@ -1416,6 +1417,8 @@ rpc_xdr_encode(struct rpc_task *task)
(char *)req->rq_buffer + req->rq_callsize,
req->rq_rcvsize);
+ req->rq_snd_buf.destructor = &req->destructor;
+
p = rpc_encode_header(task);
if (p == NULL) {
printk(KERN_INFO "RPC: couldn't encode RPC header, exit EIO\n");
@@ -1581,6 +1584,7 @@ call_connect_status(struct rpc_task *task)
static void
call_transmit(struct rpc_task *task)
{
+ struct rpc_rqst *req = task->tk_rqstp;
dprint_status(task);
task->tk_action = call_status;
@@ -1614,8 +1618,8 @@ call_transmit(struct rpc_task *task)
call_transmit_status(task);
if (rpc_reply_expected(task))
return;
- task->tk_action = rpc_exit_task;
- rpc_wake_up_queued_task(&task->tk_xprt->pending, task);
+ task->tk_action = call_complete;
+ skb_frag_destructor_unref(&req->destructor);
}
/*
@@ -1688,7 +1692,8 @@ call_bc_transmit(struct rpc_task *task)
return;
}
- task->tk_action = rpc_exit_task;
+ task->tk_action = call_complete;
+ skb_frag_destructor_unref(&req->destructor);
if (task->tk_status < 0) {
printk(KERN_NOTICE "RPC: Could not send backchannel reply "
"error: %d\n", task->tk_status);
@@ -1728,7 +1733,6 @@ call_bc_transmit(struct rpc_task *task)
"error: %d\n", task->tk_status);
break;
}
- rpc_wake_up_queued_task(&req->rq_xprt->pending, task);
}
#endif /* CONFIG_SUNRPC_BACKCHANNEL */
@@ -1906,12 +1910,14 @@ call_decode(struct rpc_task *task)
return;
}
- task->tk_action = rpc_exit_task;
+ task->tk_action = call_complete;
if (decode) {
task->tk_status = rpcauth_unwrap_resp(task, decode, req, p,
task->tk_msg.rpc_resp);
}
+ rpc_sleep_on(&req->rq_xprt->pending, task, NULL);
+ skb_frag_destructor_unref(&req->destructor);
dprintk("RPC: %5u call_decode result %d\n", task->tk_pid,
task->tk_status);
return;
@@ -1926,6 +1932,17 @@ out_retry:
}
}
+/*
+ * 8. Wait for pages to be released by the network stack.
+ */
+static void
+call_complete(struct rpc_task *task)
+{
+ dprintk("RPC: %5u call_complete result %d\n",
+ task->tk_pid, task->tk_status);
+ task->tk_action = rpc_exit_task;
+}
+
static __be32 *
rpc_encode_header(struct rpc_task *task)
{
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index f6d8c73..1145929 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -198,7 +198,8 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
while (pglen > 0) {
if (slen == size)
flags = 0;
- result = kernel_sendpage(sock, *ppage, NULL, base, size, flags);
+ result = kernel_sendpage(sock, *ppage, xdr->destructor,
+ base, size, flags);
if (result > 0)
len += result;
if (result != size)
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 6fe2dce..f8418a0 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1108,6 +1108,16 @@ static inline void xprt_init_xid(struct rpc_xprt *xprt)
xprt->xid = net_random();
}
+static int xprt_complete_skb_pages(struct skb_frag_destructor *destroy)
+{
+ struct rpc_rqst *req =
+ container_of(destroy, struct rpc_rqst, destructor);
+
+ dprintk("RPC: %5u completing skb pages\n", req->rq_task->tk_pid);
+ rpc_wake_up_queued_task(&req->rq_xprt->pending, req->rq_task);
+ return 0;
+}
+
static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
{
struct rpc_rqst *req = task->tk_rqstp;
@@ -1120,6 +1130,8 @@ static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
req->rq_xid = xprt_alloc_xid(xprt);
req->rq_release_snd_buf = NULL;
xprt_reset_majortimeo(req);
+ atomic_set(&req->destructor.ref, 1);
+ req->destructor.destroy = &xprt_complete_skb_pages;
dprintk("RPC: %5u reserved req %p xid %08x\n", task->tk_pid,
req, ntohl(req->rq_xid));
}
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index f1995dc..44e07f3 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -408,7 +408,8 @@ static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned i
remainder -= len;
if (remainder != 0 || more)
flags |= MSG_MORE;
- err = sock->ops->sendpage(sock, *ppage, NULL, base, len, flags);
+ err = sock->ops->sendpage(sock, *ppage, xdr->destructor,
+ base, len, flags);
if (remainder == 0 || err != len)
break;
sent += err;
--
1.7.2.5
^ permalink raw reply related
* [PATCH 2/9] net: Use SKB_WITH_OVERHEAD in build_skb
From: Ian Campbell @ 2012-05-03 14:56 UTC (permalink / raw)
To: netdev; +Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Ian Campbell
In-Reply-To: <1336056915.20716.96.camel@zakaz.uk.xensource.com>
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
---
net/core/skbuff.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a056d7c..c60b603 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -263,7 +263,7 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size)
if (!skb)
return NULL;
- size -= SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+ size = SKB_WITH_OVERHEAD(size);
memset(skb, 0, offsetof(struct sk_buff, tail));
skb->truesize = SKB_TRUESIZE(size);
--
1.7.2.5
^ permalink raw reply related
* [PATCH 4/9] skb: add skb_shinfo_init and use for both alloc_skb, build_skb and skb_recycle
From: Ian Campbell @ 2012-05-03 14:56 UTC (permalink / raw)
To: netdev; +Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Ian Campbell
In-Reply-To: <1336056915.20716.96.camel@zakaz.uk.xensource.com>
There is only one semantic change here which is that skb_recycle now does:
kmemcheck_annotate_variable(shinfo->destructor_arg)
I don't think it was erroneously missing before (since in the skb_recycle case
it will have happened previously) but I beleive it is harmless to do it again
and this saves having a different copy of the same code for the recycle case.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
---
net/core/skbuff.c | 30 +++++++++++++-----------------
1 files changed, 13 insertions(+), 17 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index c60b603..e96f68b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -145,6 +145,16 @@ static void skb_under_panic(struct sk_buff *skb, int sz, void *here)
BUG();
}
+static void skb_shinfo_init(struct sk_buff *skb)
+{
+ struct skb_shared_info *shinfo = skb_shinfo(skb);
+
+ /* make sure we initialize shinfo sequentially */
+ memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
+ atomic_set(&shinfo->dataref, 1);
+ kmemcheck_annotate_variable(shinfo->destructor_arg);
+}
+
/* Allocate a new skbuff. We do this ourselves so we can fill in a few
* 'private' fields and also do memory statistics to find all the
* [BEEP] leaks.
@@ -170,7 +180,6 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
int fclone, int node)
{
struct kmem_cache *cache;
- struct skb_shared_info *shinfo;
struct sk_buff *skb;
u8 *data;
@@ -210,11 +219,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
skb->mac_header = ~0U;
#endif
- /* make sure we initialize shinfo sequentially */
- shinfo = skb_shinfo(skb);
- memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
- atomic_set(&shinfo->dataref, 1);
- kmemcheck_annotate_variable(shinfo->destructor_arg);
+ skb_shinfo_init(skb);
if (fclone) {
struct sk_buff *child = skb + 1;
@@ -255,7 +260,6 @@ EXPORT_SYMBOL(__alloc_skb);
*/
struct sk_buff *build_skb(void *data, unsigned int frag_size)
{
- struct skb_shared_info *shinfo;
struct sk_buff *skb;
unsigned int size = frag_size ? : ksize(data);
@@ -277,11 +281,7 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size)
skb->mac_header = ~0U;
#endif
- /* make sure we initialize shinfo sequentially */
- shinfo = skb_shinfo(skb);
- memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
- atomic_set(&shinfo->dataref, 1);
- kmemcheck_annotate_variable(shinfo->destructor_arg);
+ skb_shinfo_init(skb);
return skb;
}
@@ -546,13 +546,9 @@ EXPORT_SYMBOL(consume_skb);
*/
void skb_recycle(struct sk_buff *skb)
{
- struct skb_shared_info *shinfo;
-
skb_release_head_state(skb);
- shinfo = skb_shinfo(skb);
- memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
- atomic_set(&shinfo->dataref, 1);
+ skb_shinfo_init(skb);
memset(skb, 0, offsetof(struct sk_buff, tail));
skb->data = skb->head + NET_SKB_PAD;
--
1.7.2.5
^ permalink raw reply related
* [PATCH 6/9] net: add support for per-paged-fragment destructors
From: Ian Campbell @ 2012-05-03 14:56 UTC (permalink / raw)
To: netdev
Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Ian Campbell,
Michał Mirosław
In-Reply-To: <1336056915.20716.96.camel@zakaz.uk.xensource.com>
Entities which care about the complete lifecycle of pages which they inject
into the network stack via an skb paged fragment can choose to set this
destructor in order to receive a callback when the stack is really finished
with a page (including all clones, retransmits, pull-ups etc etc).
This destructor will always be propagated alongside the struct page when
copying skb_frag_t->page. This is the reason I chose to embed the destructor in
a "struct { } page" within the skb_frag_t, rather than as a separate field,
since it allows existing code which propagates ->frags[N].page to Just
Work(tm).
When the destructor is present the page reference counting is done slightly
differently. No references are held by the network stack on the struct page (it
is up to the caller to manage this as necessary) instead the network stack will
track references via the count embedded in the destructor structure. When this
reference count reaches zero then the destructor will be called and the caller
can take the necesary steps to release the page (i.e. release the struct page
reference itself).
The intention is that callers can use this callback to delay completion to
_their_ callers until the network stack has completely released the page, in
order to prevent use-after-free or modification of data pages which are still
in use by the stack.
It is allowable (indeed expected) for a caller to share a single destructor
instance between multiple pages injected into the stack e.g. a group of pages
included in a single higher level operation might share a destructor which is
used to complete that higher level operation.
Previous changes have ensured that, even with the increase in frag size, the
hot fields (nr_frags through to at least frags[0]) fit with and are aligned to
a 64 byte cache line.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
Cc: netdev@vger.kernel.org
---
include/linux/skbuff.h | 50 ++++++++++++++++++++++++++++++++++++++++++++++-
net/core/skbuff.c | 18 +++++++++++++++++
net/ipv4/ip_output.c | 2 +-
net/ipv4/tcp.c | 4 +-
4 files changed, 69 insertions(+), 5 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 3698625..ccc7d93 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -168,9 +168,15 @@ struct sk_buff;
typedef struct skb_frag_struct skb_frag_t;
+struct skb_frag_destructor {
+ atomic_t ref;
+ int (*destroy)(struct skb_frag_destructor *destructor);
+};
+
struct skb_frag_struct {
struct {
struct page *p;
+ struct skb_frag_destructor *destructor;
} page;
#if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
__u32 page_offset;
@@ -1232,6 +1238,31 @@ static inline int skb_pagelen(const struct sk_buff *skb)
}
/**
+ * skb_frag_set_destructor - set destructor for a paged fragment
+ * @skb: buffer containing fragment to be initialised
+ * @i: paged fragment index to initialise
+ * @destroy: the destructor to use for this fragment
+ *
+ * Sets @destroy as the destructor to be called when all references to
+ * the frag @i in @skb (tracked over skb_clone, retransmit, pull-ups,
+ * etc) are released.
+ *
+ * When a destructor is set then reference counting is performed on
+ * @destroy->ref. When the ref reaches zero then @destroy->destroy
+ * will be called. The caller is responsible for holding and managing
+ * any other references (such a the struct page reference count).
+ *
+ * This function must be called before any use of skb_frag_ref() or
+ * skb_frag_unref().
+ */
+static inline void skb_frag_set_destructor(struct sk_buff *skb, int i,
+ struct skb_frag_destructor *destroy)
+{
+ skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+ frag->page.destructor = destroy;
+}
+
+/**
* __skb_fill_page_desc - initialise a paged fragment in an skb
* @skb: buffer containing fragment to be initialised
* @i: paged fragment index to initialise
@@ -1250,6 +1281,7 @@ static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
frag->page.p = page;
+ frag->page.destructor = NULL;
frag->page_offset = off;
skb_frag_size_set(frag, size);
}
@@ -1766,6 +1798,9 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag)
return frag->page.p;
}
+extern void skb_frag_destructor_ref(struct skb_frag_destructor *destroy);
+extern void skb_frag_destructor_unref(struct skb_frag_destructor *destroy);
+
/**
* __skb_frag_ref - take an addition reference on a paged fragment.
* @frag: the paged fragment
@@ -1774,6 +1809,10 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag)
*/
static inline void __skb_frag_ref(skb_frag_t *frag)
{
+ if (unlikely(frag->page.destructor)) {
+ skb_frag_destructor_ref(frag->page.destructor);
+ return;
+ }
get_page(skb_frag_page(frag));
}
@@ -1797,6 +1836,10 @@ static inline void skb_frag_ref(struct sk_buff *skb, int f)
*/
static inline void __skb_frag_unref(skb_frag_t *frag)
{
+ if (unlikely(frag->page.destructor)) {
+ skb_frag_destructor_unref(frag->page.destructor);
+ return;
+ }
put_page(skb_frag_page(frag));
}
@@ -1994,13 +2037,16 @@ static inline int skb_add_data(struct sk_buff *skb,
}
static inline bool skb_can_coalesce(struct sk_buff *skb, int i,
- const struct page *page, int off)
+ const struct page *page,
+ const struct skb_frag_destructor *destroy,
+ int off)
{
if (i) {
const struct skb_frag_struct *frag = &skb_shinfo(skb)->frags[i - 1];
return page == skb_frag_page(frag) &&
- off == frag->page_offset + skb_frag_size(frag);
+ off == frag->page_offset + skb_frag_size(frag) &&
+ frag->page.destructor == destroy;
}
return false;
}
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index fab6de0..945b807 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -353,6 +353,23 @@ struct sk_buff *dev_alloc_skb(unsigned int length)
}
EXPORT_SYMBOL(dev_alloc_skb);
+void skb_frag_destructor_ref(struct skb_frag_destructor *destroy)
+{
+ BUG_ON(destroy == NULL);
+ atomic_inc(&destroy->ref);
+}
+EXPORT_SYMBOL(skb_frag_destructor_ref);
+
+void skb_frag_destructor_unref(struct skb_frag_destructor *destroy)
+{
+ if (destroy == NULL)
+ return;
+
+ if (atomic_dec_and_test(&destroy->ref))
+ destroy->destroy(destroy);
+}
+EXPORT_SYMBOL(skb_frag_destructor_unref);
+
static void skb_drop_list(struct sk_buff **listp)
{
struct sk_buff *list = *listp;
@@ -2334,6 +2351,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen)
*/
if (!to ||
!skb_can_coalesce(tgt, to, skb_frag_page(fragfrom),
+ fragfrom->page.destructor,
fragfrom->page_offset)) {
merge = -1;
} else {
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 4910176..7652751 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1242,7 +1242,7 @@ ssize_t ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
i = skb_shinfo(skb)->nr_frags;
if (len > size)
len = size;
- if (skb_can_coalesce(skb, i, page, offset)) {
+ if (skb_can_coalesce(skb, i, page, NULL, offset)) {
skb_frag_size_add(&skb_shinfo(skb)->frags[i-1], len);
} else if (i < MAX_SKB_FRAGS) {
get_page(page);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 9670af3..2d590ca 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -870,7 +870,7 @@ new_segment:
copy = size;
i = skb_shinfo(skb)->nr_frags;
- can_coalesce = skb_can_coalesce(skb, i, page, offset);
+ can_coalesce = skb_can_coalesce(skb, i, page, NULL, offset);
if (!can_coalesce && i >= MAX_SKB_FRAGS) {
tcp_mark_push(tp, skb);
goto new_segment;
@@ -1124,7 +1124,7 @@ new_segment:
off = sk->sk_sndmsg_off;
- if (skb_can_coalesce(skb, i, page, off) &&
+ if (skb_can_coalesce(skb, i, page, NULL, off) &&
off != PAGE_SIZE) {
/* We can extend the last page
* fragment. */
--
1.7.2.5
^ permalink raw reply related
* [PATCH 3/9] chelsio: use SKB_WITH_OVERHEAD
From: Ian Campbell @ 2012-05-03 14:56 UTC (permalink / raw)
To: netdev
Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Ian Campbell,
Divy Le Ray
In-Reply-To: <1336056915.20716.96.camel@zakaz.uk.xensource.com>
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Divy Le Ray <divy@chelsio.com>
---
drivers/net/ethernet/chelsio/cxgb/sge.c | 3 +--
drivers/net/ethernet/chelsio/cxgb3/sge.c | 6 +++---
2 files changed, 4 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/chelsio/cxgb/sge.c b/drivers/net/ethernet/chelsio/cxgb/sge.c
index 47a8435..52373db 100644
--- a/drivers/net/ethernet/chelsio/cxgb/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb/sge.c
@@ -599,8 +599,7 @@ static int alloc_rx_resources(struct sge *sge, struct sge_params *p)
sizeof(struct cpl_rx_data) +
sge->freelQ[!sge->jumbo_fl].dma_offset;
- size = (16 * 1024) -
- SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+ size = SKB_WITH_OVERHEAD(16 * 1024);
sge->freelQ[sge->jumbo_fl].rx_buffer_size = size;
diff --git a/drivers/net/ethernet/chelsio/cxgb3/sge.c b/drivers/net/ethernet/chelsio/cxgb3/sge.c
index cfb60e1..b804470 100644
--- a/drivers/net/ethernet/chelsio/cxgb3/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb3/sge.c
@@ -3043,7 +3043,7 @@ int t3_sge_alloc_qset(struct adapter *adapter, unsigned int id, int nports,
q->fl[1].buf_size = FL1_PG_CHUNK_SIZE;
#else
q->fl[1].buf_size = is_offload(adapter) ?
- (16 * 1024) - SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) :
+ SKB_WITH_OVERHEAD(16 * 1024) :
MAX_FRAME_SIZE + 2 + sizeof(struct cpl_rx_pkt);
#endif
@@ -3282,8 +3282,8 @@ void t3_sge_prep(struct adapter *adap, struct sge_params *p)
{
int i;
- p->max_pkt_size = (16 * 1024) - sizeof(struct cpl_rx_data) -
- SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+ p->max_pkt_size =
+ SKB_WITH_OVERHEAD((16*1024) - sizeof(struct cpl_rx_data));
for (i = 0; i < SGE_QSETS; ++i) {
struct qset_params *q = p->qset + i;
--
1.7.2.5
^ permalink raw reply related
* Re: [PATCH 05/11] mm: swap: Implement generic handler for swap_activate
From: Mel Gorman @ 2012-05-03 14:57 UTC (permalink / raw)
To: Andrew Morton
Cc: Linux-MM, Linux-Netdev, Linux-NFS, LKML, David Miller,
Trond Myklebust, Neil Brown, Christoph Hellwig, Peter Zijlstra,
Mike Christie, Eric B Munson
In-Reply-To: <20120501155747.368a1d36.akpm@linux-foundation.org>
On Tue, May 01, 2012 at 03:57:47PM -0700, Andrew Morton wrote:
> On Mon, 16 Apr 2012 13:17:49 +0100
> Mel Gorman <mgorman@suse.de> wrote:
>
> > The version of swap_activate introduced is sufficient for swap-over-NFS
> > but would not provide enough information to implement a generic handler.
> > This patch shuffles things slightly to ensure the same information is
> > available for aops->swap_activate() as is available to the core.
> >
> > No functionality change.
> >
> > ...
> >
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -587,6 +587,8 @@ typedef struct {
> > typedef int (*read_actor_t)(read_descriptor_t *, struct page *,
> > unsigned long, unsigned long);
> >
> > +struct swap_info_struct;
>
> Please put forward declarations at top-of-file. To prevent accidental
> duplication later on.
>
Done.
> > struct address_space_operations {
> > int (*writepage)(struct page *page, struct writeback_control *wbc);
> > int (*readpage)(struct file *, struct page *);
> >
> > ...
> >
> > --- a/mm/page_io.c
> > +++ b/mm/page_io.c
>
> Have you tested all this code with CONFIG_SWAP=n?
>
Emm, it builds. That counts, right?
> Have you sought to minimise additional new code when CONFIG_SWAP=n?
>
Not specifically, but generic_swapfile_activate() is defined in page_io.c
and that is built only if CONFIG_SWAP=y. Similarly swapon is in
swapfile.c which is only build when swap is enabled.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* [PATCH 8/9] net: add paged frag destructor support to kernel_sendpage.
From: Ian Campbell @ 2012-05-03 14:56 UTC (permalink / raw)
To: netdev; +Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Ian Campbell
In-Reply-To: <1336056915.20716.96.camel@zakaz.uk.xensource.com>
This requires adding a new argument to various sendpage hooks up and down the
stack. At the moment this parameter is always NULL.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
---
drivers/block/drbd/drbd_main.c | 1 +
drivers/scsi/iscsi_tcp.c | 4 ++--
drivers/scsi/iscsi_tcp.h | 3 ++-
drivers/target/iscsi/iscsi_target_util.c | 3 ++-
fs/dlm/lowcomms.c | 4 ++--
fs/ocfs2/cluster/tcp.c | 1 +
include/linux/net.h | 6 +++++-
include/net/inet_common.h | 4 +++-
include/net/ip.h | 4 +++-
include/net/sock.h | 8 +++++---
include/net/tcp.h | 4 +++-
net/ceph/messenger.c | 2 +-
net/core/sock.c | 6 +++++-
net/ipv4/af_inet.c | 9 ++++++---
net/ipv4/ip_output.c | 6 ++++--
net/ipv4/tcp.c | 24 +++++++++++++++---------
net/ipv4/udp.c | 11 ++++++-----
net/ipv4/udp_impl.h | 5 +++--
net/rds/tcp_send.c | 1 +
net/socket.c | 11 +++++++----
net/sunrpc/svcsock.c | 6 +++---
net/sunrpc/xprtsock.c | 2 +-
22 files changed, 81 insertions(+), 44 deletions(-)
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 211fc44..e70ba0c 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2584,6 +2584,7 @@ static int _drbd_send_page(struct drbd_conf *mdev, struct page *page,
set_fs(KERNEL_DS);
do {
sent = mdev->data.socket->ops->sendpage(mdev->data.socket, page,
+ NULL,
offset, len,
msg_flags);
if (sent == -EAGAIN) {
diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
index 9220861..724d32538 100644
--- a/drivers/scsi/iscsi_tcp.c
+++ b/drivers/scsi/iscsi_tcp.c
@@ -284,8 +284,8 @@ static int iscsi_sw_tcp_xmit_segment(struct iscsi_tcp_conn *tcp_conn,
if (!segment->data) {
sg = segment->sg;
offset += segment->sg_offset + sg->offset;
- r = tcp_sw_conn->sendpage(sk, sg_page(sg), offset,
- copy, flags);
+ r = tcp_sw_conn->sendpage(sk, sg_page(sg), NULL,
+ offset, copy, flags);
} else {
struct msghdr msg = { .msg_flags = flags };
struct kvec iov = {
diff --git a/drivers/scsi/iscsi_tcp.h b/drivers/scsi/iscsi_tcp.h
index 666fe09..1e23265 100644
--- a/drivers/scsi/iscsi_tcp.h
+++ b/drivers/scsi/iscsi_tcp.h
@@ -52,7 +52,8 @@ struct iscsi_sw_tcp_conn {
uint32_t sendpage_failures_cnt;
uint32_t discontiguous_hdr_cnt;
- ssize_t (*sendpage)(struct socket *, struct page *, int, size_t, int);
+ ssize_t (*sendpage)(struct socket *, struct page *,
+ struct skb_frag_destructor *, int, size_t, int);
};
struct iscsi_sw_tcp_host {
diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
index 4eba86d..d876dae 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -1323,7 +1323,8 @@ send_hdr:
u32 sub_len = min_t(u32, data_len, space);
send_pg:
tx_sent = conn->sock->ops->sendpage(conn->sock,
- sg_page(sg), sg->offset + offset, sub_len, 0);
+ sg_page(sg), NULL,
+ sg->offset + offset, sub_len, 0);
if (tx_sent != sub_len) {
if (tx_sent == -EAGAIN) {
pr_err("tcp_sendpage() returned"
diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index 133ef6d..0673cea 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -1336,8 +1336,8 @@ static void send_to_sock(struct connection *con)
ret = 0;
if (len) {
- ret = kernel_sendpage(con->sock, e->page, offset, len,
- msg_flags);
+ ret = kernel_sendpage(con->sock, e->page, NULL,
+ offset, len, msg_flags);
if (ret == -EAGAIN || ret == 0) {
if (ret == -EAGAIN &&
test_bit(SOCK_ASYNC_NOSPACE, &con->sock->flags) &&
diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
index 1bfe880..c82a711 100644
--- a/fs/ocfs2/cluster/tcp.c
+++ b/fs/ocfs2/cluster/tcp.c
@@ -983,6 +983,7 @@ static void o2net_sendpage(struct o2net_sock_container *sc,
mutex_lock(&sc->sc_send_lock);
ret = sc->sc_sock->ops->sendpage(sc->sc_sock,
virt_to_page(kmalloced_virt),
+ NULL,
(long)kmalloced_virt & ~PAGE_MASK,
size, MSG_DONTWAIT);
mutex_unlock(&sc->sc_send_lock);
diff --git a/include/linux/net.h b/include/linux/net.h
index be60c7f..d9b0d648 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -157,6 +157,7 @@ struct kiocb;
struct sockaddr;
struct msghdr;
struct module;
+struct skb_frag_destructor;
struct proto_ops {
int family;
@@ -203,6 +204,7 @@ struct proto_ops {
int (*mmap) (struct file *file, struct socket *sock,
struct vm_area_struct * vma);
ssize_t (*sendpage) (struct socket *sock, struct page *page,
+ struct skb_frag_destructor *destroy,
int offset, size_t size, int flags);
ssize_t (*splice_read)(struct socket *sock, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len, unsigned int flags);
@@ -274,7 +276,9 @@ extern int kernel_getsockopt(struct socket *sock, int level, int optname,
char *optval, int *optlen);
extern int kernel_setsockopt(struct socket *sock, int level, int optname,
char *optval, unsigned int optlen);
-extern int kernel_sendpage(struct socket *sock, struct page *page, int offset,
+extern int kernel_sendpage(struct socket *sock, struct page *page,
+ struct skb_frag_destructor *destroy,
+ int offset,
size_t size, int flags);
extern int kernel_sock_ioctl(struct socket *sock, int cmd, unsigned long arg);
extern int kernel_sock_shutdown(struct socket *sock,
diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 22fac98..91cd8d0 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -21,7 +21,9 @@ extern int inet_dgram_connect(struct socket *sock, struct sockaddr * uaddr,
extern int inet_accept(struct socket *sock, struct socket *newsock, int flags);
extern int inet_sendmsg(struct kiocb *iocb, struct socket *sock,
struct msghdr *msg, size_t size);
-extern ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
+extern ssize_t inet_sendpage(struct socket *sock, struct page *page,
+ struct skb_frag_destructor *frag,
+ int offset,
size_t size, int flags);
extern int inet_recvmsg(struct kiocb *iocb, struct socket *sock,
struct msghdr *msg, size_t size, int flags);
diff --git a/include/net/ip.h b/include/net/ip.h
index 94ddb69c..dbd7ecb 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -114,7 +114,9 @@ extern int ip_append_data(struct sock *sk, struct flowi4 *fl4,
struct rtable **rt,
unsigned int flags);
extern int ip_generic_getfrag(void *from, char *to, int offset, int len, int odd, struct sk_buff *skb);
-extern ssize_t ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
+extern ssize_t ip_append_page(struct sock *sk, struct flowi4 *fl4,
+ struct page *page,
+ struct skb_frag_destructor *destroy,
int offset, size_t size, int flags);
extern struct sk_buff *__ip_make_skb(struct sock *sk,
struct flowi4 *fl4,
diff --git a/include/net/sock.h b/include/net/sock.h
index 68a2834..c999f48 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -848,6 +848,7 @@ struct proto {
size_t len, int noblock, int flags,
int *addr_len);
int (*sendpage)(struct sock *sk, struct page *page,
+ struct skb_frag_destructor *destroy,
int offset, size_t size, int flags);
int (*bind)(struct sock *sk,
struct sockaddr *uaddr, int addr_len);
@@ -1466,9 +1467,10 @@ extern int sock_no_mmap(struct file *file,
struct socket *sock,
struct vm_area_struct *vma);
extern ssize_t sock_no_sendpage(struct socket *sock,
- struct page *page,
- int offset, size_t size,
- int flags);
+ struct page *page,
+ struct skb_frag_destructor *destroy,
+ int offset, size_t size,
+ int flags);
/*
* Functions to fill in entries in struct proto_ops when a protocol
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 0fb84de..81dbfde8 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -331,7 +331,9 @@ extern void *tcp_v4_tw_get_peer(struct sock *sk);
extern int tcp_v4_tw_remember_stamp(struct inet_timewait_sock *tw);
extern int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
size_t size);
-extern int tcp_sendpage(struct sock *sk, struct page *page, int offset,
+extern int tcp_sendpage(struct sock *sk, struct page *page,
+ struct skb_frag_destructor *destroy,
+ int offset,
size_t size, int flags);
extern int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg);
extern int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 36fa6bf..b355be1 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -320,7 +320,7 @@ static int ceph_tcp_sendpage(struct socket *sock, struct page *page,
int flags = MSG_DONTWAIT | MSG_NOSIGNAL | (more ? MSG_MORE : MSG_EOR);
int ret;
- ret = kernel_sendpage(sock, page, offset, size, flags);
+ ret = kernel_sendpage(sock, page, NULL, offset, size, flags);
if (ret == -EAGAIN)
ret = 0;
diff --git a/net/core/sock.c b/net/core/sock.c
index 1a88351..cffff5f 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1953,7 +1953,9 @@ int sock_no_mmap(struct file *file, struct socket *sock, struct vm_area_struct *
}
EXPORT_SYMBOL(sock_no_mmap);
-ssize_t sock_no_sendpage(struct socket *sock, struct page *page, int offset, size_t size, int flags)
+ssize_t sock_no_sendpage(struct socket *sock, struct page *page,
+ struct skb_frag_destructor *destroy,
+ int offset, size_t size, int flags)
{
ssize_t res;
struct msghdr msg = {.msg_flags = flags};
@@ -1963,6 +1965,8 @@ ssize_t sock_no_sendpage(struct socket *sock, struct page *page, int offset, siz
iov.iov_len = size;
res = kernel_sendmsg(sock, &msg, &iov, 1, size);
kunmap(page);
+ /* kernel_sendmsg copies so we can destroy immediately */
+ skb_frag_destructor_unref(destroy);
return res;
}
EXPORT_SYMBOL(sock_no_sendpage);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index c8f7aee..b1caf89 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -747,7 +747,9 @@ int inet_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
}
EXPORT_SYMBOL(inet_sendmsg);
-ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
+ssize_t inet_sendpage(struct socket *sock, struct page *page,
+ struct skb_frag_destructor *destroy,
+ int offset,
size_t size, int flags)
{
struct sock *sk = sock->sk;
@@ -760,8 +762,9 @@ ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
return -EAGAIN;
if (sk->sk_prot->sendpage)
- return sk->sk_prot->sendpage(sk, page, offset, size, flags);
- return sock_no_sendpage(sock, page, offset, size, flags);
+ return sk->sk_prot->sendpage(sk, page, destroy,
+ offset, size, flags);
+ return sock_no_sendpage(sock, page, destroy, offset, size, flags);
}
EXPORT_SYMBOL(inet_sendpage);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 7652751..877ff62 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1129,6 +1129,7 @@ int ip_append_data(struct sock *sk, struct flowi4 *fl4,
}
ssize_t ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
+ struct skb_frag_destructor *destroy,
int offset, size_t size, int flags)
{
struct inet_sock *inet = inet_sk(sk);
@@ -1242,11 +1243,12 @@ ssize_t ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
i = skb_shinfo(skb)->nr_frags;
if (len > size)
len = size;
- if (skb_can_coalesce(skb, i, page, NULL, offset)) {
+ if (skb_can_coalesce(skb, i, page, destroy, offset)) {
skb_frag_size_add(&skb_shinfo(skb)->frags[i-1], len);
} else if (i < MAX_SKB_FRAGS) {
- get_page(page);
skb_fill_page_desc(skb, i, page, offset, len);
+ skb_frag_set_destructor(skb, i, destroy);
+ skb_frag_ref(skb, i);
} else {
err = -EMSGSIZE;
goto error;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 2d590ca..bee7864 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -822,8 +822,11 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
return mss_now;
}
-static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffset,
- size_t psize, int flags)
+static ssize_t do_tcp_sendpages(struct sock *sk,
+ struct page **pages,
+ struct skb_frag_destructor *destroy,
+ int poffset,
+ size_t psize, int flags)
{
struct tcp_sock *tp = tcp_sk(sk);
int mss_now, size_goal;
@@ -870,7 +873,7 @@ new_segment:
copy = size;
i = skb_shinfo(skb)->nr_frags;
- can_coalesce = skb_can_coalesce(skb, i, page, NULL, offset);
+ can_coalesce = skb_can_coalesce(skb, i, page, destroy, offset);
if (!can_coalesce && i >= MAX_SKB_FRAGS) {
tcp_mark_push(tp, skb);
goto new_segment;
@@ -881,8 +884,9 @@ new_segment:
if (can_coalesce) {
skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
} else {
- get_page(page);
skb_fill_page_desc(skb, i, page, offset, copy);
+ skb_frag_set_destructor(skb, i, destroy);
+ skb_frag_ref(skb, i);
}
skb->len += copy;
@@ -937,18 +941,20 @@ out_err:
return sk_stream_error(sk, flags, err);
}
-int tcp_sendpage(struct sock *sk, struct page *page, int offset,
- size_t size, int flags)
+int tcp_sendpage(struct sock *sk, struct page *page,
+ struct skb_frag_destructor *destroy,
+ int offset, size_t size, int flags)
{
ssize_t res;
if (!(sk->sk_route_caps & NETIF_F_SG) ||
!(sk->sk_route_caps & NETIF_F_ALL_CSUM))
- return sock_no_sendpage(sk->sk_socket, page, offset, size,
- flags);
+ return sock_no_sendpage(sk->sk_socket, page, destroy,
+ offset, size, flags);
lock_sock(sk);
- res = do_tcp_sendpages(sk, &page, offset, size, flags);
+ res = do_tcp_sendpages(sk, &page, destroy,
+ offset, size, flags);
release_sock(sk);
return res;
}
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 279fd08..c69aa65 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1032,8 +1032,9 @@ do_confirm:
}
EXPORT_SYMBOL(udp_sendmsg);
-int udp_sendpage(struct sock *sk, struct page *page, int offset,
- size_t size, int flags)
+int udp_sendpage(struct sock *sk, struct page *page,
+ struct skb_frag_destructor *destroy,
+ int offset, size_t size, int flags)
{
struct inet_sock *inet = inet_sk(sk);
struct udp_sock *up = udp_sk(sk);
@@ -1061,11 +1062,11 @@ int udp_sendpage(struct sock *sk, struct page *page, int offset,
}
ret = ip_append_page(sk, &inet->cork.fl.u.ip4,
- page, offset, size, flags);
+ page, destroy, offset, size, flags);
if (ret == -EOPNOTSUPP) {
release_sock(sk);
- return sock_no_sendpage(sk->sk_socket, page, offset,
- size, flags);
+ return sock_no_sendpage(sk->sk_socket, page, destroy,
+ offset, size, flags);
}
if (ret < 0) {
udp_flush_pending_frames(sk);
diff --git a/net/ipv4/udp_impl.h b/net/ipv4/udp_impl.h
index 5a681e2..aa8eca2 100644
--- a/net/ipv4/udp_impl.h
+++ b/net/ipv4/udp_impl.h
@@ -23,8 +23,9 @@ extern int compat_udp_getsockopt(struct sock *sk, int level, int optname,
#endif
extern int udp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
size_t len, int noblock, int flags, int *addr_len);
-extern int udp_sendpage(struct sock *sk, struct page *page, int offset,
- size_t size, int flags);
+extern int udp_sendpage(struct sock *sk, struct page *page,
+ struct skb_frag_destructor *destroy,
+ int offset, size_t size, int flags);
extern int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
extern void udp_destroy_sock(struct sock *sk);
diff --git a/net/rds/tcp_send.c b/net/rds/tcp_send.c
index 1b4fd68..71503ad 100644
--- a/net/rds/tcp_send.c
+++ b/net/rds/tcp_send.c
@@ -119,6 +119,7 @@ int rds_tcp_xmit(struct rds_connection *conn, struct rds_message *rm,
while (sg < rm->data.op_nents) {
ret = tc->t_sock->ops->sendpage(tc->t_sock,
sg_page(&rm->data.op_sg[sg]),
+ NULL,
rm->data.op_sg[sg].offset + off,
rm->data.op_sg[sg].length - off,
MSG_DONTWAIT|MSG_NOSIGNAL);
diff --git a/net/socket.c b/net/socket.c
index d3aaa4f..f92c9c2 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -815,7 +815,7 @@ static ssize_t sock_sendpage(struct file *file, struct page *page,
/* more is a combination of MSG_MORE and MSG_SENDPAGE_NOTLAST */
flags |= more;
- return kernel_sendpage(sock, page, offset, size, flags);
+ return kernel_sendpage(sock, page, NULL, offset, size, flags);
}
static ssize_t sock_splice_read(struct file *file, loff_t *ppos,
@@ -3349,15 +3349,18 @@ int kernel_setsockopt(struct socket *sock, int level, int optname,
}
EXPORT_SYMBOL(kernel_setsockopt);
-int kernel_sendpage(struct socket *sock, struct page *page, int offset,
+int kernel_sendpage(struct socket *sock, struct page *page,
+ struct skb_frag_destructor *destroy,
+ int offset,
size_t size, int flags)
{
sock_update_classid(sock->sk);
if (sock->ops->sendpage)
- return sock->ops->sendpage(sock, page, offset, size, flags);
+ return sock->ops->sendpage(sock, page, destroy,
+ offset, size, flags);
- return sock_no_sendpage(sock, page, offset, size, flags);
+ return sock_no_sendpage(sock, page, destroy, offset, size, flags);
}
EXPORT_SYMBOL(kernel_sendpage);
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index f0132b2..f6d8c73 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -185,7 +185,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
/* send head */
if (slen == xdr->head[0].iov_len)
flags = 0;
- len = kernel_sendpage(sock, headpage, headoffset,
+ len = kernel_sendpage(sock, headpage, NULL, headoffset,
xdr->head[0].iov_len, flags);
if (len != xdr->head[0].iov_len)
goto out;
@@ -198,7 +198,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
while (pglen > 0) {
if (slen == size)
flags = 0;
- result = kernel_sendpage(sock, *ppage, base, size, flags);
+ result = kernel_sendpage(sock, *ppage, NULL, base, size, flags);
if (result > 0)
len += result;
if (result != size)
@@ -212,7 +212,7 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
/* send tail */
if (xdr->tail[0].iov_len) {
- result = kernel_sendpage(sock, tailpage, tailoffset,
+ result = kernel_sendpage(sock, tailpage, NULL, tailoffset,
xdr->tail[0].iov_len, 0);
if (result > 0)
len += result;
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 890b03f..f1995dc 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -408,7 +408,7 @@ static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned i
remainder -= len;
if (remainder != 0 || more)
flags |= MSG_MORE;
- err = sock->ops->sendpage(sock, *ppage, base, len, flags);
+ err = sock->ops->sendpage(sock, *ppage, NULL, base, len, flags);
if (remainder == 0 || err != len)
break;
sent += err;
--
1.7.2.5
^ permalink raw reply related
* [PATCH 5/9] net: pad skb data and shinfo as a whole rather than individually
From: Ian Campbell @ 2012-05-03 14:56 UTC (permalink / raw)
To: netdev; +Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Ian Campbell
In-Reply-To: <1336056915.20716.96.camel@zakaz.uk.xensource.com>
This reduces the minimum overhead required for this allocation such that the
shinfo can be grown in the following patch without overflowing 2048 bytes for a
1500 byte frame.
Reducing this overhead while also growing the shinfo means that sometimes the
tail end of the data can end up in the same cache line as the beginning of the
shinfo. Specifically in the case of the 64 byte cache lines on a 64 bit system
the first 8 bytes of shinfo can overlap the tail cacheline of the data. In many
cases the allocation slop means that there is no overlap.
In order to ensure that the hot struct members remain on the same 64 byte cache
line move the "destructor_arg" member to the front, this member is not used on
any hot path so it is a good choice to potentially be on a separate cache line
(and which addtionally may be shared with skb->data).
Also rather than relying on knowledge about the size and layout of the rest of
the shinfo to ensure that the right parts of the shinfo are aligned decree that
nr_frags will be cache aligned and therefore that the 64 bytes starting at
nr_frags should contain the hot struct members.
All this avoids hitting an extra cache line on hot operations such as
kfree_skb.
On 4k pages this motion and alignment strategy (along with the following frag
size increase) results in the shinfo abutting the very end of the allocation.
On larger pages (where SKB_MAX_FRAGS can be smaller) it means that we still
correctly align the hot data without needing to make assumptions about the data
layout outside of the hot 64-bytes of the shinfo.
Explicitly aligning nr_frags, rather than relying on analysis of the shinfo
layout was suggested by Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
---
include/linux/skbuff.h | 50 +++++++++++++++++++++++++++++------------------
net/core/skbuff.c | 9 +++++++-
2 files changed, 39 insertions(+), 20 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 19e348f..3698625 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -41,19 +41,24 @@
#define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) & \
~(SMP_CACHE_BYTES - 1))
-/* maximum data size which can fit into an allocation of X bytes */
-#define SKB_WITH_OVERHEAD(X) \
- ((X) - SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+
/*
- * minimum allocation size required for an skb containing X bytes of data
- *
- * We do our best to align skb_shared_info on a separate cache
- * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives
- * aligned memory blocks, unless SLUB/SLAB debug is enabled. Both
- * skb->head and skb_shared_info are cache line aligned.
+ * We do our best to align the hot members of skb_shared_info on a
+ * separate cache line. We explicitly align the nr_frags field and
+ * arrange that the order of the fields in skb_shared_info is such
+ * that the interesting fields are nr_frags onwards and are therefore
+ * cache line aligned.
*/
-#define SKB_ALLOCSIZE(X) \
- (SKB_DATA_ALIGN((X)) + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+#define SKB_SHINFO_SIZE \
+ (SKB_DATA_ALIGN(sizeof(struct skb_shared_info) \
+ - offsetof(struct skb_shared_info, nr_frags)) \
+ + offsetof(struct skb_shared_info, nr_frags))
+
+/* maximum data size which can fit into an allocation of X bytes */
+#define SKB_WITH_OVERHEAD(X) ((X) - SKB_SHINFO_SIZE)
+
+/* minimum allocation size required for an skb containing X bytes of data */
+#define SKB_ALLOCSIZE(X) (SKB_DATA_ALIGN((X) + SKB_SHINFO_SIZE))
#define SKB_MAX_ORDER(X, ORDER) \
SKB_WITH_OVERHEAD((PAGE_SIZE << (ORDER)) - (X))
@@ -63,7 +68,7 @@
/* return minimum truesize of one skb containing X bytes of data */
#define SKB_TRUESIZE(X) ((X) + \
SKB_DATA_ALIGN(sizeof(struct sk_buff)) + \
- SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+ SKB_SHINFO_SIZE)
/* A. Checksumming of received packets by device.
*
@@ -263,6 +268,19 @@ struct ubuf_info {
* the end of the header data, ie. at skb->end.
*/
struct skb_shared_info {
+ /* Intermediate layers must ensure that destructor_arg
+ * remains valid until skb destructor */
+ void *destructor_arg;
+
+ /* Warning: all fields from here until dataref are cleared in
+ * skb_shinfo_init() (called from __alloc_skb, build_skb,
+ * skb_recycle, etc).
+ *
+ * nr_frags will always be aligned to the start of a cache
+ * line. It is intended that everything from nr_frags until at
+ * least frags[0] (inclusive) should fit into the same 64-byte
+ * cache line.
+ */
unsigned char nr_frags;
__u8 tx_flags;
unsigned short gso_size;
@@ -273,15 +291,9 @@ struct skb_shared_info {
struct skb_shared_hwtstamps hwtstamps;
__be32 ip6_frag_id;
- /*
- * Warning : all fields before dataref are cleared in __alloc_skb()
- */
+ /* fields from nr_frags until dataref are cleared in skb_shinfo_init */
atomic_t dataref;
- /* Intermediate layers must ensure that destructor_arg
- * remains valid until skb destructor */
- void * destructor_arg;
-
/* must be last field, see pskb_expand_head() */
skb_frag_t frags[MAX_SKB_FRAGS];
};
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index e96f68b..fab6de0 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -149,8 +149,15 @@ static void skb_shinfo_init(struct sk_buff *skb)
{
struct skb_shared_info *shinfo = skb_shinfo(skb);
+ /* Ensure that nr_frags->frags[0] (at least) fits into a
+ * single cache line. */
+ BUILD_BUG_ON((offsetof(struct skb_shared_info, frags[1])
+ - offsetof(struct skb_shared_info, nr_frags)) > 64);
+
/* make sure we initialize shinfo sequentially */
- memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
+ memset(&shinfo->nr_frags, 0,
+ offsetof(struct skb_shared_info, dataref)
+ - offsetof(struct skb_shared_info, nr_frags));
atomic_set(&shinfo->dataref, 1);
kmemcheck_annotate_variable(shinfo->destructor_arg);
}
--
1.7.2.5
^ permalink raw reply related
* [PATCH 7/9] net: add skb_orphan_frags to copy aside frags with destructors
From: Ian Campbell @ 2012-05-03 14:56 UTC (permalink / raw)
To: netdev; +Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Ian Campbell
In-Reply-To: <1336056915.20716.96.camel@zakaz.uk.xensource.com>
This should be used by drivers which need to hold on to an skb for an extended
(perhaps unbounded) period of time. e.g. the tun driver which relies on
userspace consuming the skb.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: mst@redhat.com
---
drivers/net/tun.c | 1 +
include/linux/skbuff.h | 11 ++++++++
net/core/skbuff.c | 68 ++++++++++++++++++++++++++++++++++-------------
3 files changed, 61 insertions(+), 19 deletions(-)
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index bb8c72c..b53e04e 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -415,6 +415,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
/* Orphan the skb - required as we might hang on to it
* for indefinite time. */
skb_orphan(skb);
+ skb_orphan_frags(skb, GFP_KERNEL);
/* Enqueue packet */
skb_queue_tail(&tun->socket.sk->sk_receive_queue, skb);
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index ccc7d93..9145f83 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1711,6 +1711,17 @@ static inline void skb_orphan(struct sk_buff *skb)
}
/**
+ * skb_orphan_frags - orphan the frags contained in a buffer
+ * @skb: buffer to orphan frags from
+ * @gfp_mask: allocation mask for replacement pages
+ *
+ * For each frag in the SKB which has a destructor (i.e. has an
+ * owner) create a copy of that frag and release the original
+ * page by calling the destructor.
+ */
+extern int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask);
+
+/**
* __skb_queue_purge - empty a list
* @list: list to empty
*
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 945b807..f009abb 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -697,31 +697,25 @@ struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src)
}
EXPORT_SYMBOL_GPL(skb_morph);
-/* skb_copy_ubufs - copy userspace skb frags buffers to kernel
- * @skb: the skb to modify
- * @gfp_mask: allocation priority
- *
- * This must be called on SKBTX_DEV_ZEROCOPY skb.
- * It will copy all frags into kernel and drop the reference
- * to userspace pages.
- *
- * If this function is called from an interrupt gfp_mask() must be
- * %GFP_ATOMIC.
- *
- * Returns 0 on success or a negative error code on failure
- * to allocate kernel memory to copy to.
+/*
+ * If uarg != NULL copy and replace all frags.
+ * If uarg == NULL then only copy and replace those which have a destructor
+ * pointer.
*/
-int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
+static int skb_copy_frags(struct sk_buff *skb, gfp_t gfp_mask,
+ struct ubuf_info *uarg)
{
int i;
int num_frags = skb_shinfo(skb)->nr_frags;
struct page *page, *head = NULL;
- struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg;
for (i = 0; i < num_frags; i++) {
u8 *vaddr;
skb_frag_t *f = &skb_shinfo(skb)->frags[i];
+ if (!uarg && !f->page.destructor)
+ continue;
+
page = alloc_page(GFP_ATOMIC);
if (!page) {
while (head) {
@@ -739,11 +733,16 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
head = page;
}
- /* skb frags release userspace buffers */
- for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
+ /* skb frags release buffers */
+ for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+ skb_frag_t *f = &skb_shinfo(skb)->frags[i];
+ if (!uarg && !f->page.destructor)
+ continue;
skb_frag_unref(skb, i);
+ }
- uarg->callback(uarg);
+ if (uarg)
+ uarg->callback(uarg);
/* skb frags point to kernel buffers */
for (i = skb_shinfo(skb)->nr_frags; i > 0; i--) {
@@ -752,10 +751,41 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
head = (struct page *)head->private;
}
- skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
return 0;
}
+/* skb_copy_ubufs - copy userspace skb frags buffers to kernel
+ * @skb: the skb to modify
+ * @gfp_mask: allocation priority
+ *
+ * This must be called on SKBTX_DEV_ZEROCOPY skb.
+ * It will copy all frags into kernel and drop the reference
+ * to userspace pages.
+ *
+ * If this function is called from an interrupt gfp_mask() must be
+ * %GFP_ATOMIC.
+ *
+ * Returns 0 on success or a negative error code on failure
+ * to allocate kernel memory to copy to.
+ */
+int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
+{
+ struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg;
+ int rc;
+
+ rc = skb_copy_frags(skb, gfp_mask, uarg);
+
+ if (rc == 0)
+ skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
+
+ return rc;
+}
+
+int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask)
+{
+ return skb_copy_frags(skb, gfp_mask, NULL);
+}
+EXPORT_SYMBOL(skb_orphan_frags);
/**
* skb_clone - duplicate an sk_buff
--
1.7.2.5
^ permalink raw reply related
* [PATCH 1/9] net: add and use SKB_ALLOCSIZE
From: Ian Campbell @ 2012-05-03 14:56 UTC (permalink / raw)
To: netdev; +Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, Ian Campbell
In-Reply-To: <1336056915.20716.96.camel@zakaz.uk.xensource.com>
This gives the allocation size required for an skb containing X bytes of data
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
---
drivers/net/ethernet/broadcom/bnx2.c | 7 +++----
drivers/net/ethernet/broadcom/bnx2x/bnx2x.h | 3 +--
drivers/net/ethernet/broadcom/tg3.c | 3 +--
include/linux/skbuff.h | 12 ++++++++++++
net/core/skbuff.c | 8 +-------
5 files changed, 18 insertions(+), 15 deletions(-)
diff --git a/drivers/net/ethernet/broadcom/bnx2.c b/drivers/net/ethernet/broadcom/bnx2.c
index ac7b744..62eb000 100644
--- a/drivers/net/ethernet/broadcom/bnx2.c
+++ b/drivers/net/ethernet/broadcom/bnx2.c
@@ -5321,8 +5321,7 @@ bnx2_set_rx_ring_size(struct bnx2 *bp, u32 size)
/* 8 for CRC and VLAN */
rx_size = bp->dev->mtu + ETH_HLEN + BNX2_RX_OFFSET + 8;
- rx_space = SKB_DATA_ALIGN(rx_size + BNX2_RX_ALIGN) + NET_SKB_PAD +
- SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+ rx_space = SKB_ALLOCSIZE(rx_size + BNX2_RX_ALIGN) + NET_SKB_PAD;
bp->rx_copy_thresh = BNX2_RX_COPY_THRESH;
bp->rx_pg_ring_size = 0;
@@ -5345,8 +5344,8 @@ bnx2_set_rx_ring_size(struct bnx2 *bp, u32 size)
bp->rx_buf_use_size = rx_size;
/* hw alignment + build_skb() overhead*/
- bp->rx_buf_size = SKB_DATA_ALIGN(bp->rx_buf_use_size + BNX2_RX_ALIGN) +
- NET_SKB_PAD + SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+ bp->rx_buf_size = SKB_ALLOCSIZE(bp->rx_buf_use_size + BNX2_RX_ALIGN) +
+ NET_SKB_PAD;
bp->rx_jumbo_thresh = rx_size - BNX2_RX_OFFSET;
bp->rx_ring_size = size;
bp->rx_max_ring = bnx2_find_max_ring(size, MAX_RX_RINGS);
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
index e30e2a2..3586879 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
@@ -1252,8 +1252,7 @@ struct bnx2x {
#define BNX2X_FW_RX_ALIGN_START (1UL << BNX2X_RX_ALIGN_SHIFT)
#define BNX2X_FW_RX_ALIGN_END \
- max(1UL << BNX2X_RX_ALIGN_SHIFT, \
- SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+ max(1UL << BNX2X_RX_ALIGN_SHIFT, SKB_ALLOCSIZE(0))
#define BNX2X_PXP_DRAM_ALIGN (BNX2X_RX_ALIGN_SHIFT - 5)
diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index 482138e..6869f17 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -5714,8 +5714,7 @@ static int tg3_alloc_rx_data(struct tg3 *tp, struct tg3_rx_prodring_set *tpr,
* Callers depend upon this behavior and assume that
* we leave everything unchanged if we fail.
*/
- skb_size = SKB_DATA_ALIGN(data_size + TG3_RX_OFFSET(tp)) +
- SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+ skb_size = SKB_ALLOCSIZE(data_size + TG3_RX_OFFSET(tp));
if (skb_size <= TG3_FRAGSIZE) {
data = tg3_frag_alloc(tpr);
*frag_size = TG3_FRAGSIZE;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 988fc49..19e348f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -41,8 +41,20 @@
#define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) & \
~(SMP_CACHE_BYTES - 1))
+/* maximum data size which can fit into an allocation of X bytes */
#define SKB_WITH_OVERHEAD(X) \
((X) - SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+/*
+ * minimum allocation size required for an skb containing X bytes of data
+ *
+ * We do our best to align skb_shared_info on a separate cache
+ * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives
+ * aligned memory blocks, unless SLUB/SLAB debug is enabled. Both
+ * skb->head and skb_shared_info are cache line aligned.
+ */
+#define SKB_ALLOCSIZE(X) \
+ (SKB_DATA_ALIGN((X)) + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+
#define SKB_MAX_ORDER(X, ORDER) \
SKB_WITH_OVERHEAD((PAGE_SIZE << (ORDER)) - (X))
#define SKB_MAX_HEAD(X) (SKB_MAX_ORDER((X), 0))
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 52ba2b5..a056d7c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -182,13 +182,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
goto out;
prefetchw(skb);
- /* We do our best to align skb_shared_info on a separate cache
- * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives
- * aligned memory blocks, unless SLUB/SLAB debug is enabled.
- * Both skb->head and skb_shared_info are cache line aligned.
- */
- size = SKB_DATA_ALIGN(size);
- size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+ size = SKB_ALLOCSIZE(size);
data = kmalloc_node_track_caller(size, gfp_mask, node);
if (!data)
goto nodata;
--
1.7.2.5
^ permalink raw reply related
* [PATCH v5 0/9] skb paged fragment destructors
From: Ian Campbell @ 2012-05-03 14:55 UTC (permalink / raw)
To: netdev@vger.kernel.org
Cc: David Miller, Eric Dumazet, Michael S. Tsirkin, David VomLehn,
Bart Van Assche, xen-devel, Ian Campbell, Alexander Duyck
The following series makes use of the skb fragment API (which is in 3.2
+) to add a per-paged-fragment destructor callback. This can be used by
creators of skbs who are interested in the lifecycle of the pages
included in that skb after they have handed it off to the network stack.
The mail at [0] contains some more background and rationale but
basically the completed series will allow entities which inject pages
into the networking stack to receive a notification when the stack has
really finished with those pages (i.e. including retransmissions,
clones, pull-ups etc) and not just when the original skb is finished
with, which is beneficial to many subsystems which wish to inject pages
into the network stack without giving up full ownership of those page's
lifecycle. It implements something broadly along the lines of what was
described in [1].
I have also included a patch to the RPC subsystem which uses this API to
fix the bug which I describe at [2].
I've also had some interest from David VemLehn and Bart Van Assche
regarding using this functionality in the context of vmsplice and iSCSI
targets respectively (I think).
Changes since last time:
* The big change is that the patches now explicitly align the
"nr_frags" member of the shinfo, as suggested by Alexander
Duyck. This ensures that the placement is optimal irrespective
of page size (in particular the variation of MAX_SKB_FRAGS). It
is still the case that for 4k pages a maximum MTU frame +
SKB_PAD + shinfo, still fit within 2048k.
* As part of the preceeding I squashed the patches
manipulating the shinfo layout and alignment into a
single patch (which is far more coherent than the
piecemeal approach used previously)
* I crushed "net: only allow paged fragments with the same
destructor to be coalesced." into the baseline patch (Ben
Hutchings)
* Added and used skb_shinfo_init to centralise several copies of
that code.
* Reduced CC list on "net: add paged frag destructor support to
kernel_sendpage", it was rather long and seemed a bit overly
spammy on the non-netdev recipients.
Changes since time before:
* Added skb_orphan_frags API for the use of recipients of SKBs who
may hold onto the SKB for a long time (this is analogous to
skb_orphan). This was pointed out by Michael. The TUN driver is
currently the only user.
* I can't for the life of me get anything to actually hit
this code path. I've been trying with an NFS server
running in a Xen HVM domain with emulated (e.g. tap)
networking and a client in domain 0, using the NFS fix
in this series which generates SKBs with destructors
set, so far -- nothing. I suspect that lack of TSO/GSO
etc on the TAP interface is causing the frags to be
copied to normal pages during skb_segment().
* Various fixups related to the change of alignment/padding in
shinfo, in particular to build_skb as pointed out by Eric.
* Tweaked ordering of shinfo members to ensure that all hotpath
variables up to and including the first frag fit within (and are
aligned to) a single 64 byte cache line. (Eric again)
I ran a monothread UDP benchmark (similar to that described by Eric in
e52fcb2462ac) and don't see any difference in pps throughput, it was
~810,000 pps both before and after.
Cheers,
Ian.
[0] http://marc.info/?l=linux-netdev&m=131072801125521&w=2
[1] http://marc.info/?l=linux-netdev&m=130925719513084&w=2
[2] http://marc.info/?l=linux-nfs&m=122424132729720&w=2
^ permalink raw reply
* Re: [PATCH v3 1/2] vhost-net: fix handle_rx buffer size
From: Basil Gor @ 2012-05-03 14:43 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: Eric W. Biederman, David S. Miller, netdev
In-Reply-To: <20120503131623.GA26705@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 1949 bytes --]
On Thu, May 03, 2012 at 04:16:24PM +0300, Michael S. Tsirkin wrote:
> On Wed, Apr 25, 2012 at 09:01:15PM +0400, Basil Gor wrote:
> > Take vlan header length into account, when vlan id is stored as
> > vlan_tci. Otherwise tagged packets comming from macvtap will be
> > truncated.
> >
> > Signed-off-by: Basil Gor <basil.gor@gmail.com>
>
> So I'm inclined to apply these two patches, we
> this doesn't fix packet socket backend
> but could be fixed by a follow-up patch.
>
That's what I'm going to do.
While testing packet socket I noticed that tcpdump doesn't work
on macvtap0, since there is no dev_hard_start_xmit like in
tun/tap0 case I think (lines 120-144 in trace attached). And I
have no clear picture how to fix this gracefully.
Also I think there are issues with macvtap on top of bonding, that
I'm also going to verify and debug.
> > ---
> > drivers/vhost/net.c | 7 ++++++-
> > 1 files changed, 6 insertions(+), 1 deletions(-)
> >
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index 1f21d2a..5c17010 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -24,6 +24,7 @@
> > #include <linux/if_arp.h>
> > #include <linux/if_tun.h>
> > #include <linux/if_macvlan.h>
> > +#include <linux/if_vlan.h>
> >
> > #include <net/sock.h>
> >
> > @@ -283,8 +284,12 @@ static int peek_head_len(struct sock *sk)
> >
> > spin_lock_irqsave(&sk->sk_receive_queue.lock, flags);
> > head = skb_peek(&sk->sk_receive_queue);
> > - if (likely(head))
> > + if (likely(head)) {
> > len = head->len;
> > + if (vlan_tx_tag_present(head))
> > + len += VLAN_HLEN;
> > + }
> > +
> > spin_unlock_irqrestore(&sk->sk_receive_queue.lock, flags);
> > return len;
> > }
> > --
> > 1.7.6.5
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
[-- Attachment #2: tap_trace.log --]
[-- Type: text/plain, Size: 21099 bytes --]
single arp packet receive
br0
^ \
| +->tap0 <-- tapread + tcpdump
->wlan0-+->macvlan0
\->macvtap0 <-- macvtapread + tcpdump
001 0 irq/28-b43(25823): -> netpoll_trap()
002 0xffffffff815209c0 : netpoll_trap+0x0/0x20 [kernel]
003 0xffffffff8161a189 : kretprobe_trampoline+0x0/0x57 [kernel]
004 0xffffffffa0260a8a [mac80211]
005 0xffffffffa0260ac0 [mac80211]
006 0xffffffffa0318b02 [b43]
007 0xffffffff810e4b60 : irq_thread_fn+0x0/0x50 [kernel]
008 0xffffffffa0313330 [b43]
009 0xffffffffa02f71d6 [b43]
010 0xffffffffa02f7486 [b43]
011 0xffffffff810e4b89 : irq_thread_fn+0x29/0x50 [kernel]
012 0xffffffff810e4ae0 : irq_thread+0x1a0/0x220 [kernel]
013 0xffffffff810e4940 : irq_thread+0x0/0x220 [kernel]
014 0xffffffff81079da3 : kthread+0x93/0xa0 [kernel]
015 0xffffffff81620f24 : kernel_thread_helper+0x4/0x10 [kernel]
016 0xffffffff81079d10 : kthread+0x0/0xa0 [kernel]
017 0xffffffff81620f20 : kernel_thread_helper+0x0/0x10 [kernel]
018 480 irq/28-b43(25823): <- netpoll_trap(): return=0x0
019 0 irq/28-b43(25823): -> netif_receive_skb(skb=0xffff8800a8183300)
020 skb_dump:dev:wlan0 proto:8100 len:32 vlan_tci:{prio:0 cfi:0 vid:0}
021 0xffffffff8150b360 : netif_receive_skb+0x0/0x90 [kernel]
022 0xffffffff8161a189 : kretprobe_trampoline+0x0/0x57 [kernel]
023 0xffffffffa02584d6 [mac80211]
024 0xffffffff8161a189 : kretprobe_trampoline+0x0/0x57 [kernel]
025 0xffffffffa02598f6 [mac80211]
026 0xffffffffa025a56e [mac80211]
027 0xffffffffa0312fd4 [b43]
028 0xffffffffa0318d8a [b43]
029 0xffffffff810e4b60 : irq_thread_fn+0x0/0x50 [kernel]
030 0xffffffffa02f71fd [b43]
031 0xffffffffa02f7486 [b43]
032 0xffffffff810e4b89 : irq_thread_fn+0x29/0x50 [kernel]
033 0xffffffff810e4ae0 : irq_thread+0x1a0/0x220 [kernel]
034 0xffffffff810e4940 : irq_thread+0x0/0x220 [kernel]
035 0xffffffff81079da3 : kthread+0x93/0xa0 [kernel]
036 0xffffffff81620f24 : kernel_thread_helper+0x4/0x10 [kernel]
037 0xffffffff81079d10 : kthread+0x0/0xa0 [kernel]
038 0xffffffff81620f20 : kernel_thread_helper+0x0/0x10 [kernel]
039 551 irq/28-b43(25823): -> __netif_receive_skb(skb=0xffff8800a8183300 ptype=? pt_prev=? rx_handler=? orig_dev=? null_or_dev=? deliver_exact=? ret=? type=?)
040 skb_dump:dev:wlan0 proto:8100 len:32 vlan_tci:{prio:0 cfi:0 vid:0}
041 587 irq/28-b43(25823): -> vlan_untag(skb=0xffff8800a8183300 vhdr=? vlan_tci=?)
042 skb_dump:dev:wlan0 proto:8100 len:32 vlan_tci:{prio:0 cfi:0 vid:0}
043 616 irq/28-b43(25823): <- vlan_untag(): return=0xffff8800a8183300
044 639 irq/28-b43(25823): -> packet_rcv(skb=0xffff8800a8183300 dev=0xffff8800aa24b000 pt=0xffff8800276f5cc0 orig_dev=0xffff8800aa24b000 sk=? sll=? po=? skb_head=? skb_len=? snaplen=? res=?)
045 skb_dump:dev:wlan0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
046 dev:name:wlan0
047 orig_dev:name:wlan0
048 690 irq/28-b43(25823): <- packet_rcv(): return=0x0
049 705 irq/28-b43(25823): -> vlan_do_receive(skbp=0xffff880045c1fa08 last_handler=0x0 skb=? vlan_id=? vlan_dev=0xffff8800aa24b000 rx_stats=?)
050 skbp_dump:dev:wlan0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
051 738 irq/28-b43(25823): <- vlan_do_receive(): return=0x0
052 755 irq/28-b43(25823): -> macvlan_handle_frame(pskb=0xffff880045c1fa08 port=? skb=? eth=? vlan=? src=0xffff8800aa24b000 dev=? len=? ret=?)
053 pskb_dump:dev:wlan0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
054 794 irq/28-b43(25823): -> macvlan_broadcast(skb=0xffff8800a8183300 port=0xffff8800a9466000 src=0x0 mode=0xf eth=0xffff880045c1f9f0 vlan=? n=? nskb=? i=? err=0x0)
055 skb_dump:dev:wlan0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
056 843 irq/28-b43(25823): -> netif_rx(skb=0xffff8800a7012e00 ret=0xffffffffa7012e00)
057 skb_dump:dev:macvlan0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
058 877 irq/28-b43(25823): -> enqueue_to_backlog(skb=0xffff8800a7012e00 cpu=0x0 qtail=0xffff880045c1f930 sd=? flags=?)
059 skb_dump:dev:macvlan0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
060 930 irq/28-b43(25823): <- enqueue_to_backlog(): return=0x0
061 942 irq/28-b43(25823): <- netif_rx(): return=0x0
062 962 irq/28-b43(25823): -> macvtap_receive(skb=0xffff8800136a7200)
063 skb_dump:dev:macvtap0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
064 992 irq/28-b43(25823): -> macvtap_forward(dev=0xffff8800a9467000 skb=0xffff8800136a7200 q=0x0)
065 skb_dump:dev:macvtap0 proto:0806 len:42 vlan_tci:{prio:0 cfi:4096 vid:50}
066 dev:name:macvtap0
067 1030 irq/28-b43(25823): -> __skb_get_rxhash(skb=0xffff8800136a7200 keys={...} hash=?)
068 skb_dump:dev:macvtap0 proto:0806 len:42 vlan_tci:{prio:0 cfi:4096 vid:50}
069 1058 irq/28-b43(25823): <- __skb_get_rxhash():
070 1080 irq/28-b43(25823): <- macvtap_forward(): return=0x0
071 1092 irq/28-b43(25823): <- macvtap_receive(): return=0x0
072 1105 irq/28-b43(25823): <- macvlan_broadcast():
073 1117 irq/28-b43(25823): <- macvlan_handle_frame(): return=0x3
074 1149 irq/28-b43(25823): <- __netif_receive_skb(): return=0x0
075 1161 irq/28-b43(25823): <- netif_receive_skb(): return=0x0
076 0 irq/28-b43(25823): -> net_rx_action(h=0xffffffff81c04098 sd=? time_limit=? budget=? have=0x3)
077 0xffffffff8150bc00 : net_rx_action+0x0/0x270 [kernel]
078 0xffffffff8161a189 : kretprobe_trampoline+0x0/0x57 [kernel]
079 0xffffffff8162101c : call_softirq+0x1c/0x30 [kernel]
080 0xffffffff81016455 : do_softirq+0x65/0xa0 [kernel]
081 0xffffffff8105e654 : local_bh_enable+0x94/0xa0 [kernel]
082 0xffffffffa0312fd9 [b43]
083 0xffffffffa0318d8a [b43]
084 0xffffffff810e4b60 : irq_thread_fn+0x0/0x50 [kernel]
085 0xffffffffa02f71fd [b43]
086 0xffffffffa02f7486 [b43]
087 0xffffffff810e4b89 : irq_thread_fn+0x29/0x50 [kernel]
088 0xffffffff810e4ae0 : irq_thread+0x1a0/0x220 [kernel]
089 0xffffffff810e4940 : irq_thread+0x0/0x220 [kernel]
090 0xffffffff81079da3 : kthread+0x93/0xa0 [kernel]
091 0xffffffff81620f24 : kernel_thread_helper+0x4/0x10 [kernel]
092 0xffffffff81079d10 : kthread+0x0/0xa0 [kernel]
093 0xffffffff81620f20 : kernel_thread_helper+0x0/0x10 [kernel]
094 506 irq/28-b43(25823): -> process_backlog(napi=0xffff8800afc14118 quota=0x40 work=? sd=?)
095 530 irq/28-b43(25823): -> __netif_receive_skb(skb=0xffff8800a7012e00 ptype=? pt_prev=? rx_handler=? orig_dev=? null_or_dev=? deliver_exact=? ret=? type=?)
096 skb_dump:dev:macvlan0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
097 561 irq/28-b43(25823): -> vlan_do_receive(skbp=0xffff8800afc03e30 last_handler=0x0 skb=? vlan_id=? vlan_dev=0xffff880088528000 rx_stats=?)
098 skbp_dump:dev:macvlan0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
099 591 irq/28-b43(25823): <- vlan_do_receive(): return=0x0
100 626 irq/28-b43(25823): -> br_flood_forward(br=0xffff8800a995e780 skb=0xffff8800a7012e00 skb2=0xffff8800a7012e00)
101 skb_dump:dev:macvlan0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
102 667 irq/28-b43(25823): -> br_flood(br=0xffff8800a995e780 skb=0xffff8800a7012e00 skb0=0xffff8800a7012e00 __packet_hook=0xffffffffa040f380 p=? prev=?)
103 skb_dump:dev:macvlan0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
104 711 irq/28-b43(25823): -> maybe_deliver(prev=0x0 p=0xffff8800a85bec00 skb=0xffff8800a7012e00 __packet_hook=0xffffffffa040f380 err=?)
105 skb_dump:dev:macvlan0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
106 766 irq/28-b43(25823): <- maybe_deliver(): return=0xffff8800a85bec00
107 784 irq/28-b43(25823): -> maybe_deliver(prev=0xffff8800a85bec00 p=0xffff8800a7972800 skb=0xffff8800a7012e00 __packet_hook=0xffffffffa040f380 err=?)
108 skb_dump:dev:macvlan0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
109 820 irq/28-b43(25823): <- maybe_deliver(): return=0xffff8800a85bec00
110 839 irq/28-b43(25823): -> deliver_clone(prev=0xffff8800a85bec00 skb=0xffff8800a7012e00 __packet_hook=0xffffffffa040f380 dev=?)
111 skb_dump:dev:macvlan0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
112 879 irq/28-b43(25823): -> __br_forward(to=0xffff8800a85bec00 skb=0xffff8800a8183300 indev=?)
113 skb_dump:dev:macvlan0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
114 914 irq/28-b43(25823): -> br_forward_finish(skb=0xffff8800a8183300)
115 skb_dump:dev:tap0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
116 943 irq/28-b43(25823): -> br_dev_queue_push_xmit(skb=0xffff8800a8183300)
117 skb_dump:dev:tap0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
118 973 irq/28-b43(25823): -> dev_queue_xmit(skb=0xffff8800a8183300 dev=? txq=0xffffffffa040f380 q=? rc=?)
119 skb_dump:dev:tap0 proto:0806 len:42 vlan_tci:{prio:0 cfi:4096 vid:50}
120 1010 irq/28-b43(25823): -> dev_hard_start_xmit(skb=0xffff8800a8183300 dev=0xffff8800a9866000 txq=0xffff88008864fa00 ops=? rc=? skb_len=?)
121 skb_dump:dev:tap0 proto:0806 len:42 vlan_tci:{prio:0 cfi:4096 vid:50}
122 dev:name:tap0
123 1056 irq/28-b43(25823): -> tpacket_rcv(skb=0xffff8800a9849600 dev=0xffff8800a9866000 pt=0xffff8800662994c0 orig_dev=0xffff8800a9866000 sk=? po=? sll=? h={...} skb_head=? skb_len=? snaplen=? res=? status=? macoff=? netoff=? hdrlen=? copy_skb=? tv={...} ts={...} shhwtstamps=?)
124 skb_dump:dev:tap0 proto:0806 len:42 vlan_tci:{prio:0 cfi:4096 vid:50}
125 dev:name:tap0
126 orig_dev:name:tap0
127 1109 irq/28-b43(25823): -> packet_lookup_frame(po=0xffff880066299000 rb=0xffff8800662992a0 position=0x6 status=0x0 pg_vec_pos=? frame_offset=? h={...})
128 1137 irq/28-b43(25823): -> __packet_get_status(po=0xffff880066299000 frame=0xffff880019b60000 h={...})
129 1159 irq/28-b43(25823): <- __packet_get_status(): return=0x0
130 1171 irq/28-b43(25823): <- packet_lookup_frame(): return=0xffff880019b60000
131 1192 irq/28-b43(25823): -> __packet_set_status(po=0xffff880066299000 frame=0xffff880019b60000 status=0x11 h={...})
132 1217 irq/28-b43(25823): <- __packet_set_status():
133 1240 irq/28-b43(25823): <- tpacket_rcv(): return=0x0
134 1255 irq/28-b43(25823): -> netif_skb_features(skb=0xffff8800a8183300 protocol=? features=?)
135 skb_dump:dev:tap0 proto:0806 len:42 vlan_tci:{prio:0 cfi:4096 vid:50}
136 1286 irq/28-b43(25823): -> harmonize_features(skb=0xffff8800a8183300 protocol=0x608 features=0x0)
137 skb_dump:dev:tap0 proto:0806 len:42 vlan_tci:{prio:0 cfi:4096 vid:50}
138 1314 irq/28-b43(25823): <- harmonize_features(): return=0x0
139 1326 irq/28-b43(25823): <- netif_skb_features(): return=0x0
140 1348 irq/28-b43(25823): -> tun_net_xmit(skb=0xffff8800a8183300 dev=0xffff8800a9866000 tun=?)
141 skb_dump:dev:tap0 proto:8100 len:46 vlan_tci:{prio:0 cfi:0 vid:0}
142 dev:name:tap0
143 1388 irq/28-b43(25823): <- tun_net_xmit(): return=0x0
144 1401 irq/28-b43(25823): <- dev_hard_start_xmit(): return=0x0
145 1414 irq/28-b43(25823): <- dev_queue_xmit(): return=0x0
146 1426 irq/28-b43(25823): <- br_dev_queue_push_xmit(): return=0x0
147 1438 irq/28-b43(25823): <- br_forward_finish(): return=0x0
148 1450 irq/28-b43(25823): <- __br_forward():
149 1461 irq/28-b43(25823): <- deliver_clone(): return=0x0
150 1473 irq/28-b43(25823): <- br_flood():
151 1484 irq/28-b43(25823): <- br_flood_forward():
152 1499 irq/28-b43(25823): -> netif_receive_skb(skb=0xffff8800a7012e00)
153 skb_dump:dev:br0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
154 1524 irq/28-b43(25823): -> __netif_receive_skb(skb=0xffff8800a7012e00 ptype=? pt_prev=? rx_handler=? orig_dev=? null_or_dev=? deliver_exact=? ret=? type=?)
155 skb_dump:dev:br0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
156 1553 irq/28-b43(25823): -> vlan_do_receive(skbp=0xffff8800afc03d20 last_handler=0x1 skb=? vlan_id=? vlan_dev=0xffff8800a995e000 rx_stats=?)
157 skbp_dump:dev:br0 proto:0806 len:28 vlan_tci:{prio:0 cfi:4096 vid:50}
158 1583 irq/28-b43(25823): <- vlan_do_receive(): return=0x0
159 1596 irq/28-b43(25823): <- __netif_receive_skb(): return=0x0
160 1608 irq/28-b43(25823): <- netif_receive_skb(): return=0x0
161 1619 irq/28-b43(25823): <- __netif_receive_skb(): return=0x1
162 1631 irq/28-b43(25823): <- process_backlog(): return=0x1
163 1646 irq/28-b43(25823): -> net_rps_action_and_irq_enable(sd=0xffff8800afc14040 remsd=?)
164 1666 irq/28-b43(25823): <- net_rps_action_and_irq_enable():
165 1678 irq/28-b43(25823): <- net_rx_action():
166 0 macvtapread(13215): -> macvtap_poll(file=0xffff88006bb35c00 wait=0xffff8800a86a9ab8 q=? mask=?)
167 0xffffffffa03f6000 : macvtap_poll+0x0/0xa0 [macvtap]
168 0xffffffff8161a189 : kretprobe_trampoline+0x0/0x57 [kernel]
169 0xffffffff8119561c : core_sys_select+0x1ec/0x370 [kernel]
170 0xffffffff81195860 : sys_select+0xc0/0x100 [kernel]
171 0xffffffff8161fb69 : system_call_fastpath+0x16/0x1b [kernel]
172 206 macvtapread(13215): <- macvtap_poll(): return=0x145
173 0 macvtapread(13215): -> macvtap_aio_read(iocb=0xffff8800a86a9e00 iv=0xffff8800a86a9ed8 count=0x1 pos=0x0 file=? q=? len=? ret=?)
174 0xffffffffa03f65e0 : macvtap_aio_read+0x0/0x80 [macvtap]
175 0xffffffff8161a189 : kretprobe_trampoline+0x0/0x57 [kernel]
176 0xffffffff81182305 : vfs_read+0x165/0x180 [kernel]
177 0xffffffff8118236a : sys_read+0x4a/0x90 [kernel]
178 0xffffffff8161fb69 : system_call_fastpath+0x16/0x1b [kernel]
179 193 macvtapread(13215): -> macvtap_do_read(q=0xffff88001a62f800 iocb=0xffff8800a86a9e00 iv=0xffff8800a86a9ed8 len=0x5de noblock=0x0 wait={...} skb=? ret=?)
180 233 macvtapread(13215): <- macvtap_do_read(): return=0x34
181 246 macvtapread(13215): <- macvtap_aio_read(): return=0x34
182 0 macvtapread(13215): -> macvtap_poll(file=0xffff88006bb35c00 wait=0xffff8800a86a9ab8 q=? mask=?)
183 0xffffffffa03f6000 : macvtap_poll+0x0/0xa0 [macvtap]
184 0xffffffff8161a189 : kretprobe_trampoline+0x0/0x57 [kernel]
185 0xffffffff8119561c : core_sys_select+0x1ec/0x370 [kernel]
186 0xffffffff81195860 : sys_select+0xc0/0x100 [kernel]
187 0xffffffff8161fb69 : system_call_fastpath+0x16/0x1b [kernel]
188 190 macvtapread(13215): <- macvtap_poll(): return=0x104
189 0 tcpdump(tap0): -> packet_poll(file=0xffff8800885f0e00 sock=0xffff88001904a000 wait=0xffff880017601b88 sk=? po=? mask=?)
190 0xffffffff815e51f0 : packet_poll+0x0/0x130 [kernel]
191 0xffffffff8161a189 : kretprobe_trampoline+0x0/0x57 [kernel]
192 0xffffffff81195ceb : do_sys_poll+0x25b/0x4c0 [kernel]
193 0xffffffff8119602b : sys_poll+0x6b/0x100 [kernel]
194 0xffffffff8161fb69 : system_call_fastpath+0x16/0x1b [kernel]
195 223 tcpdump(tap0): -> packet_lookup_frame(po=0xffff880066299000 rb=0xffff8800662992a0 position=0x6 status=0x0 pg_vec_pos=? frame_offset=? h={...})
196 249 tcpdump(tap0): -> __packet_get_status(po=0xffff880066299000 frame=0xffff880019b60000 h={...})
197 269 tcpdump(tap0): <- __packet_get_status(): return=0x11
198 282 tcpdump(tap0): <- packet_lookup_frame(): return=0x0
199 294 tcpdump(tap0): <- packet_poll(): return=0x345
200 0 tcpdump(tap0): -> packet_poll(file=0xffff8800885f0e00 sock=0xffff88001904a000 wait=0xffff880017601b88 sk=? po=? mask=?)
201 0xffffffff815e51f0 : packet_poll+0x0/0x130 [kernel]
202 0xffffffff8161a189 : kretprobe_trampoline+0x0/0x57 [kernel]
203 0xffffffff81195ceb : do_sys_poll+0x25b/0x4c0 [kernel]
204 0xffffffff8119602b : sys_poll+0x6b/0x100 [kernel]
205 0xffffffff8161fb69 : system_call_fastpath+0x16/0x1b [kernel]
206 204 tcpdump(tap0): -> packet_lookup_frame(po=0xffff880066299000 rb=0xffff8800662992a0 position=0x6 status=0x0 pg_vec_pos=? frame_offset=? h={...})
207 229 tcpdump(tap0): -> __packet_get_status(po=0xffff880066299000 frame=0xffff880019b60000 h={...})
208 250 tcpdump(tap0): <- __packet_get_status(): return=0x0
209 262 tcpdump(tap0): <- packet_lookup_frame(): return=0xffff880019b60000
210 277 tcpdump(tap0): <- packet_poll(): return=0x304
211 0 tapread(13310): -> tun_chr_poll(file=0xffff880002451f00 wait=0xffff8800a7435ab8 tfile=? tun=? sk=? mask=?)
212 0xffffffffa042f4a0 : tun_chr_poll+0x0/0x110 [tun]
213 0xffffffff8161a189 : kretprobe_trampoline+0x0/0x57 [kernel]
214 0xffffffff8119561c : core_sys_select+0x1ec/0x370 [kernel]
215 0xffffffff81195860 : sys_select+0xc0/0x100 [kernel]
216 0xffffffff8161fb69 : system_call_fastpath+0x16/0x1b [kernel]
217 196 tapread(13310): -> __tun_get(tfile=0xffff88000456afc0 tun=?)
218 215 tapread(13310): <- __tun_get(): return=0xffff8800a9866780
219 234 tapread(13310): -> tun_put(tun=0xffff8800a9866780 tfile=?)
220 252 tapread(13310): <- tun_put():
221 262 tapread(13310): <- tun_chr_poll(): return=0x145
222 0 tapread(13310): -> tun_chr_aio_read(iocb=0xffff8800a7435e00 iv=0xffff8800a7435ed8 count=0x1 pos=0x0 file=? tfile=? tun=? len=? ret=?)
223 0xffffffffa042f6a0 : tun_chr_aio_read+0x0/0xe0 [tun]
224 0xffffffff8161a189 : kretprobe_trampoline+0x0/0x57 [kernel]
225 0xffffffff81182250 : vfs_read+0xb0/0x180 [kernel]
226 0xffffffff8118236a : sys_read+0x4a/0x90 [kernel]
227 0xffffffff8161fb69 : system_call_fastpath+0x16/0x1b [kernel]
228 213 tapread(13310): -> __tun_get(tfile=0xffff88000456afc0 tun=?)
229 230 tapread(13310): <- __tun_get(): return=0xffff8800a9866780
230 248 tapread(13310): -> tun_do_read(tun=0xffff8800a9866780 iocb=0xffff8800a7435e00 iv=0xffff8800a7435ed8 len=0x5de noblock=0x0 wait={...} skb=? ret=?)
231 284 tapread(13310): -> netpoll_trap()
232 295 tapread(13310): <- netpoll_trap(): return=0x0
233 312 tapread(13310): <- tun_do_read(): return=0x2e
234 325 tapread(13310): -> tun_put(tun=0xffff8800a9866780 tfile=?)
235 340 tapread(13310): <- tun_put():
236 350 tapread(13310): <- tun_chr_aio_read(): return=0x2e
237 0 tapread(13310): -> tun_chr_poll(file=0xffff880002451f00 wait=0xffff8800a7435ab8 tfile=? tun=? sk=? mask=?)
238 0xffffffffa042f4a0 : tun_chr_poll+0x0/0x110 [tun]
239 0xffffffff8161a189 : kretprobe_trampoline+0x0/0x57 [kernel]
240 0xffffffff8119561c : core_sys_select+0x1ec/0x370 [kernel]
241 0xffffffff81195860 : sys_select+0xc0/0x100 [kernel]
242 0xffffffff8161fb69 : system_call_fastpath+0x16/0x1b [kernel]
243 192 tapread(13310): -> __tun_get(tfile=0xffff88000456afc0 tun=?)
244 209 tapread(13310): <- __tun_get(): return=0xffff8800a9866780
245 225 tapread(13310): -> tun_put(tun=0xffff8800a9866780 tfile=?)
246 241 tapread(13310): <- tun_put():
247 251 tapread(13310): <- tun_chr_poll(): return=0x104
248 0 tcpdump(macvtap0): -> packet_poll(file=0xffff880027793a00 sock=0xffff8800a925c500 wait=0xffff8800a9a99b88 sk=? po=? mask=?)
249 0xffffffff815e51f0 : packet_poll+0x0/0x130 [kernel]
250 0xffffffff8161a189 : kretprobe_trampoline+0x0/0x57 [kernel]
251 0xffffffff81195ceb : do_sys_poll+0x25b/0x4c0 [kernel]
252 0xffffffff8119602b : sys_poll+0x6b/0x100 [kernel]
253 0xffffffff8161fb69 : system_call_fastpath+0x16/0x1b [kernel]
254 242 tcpdump(macvtap0): -> packet_lookup_frame(po=0xffff88003dd2b000 rb=0xffff88003dd2b2a0 position=0x1e status=0x0 pg_vec_pos=? frame_offset=? h={...})
255 271 tcpdump(macvtap0): -> __packet_get_status(po=0xffff88003dd2b000 frame=0xffff88003c180000 h={...})
256 294 tcpdump(macvtap0): <- __packet_get_status(): return=0x0
257 308 tcpdump(macvtap0): <- packet_lookup_frame(): return=0xffff88003c180000
258 324 tcpdump(macvtap0): <- packet_poll(): return=0x304
259 0 tcpdump(macvtap0): -> packet_poll(file=0xffff880027793a00 sock=0xffff8800a925c500 wait=0xffff8800a9a99b88 sk=? po=? mask=?)
260 0xffffffff815e51f0 : packet_poll+0x0/0x130 [kernel]
261 0xffffffff8161a189 : kretprobe_trampoline+0x0/0x57 [kernel]
262 0xffffffff81195ceb : do_sys_poll+0x25b/0x4c0 [kernel]
263 0xffffffff8119602b : sys_poll+0x6b/0x100 [kernel]
264 0xffffffff8161fb69 : system_call_fastpath+0x16/0x1b [kernel]
265 193 tcpdump(macvtap0): -> packet_lookup_frame(po=0xffff88003dd2b000 rb=0xffff88003dd2b2a0 position=0x1e status=0x0 pg_vec_pos=? frame_offset=? h={...})
266 217 tcpdump(macvtap0): -> __packet_get_status(po=0xffff88003dd2b000 frame=0xffff88003c180000 h={...})
267 237 tcpdump(macvtap0): <- __packet_get_status(): return=0x0
268 248 tcpdump(macvtap0): <- packet_lookup_frame(): return=0xffff88003c180000
269 264 tcpdump(macvtap0): <- packet_poll(): return=0x304
[-- Attachment #3: tun-trace.stp --]
[-- Type: text/plain, Size: 2747 bytes --]
%{
#include <linux/skbuff.h>
#include <linux/if_vlan.h>
%}
function skb_dump:string(skb_ptr:long) %{
struct sk_buff *skb = (void*)THIS->skb_ptr;
if (!skb) {
snprintf(THIS->__retvalue, MAXSTRINGLEN, "skb=NULL");
return;
}
snprintf(THIS->__retvalue, MAXSTRINGLEN, "dev:%s proto:%04x len:%u vlan_tci:{prio:%d cfi:%d vid:%d}",
skb->dev ? skb->dev->name : "NULL",
htons(skb->protocol),
skb->len,
skb->vlan_tci & VLAN_PRIO_MASK,
skb->vlan_tci & VLAN_CFI_MASK,
skb->vlan_tci & VLAN_VID_MASK);
%}
function dev_dump:string(dev_ptr:long) %{
struct net_device *dev = (void*)THIS->dev_ptr;
if (!dev) {
snprintf(THIS->__retvalue, MAXSTRINGLEN, "dev=NULL");
return;
}
snprintf(THIS->__retvalue, MAXSTRINGLEN, "name:%s", dev->name);
%}
function deref_unsafe:long(pskb_ptr:long) %{
struct sk_buff **pskb = (void*)THIS->pskb_ptr;
if (!pskb || !(*pskb))
THIS->__retvalue = 0;
else
THIS->__retvalue = (long)*pskb;
%}
probe begin {
printf ("started\n")
}
global nesting = 0
probe probepoints.call = module("tun").function("*@drivers/net/tun.c").call,
module("macvlan").function("*@drivers/net/macvlan.c").call,
module("macvtap").function("*@drivers/net/macvtap.c").call,
module("bridge").function("*@net/bridge/br_device.c").call,
module("bridge").function("*@net/bridge/br_forward.c").call,
kernel.function("vlan_*").call,
kernel.function("__vlan_*").call,
kernel.function("*@net/packet/af_packet.c").call,
kernel.function("*@net/core/netpoll.c").call,
kernel.function("*@net/core/dev.c").call {
nesting++
}
probe probepoints.return = module("tun").function("*@drivers/net/tun.c").return,
module("macvlan").function("*@drivers/net/macvlan.c").return,
module("macvtap").function("*@drivers/net/macvtap.c").return,
module("bridge").function("*@net/bridge/br_device.c").return,
module("bridge").function("*@net/bridge/br_forward.c").return,
kernel.function("vlan_*").return,
kernel.function("__vlan_*").return,
kernel.function("*@net/packet/af_packet.c").return,
kernel.function("*@net/core/netpoll.c").return,
kernel.function("*@net/core/dev.c").return {
nesting--
}
probe probepoints.call {
printf ("%s -> %s(%s)\n", thread_indent(1), probefunc(), $$vars)
if (@defined($skb))
printf("\tskb_dump:%s\n", skb_dump($skb))
if (@defined($pskb))
printf("\tpskb_dump:%s\n", skb_dump(deref_unsafe($pskb)))
if (@defined($skbp))
printf("\tskbp_dump:%s\n", skb_dump(deref_unsafe($skbp)))
if (@defined($dev))
printf("\tdev:%s\n", dev_dump($dev))
if (@defined($orig_dev))
printf("\torig_dev:%s\n", dev_dump($dev))
if (nesting == 1)
print_stack(backtrace())
}
probe probepoints.return {
printf ("%s <- %s(): %s\n", thread_indent(-1), probefunc(), $$return)
}
^ permalink raw reply
* Re: [PATCH v3 2/2] macvtap: restore vlan header on user read
From: Michael S. Tsirkin @ 2012-05-03 14:31 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: Basil Gor, David S. Miller, netdev
In-Reply-To: <m1ehr1l711.fsf@fess.ebiederm.org>
On Thu, May 03, 2012 at 06:37:46AM -0700, Eric W. Biederman wrote:
> "Michael S. Tsirkin" <mst@redhat.com> writes:
>
> > On Wed, Apr 25, 2012 at 10:31:25PM -0700, Eric W. Biederman wrote:
> >> Basil Gor <basil.gor@gmail.com> writes:
> >>
> >> > Vlan tag is restored during buffer transmit to a network device (bridge
> >> > port) in bridging code in case of tun/tap driver. In case of macvtap it
> >> > has to be done explicitly. Otherwise vlan_tci is ignored and user always
> >> > gets untagged packets.
> >>
> >> We could quibble about efficiencies but this looks good except for
> >> macvtap_recvmsg which isn't setting the auxdata for the vlan header.
> >>
> >> Eric
> >
> > Right. I'm guessing we need to support old userspace
> > so if there's auxdata, put vlan there but if not,
> > put the vlan in the packet like this patch does.
>
> This patch isn't horrible.
>
> Still why copy the skb when you can just split the copy to userspace
> into a couple of pieces?
>
> We don't need to change the skb and changing the skb looks like
> it is likely to confuse things and cause bugs because we are
> not working with a consistent model of how vlan information
> is encoded.
>
> Still something needs to happen and this works in more cases even if it
> isn't perfect.
>
> Eric
Absolutely. And it's easier than I thought.
So we can do something like the below (warning: compiled only).
Basil - want to take a look?
My only concern if we put this logic in an out of way
driver like macvtap will people remember to update it?
Maybe better to update skb_copy_datagram_const_iovec which is in core?
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 0427c65..5a1724c 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -1,5 +1,6 @@
#include <linux/etherdevice.h>
#include <linux/if_macvlan.h>
+#include <linux/if_vlan.h>
#include <linux/interrupt.h>
#include <linux/nsproxy.h>
#include <linux/compat.h>
@@ -759,6 +760,7 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
struct macvlan_dev *vlan;
int ret;
int vnet_hdr_len = 0;
+ int vlan_offset = 0;
if (q->flags & IFF_VNET_HDR) {
struct virtio_net_hdr vnet_hdr;
@@ -776,8 +778,29 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
len = min_t(int, skb->len, len);
- ret = skb_copy_datagram_const_iovec(skb, 0, iv, vnet_hdr_len, len);
+ if (vlan_tx_tag_present(skb)) {
+ struct {
+ __be16 h_vlan_proto;
+ __be16 h_vlan_TCI;
+ } veth;
+ veth.h_vlan_proto = htons(ETH_P_8021Q);
+ veth.h_vlan_TCI = vlan_tx_tag_get(skb);
+
+ vlan_offset = offsetof(struct vlan_ethhdr, h_vlan_proto);
+ ret = skb_copy_datagram_const_iovec(skb, 0, iv, vnet_hdr_len,
+ vlan_offset);
+ if (ret)
+ goto done;
+ ret = memcpy_toiovecend(iv, (unsigned char *)&veth, vlan_offset,
+ sizeof veth);
+ if (ret)
+ goto done;
+ vlan_offset += sizeof veth;
+ }
+ ret = skb_copy_datagram_const_iovec(skb, vlan_offset, iv, vnet_hdr_len,
+ len);
+done:
rcu_read_lock_bh();
vlan = rcu_dereference_bh(q->vlan);
if (vlan)
^ permalink raw reply related
* Re: [PATCH] net: davinci_emac: Add pre_open, post_stop platform callbacks
From: Kevin Hilman @ 2012-05-03 14:22 UTC (permalink / raw)
To: Bedia, Vaibhav
Cc: Mark A. Greer, netdev@vger.kernel.org, linux-omap@vger.kernel.org,
linux-arm-kernel@lists.infradead.org
In-Reply-To: <B5906170F1614E41A8A28DE3B8D121433EA72820@DBDE01.ent.ti.com>
"Bedia, Vaibhav" <vaibhav.bedia@ti.com> writes:
> On Thu, May 03, 2012 at 05:17:18, Mark A. Greer wrote:
>> From: "Mark A. Greer" <mgreer@animalcreek.com>
>>
>> The davinci EMAC driver has been incorporated into the am35x
>> family of SoC's which is OMAP-based. The incorporation is
>> incomplete in that the EMAC cannot unblock the [ARM] core if
>> its blocked on a 'wfi' instruction. This is an issue with
>> the cpu_idle code because it has the core execute a 'wfi'
>> instruction.
>>
>> To work around this issue, add platform data callbacks which
>> are called at the beginning of the open routine and at the
>> end of the stop routine of the davinci_emac driver. The
>> callbacks allow the platform code to issue disable_hlt() and
>> enable_hlt() calls appropriately. Calling disable_hlt()
>> prevents cpu_idle from issuing the 'wfi' instruction.
>>
>> It is not sufficient to simply call disable_hlt() when
>> there is an EMAC present because it could be present but
>> not actually used in which case, we do want the 'wfi' to
>> be executed.
>>
>
> Are you trying to say that if ARM executes _just_ wfi and _absolutely
> nothing else_ is done in the OMAP PM code, EMAC stops working?
>
> However, if this is indeed the case, then probably a better solution would be
> to invoke disable_hlt() from the board file when EMAC support is compiled in.
No. As Mark stated in the changelog, doing that will prevent any
low-power states states even if the EMAC is not in use. IMO, it is best
to only prevent WFI when absolutely needed.
Kevin
^ permalink raw reply
* Re: [PATCH 04/11] mm: Add support for a filesystem to activate swap files and use direct_IO for writing swap pages
From: Mel Gorman @ 2012-05-03 14:14 UTC (permalink / raw)
To: Andrew Morton
Cc: Linux-MM, Linux-Netdev, Linux-NFS, LKML, David Miller,
Trond Myklebust, Neil Brown, Christoph Hellwig, Peter Zijlstra,
Mike Christie, Eric B Munson
In-Reply-To: <20120501155308.5679a09b.akpm@linux-foundation.org>
On Tue, May 01, 2012 at 03:53:08PM -0700, Andrew Morton wrote:
> > It is perfectly possible that direct_IO be used to read the swap
> > pages but it is an unnecessary complication. Similarly, it is possible
> > that ->writepage be used instead of direct_io to write the pages but
> > filesystem developers have stated that calling writepage from the VM
> > is undesirable for a variety of reasons and using direct_IO opens up
> > the possibility of writing back batches of swap pages in the future.
>
> This all seems a bit odd. And abusive.
>
> Yes, it would be more pleasing if direct-io was used for reading as
> well. How much more complication would it add?
>
Quite a bit.
Superficially it's easy because swap_readpage() just sets up a kiocb,
fills in the necessary details and call ->direct_IO. The complexity is
around page locking and writing back pending writes in NFS.
read_swap_cache_async() calls swap_readpage with the page locked and
is expected to return with the page unlocked on successful completion of
the IO.
For swap-over-nfs, the readpage handler behaves exactly as
read_swap_cache_async() expects. For everything else, submit_bio() is used
with end_swap_bio_read() unlocking the page. Both of these handlers behave
the same with respect to locking. The direct_IO handler does not expect the
page to be locked and does not unlock it itself. Even if it works for NFS,
there might be other complications in the future around page locking in
direct_IO handlers.
The second complexity may be specific to NFS. The NFS readpage handler
flushes any pending writes with nfs_wb_page() before doing the read which it
can do because it holds the page lock. It was completely unclear how the same
could be achieved from swap_readpage() in a filesystem-independent manner.
As ->readpage() already knew how to do the right thing in all cases, I
used it.
> If I understand correctly, on the read path we're taking a fresh page
> which is destined for swapcache and then pretending that it is a
> pagecache page for the purpose of the I/O?
>
> If there already existed a
> pagecache page for that file offset then we let it just sit there and
> bypass it?
>
On the read path read_swap_cache_async() checks if a page is already in
swapcache and if not not, allocates a new page, adds it to the swapcache
and calls swap_readpage. Hence I do not think we are tripping the
problem you are thinking of.
> I'm surprised that this works at all - I guess nothing under
> ->readpage() goes poking around in the address_space. For NFS, at
> least!
>
> >
> > ...
> >
> > @@ -93,11 +94,38 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
> > {
> > struct bio *bio;
> > int ret = 0, rw = WRITE;
> > + struct swap_info_struct *sis = page_swap_info(page);
> >
> > if (try_to_free_swap(page)) {
> > unlock_page(page);
> > goto out;
> > }
> > +
> > + if (sis->flags & SWP_FILE) {
> > + struct kiocb kiocb;
> > + struct file *swap_file = sis->swap_file;
> > + struct address_space *mapping = swap_file->f_mapping;
> > + struct iovec iov = {
> > + .iov_base = page_address(page),
>
> Didn't we need to kmap the page?
>
.... Yep, that would be important all right. I'll look at this closely
and do a round of testing on x86-32.
> > + .iov_len = PAGE_SIZE,
> > + };
> > +
> > + init_sync_kiocb(&kiocb, swap_file);
> > + kiocb.ki_pos = page_file_offset(page);
> > + kiocb.ki_left = PAGE_SIZE;
> > + kiocb.ki_nbytes = PAGE_SIZE;
> > +
> > + unlock_page(page);
> > + ret = mapping->a_ops->direct_IO(KERNEL_WRITE,
> > + &kiocb, &iov,
> > + kiocb.ki_pos, 1);
>
> I wonder if there's any point in setting PG_writeback around the IO. I
> can't think of a reason.
>
One does not spring to mind.
> > + if (ret == PAGE_SIZE) {
> > + count_vm_event(PSWPOUT);
> > + ret = 0;
> > + }
> > + return ret;
> > + }
> > +
> > bio = get_swap_bio(GFP_NOIO, page, end_swap_bio_write);
> > if (bio == NULL) {
> > set_page_dirty(page);
> > @@ -119,9 +147,21 @@ int swap_readpage(struct page *page)
> > {
> > struct bio *bio;
> > int ret = 0;
> > + struct swap_info_struct *sis = page_swap_info(page);
> >
> > VM_BUG_ON(!PageLocked(page));
> > VM_BUG_ON(PageUptodate(page));
> > +
> > + if (sis->flags & SWP_FILE) {
> > + struct file *swap_file = sis->swap_file;
> > + struct address_space *mapping = swap_file->f_mapping;
> > +
> > + ret = mapping->a_ops->readpage(swap_file, page);
> > + if (!ret)
> > + count_vm_event(PSWPIN);
> > + return ret;
> > + }
>
> Confused. Where did we set up page->index with the file offset?
>
We don't use page->index in this case.
__add_to_swap_cache() records the swap entry in page->private.
nfs_readpage() looks up the page index with page_index() which for
SwapCache pages calls __page_file_index(). It in turn gets the swap
entry and looks up the index with swp_offset().
> > bio = get_swap_bio(GFP_KERNEL, page, end_swap_bio_read);
> > if (bio == NULL) {
> > unlock_page(page);
> > @@ -133,3 +173,15 @@ int swap_readpage(struct page *page)
> > out:
> > return ret;
> > }
> > +
> > +int swap_set_page_dirty(struct page *page)
> > +{
> > + struct swap_info_struct *sis = page_swap_info(page);
> > +
> > + if (sis->flags & SWP_FILE) {
> > + struct address_space *mapping = sis->swap_file->f_mapping;
> > + return mapping->a_ops->set_page_dirty(page);
> > + } else {
> > + return __set_page_dirty_nobuffers(page);
> > + }
> > +}
>
> More confused. This is a swapcache page, not a pagecache page? Why
> are we running set_page_dirty() against it?
>
I don't really get the question. swap-over-NFS is not doing anything
different here than what we do today. PageSwapCache pages still have to
be marked dirty so they get written to disk before being discarded.
> And what are we doing on the !SWP_FILE path?
Maintaining existing behaviour. This is what the swap ops looks like
without the patchset
static const struct address_space_operations swap_aops = {
.writepage = swap_writepage,
.set_page_dirty = __set_page_dirty_nobuffers,
.migratepage = migrate_page,
};
> Newly setting PG_dirty
> against block-device-backed swapcache pages? Why? Where does it get
> cleared again?
>
clear_page_dirty_for_io() in vmscan.c#pageout() ? I might be missing
something in your question again :(
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index 9d3dd37..c25b9cf 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -26,7 +26,7 @@
> > */
> > static const struct address_space_operations swap_aops = {
> > .writepage = swap_writepage,
> > - .set_page_dirty = __set_page_dirty_nobuffers,
> > + .set_page_dirty = swap_set_page_dirty,
> > .migratepage = migrate_page,
> > };
> >
> > ...
> >
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* Re: [PATCH v3 2/2] macvtap: restore vlan header on user read
From: Eric W. Biederman @ 2012-05-03 13:37 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: Basil Gor, David S. Miller, netdev
In-Reply-To: <20120503130751.GB26366@redhat.com>
"Michael S. Tsirkin" <mst@redhat.com> writes:
> On Wed, Apr 25, 2012 at 10:31:25PM -0700, Eric W. Biederman wrote:
>> Basil Gor <basil.gor@gmail.com> writes:
>>
>> > Vlan tag is restored during buffer transmit to a network device (bridge
>> > port) in bridging code in case of tun/tap driver. In case of macvtap it
>> > has to be done explicitly. Otherwise vlan_tci is ignored and user always
>> > gets untagged packets.
>>
>> We could quibble about efficiencies but this looks good except for
>> macvtap_recvmsg which isn't setting the auxdata for the vlan header.
>>
>> Eric
>
> Right. I'm guessing we need to support old userspace
> so if there's auxdata, put vlan there but if not,
> put the vlan in the packet like this patch does.
This patch isn't horrible.
Still why copy the skb when you can just split the copy to userspace
into a couple of pieces?
We don't need to change the skb and changing the skb looks like
it is likely to confuse things and cause bugs because we are
not working with a consistent model of how vlan information
is encoded.
Still something needs to happen and this works in more cases even if it
isn't perfect.
Eric
>> > Signed-off-by: Basil Gor <basil.gor@gmail.com>
>> > ---
>> > drivers/net/macvtap.c | 11 ++++++++++-
>> > 1 files changed, 10 insertions(+), 1 deletions(-)
>> >
>> > diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
>> > index 0427c65..28d2678 100644
>> > --- a/drivers/net/macvtap.c
>> > +++ b/drivers/net/macvtap.c
>> > @@ -1,6 +1,7 @@
>> > #include <linux/etherdevice.h>
>> > #include <linux/if_macvlan.h>
>> > #include <linux/interrupt.h>
>> > +#include <linux/if_vlan.h>
>> > #include <linux/nsproxy.h>
>> > #include <linux/compat.h>
>> > #include <linux/if_tun.h>
>> > @@ -753,13 +754,21 @@ static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
>> >
>> > /* Put packet to the user space buffer */
>> > static ssize_t macvtap_put_user(struct macvtap_queue *q,
>> > - const struct sk_buff *skb,
>> > + struct sk_buff *skb,
>> > const struct iovec *iv, int len)
>> > {
>> > struct macvlan_dev *vlan;
>> > int ret;
>> > int vnet_hdr_len = 0;
>> >
>> > + if (vlan_tx_tag_present(skb)) {
>> > + skb = __vlan_put_tag(skb, vlan_tx_tag_get(skb));
>> > + if (unlikely(!skb))
>> > + return -ENOMEM;
>> > +
>> > + skb->vlan_tci = 0;
>> > + }
>> > +
>> > if (q->flags & IFF_VNET_HDR) {
>> > struct virtio_net_hdr vnet_hdr;
>> > vnet_hdr_len = q->vnet_hdr_sz;
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH v3 1/2] vhost-net: fix handle_rx buffer size
From: Michael S. Tsirkin @ 2012-05-03 13:16 UTC (permalink / raw)
To: Basil Gor; +Cc: Eric W. Biederman, David S. Miller, netdev
In-Reply-To: <1335373275-336-1-git-send-email-basil.gor@gmail.com>
On Wed, Apr 25, 2012 at 09:01:15PM +0400, Basil Gor wrote:
> Take vlan header length into account, when vlan id is stored as
> vlan_tci. Otherwise tagged packets comming from macvtap will be
> truncated.
>
> Signed-off-by: Basil Gor <basil.gor@gmail.com>
So I'm inclined to apply these two patches, we
this doesn't fix packet socket backend
but could be fixed by a follow-up patch.
> ---
> drivers/vhost/net.c | 7 ++++++-
> 1 files changed, 6 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 1f21d2a..5c17010 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -24,6 +24,7 @@
> #include <linux/if_arp.h>
> #include <linux/if_tun.h>
> #include <linux/if_macvlan.h>
> +#include <linux/if_vlan.h>
>
> #include <net/sock.h>
>
> @@ -283,8 +284,12 @@ static int peek_head_len(struct sock *sk)
>
> spin_lock_irqsave(&sk->sk_receive_queue.lock, flags);
> head = skb_peek(&sk->sk_receive_queue);
> - if (likely(head))
> + if (likely(head)) {
> len = head->len;
> + if (vlan_tx_tag_present(head))
> + len += VLAN_HLEN;
> + }
> +
> spin_unlock_irqrestore(&sk->sk_receive_queue.lock, flags);
> return len;
> }
> --
> 1.7.6.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH v3 2/2] macvtap: restore vlan header on user read
From: Michael S. Tsirkin @ 2012-05-03 13:07 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: Basil Gor, David S. Miller, netdev
In-Reply-To: <m162cngitu.fsf@fess.ebiederm.org>
On Wed, Apr 25, 2012 at 10:31:25PM -0700, Eric W. Biederman wrote:
> Basil Gor <basil.gor@gmail.com> writes:
>
> > Vlan tag is restored during buffer transmit to a network device (bridge
> > port) in bridging code in case of tun/tap driver. In case of macvtap it
> > has to be done explicitly. Otherwise vlan_tci is ignored and user always
> > gets untagged packets.
>
> We could quibble about efficiencies but this looks good except for
> macvtap_recvmsg which isn't setting the auxdata for the vlan header.
>
> Eric
Right. I'm guessing we need to support old userspace
so if there's auxdata, put vlan there but if not,
put the vlan in the packet like this patch does.
> > Signed-off-by: Basil Gor <basil.gor@gmail.com>
> > ---
> > drivers/net/macvtap.c | 11 ++++++++++-
> > 1 files changed, 10 insertions(+), 1 deletions(-)
> >
> > diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> > index 0427c65..28d2678 100644
> > --- a/drivers/net/macvtap.c
> > +++ b/drivers/net/macvtap.c
> > @@ -1,6 +1,7 @@
> > #include <linux/etherdevice.h>
> > #include <linux/if_macvlan.h>
> > #include <linux/interrupt.h>
> > +#include <linux/if_vlan.h>
> > #include <linux/nsproxy.h>
> > #include <linux/compat.h>
> > #include <linux/if_tun.h>
> > @@ -753,13 +754,21 @@ static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
> >
> > /* Put packet to the user space buffer */
> > static ssize_t macvtap_put_user(struct macvtap_queue *q,
> > - const struct sk_buff *skb,
> > + struct sk_buff *skb,
> > const struct iovec *iv, int len)
> > {
> > struct macvlan_dev *vlan;
> > int ret;
> > int vnet_hdr_len = 0;
> >
> > + if (vlan_tx_tag_present(skb)) {
> > + skb = __vlan_put_tag(skb, vlan_tx_tag_get(skb));
> > + if (unlikely(!skb))
> > + return -ENOMEM;
> > +
> > + skb->vlan_tci = 0;
> > + }
> > +
> > if (q->flags & IFF_VNET_HDR) {
> > struct virtio_net_hdr vnet_hdr;
> > vnet_hdr_len = q->vnet_hdr_sz;
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH] wl12xx: fix size of two memset's in wl1271_cmd_build_arp_rsp()
From: Luciano Coelho @ 2012-05-03 12:56 UTC (permalink / raw)
To: Jesper Juhl; +Cc: linux-wireless, John W. Linville, netdev, linux-kernel
In-Reply-To: <alpine.LNX.2.00.1204222255460.27455@swampdragon.chaosbits.net>
On Sun, 2012-04-22 at 22:57 +0200, Jesper Juhl wrote:
> On Sun, 22 Apr 2012, Jesper Juhl wrote:
>
> > We currently do this:
> >
> > int wl1271_cmd_build_arp_rsp(struct wl1271 *wl, struct wl12xx_vif *wlvif)
> > ...
> > struct wl12xx_arp_rsp_template *tmpl;
> > struct ieee80211_hdr_3addr *hdr;
> > ...
> > tmpl = (struct wl12xx_arp_rsp_template *)skb_put(skb, sizeof(*tmpl));
> > memset(tmpl, 0, sizeof(tmpl));
> > ...
> > hdr = (struct ieee80211_hdr_3addr *)skb_push(skb, sizeof(*hdr));
> > memset(hdr, 0, sizeof(*hdr));
> > ...
> >
> > I believe we want to set the entire structures to 0 with those
> > memset() calls, not just zero the initial part of them (size of the
> > pointer bytes).
> >
>
> Sorry, I accidentally copied that code from the fixed version. The above
> should read:
>
>
> We currently do this:
>
> int wl1271_cmd_build_arp_rsp(struct wl1271 *wl, struct wl12xx_vif *wlvif)
> ...
> struct wl12xx_arp_rsp_template *tmpl;
> struct ieee80211_hdr_3addr *hdr;
> ...
> tmpl = (struct wl12xx_arp_rsp_template *)skb_put(skb, sizeof(*tmpl));
> memset(tmpl, 0, sizeof(tmpl));
> ...
> hdr = (struct ieee80211_hdr_3addr *)skb_push(skb, sizeof(*hdr));
> memset(hdr, 0, sizeof(hdr));
> ...
>
> I believe we want to set the entire structures to 0 with those
> memset() calls, not just zero the initial part of them (size of the
> pointer bytes).
>
>
>
>
>
> > Signed-off-by: Jesper Juhl <jj@chaosbits.net>
> > ---
Applied with the fixed commit log and merged into the new directory
structure. Thanks Jesper!
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox