From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
stable@vger.kernel.org, Herbert Xu <herbert@gondor.apana.org.au>,
Marcelo Ricardo Leitner <mleitner@redhat.com>,
Florian Westphal <fw@strlen.de>,
"David S. Miller" <davem@davemloft.net>
Subject: [PATCH 3.10 36/97] net: ip, ipv6: handle gso skbs in forwarding path
Date: Tue, 4 Mar 2014 12:03:49 -0800 [thread overview]
Message-ID: <20140304200347.238636259@linuxfoundation.org> (raw)
In-Reply-To: <20140304200345.895517495@linuxfoundation.org>
3.10-stable review patch. If anyone has any objections, please let me know.
------------------
From: Florian Westphal <fw@strlen.de>
commit fe6cc55f3a9a053482a76f5a6b2257cee51b4663 upstream.
Marcelo Ricardo Leitner reported problems when the forwarding link path
has a lower mtu than the incoming one if the inbound interface supports GRO.
Given:
Host <mtu1500> R1 <mtu1200> R2
Host sends tcp stream which is routed via R1 and R2. R1 performs GRO.
In this case, the kernel will fail to send ICMP fragmentation needed
messages (or pkt too big for ipv6), as GSO packets currently bypass dstmtu
checks in forward path. Instead, Linux tries to send out packets exceeding
the mtu.
When locking route MTU on Host (i.e., no ipv4 DF bit set), R1 does
not fragment the packets when forwarding, and again tries to send out
packets exceeding R1-R2 link mtu.
This alters the forwarding dstmtu checks to take the individual gso
segment lengths into account.
For ipv6, we send out pkt too big error for gso if the individual
segments are too big.
For ipv4, we either send icmp fragmentation needed, or, if the DF bit
is not set, perform software segmentation and let the output path
create fragments when the packet is leaving the machine.
It is not 100% correct as the error message will contain the headers of
the GRO skb instead of the original/segmented one, but it seems to
work fine in my (limited) tests.
Eric Dumazet suggested to simply shrink mss via ->gso_size to avoid
sofware segmentation.
However it turns out that skb_segment() assumes skb nr_frags is related
to mss size so we would BUG there. I don't want to mess with it considering
Herbert and Eric disagree on what the correct behavior should be.
Hannes Frederic Sowa notes that when we would shrink gso_size
skb_segment would then also need to deal with the case where
SKB_MAX_FRAGS would be exceeded.
This uses sofware segmentation in the forward path when we hit ipv4
non-DF packets and the outgoing link mtu is too small. Its not perfect,
but given the lack of bug reports wrt. GRO fwd being broken this is a
rare case anyway. Also its not like this could not be improved later
once the dust settles.
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Reported-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
include/linux/skbuff.h | 17 +++++++++++
net/ipv4/ip_forward.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++--
net/ipv6/ip6_output.c | 17 ++++++++++-
3 files changed, 101 insertions(+), 4 deletions(-)
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2913,5 +2913,22 @@ static inline bool skb_head_is_locked(co
{
return !skb->head_frag || skb_cloned(skb);
}
+
+/**
+ * skb_gso_network_seglen - Return length of individual segments of a gso packet
+ *
+ * @skb: GSO skb
+ *
+ * skb_gso_network_seglen is used to determine the real size of the
+ * individual segments, including Layer3 (IP, IPv6) and L4 headers (TCP/UDP).
+ *
+ * The MAC/L2 header is not accounted for.
+ */
+static inline unsigned int skb_gso_network_seglen(const struct sk_buff *skb)
+{
+ unsigned int hdr_len = skb_transport_header(skb) -
+ skb_network_header(skb);
+ return hdr_len + skb_gso_transport_seglen(skb);
+}
#endif /* __KERNEL__ */
#endif /* _LINUX_SKBUFF_H */
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -39,6 +39,71 @@
#include <net/route.h>
#include <net/xfrm.h>
+static bool ip_may_fragment(const struct sk_buff *skb)
+{
+ return unlikely((ip_hdr(skb)->frag_off & htons(IP_DF)) == 0) ||
+ !skb->local_df;
+}
+
+static bool ip_exceeds_mtu(const struct sk_buff *skb, unsigned int mtu)
+{
+ if (skb->len <= mtu || skb->local_df)
+ return false;
+
+ if (skb_is_gso(skb) && skb_gso_network_seglen(skb) <= mtu)
+ return false;
+
+ return true;
+}
+
+static bool ip_gso_exceeds_dst_mtu(const struct sk_buff *skb)
+{
+ unsigned int mtu;
+
+ if (skb->local_df || !skb_is_gso(skb))
+ return false;
+
+ mtu = dst_mtu(skb_dst(skb));
+
+ /* if seglen > mtu, do software segmentation for IP fragmentation on
+ * output. DF bit cannot be set since ip_forward would have sent
+ * icmp error.
+ */
+ return skb_gso_network_seglen(skb) > mtu;
+}
+
+/* called if GSO skb needs to be fragmented on forward */
+static int ip_forward_finish_gso(struct sk_buff *skb)
+{
+ struct dst_entry *dst = skb_dst(skb);
+ netdev_features_t features;
+ struct sk_buff *segs;
+ int ret = 0;
+
+ features = netif_skb_dev_features(skb, dst->dev);
+ segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
+ if (IS_ERR(segs)) {
+ kfree_skb(skb);
+ return -ENOMEM;
+ }
+
+ consume_skb(skb);
+
+ do {
+ struct sk_buff *nskb = segs->next;
+ int err;
+
+ segs->next = NULL;
+ err = dst_output(segs);
+
+ if (err && ret == 0)
+ ret = err;
+ segs = nskb;
+ } while (segs);
+
+ return ret;
+}
+
static int ip_forward_finish(struct sk_buff *skb)
{
struct ip_options *opt = &(IPCB(skb)->opt);
@@ -49,6 +114,9 @@ static int ip_forward_finish(struct sk_b
if (unlikely(opt->optlen))
ip_forward_options(skb);
+ if (ip_gso_exceeds_dst_mtu(skb))
+ return ip_forward_finish_gso(skb);
+
return dst_output(skb);
}
@@ -88,8 +156,7 @@ int ip_forward(struct sk_buff *skb)
if (opt->is_strictroute && rt->rt_uses_gateway)
goto sr_failed;
- if (unlikely(skb->len > dst_mtu(&rt->dst) && !skb_is_gso(skb) &&
- (ip_hdr(skb)->frag_off & htons(IP_DF))) && !skb->local_df) {
+ if (!ip_may_fragment(skb) && ip_exceeds_mtu(skb, dst_mtu(&rt->dst))) {
IP_INC_STATS(dev_net(rt->dst.dev), IPSTATS_MIB_FRAGFAILS);
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
htonl(dst_mtu(&rt->dst)));
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -345,6 +345,20 @@ static inline int ip6_forward_finish(str
return dst_output(skb);
}
+static bool ip6_pkt_too_big(const struct sk_buff *skb, unsigned int mtu)
+{
+ if (skb->len <= mtu || skb->local_df)
+ return false;
+
+ if (IP6CB(skb)->frag_max_size && IP6CB(skb)->frag_max_size > mtu)
+ return true;
+
+ if (skb_is_gso(skb) && skb_gso_network_seglen(skb) <= mtu)
+ return false;
+
+ return true;
+}
+
int ip6_forward(struct sk_buff *skb)
{
struct dst_entry *dst = skb_dst(skb);
@@ -467,8 +481,7 @@ int ip6_forward(struct sk_buff *skb)
if (mtu < IPV6_MIN_MTU)
mtu = IPV6_MIN_MTU;
- if ((!skb->local_df && skb->len > mtu && !skb_is_gso(skb)) ||
- (IP6CB(skb)->frag_max_size && IP6CB(skb)->frag_max_size > mtu)) {
+ if (ip6_pkt_too_big(skb, mtu)) {
/* Again, force OUTPUT device used as source address */
skb->dev = dst->dev;
icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu);
next prev parent reply other threads:[~2014-03-04 20:03 UTC|newest]
Thread overview: 94+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-03-04 20:03 [PATCH 3.10 00/97] 3.10.33-stable review Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 01/97] drm/nouveau: set irq_enabled manually Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 02/97] drm/nv50/disp: use correct register to determine DP display bpp Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 03/97] ext4: fix error paths in swap_inode_boot_loader() Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 04/97] ext4: dont try to modify s_flags if the the file system is read-only Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 05/97] ext4: fix online resize with very large inode tables Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 06/97] ext4: fix online resize with a non-standard blocks per group setting Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 07/97] ext4: dont leave i_crtime.tv_sec uninitialized Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 08/97] ARM: dma-mapping: fix GFP_ATOMIC macro usage Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 09/97] ARM: 7953/1: mm: ensure TLB invalidation is complete before enabling MMU Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 10/97] ARM: 7957/1: add DSB after icache flush in __flush_icache_all() Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 11/97] ARM: OMAP2+: gpmc: fix: DT NAND child nodes not probed when MTD_NAND is built as module Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 12/97] ARM: OMAP2+: gpmc: fix: DT ONENAND child nodes not probed when MTD_ONENAND " Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 13/97] avr32: fix missing module.h causing build failure in mimc200/fram.c Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 14/97] avr32: Makefile: add -D__linux__ flag for gcc-4.4.7 use Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 15/97] cifs: ensure that uncached writes handle unmapped areas correctly Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 16/97] CIFS: Fix too big maxBuf size for SMB3 mounts Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 17/97] rtl8187: fix regression on MIPS without coherent DMA Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 18/97] rtlwifi: Fix incorrect return from rtl_ps_enable_nic() Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 19/97] rtlwifi: rtl8192ce: Fix too long disable of IRQs Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 20/97] 6lowpan: fix lockdep splats Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 21/97] 9p/trans_virtio.c: Fix broken zero-copy on vmalloc() buffers Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 22/97] can: add destructor for self generated skbs Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 23/97] ipv4: Fix runtime WARNING in rtmsg_ifa() Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 25/97] netpoll: fix netconsole IPv6 setup Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 27/97] tcp: tsq: fix nonagle handling Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 28/97] tg3: Fix deadlock in tg3_change_mtu() Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 30/97] usbnet: remove generic hard_header_len check Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 31/97] bonding: 802.3ad: make aggregator_identifier bond-private Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 32/97] ipv4: fix counter in_slow_tot Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 33/97] net: sctp: fix sctp_connectx abi for ia32 emulation/compat mode Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 34/97] net: add and use skb_gso_transport_seglen() Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 35/97] net: core: introduce netif_skb_dev_features Greg Kroah-Hartman
2014-03-04 20:03 ` Greg Kroah-Hartman [this message]
2014-03-04 20:03 ` [PATCH 3.10 37/97] net: use __GFP_NORETRY for high order allocations Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 38/97] memcg: fix endless loop caused by mem_cgroup_iter Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 39/97] fs: fix iversion handling Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 40/97] ALSA: usb-audio: work around KEF X300A firmware bug Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 41/97] ALSA: hda/ca0132 - setup/cleanup streams Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 42/97] ALSA: hda/ca0132 - Fix recording from mode id 0x8 Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 43/97] ALSA: hda - Enable front audio jacks on one HP desktop model Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 44/97] kvm: x86: fix emulator buffer overflow (CVE-2014-0049) Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 45/97] ASoC: max98090: sync regcache on entering STANDBY Greg Kroah-Hartman
2014-03-04 20:03 ` [PATCH 3.10 46/97] ASoC: wm8770: Fix wrong number of enum items Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 47/97] ASoC: da732x: Mark DC offset control registers volatile Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 48/97] ASoC: sta32x: Fix cache sync Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 50/97] ASoC: sta32x: Fix array access overflow Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 51/97] ASoC: wm8958-dsp: Fix firmware block loading Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 52/97] SUNRPC: Fix races in xs_nospace() Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 53/97] powerpc/le: Ensure that the stop-self RTAS token is handled correctly Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 54/97] powerpc/crashdump : Fix page frame number check in copy_oldmem_page Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 55/97] ahci: disable NCQ on Samsung pci-e SSDs on macbooks Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 56/97] x86: dma-mapping: fix GFP_ATOMIC macro usage Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 57/97] perf/x86: Fix event scheduling Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 58/97] ata: enable quirk from jmicron JMB350 for JMB394 Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 59/97] sata_sil: apply MOD15WRITE quirk to TOSHIBA MK2561GSYN Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 60/97] cpufreq: powernow-k8: Initialize per-cpu data-structures properly Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 61/97] PCI: Enable INTx if BIOS left them disabled Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 62/97] ACPI / PCI: Fix memory leak in acpi_pci_irq_enable() Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 63/97] i7core_edac: Fix PCI device reference count Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 64/97] ACPI / video: Filter the _BCL table for duplicate brightness values Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 65/97] ACPI / processor: Rework processor throttling with work_on_cpu() Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 66/97] can: kvaser_usb: check number of channels returned by HW Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 67/97] usb: chipidea: need to mask when writting endptflush and endptprime Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 68/97] usb: gadget: bcm63xx_udc: fix build failure on DMA channel code Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 69/97] USB: serial: option: blacklist interface 4 for Cinterion PHS8 and PXS8 Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 70/97] usb: ehci: fix deadlock when threadirqs option is used Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 71/97] USB: ftdi_sio: add Cressi Leonardo PID Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 72/97] mei: set clients read_cb to NULL when flow control fails Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 73/97] hwmon: (max1668) Fix writing the minimum temperature Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 74/97] workqueue: ensure @task is valid across kthread_stop() Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 76/97] iio:gyro: bug on L3GD20H gyroscope support Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 77/97] perf: Fix hotplug splat Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 78/97] ALSA: hda - Add a fixup for HP Folio 13 mute LED Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 79/97] xtensa: introduce spill_registers_kernel macro Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 80/97] SELinux: bigendian problems with filename trans rules Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 81/97] quota: Fix race between dqput() and dquot_scan_active() Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 82/97] ipc,mqueue: remove limits for the amount of system-wide queues Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 83/97] Input - arizona-haptics: Fix double lock of dapm_mutex Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 84/97] irq-metag*: stop set_affinity vectoring to offline cpus Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 85/97] ARM64: unwind: Fix PC calculation Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 86/97] ARM: tegra: only run PL310 init on systems with one Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 87/97] ARM: 7749/1: spinlock: retry trylock operation if strex fails on free lock Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 88/97] ARM: 7812/1: rwlocks: " Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 89/97] qla2xxx: Fix kernel panic on selective retransmission request Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 90/97] i7300_edac: Fix device reference count Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 91/97] dma: ste_dma40: dont dereference free:d descriptor Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 92/97] dm mpath: fix stalls when handling invalid ioctls Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 93/97] dm thin: avoid metadata commit if a pools thin devices havent changed Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 94/97] dm thin: fix the error path for the thin device constructor Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 95/97] drm/radeon: print the supported atpx function mask Greg Kroah-Hartman
2014-03-04 20:04 ` [PATCH 3.10 97/97] drm/radeon: disable pll sharing for DP on DCE4.1 Greg Kroah-Hartman
2014-03-05 1:16 ` [PATCH 3.10 00/97] 3.10.33-stable review Guenter Roeck
2014-03-05 22:31 ` Shuah Khan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140304200347.238636259@linuxfoundation.org \
--to=gregkh@linuxfoundation.org \
--cc=davem@davemloft.net \
--cc=fw@strlen.de \
--cc=herbert@gondor.apana.org.au \
--cc=linux-kernel@vger.kernel.org \
--cc=mleitner@redhat.com \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).