* [PATCH] net: sched: integer overflow fix
From: Stefan Hasko @ 2012-12-22 1:04 UTC (permalink / raw)
To: Jamal Hadi Salim, David S. Miller, netdev; +Cc: linux-kernel, Stefan Hasko
Sorry, I did not realize different sizes casting problem, now it's clear to me. Thanks for help.
Fixed integer overflow in function htb_dequeue
Signed-off-by: Stefan Hasko <hasko.stevo@gmail.com>
---
net/sched/sch_htb.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
index d2922c0..51561ea 100644
--- a/net/sched/sch_htb.c
+++ b/net/sched/sch_htb.c
@@ -919,7 +919,7 @@ ok:
q->now = ktime_to_ns(ktime_get());
start_at = jiffies;
- next_event = q->now + 5 * NSEC_PER_SEC;
+ next_event = q->now + 5LLU * NSEC_PER_SEC;
for (level = 0; level < TC_HTB_MAXDEPTH; level++) {
/* common case optimization - skip event handler quickly */
--
1.7.10.4
^ permalink raw reply related
* Re: [RFC] IP_MAX_MTU value
From: Alexey Kuznetsov @ 2012-12-21 23:45 UTC (permalink / raw)
To: David Miller; +Cc: erdnetdev, netdev, rick.jones2
In-Reply-To: <20121221.115910.1536088448298730991.davem@davemloft.net>
On Fri, Dec 21, 2012 at 11:59:10AM -0800, David Miller wrote:
> From: Eric Dumazet <erdnetdev@gmail.com>
> > Changing IP_MAX_MTU from 0xFFF0 to 0x10000 seems safe [1], but I might
> > miss something really obvious ?
Maximal IP MTU is 65535 (due to iph->len).
> > It might be because in old days we reserved 16 bytes for the ethernet
> > header, and we wanted to avoid kmalloc() round-up to kmalloc-131072
> > slab ?
> >
> > If so, we certainly can limit skb->head to 32 or 64 KB and complete with
> > page fragments the remaining space.
>
> I don't think it has to do with kmalloc() at all.
>
> Maybe something strange to do with the fact that each frag has to
> be an 8-byte multiple, or something like that?
>
> Alexey choose this value back in 1998, maybe he remembers the
> reason for the strange value.
Vaguely. I do not think it was something clever.
Apparently, this was done while implementing IPv6 jumbo frames.
dev->mtu was allowed to be arbitrarily high and for use with IPv4 it had
to be clamped at some reasonable value. This value should be 65535, of course.
Here was a problem: apparently, when experimenting with jumbo frames I used
virtual device (loopback?) setting some crazy values for mtu there. And it looked
dubious that nicely aligned device mtu sort of 256K converted to perverted odd
value 65535 for IP. This value would be translated to odd value for mss and so on.
Nothing fatal, but forcing tcp segmentation to an odd value looked too weird.
So, I just decided to apply some bound which looked sane, safe and aligned. :-)
Note that values of mtu > 65280 (or something like that, which was used by some funny hippi device)
were purely of theoritical value, they were used only to experiment with jumbo frames or for stress
testing, so actual value did not have any meaning.
^ permalink raw reply
* Re: [PATCH] net: sched: integer overflow fix
From: Eric Dumazet @ 2012-12-22 0:11 UTC (permalink / raw)
To: Stefan Hasko
Cc: Jamal Hadi Salim, David S. Miller, netdev, linux-kernel,
Vimalkumar
In-Reply-To: <1356130308.21834.8014.camel@edumazet-glaptop>
On Fri, 2012-12-21 at 14:51 -0800, Eric Dumazet wrote:
> On Fri, 2012-12-21 at 21:39 +0100, Stefan Hasko wrote:
> > Fixed integer overflow in function htb_dequeue
> >
> > Signed-off-by: Stefan Hasko <hasko.stevo@gmail.com>
> > ---
> > net/sched/sch_htb.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
> > index d2922c0..1bd3faa 100644
> > --- a/net/sched/sch_htb.c
> > +++ b/net/sched/sch_htb.c
> > @@ -919,7 +919,7 @@ ok:
> > q->now = ktime_to_ns(ktime_get());
> > start_at = jiffies;
> >
> > - next_event = q->now + 5 * NSEC_PER_SEC;
> > + next_event = q->now + (u32)5 * NSEC_PER_SEC;
> >
> > for (level = 0; level < TC_HTB_MAXDEPTH; level++) {
> > /* common case optimization - skip event handler quickly */
>
> But this patch is wrong !
>
Please resend your patch using something like
It will work much better.
diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
index d2922c0..51561ea 100644
--- a/net/sched/sch_htb.c
+++ b/net/sched/sch_htb.c
@@ -919,7 +919,7 @@ ok:
q->now = ktime_to_ns(ktime_get());
start_at = jiffies;
- next_event = q->now + 5 * NSEC_PER_SEC;
+ next_event = q->now + 5LLU * NSEC_PER_SEC;
for (level = 0; level < TC_HTB_MAXDEPTH; level++) {
/* common case optimization - skip event handler quickly */
^ permalink raw reply related
* Re: TCP sequence number inference attack on Linux
From: Zhiyun Qian @ 2012-12-22 0:08 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1356134676.21834.8096.camel@edumazet-glaptop>
I see. If you need any help, just let me know! E.g., I can definitely
help review the patches.
-Zhiyun
On Fri, Dec 21, 2012 at 7:04 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2012-12-21 at 16:01 -0800, Eric Dumazet wrote:
>
>> I began coding a serie of patches.
>
> My goal is to make very small patches, to ease code review, and
> eventually come to a state where we have more factorized code.
>
>
>
^ permalink raw reply
* Re: TCP sequence number inference attack on Linux
From: Eric Dumazet @ 2012-12-22 0:04 UTC (permalink / raw)
To: Zhiyun Qian; +Cc: netdev
In-Reply-To: <1356134481.21834.8090.camel@edumazet-glaptop>
On Fri, 2012-12-21 at 16:01 -0800, Eric Dumazet wrote:
> I began coding a serie of patches.
My goal is to make very small patches, to ease code review, and
eventually come to a state where we have more factorized code.
^ permalink raw reply
* Re: TCP sequence number inference attack on Linux
From: Eric Dumazet @ 2012-12-22 0:01 UTC (permalink / raw)
To: Zhiyun Qian; +Cc: netdev
In-Reply-To: <CALvgte-tpvyBF608MahqLc_JFZNzg2q1FpsUY72fDDicbdsBWA@mail.gmail.com>
On Fri, 2012-12-21 at 18:52 -0500, Zhiyun Qian wrote:
> It seems that it is not straightforward to simply move tcp_ack()
> before tcp_validate_incoming() as tcp_ack() currently assumes the tcp
> sequence number is already validated and it may adjust certain states
> purely depending on the ACK number. I guess the solution is to extract
> all safety checks out and put at the very beginning. The rest of the
> code in tcp_validate_incoming() and tcp_ack() may still need to
> perform the redundant checks since if some state changes are dependent
> on the sequence number or ACK number (e.g., window update).
>
> I'm willing to help on this. Perhaps I can draft an initial patch and
> you can help take a look before I submit it?
I began coding a serie of patches.
Part of the problem comes from a 'fast path' idea that was nice 20 years
ago but no longer a real killer these days.
So some tests are duplicated, others are not correctly done.
For example, the fast path misses the more complete tests done in
tcp_ack()
Its a big mess.
^ permalink raw reply
* Re: TCP sequence number inference attack on Linux
From: Zhiyun Qian @ 2012-12-21 23:52 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1356129948.21834.8002.camel@edumazet-glaptop>
On Fri, Dec 21, 2012 at 5:45 PM, Eric Dumazet <erdnetdev@gmail.com> wrote:
> On Fri, 2012-12-21 at 14:49 -0500, Zhiyun Qian wrote:
>
>> If I am not mistaken, line 6142 in kernel v3.7.1 corresponds to
>> tcp_rcv_state_process(). According to the comments, "This function
>> implements the receiving procedure of RFC 793 for all states except
>> ESTABLISHED and TIME_WAIT." Are you referring to a different kernel
>> version?
>
> You are not mistaken, it seems code is too permissive.
>
> We should reject a frame without ACK flag while in ESTABLISHED state.
>
> Thats explicitly stated in RFC 973.
>
> Then we should make all possible safety checks before even sending a
> frame or changing socket variables.
I completely agree!
>
> (For instance the tests done in tcp_ack() should be done before calling
> tcp_validate_incoming())
>
It seems that it is not straightforward to simply move tcp_ack()
before tcp_validate_incoming() as tcp_ack() currently assumes the tcp
sequence number is already validated and it may adjust certain states
purely depending on the ACK number. I guess the solution is to extract
all safety checks out and put at the very beginning. The rest of the
code in tcp_validate_incoming() and tcp_ack() may still need to
perform the redundant checks since if some state changes are dependent
on the sequence number or ACK number (e.g., window update).
I'm willing to help on this. Perhaps I can draft an initial patch and
you can help take a look before I submit it?
> John Dykstra in commit 96e0bf4b5193d0 (tcp: Discard segments that ack
> data not yet sent) did a step into right direction, but missed this.
>
> Current code assumes the incoming frame is mostly legitimate.
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index a136925..2ea4937 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -5551,7 +5551,7 @@ slow_path:
> return 0;
>
> step5:
> - if (th->ack && tcp_ack(sk, skb, FLAG_SLOWPATH) < 0)
> + if (!th->ack || tcp_ack(sk, skb, FLAG_SLOWPATH) < 0)
> goto discard;
>
> /* ts_recent update must be made after we are sure that the packet
>
>
Neat change. This should enforce the ACK flag and ACK number check for
every packet received in established state.
^ permalink raw reply
* [PATCH] Drivers: network: more __dev* removal
From: Greg Kroah-Hartman @ 2012-12-21 23:42 UTC (permalink / raw)
To: netdev; +Cc: Bill Pemberton
Remove some __dev* markings that snuck in the 3.8-rc1 merge window in
the drivers/net/* directory.
Cc: Bill Pemberton <wfp5p@virginia.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff --git a/drivers/net/ethernet/marvell/mvmdio.c b/drivers/net/ethernet/marvell/mvmdio.c
index 6d6002b..74f1c15 100644
--- a/drivers/net/ethernet/marvell/mvmdio.c
+++ b/drivers/net/ethernet/marvell/mvmdio.c
@@ -141,7 +141,7 @@ static int orion_mdio_reset(struct mii_bus *bus)
return 0;
}
-static int __devinit orion_mdio_probe(struct platform_device *pdev)
+static int orion_mdio_probe(struct platform_device *pdev)
{
struct device_node *np = pdev->dev.of_node;
struct mii_bus *bus;
@@ -197,7 +197,7 @@ static int __devinit orion_mdio_probe(struct platform_device *pdev)
return 0;
}
-static int __devexit orion_mdio_remove(struct platform_device *pdev)
+static int orion_mdio_remove(struct platform_device *pdev)
{
struct mii_bus *bus = platform_get_drvdata(pdev);
mdiobus_unregister(bus);
@@ -214,7 +214,7 @@ MODULE_DEVICE_TABLE(of, orion_mdio_match);
static struct platform_driver orion_mdio_driver = {
.probe = orion_mdio_probe,
- .remove = __devexit_p(orion_mdio_remove),
+ .remove = orion_mdio_remove,
.driver = {
.name = "orion-mdio",
.of_match_table = orion_mdio_match,
diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 3f8086b..b6025c3 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -635,7 +635,7 @@ static void mvneta_rxq_bm_disable(struct mvneta_port *pp,
/* Sets the RGMII Enable bit (RGMIIEn) in port MAC control register */
-static void __devinit mvneta_gmac_rgmii_set(struct mvneta_port *pp, int enable)
+static void mvneta_gmac_rgmii_set(struct mvneta_port *pp, int enable)
{
u32 val;
@@ -650,7 +650,7 @@ static void __devinit mvneta_gmac_rgmii_set(struct mvneta_port *pp, int enable)
}
/* Config SGMII port */
-static void __devinit mvneta_port_sgmii_config(struct mvneta_port *pp)
+static void mvneta_port_sgmii_config(struct mvneta_port *pp)
{
u32 val;
@@ -2564,7 +2564,7 @@ const struct ethtool_ops mvneta_eth_tool_ops = {
};
/* Initialize hw */
-static int __devinit mvneta_init(struct mvneta_port *pp, int phy_addr)
+static int mvneta_init(struct mvneta_port *pp, int phy_addr)
{
int queue;
@@ -2613,9 +2613,8 @@ static void mvneta_deinit(struct mvneta_port *pp)
}
/* platform glue : initialize decoding windows */
-static void __devinit
-mvneta_conf_mbus_windows(struct mvneta_port *pp,
- const struct mbus_dram_target_info *dram)
+static void mvneta_conf_mbus_windows(struct mvneta_port *pp,
+ const struct mbus_dram_target_info *dram)
{
u32 win_enable;
u32 win_protect;
@@ -2648,7 +2647,7 @@ mvneta_conf_mbus_windows(struct mvneta_port *pp,
}
/* Power up the port */
-static void __devinit mvneta_port_power_up(struct mvneta_port *pp, int phy_mode)
+static void mvneta_port_power_up(struct mvneta_port *pp, int phy_mode)
{
u32 val;
@@ -2671,7 +2670,7 @@ static void __devinit mvneta_port_power_up(struct mvneta_port *pp, int phy_mode)
}
/* Device initialization routine */
-static int __devinit mvneta_probe(struct platform_device *pdev)
+static int mvneta_probe(struct platform_device *pdev)
{
const struct mbus_dram_target_info *dram_target_info;
struct device_node *dn = pdev->dev.of_node;
@@ -2803,7 +2802,7 @@ err_free_netdev:
}
/* Device removal routine */
-static int __devexit mvneta_remove(struct platform_device *pdev)
+static int mvneta_remove(struct platform_device *pdev)
{
struct net_device *dev = platform_get_drvdata(pdev);
struct mvneta_port *pp = netdev_priv(dev);
@@ -2828,7 +2827,7 @@ MODULE_DEVICE_TABLE(of, mvneta_match);
static struct platform_driver mvneta_driver = {
.probe = mvneta_probe,
- .remove = __devexit_p(mvneta_remove),
+ .remove = mvneta_remove,
.driver = {
.name = MVNETA_DRIVER_NAME,
.of_match_table = mvneta_match,
diff --git a/drivers/net/wireless/rtlwifi/rtl8723ae/sw.c b/drivers/net/wireless/rtlwifi/rtl8723ae/sw.c
index 18b0bc5..bb7cc90 100644
--- a/drivers/net/wireless/rtlwifi/rtl8723ae/sw.c
+++ b/drivers/net/wireless/rtlwifi/rtl8723ae/sw.c
@@ -341,7 +341,7 @@ static struct rtl_hal_cfg rtl8723ae_hal_cfg = {
.maps[RTL_RC_HT_RATEMCS15] = DESC92_RATEMCS15,
};
-static struct pci_device_id rtl8723ae_pci_ids[] __devinitdata = {
+static struct pci_device_id rtl8723ae_pci_ids[] = {
{RTL_PCI_DEVICE(PCI_VENDOR_ID_REALTEK, 0x8723, rtl8723ae_hal_cfg)},
{},
};
^ permalink raw reply related
* [PATCH] CONFIG_HOTPLUG removal from networking core
From: Greg Kroah-Hartman @ 2012-12-21 23:44 UTC (permalink / raw)
To: netdev; +Cc: Bill Pemberton
CONFIG_HOTPLUG is always enabled now, so remove the unused code that was
trying to be compiled out when this option was disabled, in the
networking core.
Cc: Bill Pemberton <wfp5p@virginia.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 334efd5..28c5f5a 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1334,7 +1334,6 @@ struct kobj_ns_type_operations net_ns_type_operations = {
};
EXPORT_SYMBOL_GPL(net_ns_type_operations);
-#ifdef CONFIG_HOTPLUG
static int netdev_uevent(struct device *d, struct kobj_uevent_env *env)
{
struct net_device *dev = to_net_dev(d);
@@ -1353,7 +1352,6 @@ static int netdev_uevent(struct device *d, struct kobj_uevent_env *env)
exit:
return retval;
}
-#endif
/*
* netdev_release -- destroy and free a dead device.
@@ -1382,9 +1380,7 @@ static struct class net_class = {
#ifdef CONFIG_SYSFS
.dev_attrs = net_class_attributes,
#endif /* CONFIG_SYSFS */
-#ifdef CONFIG_HOTPLUG
.dev_uevent = netdev_uevent,
-#endif
.ns_type = &net_ns_type_operations,
.namespace = net_namespace,
};
diff --git a/net/wireless/reg.c b/net/wireless/reg.c
index 6e53089..82c4fc7 100644
--- a/net/wireless/reg.c
+++ b/net/wireless/reg.c
@@ -2365,7 +2365,6 @@ int set_regdom(const struct ieee80211_regdomain *rd)
return r;
}
-#ifdef CONFIG_HOTPLUG
int reg_device_uevent(struct device *dev, struct kobj_uevent_env *env)
{
if (last_request && !last_request->processed) {
@@ -2377,12 +2376,6 @@ int reg_device_uevent(struct device *dev, struct kobj_uevent_env *env)
return 0;
}
-#else
-int reg_device_uevent(struct device *dev, struct kobj_uevent_env *env)
-{
- return -ENODEV;
-}
-#endif /* CONFIG_HOTPLUG */
void wiphy_regulatory_register(struct wiphy *wiphy)
{
diff --git a/net/wireless/sysfs.c b/net/wireless/sysfs.c
index 9bf6d5e..1f6f01e 100644
--- a/net/wireless/sysfs.c
+++ b/net/wireless/sysfs.c
@@ -77,13 +77,11 @@ static void wiphy_dev_release(struct device *dev)
cfg80211_dev_free(rdev);
}
-#ifdef CONFIG_HOTPLUG
static int wiphy_uevent(struct device *dev, struct kobj_uevent_env *env)
{
/* TODO, we probably need stuff here */
return 0;
}
-#endif
static int wiphy_suspend(struct device *dev, pm_message_t state)
{
@@ -134,9 +132,7 @@ struct class ieee80211_class = {
.owner = THIS_MODULE,
.dev_release = wiphy_dev_release,
.dev_attrs = ieee80211_dev_attrs,
-#ifdef CONFIG_HOTPLUG
.dev_uevent = wiphy_uevent,
-#endif
.suspend = wiphy_suspend,
.resume = wiphy_resume,
.ns_type = &net_ns_type_operations,
^ permalink raw reply related
* Re: IPv6 over Firewire
From: Stefan Richter @ 2012-12-21 23:12 UTC (permalink / raw)
To: YOSHIFUJI Hideaki, stephan.gatzka; +Cc: netdev, linux1394-devel
In-Reply-To: <50D4BD2F.7060006@linux-ipv6.org>
On Dec 22 YOSHIFUJI Hideaki wrote:
> Stephan Gatzka wrote:
> >
> >> If you are talking about how to build NS/NA/RS/Redirect messages, you
> >> can just use ndisc_opt_addr_space() and ndisc_fill_addr_option() here.
> >
> > Thanks, these functions are certainly helpful. But ndisc_opt_addr_space() calculates the required space from dev->addr_len and ndisc_addr_option_pad(dev->type). The latter is 0 for IEEE1394 (firewire). So the required option space just comes from dev->addr_len, which is 8 for firewire, resulting in an option address space of 16 (2 octets).
> >
> > But rfc3146 requires an option address space of 3 octets. So my main question is if in such a situation the best is to reserve additional skb tail room using needed_tailroom in struct netdevice. This directly affects the memory allocated in ndisc_build_skb().
>
> Something like this:
>
> static inline int ndisc_opt_addr_space(struct net_device *dev)
> {
> - return NDISC_OPT_SPACE(dev->addr_len + ndisc_addr_option_pad(dev->type));
> + switch (dev->type) {
> + case ARPHRD_IEEE1394:
> + return sizeof(struct ndisc_opt_ieee1394_llinfo);
> + default:
> + return NDISC_OPT_SPACE(dev->addr_len + ndisc_addr_option_pad(dev->type));
> + }
> }
Can't we increase dev->addr_len for RFC 3146 interfaces?
Can't we add another dev->type besides ARPHRD_IEEE1394 (RFC 2734)?
Is a single dev instance transporting both IPv4 and IPv6 or will there be
separate instances for those?
--
Stefan Richter
-=====-===-- ==-- =-==-
http://arcgraph.de/sr/
^ permalink raw reply
* Re: [PATCH] net: sched: integer overflow fix
From: Eric Dumazet @ 2012-12-21 22:51 UTC (permalink / raw)
To: Stefan Hasko
Cc: Jamal Hadi Salim, David S. Miller, netdev, linux-kernel,
Vimalkumar
In-Reply-To: <1356122364-1418-1-git-send-email-hasko.stevo@gmail.com>
On Fri, 2012-12-21 at 21:39 +0100, Stefan Hasko wrote:
> Fixed integer overflow in function htb_dequeue
>
> Signed-off-by: Stefan Hasko <hasko.stevo@gmail.com>
> ---
> net/sched/sch_htb.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
> index d2922c0..1bd3faa 100644
> --- a/net/sched/sch_htb.c
> +++ b/net/sched/sch_htb.c
> @@ -919,7 +919,7 @@ ok:
> q->now = ktime_to_ns(ktime_get());
> start_at = jiffies;
>
> - next_event = q->now + 5 * NSEC_PER_SEC;
> + next_event = q->now + (u32)5 * NSEC_PER_SEC;
>
> for (level = 0; level < TC_HTB_MAXDEPTH; level++) {
> /* common case optimization - skip event handler quickly */
But this patch is wrong !
I am a bit surprised, as I remember spotting this error in one patch
submission from Vimalkumar.
Apparently it got lost.
Please fix the bug for good, not adding another one.
Thanks
^ permalink raw reply
* Re: TCP sequence number inference attack on Linux
From: Eric Dumazet @ 2012-12-21 22:45 UTC (permalink / raw)
To: Zhiyun Qian; +Cc: netdev
In-Reply-To: <CALvgte9WJRR95B5WCTavF0w3Kvf+MksZtyY4OQ=ioAe71BasCw@mail.gmail.com>
On Fri, 2012-12-21 at 14:49 -0500, Zhiyun Qian wrote:
> If I am not mistaken, line 6142 in kernel v3.7.1 corresponds to
> tcp_rcv_state_process(). According to the comments, "This function
> implements the receiving procedure of RFC 793 for all states except
> ESTABLISHED and TIME_WAIT." Are you referring to a different kernel
> version?
You are not mistaken, it seems code is too permissive.
We should reject a frame without ACK flag while in ESTABLISHED state.
Thats explicitly stated in RFC 973.
Then we should make all possible safety checks before even sending a
frame or changing socket variables.
(For instance the tests done in tcp_ack() should be done before calling
tcp_validate_incoming())
John Dykstra in commit 96e0bf4b5193d0 (tcp: Discard segments that ack
data not yet sent) did a step into right direction, but missed this.
Current code assumes the incoming frame is mostly legitimate.
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a136925..2ea4937 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5551,7 +5551,7 @@ slow_path:
return 0;
step5:
- if (th->ack && tcp_ack(sk, skb, FLAG_SLOWPATH) < 0)
+ if (!th->ack || tcp_ack(sk, skb, FLAG_SLOWPATH) < 0)
goto discard;
/* ts_recent update must be made after we are sure that the packet
^ permalink raw reply related
* Is keepalive behaving as expected in 3.7.0+/net-next?
From: Rick Jones @ 2012-12-21 22:05 UTC (permalink / raw)
To: netdev
I was looking to do a bit more documentation clean-up and thought I
would work on the descriptions of the "keepalive" sysctls, but first I
wanted to see if they behaved as the existing descriptions suggested:
> tcp_keepalive_time - INTEGER
> How often TCP sends out keepalive messages when keepalive is enabled.
> Default: 2hours.
>
> tcp_keepalive_probes - INTEGER
> How many keepalive probes TCP sends out, until it decides that the
> connection is broken. Default value: 9.
>
> tcp_keepalive_intvl - INTEGER
> How frequently the probes are send out. Multiplied by
> tcp_keepalive_probes it is time to kill not responding connection,
> after probes started. Default value: 75sec i.e. connection
> will be aborted after ~11 minutes of retries.
I interpreted all that that as: When a connection is idle, TCP will
send a keepalive probe every tcp_keepalive_time seconds. If a response
to a keepalive probe is not received, TCP will resend (retransmit) it
every tcp_keepalive_intvl seconds.
However, what I see is that on a connection where the remote is indeed
still there, only the first keepalive probe is sent after
tcp_keepalive_time, and thereafter it is sent every tcp_keepalive_intvl
seconds.
Now, some of this may relate to my being impatient - rather than wait
two hours for the first probe, I set tcp_keepalive_time to 3 seconds,
and tcp_keepalive_intvl to 7 seconds. I then kicked-off a ./configure
--intervals-enable netperf TCP_RR test with a burst of one and a wait
time of 90 seconds and got the following (trimmed) trace:
13:43:46.879133 IP netnextraj.43054 > netnextraj2.srvr: Flags [S], seq
807869796, win 14600, options [mss 1460,sackOK,TS val 133470 ecr
0,nop,wscale 7], length 0
13:43:46.880091 IP netnextraj2.srvr > netnextraj.43054: Flags [S.], seq
1522345902, ack 807869797, win 14480, options [mss 1460,sackOK,TS val
136186 ecr 133470,nop,wscale 4], length 0
13:43:46.880114 IP netnextraj.43054 > netnextraj2.srvr: Flags [.], ack
1, win 115, options [nop,nop,TS val 133470 ecr 136186], length 0
13:43:46.880306 IP netnextraj.43054 > netnextraj2.srvr: Flags [P.], seq
1:11, ack 1, win 115, options [nop,nop,TS val 133470 ecr 136186], length 10
13:43:46.880948 IP netnextraj2.srvr > netnextraj.43054: Flags [.], ack
11, win 905, options [nop,nop,TS val 136187 ecr 133470], length 0
13:43:46.880964 IP netnextraj2.srvr > netnextraj.43054: Flags [P.], seq
1:11, ack 11, win 905, options [nop,nop,TS val 136187 ecr 133470], length 10
13:43:46.881161 IP netnextraj.43054 > netnextraj2.srvr: Flags [.], ack
11, win 115, options [nop,nop,TS val 133470 ecr 136187], length 0
The first probe above comes after 3 seconds - tcp_keepalive_time - at
13:43:49
13:43:49.886752 IP netnextraj.43054 > netnextraj2.srvr: Flags [.], ack
11, win 115, options [nop,nop,TS val 134222 ecr 136187], length 0
And it does seem to elicit a response:
13:43:49.887530 IP netnextraj2.srvr > netnextraj.43054: Flags [.], ack
11, win 905, options [nop,nop,TS val 136938 ecr 133470], length 0
Now it starts sending probes every 7 seconds (tcp_keepalive_intvl):
13:43:56.903576 IP netnextraj.43054 > netnextraj2.srvr: Flags [.], ack
11, win 115, options [nop,nop,TS val 135976 ecr 136938], length 0
13:43:56.904480 IP netnextraj2.srvr > netnextraj.43054: Flags [.], ack
11, win 905, options [nop,nop,TS val 138693 ecr 133470], length 0
13:44:03.910744 IP netnextraj.43054 > netnextraj2.srvr: Flags [.], ack
11, win 115, options [nop,nop,TS val 137728 ecr 138693], length 0
13:44:03.911623 IP netnextraj2.srvr > netnextraj.43054: Flags [.], ack
11, win 905, options [nop,nop,TS val 140444 ecr 133470], length 0
I;ve deleted the next 9 or so probes... It continues, and doesn't
terminate the connection, so I assume it was happy with the responses to
the probes.
13:45:13.990746 IP netnextraj.43054 > netnextraj2.srvr: Flags [.], ack
11, win 115, options [nop,nop,TS val 155248 ecr 156213], length 0
13:45:13.991578 IP netnextraj2.srvr > netnextraj.43054: Flags [.], ack
11, win 905, options [nop,nop,TS val 157965 ecr 133470], length 0
Now the next netperf transaction happens:
13:45:16.879222 IP netnextraj.43054 > netnextraj2.srvr: Flags [P.], seq
11:21, ack 11, win 115, options [nop,nop,TS val 155970 ecr 157965],
length 10
13:45:16.880033 IP netnextraj2.srvr > netnextraj.43054: Flags [P.], seq
11:21, ack 21, win 905, options [nop,nop,TS val 158687 ecr 155970],
length 10
13:45:16.880220 IP netnextraj.43054 > netnextraj2.srvr: Flags [.], ack
21, win 115, options [nop,nop,TS val 155970 ecr 158687], length 0
But the next keepalive probe is tcp_keepalive_intvl seconds after the
last one, rather than that many, or tcp_keepalive_time seconds after the
connection was last "active."
13:45:20.998739 IP netnextraj.43054 > netnextraj2.srvr: Flags [.], ack
21, win 115, options [nop,nop,TS val 157000 ecr 158687], length 0
13:45:20.999754 IP netnextraj2.srvr > netnextraj.43054: Flags [.], ack
21, win 905, options [nop,nop,TS val 159717 ecr 155970], length 0
13:45:28.006747 IP netnextraj.43054 > netnextraj2.srvr: Flags [.], ack
21, win 115, options [nop,nop,TS val 158752 ecr 159717], length 0
13:45:28.007624 IP netnextraj2.srvr > netnextraj.43054: Flags [.], ack
21, win 905, options [nop,nop,TS val 161469 ecr 155970], length 0
Is this the expected behaviour? If I reverse the values - make
tcp_keepalive_time 7 and tcp_keepalive_intvl 3, it seems that all the
probes are after 7 seconds.
rick jones
^ permalink raw reply
* Re: [PATCH] bridge: call br_netpoll_disable in br_add_if
From: David Miller @ 2012-12-21 21:17 UTC (permalink / raw)
To: amwang; +Cc: gaofeng, netdev, shemminger
In-Reply-To: <1355999613.25310.11.camel@cr0>
From: Cong Wang <amwang@redhat.com>
Date: Thu, 20 Dec 2012 18:33:33 +0800
> On Thu, 2012-12-20 at 17:41 +0800, Gao feng wrote:
>> When netdev_set_master faild in br_add_if, we should
>> call br_netpoll_disable to do some cleanup jobs,such
>> as free the memory of struct netpoll which allocated
>> in br_netpoll_enable.
>>
>> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
>
> Looks good!
>
> Acked-by: Cong Wang <amwang@redhat.com>
Applied.
^ permalink raw reply
* Re: [patch] solos-pci: double lock in geos_gpio_store()
From: David Miller @ 2012-12-21 21:16 UTC (permalink / raw)
To: david.woodhouse
Cc: dan.carpenter, chas, nathan, linux-atm-general, netdev, kbuild
In-Reply-To: <1355995348.18919.104.camel@shinybook.infradead.org>
From: "Woodhouse, David" <david.woodhouse@intel.com>
Date: Thu, 20 Dec 2012 09:22:26 +0000
> On Thu, 2012-12-20 at 10:48 +0300, Dan Carpenter wrote:
>> There is a typo here so we do a double lock instead of an unlock.
>>
>> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
>> ---
>> Only needed in linux-next. Introduced in f9baad02e7411d9 [14/17]
>> solos-pci: add GPIO support for newer versions on Geos board
>
> Ah, crap. That was my fault; it was part of my "improvement" to Nathan's
> patch. Sorry.
>
> Acked-by: David Woodhouse <David.Woodhouse@intel.com>
Applied.
^ permalink raw reply
* Re: [PATCH] ipv4: arp: fix a lockdep splat in arp_solicit()
From: David Miller @ 2012-12-21 21:16 UTC (permalink / raw)
To: erdnetdev; +Cc: netdev, yanb, shemminger
In-Reply-To: <1356111130.21834.7565.camel@edumazet-glaptop>
From: Eric Dumazet <erdnetdev@gmail.com>
Date: Fri, 21 Dec 2012 09:32:10 -0800
> From: Eric Dumazet <edumazet@google.com>
>
> Yan Burman reported following lockdep warning :
...
> Bug is from arp_solicit(), releasing the neigh lock after arp_send()
> In case of vxlan, we eventually need to write lock a neigh lock later.
>
> Its a false positive, but we can get rid of it without lockdep
> annotations.
>
> We can instead use neigh_ha_snapshot() helper.
>
> Reported-by: Yan Burman <yanb@mellanox.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Applied.
^ permalink raw reply
* Re: TUN problems (regression?)
From: David Miller @ 2012-12-21 21:15 UTC (permalink / raw)
To: jasowang; +Cc: shemminger, eric.dumazet, pmoore, netdev
In-Reply-To: <50D3D85B.1070605@redhat.com>
From: Jason Wang <jasowang@redhat.com>
Date: Fri, 21 Dec 2012 11:32:43 +0800
> On 12/21/2012 07:50 AM, Stephen Hemminger wrote:
>> On Thu, 20 Dec 2012 15:38:17 -0800
>> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>>> On Thu, 2012-12-20 at 18:16 -0500, Paul Moore wrote:
>>>> [CC'ing netdev in case this is a known problem I just missed ...]
>>>>
>>>> Hi Jason,
>>>>
>>>> I started doing some more testing with the multiqueue TUN changes and I ran
>>>> into a problem when running tunctl: running it once w/o arguments works as
>>>> expected, but running it a second time results in failure and a
>>>> kmem_cache_sanity_check() failure. The problem appears to be very repeatable
>>>> on my test VM and happens independent of the LSM/SELinux fixup patches.
>>>>
>>>> Have you seen this before?
>>>>
>>> Obviously code in tun_flow_init() is wrong...
>>>
>>> static int tun_flow_init(struct tun_struct *tun)
>>> {
>>> int i;
>>>
>>> tun->flow_cache = kmem_cache_create("tun_flow_cache",
>>> sizeof(struct tun_flow_entry), 0, 0,
>>> NULL);
>>> if (!tun->flow_cache)
>>> return -ENOMEM;
>>> ...
>>> }
>>>
>>>
>>> I have no idea why we would need a kmem_cache per tun_struct,
>>> and why we even need a kmem_cache.
>> Normally flow malloc/free should be good enough.
>> It might make sense to use private kmem_cache if doing hlist_nulls.
>>
>>
>> Acked-by: Stephen Hemminger <shemminger@vyatta.com>
>
> Should be at least a global cache, I thought I can get some speed-up by
> using kmem_cache.
>
> Acked-by: Jason Wang <jasowang@redhat.com>
Applied.
^ permalink raw reply
* Re: [PATCH] ip_gre: fix possible use after free
From: David Miller @ 2012-12-21 21:15 UTC (permalink / raw)
To: erdnetdev; +Cc: netdev, yamahata
In-Reply-To: <1356055227.21834.4097.camel@edumazet-glaptop>
From: Eric Dumazet <erdnetdev@gmail.com>
Date: Thu, 20 Dec 2012 18:00:27 -0800
> From: Eric Dumazet <edumazet@google.com>
>
> Once skb_realloc_headroom() is called, tiph might point to freed memory.
>
> Cache tiph->ttl value before the reallocation, to avoid unexpected
> behavior.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Isaku Yamahata <yamahata@valinux.co.jp>
Applied.
^ permalink raw reply
* Re: [PATCH] ipv4/ip_gre: make ipgre_tunnel_xmit() not parse network header as IP unconditionally
From: David Miller @ 2012-12-21 21:15 UTC (permalink / raw)
To: eric.dumazet; +Cc: yamahata, netdev
In-Reply-To: <1356054521.21834.4043.camel@edumazet-glaptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 20 Dec 2012 17:48:41 -0800
> On Fri, 2012-12-21 at 10:12 +0900, Isaku Yamahata wrote:
>> ipgre_tunnel_xmit() parses network header as IP unconditionally.
>> But transmitting packets are not always IP packet. For example such packet
>> can be sent by packet socket with sockaddr_ll.sll_protocol set.
>> So make the function check if skb->protocol is IP.
>>
>> Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
>> ---
>> net/ipv4/ip_gre.c | 5 ++++-
>> 1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
>> index a85ae2f..8fcf0ed 100644
>> --- a/net/ipv4/ip_gre.c
>> +++ b/net/ipv4/ip_gre.c
>> @@ -760,7 +760,10 @@ static netdev_tx_t ipgre_tunnel_xmit(struct sk_buff *skb, struct net_device *dev
>>
>> if (dev->header_ops && dev->type == ARPHRD_IPGRE) {
>> gre_hlen = 0;
>> - tiph = (const struct iphdr *)skb->data;
>> + if (skb->protocol == htons(ETH_P_IP))
>> + tiph = (const struct iphdr *)skb->data;
>> + else
>> + tiph = &tunnel->parms.iph;
>> } else {
>> gre_hlen = tunnel->hlen;
>> tiph = &tunnel->parms.iph;
>
> Seems good to me thanks !
>
> Acked-by: Eric Dumazet <edumazet@google.com>
Applied.
^ permalink raw reply
* Re: sock_ioctl sleeping while atomic warning during boot.
From: David Miller @ 2012-12-21 21:15 UTC (permalink / raw)
To: brian.haley; +Cc: eric.dumazet, davej, netdev
In-Reply-To: <50D3D7F7.2030200@hp.com>
From: Brian Haley <brian.haley@hp.com>
Date: Thu, 20 Dec 2012 22:31:03 -0500
> On 12/20/2012 10:25 PM, Eric Dumazet wrote:
>> OK, thanks for the report.
>>
>> We need a seqcount, not a seqlock, as RTNL already protects multiple
>> writers.
>>
>> Please try following fix :
>>
>>
>> [PATCH] net: devnet_rename_seq should be a seqcount
>>
>> Using a seqlock for devnet_rename_seq is not a good idea,
>> as device_rename() can sleep.
>>
>> As we hold RTNL, we dont need a protection for writers,
>> and only need a seqcount so that readers can catch a change done
>> by a writer.
>>
>> Bug added in commit c91f6df2db4972d3 (sockopt: Change getsockopt() of
>> SO_BINDTODEVICE to return an interface name)
>>
>> Reported-by: Dave Jones <davej@redhat.com>
>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>> Cc: Brian Haley <brian.haley@hp.com>
>
> Sorry about that, thanks for the quick fix Eric.
Applied.
^ permalink raw reply
* [PATCH] net: sched: integer overflow fix
From: Stefan Hasko @ 2012-12-21 20:39 UTC (permalink / raw)
To: Jamal Hadi Salim, David S. Miller, netdev; +Cc: linux-kernel, Stefan Hasko
Fixed integer overflow in function htb_dequeue
Signed-off-by: Stefan Hasko <hasko.stevo@gmail.com>
---
net/sched/sch_htb.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
index d2922c0..1bd3faa 100644
--- a/net/sched/sch_htb.c
+++ b/net/sched/sch_htb.c
@@ -919,7 +919,7 @@ ok:
q->now = ktime_to_ns(ktime_get());
start_at = jiffies;
- next_event = q->now + 5 * NSEC_PER_SEC;
+ next_event = q->now + (u32)5 * NSEC_PER_SEC;
for (level = 0; level < TC_HTB_MAXDEPTH; level++) {
/* common case optimization - skip event handler quickly */
--
1.7.10.4
^ permalink raw reply related
* Re: [RFC] IP_MAX_MTU value
From: David Miller @ 2012-12-21 19:59 UTC (permalink / raw)
To: erdnetdev; +Cc: netdev, rick.jones2, kuznet
In-Reply-To: <1356072468.21834.4805.camel@edumazet-glaptop>
From: Eric Dumazet <erdnetdev@gmail.com>
Date: Thu, 20 Dec 2012 22:47:48 -0800
> We have the following definition in net/ipv4/route.c
>
> #define IP_MAX_MTU 0xFFF0
>
> This means that "netperf -t UDP_STREAM", using UDP messages of 65507
> bytes, are fragmented on loopback interface (while its MTU is now 65536
> and should allow those UDP messages being sent without fragments)
>
> I guess Rick chose 65507 bytes in netperf because it was related to the
> max IPv4 datagram length :
>
> 65507 + 28 = 65535
>
> Changing IP_MAX_MTU from 0xFFF0 to 0x10000 seems safe [1], but I might
> miss something really obvious ?
>
> It might be because in old days we reserved 16 bytes for the ethernet
> header, and we wanted to avoid kmalloc() round-up to kmalloc-131072
> slab ?
>
> If so, we certainly can limit skb->head to 32 or 64 KB and complete with
> page fragments the remaining space.
I don't think it has to do with kmalloc() at all.
Maybe something strange to do with the fact that each frag has to
be an 8-byte multiple, or something like that?
Alexey choose this value back in 1998, maybe he remembers the
reason for the strange value.
Alexey?
^ permalink raw reply
* Re: TCP sequence number inference attack on Linux
From: Zhiyun Qian @ 2012-12-21 19:49 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1356118052.21834.7793.camel@edumazet-glaptop>
On Fri, Dec 21, 2012 at 2:27 PM, Eric Dumazet <erdnetdev@gmail.com> wrote:
> On Fri, 2012-12-21 at 14:10 -0500, Zhiyun Qian wrote:
>> That's good to know. However, implementing RFC 5961 alone is not
>> sufficient. Like I said, since DelayedAckLost counter is incremented
>> purely upon looking at the sequence number, regardless of the ACK
>> number. An attacker thus can still infer the sequence number based on
>> DelayedAckLost counter without knowing the right ACK number.
>>
>
>
>
>> The next question is how can the attacker eventually know the right
>> ACK number in order to inject real data. It turns out that this is not
>> hard either. First, based on the current Linux TCP stack, it accepts
>> incoming packets without ACK flag.
>
> I dont really think so.
>
> We must discard frame is th->ack is not set. (Step 5, line 6142)
>
If I am not mistaken, line 6142 in kernel v3.7.1 corresponds to
tcp_rcv_state_process(). According to the comments, "This function
implements the receiving procedure of RFC 793 for all states except
ESTABLISHED and TIME_WAIT." Are you referring to a different kernel
version?
>
>
>> Further, if ACK flag is not set,
>> ACK number will not be checked at all. See code in
>> net/ipv4/tcp_input.c, function tcp_rcv_established()
>>
>> 5547 if (th->ack && tcp_ack(sk, skb, FLAG_SLOWPATH) < 0)
>> 5548 goto discard;
>>
>> Second, even if ACK number is always checked before accepting the
>> payload, it is still possible to infer the ACK number much like how
>> sequence number can be inferred. The details is described in Section
>> 3.4 of my paper, paragraph starting with "Client-side sequence number
>> inference".
>>
>> I'm looking at the latest kernel v3.7.1 right now. I believe the
>> problem do still exist in today's Linux.
>>
>
> It seems you know pretty well this code, I wonder why you dont send
> patches to fix the bugs...
>
> Its not like it has to be buggy forever.
>
I have never submitted any patch before...I would do it if no one else
wants to :)
>
>
^ permalink raw reply
* Re: IPv6 over Firewire
From: YOSHIFUJI Hideaki @ 2012-12-21 19:49 UTC (permalink / raw)
To: stephan.gatzka; +Cc: netdev, linux1394-devel, YOSHIFUJI Hideaki
In-Reply-To: <50D4ACFA.6040901@gmail.com>
Stephan Gatzka wrote:
>
>> If you are talking about how to build NS/NA/RS/Redirect messages, you
>> can just use ndisc_opt_addr_space() and ndisc_fill_addr_option() here.
>
> Thanks, these functions are certainly helpful. But ndisc_opt_addr_space() calculates the required space from dev->addr_len and ndisc_addr_option_pad(dev->type). The latter is 0 for IEEE1394 (firewire). So the required option space just comes from dev->addr_len, which is 8 for firewire, resulting in an option address space of 16 (2 octets).
>
> But rfc3146 requires an option address space of 3 octets. So my main question is if in such a situation the best is to reserve additional skb tail room using needed_tailroom in struct netdevice. This directly affects the memory allocated in ndisc_build_skb().
Something like this:
static inline int ndisc_opt_addr_space(struct net_device *dev)
{
- return NDISC_OPT_SPACE(dev->addr_len + ndisc_addr_option_pad(dev->type));
+ switch (dev->type) {
+ case ARPHRD_IEEE1394:
+ return sizeof(struct ndisc_opt_ieee1394_llinfo);
+ default:
+ return NDISC_OPT_SPACE(dev->addr_len + ndisc_addr_option_pad(dev->type));
+ }
}
--yoshfuji
^ permalink raw reply
* Re: TCP sequence number inference attack on Linux
From: Eric Dumazet @ 2012-12-21 19:30 UTC (permalink / raw)
To: Zhiyun Qian; +Cc: netdev
In-Reply-To: <CALvgte-=v=bRnQ2HPDxyqxBOjD6DXp_BUySW6Hm7h+81M+RYMA@mail.gmail.com>
On Fri, 2012-12-21 at 14:13 -0500, Zhiyun Qian wrote:
> That seems like a good idea. I am not sure how it is implemented
> though. Is it a new feature of Linux? Would you mind sending some
> pointers for this?
I dont have details, I only remember having to fix a memory allocation
error that was hitting Chrome users.
http://comments.gmane.org/gmane.linux.network/249535
Julien said at that time :
<quote>
It happens when users start Chrome. Chrome will create one new network
NS (for the sandbox).
This has been used for a few years now, but we had our first report in
January of this year and we've been getting a few reports very
recently at a rate that is starting to worry me (crbug.com/110756).
Thanks a lot for helping with this!
Julien
</quote>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox