* Re: [PATCH] ipv6: addrconf: clear IPv6 addresses and routes when losing link
From: David Miller @ 2010-10-27 15:51 UTC (permalink / raw)
To: lorenzo; +Cc: brian.haley, shemminger, netdev
In-Reply-To: <AANLkTik2Q=fF9j6eiBQx33e2JU7EG3Sc5hnXyDH24+mt@mail.gmail.com>
From: Lorenzo Colitti <lorenzo@google.com>
Date: Tue, 26 Oct 2010 15:53:50 -0700
> As the patch stands, they don't. Only autoconfigured addresses will be
> cleared, because addrconf_ifdown() does not remove any addresses that
> are permanent (unless they are link-local, in which case they are
> recreated as soon as link comes back).
Ok, and that brings us back to the issue of losing a TCP connection
over a link-local et al. address during a minor link flap.
I think some financial services people will really dislike that
behavior :)
^ permalink raw reply
* Re: [PATCH] af_packet: account for VLAN when checking packet size
From: David Miller @ 2010-10-27 15:48 UTC (permalink / raw)
To: horms; +Cc: mst, eric.dumazet, netdev, johann.baudy
In-Reply-To: <20101022084052.GA2118@verge.net.au>
From: Simon Horman <horms@verge.net.au>
Date: Fri, 22 Oct 2010 10:41:26 +0200
> Incidently, I believe that this problem will only become more acute
> and complex if support for 802.1ad (Provider Bridges, aka Q-in-Q),
> 802.1ah (Provider Backbone Bridges, aka MAC-in-MAC) or other standards
> which further extend the maximum frame size.
No doubt.
> Dave, you were mentioning to me the other day that the kernel
> already supports some notion of Q-in-Q (though its not 802.1ad).
> Does the current implementation allow for frames > 1504 bytes?
It's only going to hardware offload and allow the extra space
for the outer-most VLAN tag. Everthing inside of the outer
tag will be handled in software as far as Linux is concerned.
> Is that a complication to the change proposed here?
For now, I don't think so.
^ permalink raw reply
* Re: [PATCH net-next-2.6 1/2] be2net: Adding an option to use INTx instead of MSI-X
From: David Miller @ 2010-10-27 15:46 UTC (permalink / raw)
To: michael; +Cc: bhutchings, somnath.kotur, netdev, linux-pci
In-Reply-To: <1288135235.16778.14.camel@concordia>
From: Michael Ellerman <michael@ellerman.id.au>
Date: Wed, 27 Oct 2010 10:20:35 +1100
> On Tue, 2010-10-26 at 14:32 +0100, Ben Hutchings wrote:
>> Michael Ellerman wrote:
>> > On Mon, 2010-10-25 at 16:25 -0700, David Miller wrote:
>> > > From: Ben Hutchings <bhutchings@solarflare.com>
>> > > Date: Mon, 25 Oct 2010 23:38:53 +0100
>
>> > Ethtool would be nice, but only for network drivers. Is there a generic
>> > solution, quirks are obviously not keeping people happy.
>>
>> Since this is (normally) a property of the system, pci=nomsi is the
>> generic solution.
>
> Sort of, it's a big hammer. Did all these driver writers not know about
> pci=nomsi or did they prefer to add a parameter to their driver for some
> reason?
Every time I've actually done the work to try and track down the
true issue, it always turned out to be a PCI chipset problem rather
than a device specific issue.
^ permalink raw reply
* Re: [PATCH net-next-2.6 1/2] be2net: Adding an option to use INTx instead of MSI-X
From: David Miller @ 2010-10-27 15:45 UTC (permalink / raw)
To: michael; +Cc: bhutchings, somnath.kotur, netdev, linux-pci
In-Reply-To: <1288075928.6578.185.camel@concordia>
From: Michael Ellerman <michael@ellerman.id.au>
Date: Tue, 26 Oct 2010 17:52:08 +1100
> That horse has really really bolted, it's gawn.
>
> I count 26 drivers with "disable MSI/X" parameters. Some even have more
> than one.
>
> 11 of them are network drivers, 9 scsi, 3 ata.
>
> I agree it's a mess for users, but it's probably preferable to a
> non-working driver.
Stupid inappropriate things being in the tree doesn't mean I need
to accept more of them.
^ permalink raw reply
* [PATCH] tunnels: Fix tunnels change rcu protection
From: Pavel Emelyanov @ 2010-10-27 15:43 UTC (permalink / raw)
To: David Miller; +Cc: Eric Dumazet, Linux Netdev List
After making rcu protection for tunnels (ipip, gre, sit and ip6) a bug
was introduced into the SIOCCHGTUNNEL code.
The tunnel is first unlinked, then addresses change, then it is linked
back probably into another bucket. But while changing the parms, the
hash table is unlocked to readers and they can lookup the improper tunnel.
Respective commits are b7285b79 (ipip: get rid of ipip_lock), 1507850b
(gre: get rid of ipgre_lock), 3a43be3c (sit: get rid of ipip6_lock) and
94767632 (ip6tnl: get rid of ip6_tnl_lock).
The quick fix is to wait for quiescent state to pass after unlinking,
but if it is inappropriate I can invent something better, just let me
know.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
---
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index d0ffcbe..01087e0 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -1072,6 +1072,7 @@ ipgre_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
break;
}
ipgre_tunnel_unlink(ign, t);
+ synchronize_net();
t->parms.iph.saddr = p.iph.saddr;
t->parms.iph.daddr = p.iph.daddr;
t->parms.i_key = p.i_key;
diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c
index e9b816e..cd300aa 100644
--- a/net/ipv4/ipip.c
+++ b/net/ipv4/ipip.c
@@ -676,6 +676,7 @@ ipip_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
}
t = netdev_priv(dev);
ipip_tunnel_unlink(ipn, t);
+ synchronize_net();
t->parms.iph.saddr = p.iph.saddr;
t->parms.iph.daddr = p.iph.daddr;
memcpy(dev->dev_addr, &p.iph.saddr, 4);
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 38b9a56..2a59610 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1284,6 +1284,7 @@ ip6_tnl_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
t = netdev_priv(dev);
ip6_tnl_unlink(ip6n, t);
+ synchronize_net();
err = ip6_tnl_change(t, &p);
ip6_tnl_link(ip6n, t);
netdev_state_change(dev);
diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 367a6cc..d6bfaec 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -963,6 +963,7 @@ ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
}
t = netdev_priv(dev);
ipip6_tunnel_unlink(sitn, t);
+ synchronize_net();
t->parms.iph.saddr = p.iph.saddr;
t->parms.iph.daddr = p.iph.daddr;
memcpy(dev->dev_addr, &p.iph.saddr, 4);
^ permalink raw reply related
* [net-next PATCH 1/2] qlge: Add firmware info to ethtool get regs.
From: Ron Mercer @ 2010-10-27 14:58 UTC (permalink / raw)
To: davem; +Cc: netdev, ron.mercer, jitendra.kalsaria, ying.lok
By default we add firmware information to ethtool get regs.
Optionally firmware info can instead be sent to log.
Signed-off-by: Jitendra Kalsaria <jitendra.kalsaria@qlogic.com>
Signed-off-by: Ron Mercer <ron.mercer@qlogic.com>
---
drivers/net/qlge/qlge.h | 2 ++
drivers/net/qlge/qlge_dbg.c | 21 ++++++++++++++++++++-
drivers/net/qlge/qlge_ethtool.c | 19 ++++++++++++++++---
3 files changed, 38 insertions(+), 4 deletions(-)
diff --git a/drivers/net/qlge/qlge.h b/drivers/net/qlge/qlge.h
index a478786..0474d20 100644
--- a/drivers/net/qlge/qlge.h
+++ b/drivers/net/qlge/qlge.h
@@ -2221,6 +2221,7 @@ int ql_write_mpi_reg(struct ql_adapter *qdev, u32 reg, u32 data);
int ql_unpause_mpi_risc(struct ql_adapter *qdev);
int ql_pause_mpi_risc(struct ql_adapter *qdev);
int ql_hard_reset_mpi_risc(struct ql_adapter *qdev);
+int ql_soft_reset_mpi_risc(struct ql_adapter *qdev);
int ql_dump_risc_ram_area(struct ql_adapter *qdev, void *buf,
u32 ram_addr, int word_count);
int ql_core_dump(struct ql_adapter *qdev,
@@ -2237,6 +2238,7 @@ int ql_mb_set_mgmnt_traffic_ctl(struct ql_adapter *qdev, u32 control);
int ql_mb_get_port_cfg(struct ql_adapter *qdev);
int ql_mb_set_port_cfg(struct ql_adapter *qdev);
int ql_wait_fifo_empty(struct ql_adapter *qdev);
+void ql_get_dump(struct ql_adapter *qdev, void *buff);
void ql_gen_reg_dump(struct ql_adapter *qdev,
struct ql_reg_dump *mpi_coredump);
netdev_tx_t ql_lb_send(struct sk_buff *skb, struct net_device *ndev);
diff --git a/drivers/net/qlge/qlge_dbg.c b/drivers/net/qlge/qlge_dbg.c
index 4747492..fca804f 100644
--- a/drivers/net/qlge/qlge_dbg.c
+++ b/drivers/net/qlge/qlge_dbg.c
@@ -1317,9 +1317,28 @@ void ql_gen_reg_dump(struct ql_adapter *qdev,
status = ql_get_ets_regs(qdev, &mpi_coredump->ets[0]);
if (status)
return;
+}
+
+void ql_get_dump(struct ql_adapter *qdev, void *buff)
+{
+ /*
+ * If the dump has already been taken and is stored
+ * in our internal buffer and if force dump is set then
+ * just start the spool to dump it to the log file
+ * and also, take a snapshot of the general regs to
+ * to the user's buffer or else take complete dump
+ * to the user's buffer if force is not set.
+ */
- if (test_bit(QL_FRC_COREDUMP, &qdev->flags))
+ if (!test_bit(QL_FRC_COREDUMP, &qdev->flags)) {
+ if (!ql_core_dump(qdev, buff))
+ ql_soft_reset_mpi_risc(qdev);
+ else
+ netif_err(qdev, drv, qdev->ndev, "coredump failed!\n");
+ } else {
+ ql_gen_reg_dump(qdev, buff);
ql_get_core_dump(qdev);
+ }
}
/* Coredump to messages log file using separate worker thread */
diff --git a/drivers/net/qlge/qlge_ethtool.c b/drivers/net/qlge/qlge_ethtool.c
index 4892d64..8149cc9 100644
--- a/drivers/net/qlge/qlge_ethtool.c
+++ b/drivers/net/qlge/qlge_ethtool.c
@@ -375,7 +375,10 @@ static void ql_get_drvinfo(struct net_device *ndev,
strncpy(drvinfo->bus_info, pci_name(qdev->pdev), 32);
drvinfo->n_stats = 0;
drvinfo->testinfo_len = 0;
- drvinfo->regdump_len = 0;
+ if (!test_bit(QL_FRC_COREDUMP, &qdev->flags))
+ drvinfo->regdump_len = sizeof(struct ql_mpi_coredump);
+ else
+ drvinfo->regdump_len = sizeof(struct ql_reg_dump);
drvinfo->eedump_len = 0;
}
@@ -547,7 +550,12 @@ static void ql_self_test(struct net_device *ndev,
static int ql_get_regs_len(struct net_device *ndev)
{
- return sizeof(struct ql_reg_dump);
+ struct ql_adapter *qdev = netdev_priv(ndev);
+
+ if (!test_bit(QL_FRC_COREDUMP, &qdev->flags))
+ return sizeof(struct ql_mpi_coredump);
+ else
+ return sizeof(struct ql_reg_dump);
}
static void ql_get_regs(struct net_device *ndev,
@@ -555,7 +563,12 @@ static void ql_get_regs(struct net_device *ndev,
{
struct ql_adapter *qdev = netdev_priv(ndev);
- ql_gen_reg_dump(qdev, p);
+ ql_get_dump(qdev, p);
+ qdev->core_is_dumped = 0;
+ if (!test_bit(QL_FRC_COREDUMP, &qdev->flags))
+ regs->len = sizeof(struct ql_mpi_coredump);
+ else
+ regs->len = sizeof(struct ql_reg_dump);
}
static int ql_get_coalesce(struct net_device *dev, struct ethtool_coalesce *c)
--
1.6.0.2
^ permalink raw reply related
* [net-next PATCH 2/2] qlge: Version change to v1.00.00.27
From: Ron Mercer @ 2010-10-27 14:58 UTC (permalink / raw)
To: davem; +Cc: netdev, ron.mercer, jitendra.kalsaria, ying.lok
In-Reply-To: <1288191507-1994-1-git-send-email-ron.mercer@qlogic.com>
Signed-off-by: Jitendra Kalsaria <jitendra.kalsaria@qlogic.com>
Signed-off-by: Ron Mercer <ron.mercer@qlogic.com>
---
drivers/net/qlge/qlge.h | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/drivers/net/qlge/qlge.h b/drivers/net/qlge/qlge.h
index 0474d20..69c4780 100644
--- a/drivers/net/qlge/qlge.h
+++ b/drivers/net/qlge/qlge.h
@@ -16,7 +16,7 @@
*/
#define DRV_NAME "qlge"
#define DRV_STRING "QLogic 10 Gigabit PCI-E Ethernet Driver "
-#define DRV_VERSION "v1.00.00.25.00.00-01"
+#define DRV_VERSION "v1.00.00.27.00.00-01"
#define WQ_ADDR_ALIGN 0x3 /* 4 byte alignment */
--
1.6.0.2
^ permalink raw reply related
* [net-2.6 PATCH 1/1] qlge: bugfix: Restoring the vlan setting.
From: Ron Mercer @ 2010-10-27 14:58 UTC (permalink / raw)
To: davem; +Cc: netdev, ron.mercer, jitendra.kalsaria, ying.lok
Signed-off-by: Jitendra Kalsaria <jitendra.kalsaria@qlogic.com>
Signed-off-by: Ron Mercer <ron.mercer@qlogic.com>
---
drivers/net/qlge/qlge_main.c | 17 +++++++++++++++++
1 files changed, 17 insertions(+), 0 deletions(-)
diff --git a/drivers/net/qlge/qlge_main.c b/drivers/net/qlge/qlge_main.c
index e621056..c30e0fe 100644
--- a/drivers/net/qlge/qlge_main.c
+++ b/drivers/net/qlge/qlge_main.c
@@ -2385,6 +2385,20 @@ static void qlge_vlan_rx_kill_vid(struct net_device *ndev, u16 vid)
}
+static void qlge_restore_vlan(struct ql_adapter *qdev)
+{
+ qlge_vlan_rx_register(qdev->ndev, qdev->vlgrp);
+
+ if (qdev->vlgrp) {
+ u16 vid;
+ for (vid = 0; vid < VLAN_N_VID; vid++) {
+ if (!vlan_group_get_device(qdev->vlgrp, vid))
+ continue;
+ qlge_vlan_rx_add_vid(qdev->ndev, vid);
+ }
+ }
+}
+
/* MSI-X Multiple Vector Interrupt Handler for inbound completions. */
static irqreturn_t qlge_msix_rx_isr(int irq, void *dev_id)
{
@@ -3960,6 +3974,9 @@ static int ql_adapter_up(struct ql_adapter *qdev)
clear_bit(QL_PROMISCUOUS, &qdev->flags);
qlge_set_multicast_list(qdev->ndev);
+ /* Restore vlan setting. */
+ qlge_restore_vlan(qdev);
+
ql_enable_interrupts(qdev);
ql_enable_all_completion_interrupts(qdev);
netif_tx_start_all_queues(qdev->ndev);
--
1.6.0.2
^ permalink raw reply related
* Re: [PATCH 5/5] tcp: ipv4 listen state scaled
From: Alexey Kuznetsov @ 2010-10-27 15:04 UTC (permalink / raw)
To: Dmitry Popov, netdev
In-Reply-To: <AANLkTikRsOevLBHn0xb0S_YvfPMWpAdw373bxQUc+xbV@mail.gmail.com>
Hello!
It looks like there is at least one hole here.
You take lock, check syn table and drop lock in tcp_v4_hnd_req().
Then you immediately enter tcp_v4_conn_request() and grab lock again.
Oops, in the tiny hole while lock was dropped the request can be already
created (even funnier, the whole socket can be already created and even accepted).
So, if you drop lock, you have to restart the whole tcp_v4_rcv_listen()
(which seems to be impossible without additional tricks)
Alexey
^ permalink raw reply
* [PATCH] usbnet: runtime pm: fix usb_autopm_get_interface failure
From: tom.leiming @ 2010-10-27 14:45 UTC (permalink / raw)
To: netdev, oliver, davem
Cc: Ming Lei, David Brownell, Greg Kroah-Hartman, Ben Hutchings,
Joe Perches, Andy Shevchenko, stable
From: Ming Lei <tom.leiming@gmail.com>
Since usbnet already took usb runtime pm, we have to
enable runtime pm for usb interface of usbnet, otherwise
usb_autopm_get_interface may return failure and cause
'ifconfig usb0 up' failed if USB_SUSPEND(RUNTIME_PM) is
enabled.
Cc: David Brownell <dbrownell@users.sourceforge.net>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Joe Perches <joe@perches.com>
Cc: Oliver Neukum <oliver@neukum.org>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
drivers/net/usb/usbnet.c | 10 ++++++++++
1 files changed, 10 insertions(+), 0 deletions(-)
diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
index ca7fc9d..765308f 100644
--- a/drivers/net/usb/usbnet.c
+++ b/drivers/net/usb/usbnet.c
@@ -1273,6 +1273,16 @@ usbnet_probe (struct usb_interface *udev, const struct usb_device_id *prod)
struct usb_device *xdev;
int status;
const char *name;
+ struct usb_driver *driver = to_usb_driver(udev->dev.driver);
+
+ /*usbnet already took usb runtime pm, so have to enable the feature
+ * for usb interface, otherwise usb_autopm_get_interface may return
+ * failure if USB_SUSPEND(RUNTIME_PM) is enabled.
+ * */
+ if (!driver->supports_autosuspend) {
+ driver->supports_autosuspend = 1;
+ pm_runtime_enable(&udev->dev);
+ }
name = udev->dev.driver->name;
info = (struct driver_info *) prod->driver_info;
--
1.7.3
^ permalink raw reply related
* Re: [RFC PATCH] macvlan: Introduce a PASSTHRU mode to takeover the underlying device
From: Arnd Bergmann @ 2010-10-27 14:05 UTC (permalink / raw)
To: Sridhar Samudrala; +Cc: kaber, netdev, kvm@vger.kernel.org
In-Reply-To: <1288131578.7582.49.camel@sridhar.beaverton.ibm.com>
On Wednesday 27 October 2010, Sridhar Samudrala wrote:
> With the current default macvtap mode, a KVM guest using virtio with
> macvtap backend has the following limitations.
> - cannot change/add a mac address on the guest virtio-net
> - cannot create a vlan device on the guest virtio-net
> - cannot enable promiscuous mode on guest virtio-net
>
> This patch introduces a new mode called 'passthru' when creating a
> macvlan device which allows takeover of the underlying device and
> passing it to a guest using virtio with macvtap backend.
>
> Only one macvlan device is allowed in passthru mode and it inherits
> the mac address from the underlying device and sets it in promiscuous
> mode to receive and forward all the packets.
Interesting approach. It somewhat stretches the definition of the
macvlan concept, but it does sound useful to have.
I was thinking about adding a new tap frontend driver that could
share some code with macvtap and do only the takeover but not
use macvlan as a base. I believe that would be a cleaner abstraction,
but your code has two advantages in that the implementation is much
simpler and that it can share a fair amount of the infrastructure
that we're putting into qemu/libvirt/etc.
Arnd
PS: Please add a Signed-off-by: line when sending a patch, even for
discussion.
^ permalink raw reply
* Re: [PATCH iproute2] Add passthru mode and support 'mode' parameter with macvtap devices
From: Arnd Bergmann @ 2010-10-27 14:00 UTC (permalink / raw)
To: Sridhar Samudrala; +Cc: kaber, netdev, kvm@vger.kernel.org, Stephen Hemminger
In-Reply-To: <1288131588.7582.50.camel@sridhar.beaverton.ibm.com>
On Wednesday 27 October 2010, Sridhar Samudrala wrote:
> Support a new 'passthru' mode with macvlan and 'mode' parameter
> with macvtap devices.
>
> Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Can you split this into two patches?
We definitely want the part adding support for macvtap device mode
setting now. The new passthru mode for macvlan and macvtap probably
needs some discussion and the patch in iproute2 will depends on
the kernel patch getting merged first.
I've added Stephen to the Cc list, he should also take a look.
Arnd
> diff --git a/include/linux/if_link.h b/include/linux/if_link.h
> index f5bb2dc..23de79e 100644
> --- a/include/linux/if_link.h
> +++ b/include/linux/if_link.h
> @@ -230,6 +230,7 @@ enum macvlan_mode {
> MACVLAN_MODE_PRIVATE = 1, /* don't talk to other macvlans */
> MACVLAN_MODE_VEPA = 2, /* talk to other ports through ext bridge */
> MACVLAN_MODE_BRIDGE = 4, /* talk to bridge ports directly */
> + MACVLAN_MODE_PASSTHRU = 8, /* take over the underlying device */
> };
>
> /* SR-IOV virtual function management section */
> diff --git a/ip/Makefile b/ip/Makefile
> index 2f223ca..6054e8a 100644
> --- a/ip/Makefile
> +++ b/ip/Makefile
> @@ -3,7 +3,7 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o \
> ipmaddr.o ipmonitor.o ipmroute.o ipprefix.o iptuntap.o \
> ipxfrm.o xfrm_state.o xfrm_policy.o xfrm_monitor.o \
> iplink_vlan.o link_veth.o link_gre.o iplink_can.o \
> - iplink_macvlan.o
> + iplink_macvlan.o iplink_macvtap.o
>
> RTMONOBJ=rtmon.o
>
> diff --git a/ip/iplink_macvlan.c b/ip/iplink_macvlan.c
> index a3c78bd..97787f9 100644
> --- a/ip/iplink_macvlan.c
> +++ b/ip/iplink_macvlan.c
> @@ -48,6 +48,8 @@ static int macvlan_parse_opt(struct link_util *lu, int argc, char **argv,
> mode = MACVLAN_MODE_VEPA;
> else if (strcmp(*argv, "bridge") == 0)
> mode = MACVLAN_MODE_BRIDGE;
> + else if (strcmp(*argv, "passthru") == 0)
> + mode = MACVLAN_MODE_PASSTHRU;
> else
> return mode_arg();
>
> @@ -82,6 +84,7 @@ static void macvlan_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[]
> mode == MACVLAN_MODE_PRIVATE ? "private"
> : mode == MACVLAN_MODE_VEPA ? "vepa"
> : mode == MACVLAN_MODE_BRIDGE ? "bridge"
> + : mode == MACVLAN_MODE_PASSTHRU ? "passthru"
> : "unknown");
> }
>
> diff --git a/ip/iplink_macvtap.c b/ip/iplink_macvtap.c
> new file mode 100644
> index 0000000..040cc68
> --- /dev/null
> +++ b/ip/iplink_macvtap.c
> @@ -0,0 +1,93 @@
> +/*
> + * iplink_macvtap.c macvtap device support
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version
> + * 2 of the License, or (at your option) any later version.
> + */
> +
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/socket.h>
> +#include <linux/if_link.h>
> +
> +#include "rt_names.h"
> +#include "utils.h"
> +#include "ip_common.h"
> +
> +static void explain(void)
> +{
> + fprintf(stderr,
> + "Usage: ... macvtap mode { private | vepa | bridge | passthru }\n"
> + );
> +}
> +
> +static int mode_arg(void)
> +{
> + fprintf(stderr, "Error: argument of \"mode\" must be \"private\", "
> + "\"vepa\" or \"bridge\" \"passthru\"\n");
> + return -1;
> +}
> +
> +static int macvtap_parse_opt(struct link_util *lu, int argc, char **argv,
> + struct nlmsghdr *n)
> +{
> + while (argc > 0) {
> + if (matches(*argv, "mode") == 0) {
> + __u32 mode = 0;
> + NEXT_ARG();
> +
> + if (strcmp(*argv, "private") == 0)
> + mode = MACVLAN_MODE_PRIVATE;
> + else if (strcmp(*argv, "vepa") == 0)
> + mode = MACVLAN_MODE_VEPA;
> + else if (strcmp(*argv, "bridge") == 0)
> + mode = MACVLAN_MODE_BRIDGE;
> + else if (strcmp(*argv, "passthru") == 0)
> + mode = MACVLAN_MODE_PASSTHRU;
> + else
> + return mode_arg();
> +
> + addattr32(n, 1024, IFLA_MACVLAN_MODE, mode);
> + } else if (matches(*argv, "help") == 0) {
> + explain();
> + return -1;
> + } else {
> + fprintf(stderr, "macvtap: what is \"%s\"?\n", *argv);
> + explain();
> + return -1;
> + }
> + argc--, argv++;
> + }
> +
> + return 0;
> +}
> +
> +static void macvtap_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
> +{
> + __u32 mode;
> +
> + if (!tb)
> + return;
> +
> + if (!tb[IFLA_MACVLAN_MODE] ||
> + RTA_PAYLOAD(tb[IFLA_MACVLAN_MODE]) < sizeof(__u32))
> + return;
> +
> + mode = *(__u32 *)RTA_DATA(tb[IFLA_VLAN_ID]);
> + fprintf(f, " mode %s ",
> + mode == MACVLAN_MODE_PRIVATE ? "private"
> + : mode == MACVLAN_MODE_VEPA ? "vepa"
> + : mode == MACVLAN_MODE_BRIDGE ? "bridge"
> + : mode == MACVLAN_MODE_PASSTHRU ? "passthru"
> + : "unknown");
> +}
> +
> +struct link_util macvtap_link_util = {
> + .id = "macvtap",
> + .maxattr = IFLA_MACVLAN_MAX,
> + .parse_opt = macvtap_parse_opt,
> + .print_opt = macvtap_print_opt,
> +};
>
>
>
^ permalink raw reply
* [PATCH 5/5] tcp: ipv4 listen state scaled
From: Dmitry Popov @ 2010-10-27 13:32 UTC (permalink / raw)
To: David S. Miller, William.Allen.Simpson, Eric Dumazet,
Andreas Petlund, Shan Wei
From: Dmitry Popov <dp@highloadlab.com>
Fast path for TCP_LISTEN state processing added.
tcp_v4_rcv_listen is called from tcp_v4_rcv without socket lock.
However, it may acquire main socket lock in 3 cases:
1) To check syn_table in tcp_v4_hnd_req.
2) To check syn_table and modify accept queue in tcp_v4_conn_request.
3) To modify accept queue in get_cookie_sock.
In cases 1 and 2 we check for user lock and add skb to sk_backlog if
socket is locked.
In case 3 we don't check for user lock and it may lead to wrong
behavior. That's why we need socket locking in tcp_set_state(sk,
TCP_CLOSE).
Additional state in sk->sk_lock.owned is needed to prevent infinite
loop in backlog processing.
Signed-off-by: Dmitry Popov <dp@highloadlab.com>
---
include/net/sock.h | 6 ++-
net/core/sock.c | 4 +-
net/ipv4/syncookies.c | 20 +++++-
net/ipv4/tcp.c | 5 ++
net/ipv4/tcp_ipv4.c | 159 +++++++++++++++++++++++++++++++++++++++++--------
5 files changed, 162 insertions(+), 32 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index adab9dc..b6d0ca1 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -994,7 +994,11 @@ static inline void sk_wmem_free_skb(struct sock
*sk, struct sk_buff *skb)
* Since ~2.3.5 it is also exclusive sleep lock serializing
* accesses from user process context.
*/
-#define sock_owned_by_user(sk) ((sk)->sk_lock.owned)
+#define sock_owned_by_user(sk) ((sk)->sk_lock.owned)
+/* backlog processing, see __release_sock(sk) */
+#define sock_owned_by_backlog(sk) ((sk)->sk_lock.owned < 0)
+/* sock owned by user, but not for backlog processing */
+#define __sock_owned_by_user(sk) ((sk)->sk_lock.owned > 0)
/*
* Macro so as to not evaluate some arguments when
diff --git a/net/core/sock.c b/net/core/sock.c
index e73dfe3..f4233c7 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2015,8 +2015,10 @@ void release_sock(struct sock *sk)
mutex_release(&sk->sk_lock.dep_map, 1, _RET_IP_);
spin_lock_bh(&sk->sk_lock.slock);
- if (sk->sk_backlog.tail)
+ if (sk->sk_backlog.tail) {
+ sk->sk_lock.owned = -1;
__release_sock(sk);
+ }
sk->sk_lock.owned = 0;
if (waitqueue_active(&sk->sk_lock.wq))
wake_up(&sk->sk_lock.wq);
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 650cace..a37f8e8 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -211,10 +211,22 @@ static inline struct sock
*get_cookie_sock(struct sock *sk, struct sk_buff *skb,
struct inet_connection_sock *icsk = inet_csk(sk);
struct sock *child;
- child = icsk->icsk_af_ops->syn_recv_sock(sk, skb, req, dst);
- if (child)
- inet_csk_reqsk_queue_add(sk, req, child);
- else
+ bh_lock_sock_nested(sk);
+ /* TODO: move syn_recv_sock before this lock */
+ spin_lock(&icsk->icsk_accept_queue.rskq_accept_lock);
+
+ if (likely(icsk->icsk_accept_queue.rskq_active)) {
+ child = icsk->icsk_af_ops->syn_recv_sock(sk, skb, req, dst);
+ if (child)
+ inet_csk_reqsk_queue_do_add(sk, req, child);
+ } else {
+ child = NULL;
+ }
+
+ spin_unlock(&icsk->icsk_accept_queue.rskq_accept_lock);
+ bh_unlock_sock(sk);
+
+ if (unlikely(child == NULL))
reqsk_free(req);
return child;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ebb9d80..417f2d9 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1812,10 +1812,15 @@ void tcp_set_state(struct sock *sk, int state)
if (oldstate == TCP_CLOSE_WAIT || oldstate == TCP_ESTABLISHED)
TCP_INC_STATS(sock_net(sk), TCP_MIB_ESTABRESETS);
+ if (oldstate == TCP_LISTEN)
+ /* We have to prevent race condition in syn_recv_sock */
+ bh_lock_sock_nested(sk);
sk->sk_prot->unhash(sk);
if (inet_csk(sk)->icsk_bind_hash &&
!(sk->sk_userlocks & SOCK_BINDPORT_LOCK))
inet_put_port(sk);
+ if (oldstate == TCP_LISTEN)
+ bh_unlock_sock(sk);
/* fall through */
default:
if (oldstate == TCP_ESTABLISHED)
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 1e641b0..f22931d 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1338,7 +1338,24 @@ int tcp_v4_conn_request(struct sock *sk, struct
sk_buff *skb)
/* Never answer to SYNs send to broadcast or multicast */
if (skb_rtable(skb)->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
+ return 0;
+
+ bh_lock_sock_nested(sk);
+
+ if (__sock_owned_by_user(sk)) {
+ /* Some inefficiency: it leads to double syn_table lookup */
+ if (likely(!sk_add_backlog(sk, skb)))
+ skb_get(skb);
+ else
+ NET_INC_STATS_BH(dev_net(skb->dev),
+ LINUX_MIB_TCPBACKLOGDROP);
goto drop;
+ }
+
+ if (inet_csk(sk)->icsk_accept_queue.listen_opt == NULL) {
+ /* socket is closing */
+ goto drop;
+ }
/* TW buckets are converted to open requests without
* limitations, they conserve resources and peer is
@@ -1353,6 +1370,7 @@ int tcp_v4_conn_request(struct sock *sk, struct
sk_buff *skb)
syn_flood_warning(skb);
if (sysctl_tcp_syncookies) {
tcp_inc_syncookie_stats(&tp->syncookie_stats);
+ bh_unlock_sock(sk);
want_cookie = 1;
} else
#else
@@ -1405,9 +1423,6 @@ int tcp_v4_conn_request(struct sock *sk, struct
sk_buff *skb)
while (l-- > 0)
*c++ ^= *hash_location++;
-#ifdef CONFIG_SYN_COOKIES
- want_cookie = 0; /* not our kind of cookie */
-#endif
tmp_ext.cookie_out_never = 0; /* false */
tmp_ext.cookie_plus = tmp_opt.cookie_plus;
tmp_ext.cookie_in_always = tp->rx_opt.cookie_in_always;
@@ -1494,6 +1509,7 @@ int tcp_v4_conn_request(struct sock *sk, struct
sk_buff *skb)
goto drop_and_free;
inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+ bh_unlock_sock(sk);
return 0;
drop_and_release:
@@ -1501,6 +1517,8 @@ drop_and_release:
drop_and_free:
reqsk_free(req);
drop:
+ if (!want_cookie)
+ bh_unlock_sock(sk);
return 0;
}
EXPORT_SYMBOL(tcp_v4_conn_request);
@@ -1588,10 +1606,35 @@ static struct sock *tcp_v4_hnd_req(struct sock
*sk, struct sk_buff *skb)
struct sock *nsk;
struct request_sock **prev;
/* Find possible connection requests. */
- struct request_sock *req = inet_csk_search_req(sk, &prev, th->source,
+ struct request_sock *req;
+
+ bh_lock_sock_nested(sk);
+
+ if (__sock_owned_by_user(sk)) {
+ if (likely(!sk_add_backlog(sk, skb)))
+ skb_get(skb);
+ else
+ NET_INC_STATS_BH(dev_net(skb->dev),
+ LINUX_MIB_TCPBACKLOGDROP);
+ bh_unlock_sock(sk);
+ return NULL;
+ }
+
+ if (inet_csk(sk)->icsk_accept_queue.listen_opt == NULL) {
+ /* socket is closing */
+ bh_unlock_sock(sk);
+ return NULL;
+ }
+
+ req = inet_csk_search_req(sk, &prev, th->source,
iph->saddr, iph->daddr);
- if (req)
- return tcp_check_req(sk, skb, req, prev);
+ if (req) {
+ nsk = tcp_check_req(sk, skb, req, prev);
+ bh_unlock_sock(sk);
+ return nsk;
+ } else {
+ bh_unlock_sock(sk);
+ }
nsk = inet_lookup_established(sock_net(sk), &tcp_hashinfo, iph->saddr,
th->source, iph->daddr, th->dest, inet_iif(skb));
@@ -1633,6 +1676,72 @@ static __sum16 tcp_v4_checksum_init(struct sk_buff *skb)
return 0;
}
+/* Beware! This may be called without socket lock.
+ * TCP Checksum should be checked before this call.
+ */
+int tcp_v4_rcv_listen(struct sock *sk, struct sk_buff *skb)
+{
+ struct sock *nsk;
+ struct sock *rsk;
+ struct tcphdr *th = tcp_hdr(skb);
+
+ nsk = tcp_v4_hnd_req(sk, skb);
+
+ if (!nsk)
+ goto discard;
+
+ if (nsk != sk) {
+ /* Probable SYN-ACK */
+ if (tcp_child_process(sk, nsk, skb)) {
+ rsk = nsk;
+ goto reset;
+ }
+ return 0;
+ }
+
+ /* Probable SYN */
+ TCP_CHECK_TIMER(sk);
+
+ if (th->ack) {
+ rsk = sk;
+ goto reset;
+ }
+
+ if (!th->rst && th->syn) {
+ if (inet_csk(sk)->icsk_af_ops->conn_request(sk, skb) < 0) {
+ rsk = sk;
+ goto reset;
+ }
+ /* Now we have several options: In theory there is
+ * nothing else in the frame. KA9Q has an option to
+ * send data with the syn, BSD accepts data with the
+ * syn up to the [to be] advertised window and
+ * Solaris 2.1 gives you a protocol error. For now
+ * we just ignore it, that fits the spec precisely
+ * and avoids incompatibilities. It would be nice in
+ * future to drop through and process the data.
+ *
+ * Now that TTCP is starting to be used we ought to
+ * queue this data.
+ * But, this leaves one open to an easy denial of
+ * service attack, and SYN cookies can't defend
+ * against this problem. So, we drop the data
+ * in the interest of security over speed unless
+ * it's still in use.
+ */
+ }
+
+ TCP_CHECK_TIMER(sk);
+
+discard:
+ kfree_skb(skb);
+ return 0;
+
+reset:
+ tcp_v4_send_reset(rsk, skb);
+ goto discard;
+}
+
/* The socket must have it's spinlock held when we get
* here.
@@ -1644,15 +1753,11 @@ static __sum16 tcp_v4_checksum_init(struct sk_buff *skb)
*/
int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
{
- struct sock *rsk;
-
if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */
sock_rps_save_rxhash(sk, skb->rxhash);
TCP_CHECK_TIMER(sk);
- if (tcp_rcv_established(sk, skb, tcp_hdr(skb), skb->len)) {
- rsk = sk;
+ if (tcp_rcv_established(sk, skb, tcp_hdr(skb), skb->len))
goto reset;
- }
TCP_CHECK_TIMER(sk);
return 0;
}
@@ -1660,32 +1765,23 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
if (skb->len < tcp_hdrlen(skb) || tcp_checksum_complete(skb))
goto csum_err;
- if (sk->sk_state == TCP_LISTEN) {
- struct sock *nsk = tcp_v4_hnd_req(sk, skb);
- if (!nsk)
- goto discard;
-
- if (nsk != sk) {
- if (tcp_child_process(sk, nsk, skb)) {
- rsk = nsk;
- goto reset;
- }
- return 0;
- }
- } else
+ if (sk->sk_state == TCP_LISTEN)
+ /* This is for IPv4-mapped IPv6 addresses
+ and backlog processing */
+ return tcp_v4_rcv_listen(sk, skb);
+ else
sock_rps_save_rxhash(sk, skb->rxhash);
TCP_CHECK_TIMER(sk);
if (tcp_rcv_state_process(sk, skb, tcp_hdr(skb), skb->len)) {
- rsk = sk;
goto reset;
}
TCP_CHECK_TIMER(sk);
return 0;
reset:
- tcp_v4_send_reset(rsk, skb);
+ tcp_v4_send_reset(sk, skb);
discard:
kfree_skb(skb);
/* Be careful here. If this function gets more complicated and
@@ -1779,6 +1875,17 @@ process:
goto discard_and_relse;
#endif
+ if (sk->sk_state == TCP_LISTEN) {
+ /* Fast path for listening socket */
+ if (skb->len < tcp_hdrlen(skb) || tcp_checksum_complete(skb)) {
+ TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_INERRS);
+ goto discard_and_relse;
+ }
+ tcp_v4_rcv_listen(sk, skb);
+ sock_put(sk);
+ return 0;
+ }
+
bh_lock_sock_nested(sk);
ret = 0;
if (!sock_owned_by_user(sk)) {
^ permalink raw reply related
* [PATCH 4/5] tcp: syncookie stats counter
From: Dmitry Popov @ 2010-10-27 13:30 UTC (permalink / raw)
To: David S. Miller, William.Allen.Simpson, Eric Dumazet,
Andreas Petlund, Shan Wei
From: Dmitry Popov <dp@highloadlab.com>
Estimation of amount of sent SYNACKs used to choose between
syn_backlog and syn_cookies added.
Estimation is made within TCP_TIMEOUT_INIT period. tcp_syncookie_stats
struct is added to tcp_sock.
Signed-off-by: Dmitry Popov <dp@highloadlab.com>
---
include/linux/tcp.h | 15 +++++++++++++++
include/net/inet_connection_sock.h | 7 +++++++
include/net/request_sock.h | 8 ++++++++
include/net/tcp.h | 29 +++++++++++++++++++++++++++++
net/ipv4/tcp_ipv4.c | 12 ++++++++++--
net/ipv6/tcp_ipv6.c | 16 ++++++++++++----
6 files changed, 81 insertions(+), 6 deletions(-)
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 3436176..7445d17 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -288,6 +288,17 @@ static inline struct tcp_request_sock
*tcp_rsk(const struct request_sock *req)
return (struct tcp_request_sock *)req;
}
+/* This structure is used to estimate amount of SYNs under (possible) SYNflood.
+ * @cookies_sent[@clock_hand] is incremented for each cookie sent.
+ * This counter is used while jiffies < @expires[@clock_hand],
+ * then clock_hand is switched (clock_hand ^= 1).
+ */
+struct tcp_syncookie_stats {
+ unsigned long expires[2];
+ u32 cookies_sent[2];
+ u8 clock_hand;
+};
+
struct tcp_sock {
/* inet_connection_sock has to be the first member of tcp_sock */
struct inet_connection_sock inet_conn;
@@ -471,6 +482,10 @@ struct tcp_sock {
* So readers which hold this lock may omit rcu_reader_lock.
*/
struct tcp_cookie_values *cookie_values;
+
+#ifdef CONFIG_SYN_COOKIES
+ struct tcp_syncookie_stats syncookie_stats;
+#endif
};
static inline struct tcp_sock *tcp_sk(const struct sock *sk)
diff --git a/include/net/inet_connection_sock.h
b/include/net/inet_connection_sock.h
index 430b58f..7c63bb0 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -293,6 +293,13 @@ static inline int
inet_csk_reqsk_queue_young(const struct sock *sk)
return reqsk_queue_len_young(&inet_csk(sk)->icsk_accept_queue);
}
+static inline int inet_csk_reqsk_queue_is_delta_full(const struct sock *sk,
+ int delta)
+{
+ const struct request_sock_queue *aq = &inet_csk(sk)->icsk_accept_queue;
+ return reqsk_queue_is_delta_full(aq, delta);
+}
+
static inline int inet_csk_reqsk_queue_is_full(const struct sock *sk)
{
return reqsk_queue_is_full(&inet_csk(sk)->icsk_accept_queue);
diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index 870c46b..1154277 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -272,6 +272,14 @@ static inline int reqsk_queue_len_young(const
struct request_sock_queue *queue)
return queue->listen_opt->qlen_young;
}
+static inline int
+ reqsk_queue_is_delta_full(const struct request_sock_queue *queue,
+ int delta)
+{
+ struct listen_sock *lopt = queue->listen_opt;
+ return (lopt->qlen + delta) >> lopt->max_qlen_log;
+}
+
static inline int reqsk_queue_is_full(const struct request_sock_queue *queue)
{
return queue->listen_opt->qlen >> queue->listen_opt->max_qlen_log;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 08e6ce1..c25d4a5 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1542,6 +1542,35 @@ static inline int tcp_s_data_size(const struct
tcp_sock *tp)
: 0;
}
+/* Updates syn cookie statistics,
+ * should be called after new cookie(SYNACK) is sent.
+ */
+static inline void tcp_inc_syncookie_stats(struct tcp_syncookie_stats *stats)
+{
+ u8 hand = stats->clock_hand;
+
+ if (unlikely(time_after(jiffies, stats->expires[hand]))) {
+ hand ^= 1;
+ stats->expires[hand] = jiffies + TCP_TIMEOUT_INIT;
+ stats->cookies_sent[hand] = 0;
+ stats->clock_hand = hand;
+ }
+ ++stats->cookies_sent[hand];
+}
+
+/* Returns previous syncookie stats(amount of cookies sent). */
+static inline u32 tcp_get_syncookie_stats(struct tcp_syncookie_stats *stats)
+{
+ u8 old_hand = stats->clock_hand ^ 1;
+
+ if (time_before(jiffies,
+ stats->expires[old_hand] + 2 * TCP_TIMEOUT_INIT))
+ /* old stats are not too old */
+ return stats->cookies_sent[old_hand];
+ else
+ return 0;
+}
+
/**
* struct tcp_extend_values - tcp_ipv?.c to tcp_output.c workspace.
*
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index c7e9b2a..1e641b0 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1344,13 +1344,21 @@ int tcp_v4_conn_request(struct sock *sk,
struct sk_buff *skb)
* limitations, they conserve resources and peer is
* evidently real one.
*/
- if (inet_csk_reqsk_queue_is_full(sk) && !isn) {
+#ifdef CONFIG_SYN_COOKIES
+ if (inet_csk_reqsk_queue_is_delta_full(sk,
+ tcp_get_syncookie_stats(&tp->syncookie_stats)) &&
+ !isn)
+ {
if (net_ratelimit())
syn_flood_warning(skb);
-#ifdef CONFIG_SYN_COOKIES
if (sysctl_tcp_syncookies) {
+ tcp_inc_syncookie_stats(&tp->syncookie_stats);
want_cookie = 1;
} else
+#else
+ if (inet_csk_reqsk_queue_is_full(sk) && !isn) {
+ if (net_ratelimit())
+ syn_flood_warning(skb);
#endif
goto drop;
}
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index a3fa1f9..767dfde 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1275,13 +1275,21 @@ static int tcp_v6_conn_request(struct sock
*sk, struct sk_buff *skb)
if (!ipv6_unicast_destination(skb))
goto drop;
- if (inet_csk_reqsk_queue_is_full(sk) && !isn) {
+#ifdef CONFIG_SYN_COOKIES
+ if (inet_csk_reqsk_queue_is_delta_full(sk,
+ tcp_get_syncookie_stats(&tp->syncookie_stats)) &&
+ !isn)
+ {
if (net_ratelimit())
syn_flood_warning(skb);
-#ifdef CONFIG_SYN_COOKIES
- if (sysctl_tcp_syncookies)
+ if (sysctl_tcp_syncookies) {
+ tcp_inc_syncookie_stats(&tp->syncookie_stats);
want_cookie = 1;
- else
+ } else
+#else
+ if (inet_csk_reqsk_queue_is_full(sk) && !isn) {
+ if (net_ratelimit())
+ syn_flood_warning(skb);
#endif
goto drop;
}
^ permalink raw reply related
* [PATCH 3/5] tcp: request sock accept queue spinlock protection
From: Dmitry Popov @ 2010-10-27 13:29 UTC (permalink / raw)
To: David S. Miller, William.Allen.Simpson, Eric Dumazet,
Andreas Petlund, Shan Wei
From: Dmitry Popov <dp@highloadlab.com>
Spinlock and active flag added for request sock accept queue.
This is needed to access this queue without main socket lock.
Signed-off-by: Dmitry Popov <dp@highloadlab.com>
---
include/net/inet_connection_sock.h | 7 ++++
include/net/request_sock.h | 59 +++++++++++++++++++++++++++++------
net/core/request_sock.c | 4 ++-
net/ipv4/inet_connection_sock.c | 22 ++++++++-----
4 files changed, 73 insertions(+), 19 deletions(-)
diff --git a/include/net/inet_connection_sock.h
b/include/net/inet_connection_sock.h
index b6d3b55..430b58f 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -258,6 +258,13 @@ static inline void
inet_csk_reqsk_queue_add(struct sock *sk,
reqsk_queue_add(&inet_csk(sk)->icsk_accept_queue, req, sk, child);
}
+static inline void inet_csk_reqsk_queue_do_add(struct sock *sk,
+ struct request_sock *req,
+ struct sock *child)
+{
+ reqsk_queue_do_add(&inet_csk(sk)->icsk_accept_queue, req, sk, child);
+}
+
extern void inet_csk_reqsk_queue_hash_add(struct sock *sk,
struct request_sock *req,
unsigned long timeout);
diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index 99e6e19..870c46b 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -109,6 +109,8 @@ struct listen_sock {
*
* @rskq_accept_head - FIFO head of established children
* @rskq_accept_tail - FIFO tail of established children
+ * @rskq_accept_lock - guard for FIFO of established children
+ * @rskq_active - != 0 if we're ready for children (LISTEN state), 0 otherwise
* @rskq_defer_accept - User waits for some data after accept()
* @syn_wait_lock - serializer
*
@@ -124,9 +126,11 @@ struct listen_sock {
struct request_sock_queue {
struct request_sock *rskq_accept_head;
struct request_sock *rskq_accept_tail;
+ spinlock_t rskq_accept_lock;
rwlock_t syn_wait_lock;
u8 rskq_defer_accept;
- /* 3 bytes hole, try to pack */
+ u8 rskq_active;
+ /* 2 bytes hole, try to pack */
struct listen_sock *listen_opt;
};
@@ -137,11 +141,24 @@ extern void __reqsk_queue_destroy(struct
request_sock_queue *queue);
extern void reqsk_queue_destroy(struct request_sock_queue *queue);
static inline struct request_sock *
- reqsk_queue_yank_acceptq(struct request_sock_queue *queue)
+ reqsk_queue_do_yank_acceptq(struct request_sock_queue *queue)
{
struct request_sock *req = queue->rskq_accept_head;
queue->rskq_accept_head = NULL;
+
+ return req;
+}
+
+static inline struct request_sock *
+ reqsk_queue_yank_acceptq(struct request_sock_queue *queue)
+{
+ struct request_sock *req;
+
+ spin_lock_bh(&queue->rskq_accept_lock);
+ req = reqsk_queue_do_yank_acceptq(queue);
+ spin_unlock_bh(&queue->rskq_accept_lock);
+
return req;
}
@@ -159,13 +176,12 @@ static inline void reqsk_queue_unlink(struct
request_sock_queue *queue,
write_unlock(&queue->syn_wait_lock);
}
-static inline void reqsk_queue_add(struct request_sock_queue *queue,
+static inline void reqsk_queue_do_add(struct request_sock_queue *queue,
struct request_sock *req,
struct sock *parent,
struct sock *child)
{
req->sk = child;
- sk_acceptq_added(parent);
if (queue->rskq_accept_head == NULL)
queue->rskq_accept_head = req;
@@ -174,25 +190,48 @@ static inline void reqsk_queue_add(struct
request_sock_queue *queue,
queue->rskq_accept_tail = req;
req->dl_next = NULL;
+ sk_acceptq_added(parent);
}
-static inline struct request_sock *reqsk_queue_remove(struct
request_sock_queue *queue)
+static inline void reqsk_queue_add(struct request_sock_queue *queue,
+ struct request_sock *req,
+ struct sock *parent,
+ struct sock *child)
+{
+ spin_lock(&queue->rskq_accept_lock);
+ reqsk_queue_do_add(queue, req, parent, child);
+ spin_unlock(&queue->rskq_accept_lock);
+}
+
+static inline struct request_sock *
+ reqsk_queue_do_remove(struct request_sock_queue *queue)
{
struct request_sock *req = queue->rskq_accept_head;
WARN_ON(req == NULL);
queue->rskq_accept_head = req->dl_next;
- if (queue->rskq_accept_head == NULL)
- queue->rskq_accept_tail = NULL;
return req;
}
-static inline struct sock *reqsk_queue_get_child(struct
request_sock_queue *queue,
- struct sock *parent)
+static inline struct request_sock *
+ reqsk_queue_remove(struct request_sock_queue *queue)
+{
+ struct request_sock *req;
+
+ spin_lock_bh(&queue->rskq_accept_lock);
+ req = reqsk_queue_do_remove(queue);
+ spin_unlock_bh(&queue->rskq_accept_lock);
+
+ return req;
+}
+
+static inline struct sock *
+ reqsk_queue_do_get_child(struct request_sock_queue *queue,
+ struct sock *parent)
{
- struct request_sock *req = reqsk_queue_remove(queue);
+ struct request_sock *req = reqsk_queue_do_remove(queue);
struct sock *child = req->sk;
WARN_ON(child == NULL);
diff --git a/net/core/request_sock.c b/net/core/request_sock.c
index 7552495..a0f2955 100644
--- a/net/core/request_sock.c
+++ b/net/core/request_sock.c
@@ -58,8 +58,10 @@ int reqsk_queue_alloc(struct request_sock_queue *queue,
lopt->max_qlen_log++);
get_random_bytes(&lopt->hash_rnd, sizeof(lopt->hash_rnd));
- rwlock_init(&queue->syn_wait_lock);
+ spin_lock_init(&queue->rskq_accept_lock);
queue->rskq_accept_head = NULL;
+ queue->rskq_active = 0;
+ rwlock_init(&queue->syn_wait_lock);
lopt->nr_table_entries = nr_table_entries;
write_lock_bh(&queue->syn_wait_lock);
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 7174370..ecf98d2 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -215,7 +215,7 @@ EXPORT_SYMBOL_GPL(inet_csk_get_port);
/*
* Wait for an incoming connection, avoid race conditions. This must be called
- * with the socket locked.
+ * with rskq_accept_lock locked.
*/
static int inet_csk_wait_for_connect(struct sock *sk, long timeo)
{
@@ -240,10 +240,12 @@ static int inet_csk_wait_for_connect(struct sock
*sk, long timeo)
for (;;) {
prepare_to_wait_exclusive(sk_sleep(sk), &wait,
TASK_INTERRUPTIBLE);
- release_sock(sk);
+ spin_unlock_bh(&icsk->icsk_accept_queue.rskq_accept_lock);
+
if (reqsk_queue_empty(&icsk->icsk_accept_queue))
timeo = schedule_timeout(timeo);
- lock_sock(sk);
+
+ spin_lock_bh(&icsk->icsk_accept_queue.rskq_accept_lock);
err = 0;
if (!reqsk_queue_empty(&icsk->icsk_accept_queue))
break;
@@ -270,13 +272,13 @@ struct sock *inet_csk_accept(struct sock *sk,
int flags, int *err)
struct sock *newsk;
int error;
- lock_sock(sk);
+ spin_lock_bh(&icsk->icsk_accept_queue.rskq_accept_lock);
/* We need to make sure that this socket is listening,
* and that it has something pending.
*/
error = -EINVAL;
- if (sk->sk_state != TCP_LISTEN)
+ if (!icsk->icsk_accept_queue.rskq_active)
goto out_err;
/* Find already established connection */
@@ -293,10 +295,10 @@ struct sock *inet_csk_accept(struct sock *sk,
int flags, int *err)
goto out_err;
}
- newsk = reqsk_queue_get_child(&icsk->icsk_accept_queue, sk);
+ newsk = reqsk_queue_do_get_child(&icsk->icsk_accept_queue, sk);
WARN_ON(newsk->sk_state == TCP_SYN_RECV);
out:
- release_sock(sk);
+ spin_unlock_bh(&icsk->icsk_accept_queue.rskq_accept_lock);
return newsk;
out_err:
newsk = NULL;
@@ -632,6 +634,7 @@ int inet_csk_listen_start(struct sock *sk, const
int nr_table_entries)
sk->sk_max_ack_backlog = 0;
sk->sk_ack_backlog = 0;
+ icsk->icsk_accept_queue.rskq_active = 1;
inet_csk_delack_init(sk);
/* There is race window here: we announce ourselves listening,
@@ -668,7 +671,10 @@ void inet_csk_listen_stop(struct sock *sk)
inet_csk_delete_keepalive_timer(sk);
/* make all the listen_opt local to us */
- acc_req = reqsk_queue_yank_acceptq(&icsk->icsk_accept_queue);
+ spin_lock_bh(&icsk->icsk_accept_queue.rskq_accept_lock);
+ icsk->icsk_accept_queue.rskq_active = 0;
+ acc_req = reqsk_queue_do_yank_acceptq(&icsk->icsk_accept_queue);
+ spin_unlock_bh(&icsk->icsk_accept_queue.rskq_accept_lock);
/* Following specs, it would be better either to send FIN
* (and enter FIN-WAIT-1, it is normal close)
^ permalink raw reply related
* [PATCH 2/5] tcp: small user_mss code change
From: Dmitry Popov @ 2010-10-27 13:28 UTC (permalink / raw)
To: David S. Miller, William.Allen.Simpson, Eric Dumazet,
Andreas Petlund, Shan Wei
From: Dmitry Popov <dp@highloadlab.com>
No double access to user_mss part of socket.
This is done to prevent possible race conditions if accessed without
socket lock.
Signed-off-by: Dmitry Popov <dp@highloadlab.com>
---
net/ipv4/tcp_ipv4.c | 7 ++++---
net/ipv4/tcp_output.c | 6 ++++--
2 files changed, 8 insertions(+), 5 deletions(-)
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 3de881e..c7e9b2a 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1510,6 +1510,7 @@ struct sock *tcp_v4_syn_recv_sock(struct sock
*sk, struct sk_buff *skb,
struct inet_sock *newinet;
struct tcp_sock *newtp;
struct sock *newsk;
+ int user_mss;
#ifdef CONFIG_TCP_MD5SIG
struct tcp_md5sig_key *key;
#endif
@@ -1545,9 +1546,9 @@ struct sock *tcp_v4_syn_recv_sock(struct sock
*sk, struct sk_buff *skb,
tcp_mtup_init(newsk);
tcp_sync_mss(newsk, dst_mtu(dst));
newtp->advmss = dst_metric(dst, RTAX_ADVMSS);
- if (tcp_sk(sk)->rx_opt.user_mss &&
- tcp_sk(sk)->rx_opt.user_mss < newtp->advmss)
- newtp->advmss = tcp_sk(sk)->rx_opt.user_mss;
+ user_mss = tcp_sk(sk)->rx_opt.user_mss;
+ if (user_mss && user_mss < newtp->advmss)
+ newtp->advmss = user_mss;
tcp_initialize_rcv_mss(newsk);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index eef2d66..561a7f3 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2416,6 +2416,7 @@ struct sk_buff *tcp_make_synack(struct sock *sk,
struct dst_entry *dst,
struct tcp_md5sig_key *md5;
int tcp_header_size;
int mss;
+ int user_mss;
int s_data_desired;
rcu_read_lock();
@@ -2439,8 +2440,9 @@ struct sk_buff *tcp_make_synack(struct sock *sk,
struct dst_entry *dst,
skb_dst_set(skb, dst_clone(dst));
mss = dst_metric(dst, RTAX_ADVMSS);
- if (tp->rx_opt.user_mss && tp->rx_opt.user_mss < mss)
- mss = tp->rx_opt.user_mss;
+ user_mss = tp->rx_opt.user_mss;
+ if (user_mss && user_mss < mss)
+ mss = user_mss;
if (req->rcv_wnd == 0) { /* ignored for retransmitted syns */
__u8 rcv_wscale;
^ permalink raw reply related
* [PATCH 1/5] tcp: cookie transactions scaling
From: Dmitry Popov @ 2010-10-27 13:27 UTC (permalink / raw)
To: David S. Miller, William.Allen.Simpson, Eric Dumazet,
Andreas Petlund, Shan Wei
From: Dmitry Popov <dp@highloadlab.com>
TCPCT usage doesn't need socket lock now.
TCPCT changes via setsockopt use RCU and each tcp_cookie_values struct
has reference count. def_cookie_values (default cookie values) is used
to eliminate NULL pointer checks).
Signed-off-by: Dmitry Popov <dp@highloadlab.com>
---
include/linux/tcp.h | 6 ++
include/net/tcp.h | 18 ++++++-
net/ipv4/tcp.c | 124 +++++++++++++++++++++++++--------------------
net/ipv4/tcp_input.c | 4 +-
net/ipv4/tcp_ipv4.c | 28 +++++++---
net/ipv4/tcp_minisocks.c | 20 ++++++--
net/ipv4/tcp_output.c | 31 +++++++++---
net/ipv6/tcp_ipv6.c | 9 ++--
8 files changed, 156 insertions(+), 84 deletions(-)
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 806266a..3436176 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -463,6 +463,12 @@ struct tcp_sock {
/* When the cookie options are generated and exchanged, then this
* object holds a reference to them (cookie_values->kref). Also
* contains related tcp_cookie_transactions fields.
+ *
+ * Note:
+ * cookie_values are partially under RCU:
+ * only s_data_constant, s_data_desired and s_data_payload should be
+ * changed via RCU. Writers should use a socket lock.
+ * So readers which hold this lock may omit rcu_reader_lock.
*/
struct tcp_cookie_values *cookie_values;
};
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 3f0dbec..08e6ce1 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1498,6 +1498,7 @@ extern int tcp_cookie_generator(u32 *bakery);
* cookie option is present.
*/
struct tcp_cookie_values {
+ struct rcu_head rcu_head;
struct kref kref;
u8 cookie_pair[TCP_COOKIE_PAIR_SIZE];
u8 cookie_pair_size;
@@ -1510,18 +1511,33 @@ struct tcp_cookie_values {
u8 s_data_payload[0];
};
+extern struct tcp_cookie_values def_cookie_values;
+
static inline void tcp_cookie_values_release(struct kref *kref)
{
kfree(container_of(kref, struct tcp_cookie_values, kref));
}
+/* RCU-protected desired size of cookie
+ */
+static inline u8 rcu_tcp_cookie_desired(const struct tcp_sock *tp)
+{
+ u8 res;
+
+ rcu_read_lock();
+ res = rcu_dereference(tp->cookie_values)->cookie_desired;
+ rcu_read_unlock();
+
+ return res;
+}
+
/* The length of constant payload data. Note that s_data_desired is
* overloaded, depending on s_data_constant: either the length of constant
* data (returned here) or the limit on variable data.
*/
static inline int tcp_s_data_size(const struct tcp_sock *tp)
{
- return (tp->cookie_values != NULL && tp->cookie_values->s_data_constant)
+ return tp->cookie_values->s_data_constant
? tp->cookie_values->s_data_desired
: 0;
}
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f115ea6..ebb9d80 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -309,6 +309,20 @@ struct tcp_splice_state {
};
/*
+ * Default cookie values struct
+ * Initialization is for clarity only
+ */
+struct tcp_cookie_values def_cookie_values __read_mostly = {
+ .cookie_pair_size = 0,
+ .cookie_desired = 0,
+ .s_data_desired = 0,
+ .s_data_constant = 0,
+ .s_data_in = 0,
+ .s_data_out = 0
+};
+EXPORT_SYMBOL(def_cookie_values);
+
+/*
* Pressure flag: try to collapse.
* Technical note: it is used by multiple contexts non atomically.
* All the __sk_mem_schedule() is of this nature: accounting
@@ -2145,6 +2159,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
case TCP_COOKIE_TRANSACTIONS: {
struct tcp_cookie_transactions ctd;
struct tcp_cookie_values *cvp = NULL;
+ struct kref *oldkref = NULL;
if (sizeof(ctd) > optlen)
return -EINVAL;
@@ -2166,66 +2181,63 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
if (TCP_COOKIE_OUT_NEVER & ctd.tcpct_flags) {
/* Supercedes all other values */
lock_sock(sk);
- if (tp->cookie_values != NULL) {
- kref_put(&tp->cookie_values->kref,
- tcp_cookie_values_release);
- tp->cookie_values = NULL;
+
+ if (tp->cookie_values != &def_cookie_values) {
+ oldkref = &tp->cookie_values->kref;
+ rcu_assign_pointer(tp->cookie_values,
+ &def_cookie_values);
}
+
tp->rx_opt.cookie_in_always = 0; /* false */
tp->rx_opt.cookie_out_never = 1; /* true */
release_sock(sk);
+
+ if (oldkref) {
+ synchronize_rcu();
+ kref_put(oldkref, tcp_cookie_values_release);
+ }
+
return err;
}
- /* Allocate ancillary memory before locking.
- */
- if (ctd.tcpct_used > 0 ||
- (tp->cookie_values == NULL &&
- (sysctl_tcp_cookie_size > 0 ||
- ctd.tcpct_cookie_desired > 0 ||
- ctd.tcpct_s_data_desired > 0))) {
- cvp = kzalloc(sizeof(*cvp) + ctd.tcpct_used,
- GFP_KERNEL);
- if (cvp == NULL)
- return -ENOMEM;
-
- kref_init(&cvp->kref);
+ /* Setup cvp before locking */
+ cvp = kzalloc(sizeof(*cvp) + ctd.tcpct_used,
+ GFP_KERNEL);
+ if (cvp == NULL)
+ return -ENOMEM;
+
+ kref_init(&cvp->kref);
+
+ cvp->cookie_desired = ctd.tcpct_cookie_desired;
+
+ if (ctd.tcpct_used > 0) {
+ memcpy(cvp->s_data_payload, ctd.tcpct_value,
+ ctd.tcpct_used);
+ cvp->s_data_desired = ctd.tcpct_used;
+ cvp->s_data_constant = 1; /* true */
+ } else {
+ /* No constant payload data. */
+ cvp->s_data_desired = ctd.tcpct_s_data_desired;
+ cvp->s_data_constant = 0; /* false */
}
+
lock_sock(sk);
tp->rx_opt.cookie_in_always =
(TCP_COOKIE_IN_ALWAYS & ctd.tcpct_flags);
tp->rx_opt.cookie_out_never = 0; /* false */
- if (tp->cookie_values != NULL) {
- if (cvp != NULL) {
- /* Changed values are recorded by a changed
- * pointer, ensuring the cookie will differ,
- * without separately hashing each value later.
- */
- kref_put(&tp->cookie_values->kref,
- tcp_cookie_values_release);
- } else {
- cvp = tp->cookie_values;
- }
+ if (tp->cookie_values != &def_cookie_values) {
+ /* We have to release old structure, later */
+ oldkref = &tp->cookie_values->kref;
}
- if (cvp != NULL) {
- cvp->cookie_desired = ctd.tcpct_cookie_desired;
+ rcu_assign_pointer(tp->cookie_values, cvp);
+ release_sock(sk);
+ synchronize_rcu();
- if (ctd.tcpct_used > 0) {
- memcpy(cvp->s_data_payload, ctd.tcpct_value,
- ctd.tcpct_used);
- cvp->s_data_desired = ctd.tcpct_used;
- cvp->s_data_constant = 1; /* true */
- } else {
- /* No constant payload data. */
- cvp->s_data_desired = ctd.tcpct_s_data_desired;
- cvp->s_data_constant = 0; /* false */
- }
+ if (oldkref)
+ kref_put(oldkref, tcp_cookie_values_release);
- tp->cookie_values = cvp;
- }
- release_sock(sk);
return err;
}
default:
@@ -2572,7 +2584,7 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
case TCP_COOKIE_TRANSACTIONS: {
struct tcp_cookie_transactions ctd;
- struct tcp_cookie_values *cvp = tp->cookie_values;
+ struct tcp_cookie_values *cvp;
if (get_user(len, optlen))
return -EFAULT;
@@ -2585,19 +2597,21 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
| (tp->rx_opt.cookie_out_never ?
TCP_COOKIE_OUT_NEVER : 0);
- if (cvp != NULL) {
- ctd.tcpct_flags |= (cvp->s_data_in ?
- TCP_S_DATA_IN : 0)
- | (cvp->s_data_out ?
- TCP_S_DATA_OUT : 0);
+ rcu_read_lock();
- ctd.tcpct_cookie_desired = cvp->cookie_desired;
- ctd.tcpct_s_data_desired = cvp->s_data_desired;
+ cvp = rcu_dereference(tp->cookie_values);
+ ctd.tcpct_flags |= (cvp->s_data_in ?
+ TCP_S_DATA_IN : 0)
+ | (cvp->s_data_out ?
+ TCP_S_DATA_OUT : 0);
- memcpy(&ctd.tcpct_value[0], &cvp->cookie_pair[0],
- cvp->cookie_pair_size);
- ctd.tcpct_used = cvp->cookie_pair_size;
- }
+ ctd.tcpct_cookie_desired = cvp->cookie_desired;
+ ctd.tcpct_s_data_desired = cvp->s_data_desired;
+
+ memcpy(&ctd.tcpct_value[0], &cvp->cookie_pair[0],
+ cvp->cookie_pair_size);
+ ctd.tcpct_used = cvp->cookie_pair_size;
+ rcu_read_unlock();
if (put_user(sizeof(ctd), optlen))
return -EFAULT;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b55f60f..65fe78e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5450,6 +5450,7 @@ static int tcp_rcv_synsent_state_process(struct
sock *sk, struct sk_buff *skb,
u8 *hash_location;
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
+ /* We're under socket lock, no need for RCU */
struct tcp_cookie_values *cvp = tp->cookie_values;
int saved_clamp = tp->rx_opt.mss_clamp;
@@ -5551,8 +5552,7 @@ static int tcp_rcv_synsent_state_process(struct
sock *sk, struct sk_buff *skb,
* is initialized. */
tp->copied_seq = tp->rcv_nxt;
- if (cvp != NULL &&
- cvp->cookie_pair_size > 0 &&
+ if (cvp->cookie_pair_size > 0 &&
tp->rx_opt.cookie_plus > 0) {
int cookie_size = tp->rx_opt.cookie_plus
- TCPOLEN_COOKIE_BASE;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 6068b17..3de881e 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1380,8 +1380,7 @@ int tcp_v4_conn_request(struct sock *sk, struct
sk_buff *skb)
tmp_opt.saw_tstamp &&
!tp->rx_opt.cookie_out_never &&
(sysctl_tcp_cookie_size > 0 ||
- (tp->cookie_values != NULL &&
- tp->cookie_values->cookie_desired > 0))) {
+ rcu_tcp_cookie_desired(tp) > 0)) {
u8 *c;
u32 *mess = &tmp_ext.cookie_bakery[COOKIE_DIGEST_WORDS];
int l = tmp_opt.cookie_plus - TCPOLEN_COOKIE_BASE;
@@ -1403,14 +1402,15 @@ int tcp_v4_conn_request(struct sock *sk,
struct sk_buff *skb)
#endif
tmp_ext.cookie_out_never = 0; /* false */
tmp_ext.cookie_plus = tmp_opt.cookie_plus;
+ tmp_ext.cookie_in_always = tp->rx_opt.cookie_in_always;
} else if (!tp->rx_opt.cookie_in_always) {
/* redundant indications, but ensure initialization. */
tmp_ext.cookie_out_never = 1; /* true */
tmp_ext.cookie_plus = 0;
+ tmp_ext.cookie_in_always = 0;
} else {
goto drop_and_release;
}
- tmp_ext.cookie_in_always = tp->rx_opt.cookie_in_always;
if (want_cookie && !tmp_opt.saw_tstamp)
tcp_clear_options(&tmp_opt);
@@ -1990,9 +1990,11 @@ static int tcp_v4_init_sock(struct sock *sk)
tp->cookie_values =
kzalloc(sizeof(*tp->cookie_values),
sk->sk_allocation);
- if (tp->cookie_values != NULL)
- kref_init(&tp->cookie_values->kref);
}
+ if (tp->cookie_values != NULL)
+ kref_init(&tp->cookie_values->kref);
+ else
+ tp->cookie_values = &def_cookie_values;
/* Presumed zeroed, in order of appearance:
* cookie_in_always, cookie_out_never,
* s_data_constant, s_data_in, s_data_out
@@ -2007,6 +2009,13 @@ static int tcp_v4_init_sock(struct sock *sk)
return 0;
}
+static void tcp_cookie_values_put(struct rcu_head *rcu_head)
+{
+ struct tcp_cookie_values *cvp =
+ container_of(rcu_head, struct tcp_cookie_values, rcu_head);
+ kref_put(&cvp->kref, tcp_cookie_values_release);
+}
+
void tcp_v4_destroy_sock(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
@@ -2047,10 +2056,11 @@ void tcp_v4_destroy_sock(struct sock *sk)
}
/* TCP Cookie Transactions */
- if (tp->cookie_values != NULL) {
- kref_put(&tp->cookie_values->kref,
- tcp_cookie_values_release);
- tp->cookie_values = NULL;
+ if (tp->cookie_values != &def_cookie_values) {
+ struct rcu_head *rcu_head = &tp->cookie_values->rcu_head;
+
+ rcu_assign_pointer(tp->cookie_values, &def_cookie_values);
+ call_rcu(rcu_head, tcp_cookie_values_put);
}
percpu_counter_dec(&tcp_sockets_allocated);
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 599752c..d62c65a 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -375,7 +375,8 @@ struct sock *tcp_create_openreq_child(struct sock
*sk, struct request_sock *req,
struct inet_connection_sock *newicsk = inet_csk(newsk);
struct tcp_sock *newtp = tcp_sk(newsk);
struct tcp_sock *oldtp = tcp_sk(sk);
- struct tcp_cookie_values *oldcvp = oldtp->cookie_values;
+ struct tcp_cookie_values *oldcvp;
+ int s_data_size;
/* TCP Cookie Transactions require space for the cookie pair,
* as it differs for each connection. There is no need to
@@ -385,7 +386,10 @@ struct sock *tcp_create_openreq_child(struct sock
*sk, struct request_sock *req,
* Presumed copied, in order of appearance:
* cookie_in_always, cookie_out_never
*/
- if (oldcvp != NULL) {
+
+ rcu_read_lock();
+ oldcvp = rcu_dereference(oldtp->cookie_values);
+ if (oldcvp != &def_cookie_values) {
struct tcp_cookie_values *newcvp =
kzalloc(sizeof(*newtp->cookie_values),
GFP_ATOMIC);
@@ -397,9 +401,15 @@ struct sock *tcp_create_openreq_child(struct sock
*sk, struct request_sock *req,
newtp->cookie_values = newcvp;
} else {
/* Not Yet Implemented */
- newtp->cookie_values = NULL;
+ newtp->cookie_values = &def_cookie_values;
}
+ } else {
+ newtp->cookie_values = &def_cookie_values;
}
+ s_data_size = oldcvp->s_data_constant ?
+ oldcvp->s_data_desired :
+ 0;
+ rcu_read_unlock();
/* Now setup tcp_sock */
newtp->pred_flags = 0;
@@ -409,7 +419,7 @@ struct sock *tcp_create_openreq_child(struct sock
*sk, struct request_sock *req,
newtp->snd_sml = newtp->snd_una =
newtp->snd_nxt = newtp->snd_up =
- treq->snt_isn + 1 + tcp_s_data_size(oldtp);
+ treq->snt_isn + 1 + s_data_size;
tcp_prequeue_init(newtp);
@@ -443,7 +453,7 @@ struct sock *tcp_create_openreq_child(struct sock
*sk, struct request_sock *req,
tcp_init_xmit_timers(newsk);
skb_queue_head_init(&newtp->out_of_order_queue);
newtp->write_seq = newtp->pushed_seq =
- treq->snt_isn + 1 + tcp_s_data_size(oldtp);
+ treq->snt_isn + 1 + s_data_size;
newtp->rx_opt.saw_tstamp = 0;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 8954453..eef2d66 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -562,11 +562,14 @@ static unsigned tcp_syn_options(struct sock *sk,
struct sk_buff *skb,
struct tcp_out_options *opts,
struct tcp_md5sig_key **md5) {
struct tcp_sock *tp = tcp_sk(sk);
+ /* As long as tcp_syn_options is called under socket lock,
+ * we don't need RCU here */
struct tcp_cookie_values *cvp = tp->cookie_values;
unsigned remaining = MAX_TCP_OPTION_SPACE;
- u8 cookie_size = (!tp->rx_opt.cookie_out_never && cvp != NULL) ?
- tcp_cookie_size_check(cvp->cookie_desired) :
- 0;
+ u8 cookie_size =
+ (!tp->rx_opt.cookie_out_never && cvp != &def_cookie_values) ?
+ tcp_cookie_size_check(cvp->cookie_desired) :
+ 0;
#ifdef CONFIG_TCP_MD5SIG
*md5 = tp->af_specific->md5_get(sk, sk);
@@ -2407,19 +2410,28 @@ struct sk_buff *tcp_make_synack(struct sock
*sk, struct dst_entry *dst,
struct tcp_extend_values *xvp = tcp_xv(rvp);
struct inet_request_sock *ireq = inet_rsk(req);
struct tcp_sock *tp = tcp_sk(sk);
- const struct tcp_cookie_values *cvp = tp->cookie_values;
+ struct tcp_cookie_values *cvp;
struct tcphdr *th;
struct sk_buff *skb;
struct tcp_md5sig_key *md5;
int tcp_header_size;
int mss;
- int s_data_desired = 0;
+ int s_data_desired;
+
+ rcu_read_lock();
+ cvp = rcu_dereference(tp->cookie_values);
+ if (cvp != &def_cookie_values)
+ kref_get(&cvp->kref);
+ rcu_read_unlock();
+
+ s_data_desired = cvp->s_data_constant ? cvp->s_data_desired : 0;
- if (cvp != NULL && cvp->s_data_constant && cvp->s_data_desired)
- s_data_desired = cvp->s_data_desired;
skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15 + s_data_desired, 1, GFP_ATOMIC);
- if (skb == NULL)
+ if (skb == NULL) {
+ if (cvp != &def_cookie_values)
+ kref_put(&cvp->kref, tcp_cookie_values_release);
return NULL;
+ }
/* Reserve space for headers. */
skb_reserve(skb, MAX_TCP_HEADER);
@@ -2506,6 +2518,9 @@ struct sk_buff *tcp_make_synack(struct sock *sk,
struct dst_entry *dst,
}
}
+ if (cvp != &def_cookie_values)
+ kref_put(&cvp->kref, tcp_cookie_values_release);
+
th->seq = htonl(TCP_SKB_CB(skb)->seq);
th->ack_seq = htonl(tcp_rsk(req)->rcv_isn + 1);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 80d2d20..a3fa1f9 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1306,8 +1306,7 @@ static int tcp_v6_conn_request(struct sock *sk,
struct sk_buff *skb)
tmp_opt.saw_tstamp &&
!tp->rx_opt.cookie_out_never &&
(sysctl_tcp_cookie_size > 0 ||
- (tp->cookie_values != NULL &&
- tp->cookie_values->cookie_desired > 0))) {
+ rcu_tcp_cookie_desired(tp) > 0)) {
u8 *c;
u32 *d;
u32 *mess = &tmp_ext.cookie_bakery[COOKIE_DIGEST_WORDS];
@@ -2012,9 +2011,11 @@ static int tcp_v6_init_sock(struct sock *sk)
tp->cookie_values =
kzalloc(sizeof(*tp->cookie_values),
sk->sk_allocation);
- if (tp->cookie_values != NULL)
- kref_init(&tp->cookie_values->kref);
}
+ if (tp->cookie_values != NULL)
+ kref_init(&tp->cookie_values->kref);
+ else
+ tp->cookie_values = &def_cookie_values;
/* Presumed zeroed, in order of appearance:
* cookie_in_always, cookie_out_never,
* s_data_constant, s_data_in, s_data_out
^ permalink raw reply related
* [PATCH 0/5] tcp: ipv4 listen state scaling
From: Dmitry Popov @ 2010-10-27 13:23 UTC (permalink / raw)
To: David S. Miller, William.Allen.Simpson, Eric Dumazet,
Andreas Petlund, Shan Wei
*Note: this patch depends on "[PATCH] tcp: md5 signature check scaling"*
Hi.
The problem with current TCP stack implementation is that it locks
socket in tcp_v4_rcv on each incoming packet. However, incoming SYNs
may be processed in parallel with syncookies. And it helps a lot under
synflood. This proposed patch serie fixes this problem, but (as for
now) for ipv4 only.
6-core Xeon with 6 RX queues on NIC:
without patch: 530k syn-pkts/s;
with patch: 1620k syn-pkts/s.
Discussion?
^ permalink raw reply
* Re: [PATCH] tcp: md5 signature check scaling
From: Dmitry Popov @ 2010-10-27 13:18 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S. Miller, William.Allen.Simpson, Andreas Petlund,
Ilpo Järvinen, Alexey Kuznetsov, Pekka Savola (ipv6),
James Morris, Hideaki YOSHIFUJI, Patrick McHardy,
Stephen Hemminger, Herbert Xu, Gilad Ben-Yossef, Yony Amit,
linux-kernel, netdev, Artyom Gavrichenkov
In-Reply-To: <1288184405.2709.108.camel@edumazet-laptop>
On Wed, Oct 27, 2010 at 5:00 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> This is a huge patch :(
>
> Reading changelog, I dont understand what you did, and why you did this.
>
> You want to avoid taking the socket lock ? But we need to take it anyway
> to process packets.
Hi.
Well, I removed the dependence on socket lock from md5* functions. Yes
we need to take it to process packets, but sockets in LISTEN state may
process them without socket lock(patch coming soon). And I find md5
signature check scaling interesting even without LISTEN state scaling
patch.
Regards,
Dmitry.
^ permalink raw reply
* Re: [PATCH net-next-2.6 v2] can: Topcliff: PCH_CAN driver: Fix buildwarnings
From: Marc Kleine-Budde @ 2010-10-27 13:14 UTC (permalink / raw)
To: Tomoya MORINAGA
Cc: andrew.chih.howe.khor-ral2JQCrhuEAvxtiuMwx3w,
masa-korg-ECg8zkTtlr0C6LszWs/t0g, sameo-VuQAYsv1563Yd54FQh9/CA,
margie.foster-ral2JQCrhuEAvxtiuMwx3w,
netdev-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
yong.y.wang-ral2JQCrhuEAvxtiuMwx3w,
socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
kok.howg.ewe-ral2JQCrhuEAvxtiuMwx3w, chripell-VaTbYqLCNhc,
morinaga526-ECg8zkTtlr0C6LszWs/t0g,
joel.clark-ral2JQCrhuEAvxtiuMwx3w, David Miller,
Wolfgang Grandegger, qi.wang-ral2JQCrhuEAvxtiuMwx3w
In-Reply-To: <008201cb75c9$f27ff720$66f8800a-a06+6cuVnkTSQfdrb5gaxUEOCMrvLtNR@public.gmane.org>
[-- Attachment #1.1: Type: text/plain, Size: 7018 bytes --]
On 10/27/2010 01:27 PM, Tomoya MORINAGA wrote:
> On Wednesday, October 27, 2010 3:52 AM : Marc Kleine-Budde and Wolfgang Grandegge wrote:
>> Do I understand your code correctly? You have a big loop, but only do
>> two different things at certain values of the loop? Smells fishy.
> Uh, I can't understand your intention.
> Please show in detail.
It's easier to talk about code when we can see it, pelase don't delete :)
>> +static void pch_can_config_rx_tx_buffers(struct pch_can_priv *priv)
>> > +{
>> > + int i;
>> > + unsigned long flags;
>> > +
>> > + spin_lock_irqsave(&priv->msgif_reg_lock, flags);
>> > +
>> > + for (i = 0; i < PCH_OBJ_NUM; i++) {
>> > + if (priv->msg_obj[i] == MSG_OBJ_RX) {
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> > + iowrite32(CAN_CMASK_RX_TX_GET,
>> > + &priv->regs->if1_cmask);
>> > + pch_can_check_if_busy(&priv->regs->if1_creq, i+1);
>> > +
>> > + iowrite32(0x0, &priv->regs->if1_id1);
>> > + iowrite32(0x0, &priv->regs->if1_id2);
>> > +
>> > + pch_can_bit_set(&priv->regs->if1_mcont,
>> > + CAN_IF_MCONT_UMASK);
>> > +
>> > + /* Set FIFO mode set to 0 except last Rx Obj*/
>> > + pch_can_bit_clear(&priv->regs->if1_mcont,
>> > + CAN_IF_MCONT_EOB);
>> > + /* In case FIFO mode, Last EoB of Rx Obj must be 1 */
>> > + if (i == (PCH_RX_OBJ_NUM - 1))
>> > + pch_can_bit_set(&priv->regs->if1_mcont,
>> > + CAN_IF_MCONT_EOB);
>> > +
>> > + iowrite32(0, &priv->regs->if1_mask1);
>> > + pch_can_bit_clear(&priv->regs->if1_mask2,
>> > + 0x1fff | CAN_MASK2_MDIR_MXTD);
>> > +
>> > + /* Setting CMASK for writing */
>> > + iowrite32(CAN_CMASK_RDWR | CAN_CMASK_MASK |
>> > + CAN_CMASK_ARB | CAN_CMASK_CTRL,
>> > + &priv->regs->if1_cmask);
>> > +
>> > + pch_can_check_if_busy(&priv->regs->if1_creq, i+1);
>> > + } else if (priv->msg_obj[i] == MSG_OBJ_TX) {
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> Do I understand your code correctly? You have a big loop, but only do
> two different things at certain values of the loop? Smells fishy.
Looking again at the code it makes sense as it is :) Sorry for the
confusion.
>> > + iowrite32(CAN_CMASK_RX_TX_GET,
>> > + &priv->regs->if2_cmask);
>> > + pch_can_check_if_busy(&priv->regs->if2_creq, i+1);
>> > +
>> > + /* Resetting DIR bit for reception */
>> > + iowrite32(0x0, &priv->regs->if2_id1);
>> > + iowrite32(0x0, &priv->regs->if2_id2);
>> > + pch_can_bit_set(&priv->regs->if2_id2, CAN_ID2_DIR);
>> > +
>> > + /* Setting EOB bit for transmitter */
>> > + iowrite32(CAN_IF_MCONT_EOB, &priv->regs->if2_mcont);
>> > +
>> > + pch_can_bit_set(&priv->regs->if2_mcont,
>> > + CAN_IF_MCONT_UMASK);
>> > +
>> > + iowrite32(0, &priv->regs->if2_mask1);
>> > + pch_can_bit_clear(&priv->regs->if2_mask2, 0x1fff);
>> > +
>> > + /* Setting CMASK for writing */
>> > + iowrite32(CAN_CMASK_RDWR | CAN_CMASK_MASK |
>> > + CAN_CMASK_ARB | CAN_CMASK_CTRL,
>> > + &priv->regs->if2_cmask);
>> > +
>> > + pch_can_check_if_busy(&priv->regs->if2_creq, i+1);
>> > + }
>> > + }
>> > + spin_unlock_irqrestore(&priv->msgif_reg_lock, flags);
>> > +}
> This processing does configuration for all message objects.
Yeah, got it. However I think you can get rid of the priv->msg_obj
variable altogether. Let me recapitulate:
- you setup priv->msg_obj[] in the probe function, which defines if a
msg_obj is a rx or tx
- this definition is never changed
- all objects of one kind are in a row
So you can identify the purpose of a msg_obj by simply looking at it's
number. If you need to loop over them you can even define helper
functions like, for_each_rx_obj().
>> what does this loop do? why is it nessecarry? I don't like delay loops
>> in the hot path of a driver.
> This loop is for waiting for all tx Message Object completion.
> This is Topcliff CAN HW specification.
Can you give us a pointer into intel's documentation?
I think Wolfgang already suggested to check if the chip is busy _before_
accessing it instead of waiting the chip to finish after accessing.
>> If you figured out how to use the endianess conversion functions from
>> the cpu_to_{le,be}-{le,to}_to_cpup family use them here, too.
> Uh,le32_to_cpu have been used already here.
Let's look at the code:
>> + for (i = 0, j = 0; i < cf->can_dlc; j++) {
>> > + reg = ioread32(&priv->regs->if1_dataa1 + j*4);
>> > + cf->data[i++] = cpu_to_le32(reg & 0xff);
>> > + if (i == cf->can_dlc)
>> > + break;
>> > + cf->data[i++] = cpu_to_le32((reg >> 8) & 0xff);
>> > + }
What does the code do? It swaps bytes because the data bytes in the can
core is arranged differently compared to the data in the struct can_frame.
According to the datasheet if_dataa1 holds 1st byte in bits 07:00 and
2nd byte in 15:08. (The rest is reserved.) So in the memory it looks
like this:
xx xx byte1 byte0
The can_frame has a different layout:
__u8 data[8] __attribute__((aligned(8)));
which is in memory:
byte0 byte1 byte2 byte3 byte4 byte5 byte6 byte7
This is why you swap. However in Linux no need to do this by hand.
The if_dataXX have a little endian layout, while the can frame has a big
endian layout. Further if_dataXX has only 16 bit of can data.
I think it should look like this:
for (i = 0; i < cf->can_dlc; i += 2) {
reg = ioread32(&priv->regs->if1_data[i >> 1]);
*(__be16 *)cf->data[i] = cpu_to_be16(reg);
}
You have to change the definition of the regs struct a bit:
> u32 if1_mcont;
> u32 if1_data[4];
> u32 reserve2;
Totally untested, though.
BTW: Where can I get this Intel Hardware to improve and test the driver?
> I can't understand your intention.
> Please show in detail.
Above we have the RX-Path, the TX-path would probably use a
"be16_to_cpup", have a look at the flexcan driver. It uses the whole 32
bit for candata, though.
>>> All these check if busy in the code make me a bit nervous, can you
>>> please explain why they are needed. A pointer to the manual is okay, too.
>> Me too. I already ask in my previous mail how long that functions
>> usually blocks.
> When accessing read/write from/to Message RAM,
> Since it takes much time for transferring between Register and Message RAM,
> SW must check busy flag of CAN register.
> This is a Topcliff HW specification.
see above.
>> is there some pdev->name instead of KBUILD_MODNAME that can be used?
> I can't understand your intention.
> pdev(struct pci_dev) doesn't have "name" member.
I was just asking :) As it doesn't have a name, KBUILD_MODNAME is fine.
regards,
Marc
--
Pengutronix e.K. | Marc Kleine-Budde |
Industrial Linux Solutions | Phone: +49-231-2826-924 |
Vertretung West/Dortmund | Fax: +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de |
[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]
[-- Attachment #2: Type: text/plain, Size: 188 bytes --]
_______________________________________________
Socketcan-core mailing list
Socketcan-core-0fE9KPoRgkgATYTw5x5z8w@public.gmane.org
https://lists.berlios.de/mailman/listinfo/socketcan-core
^ permalink raw reply
* Re: [BUG net-2.6 vlan/bonding] lockdep splats
From: Eric Dumazet @ 2010-10-27 13:00 UTC (permalink / raw)
To: Jarek Poplawski; +Cc: netdev, David Miller, Jesse Gross
In-Reply-To: <20101027120334.GA11247@ff.dom.local>
Le mercredi 27 octobre 2010 à 12:03 +0000, Jarek Poplawski a écrit :
> Seems to be even older. Could you try this patch?
Indeed this is the right fix, I wonder why I did not catch it before ?
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
^ permalink raw reply
* Re: [PATCH] tcp: md5 signature check scaling
From: Eric Dumazet @ 2010-10-27 13:00 UTC (permalink / raw)
To: Dmitry Popov
Cc: David S. Miller, William.Allen.Simpson, Andreas Petlund,
Ilpo Järvinen, Alexey Kuznetsov, Pekka Savola (ipv6),
James Morris, Hideaki YOSHIFUJI, Patrick McHardy,
Stephen Hemminger, Herbert Xu, Gilad Ben-Yossef, Yony Amit,
Zhu Yi, linux-kernel, netdev
In-Reply-To: <AANLkTi=U1smX6XXDMyBTicyqbUU5V-t56jmH7qtX2XW5@mail.gmail.com>
Le mercredi 27 octobre 2010 à 16:52 +0400, Dmitry Popov a écrit :
> From: Dmitry Popov <dp@highloadlab.com>
>
> TCP MD5 signature checking without socket lock.
>
> Each tcp_sock has 2 RCU-protected arrays (tcp[46]_md5sig_info) of
> tcp[46]_md5sig_key address-key pairs.
> Each key (tcp_md5sig_key) has kref struct so that there is no need to
> lock the whole array to work with one key.
>
> MD5 functions were rewritten according to above statement and hash
> check (tcp_v4_inbound_md5_hash) was moved before socket lock.
>
> Signed-off-by: Dmitry Popov <dp@highloadlab.com>
> ---
> include/linux/tcp.h | 14 ++-
> include/net/tcp.h | 82 +++++++----
> net/ipv4/tcp_ipv4.c | 370 ++++++++++++++++++++++++++++------------------
> net/ipv4/tcp_minisocks.c | 26 +--
> net/ipv4/tcp_output.c | 12 +-
> net/ipv6/tcp_ipv6.c | 358 +++++++++++++++++++++++++++-----------------
> 6 files changed, 531 insertions(+), 331 deletions(-)
This is a huge patch :(
Reading changelog, I dont understand what you did, and why you did this.
You want to avoid taking the socket lock ? But we need to take it anyway
to process packets.
^ permalink raw reply
* [PATCH] tcp: md5 signature check scaling
From: Dmitry Popov @ 2010-10-27 12:52 UTC (permalink / raw)
To: David S. Miller, William.Allen.Simpson, Eric Dumazet,
Andreas Petlund
From: Dmitry Popov <dp@highloadlab.com>
TCP MD5 signature checking without socket lock.
Each tcp_sock has 2 RCU-protected arrays (tcp[46]_md5sig_info) of
tcp[46]_md5sig_key address-key pairs.
Each key (tcp_md5sig_key) has kref struct so that there is no need to
lock the whole array to work with one key.
MD5 functions were rewritten according to above statement and hash
check (tcp_v4_inbound_md5_hash) was moved before socket lock.
Signed-off-by: Dmitry Popov <dp@highloadlab.com>
---
include/linux/tcp.h | 14 ++-
include/net/tcp.h | 82 +++++++----
net/ipv4/tcp_ipv4.c | 370 ++++++++++++++++++++++++++++------------------
net/ipv4/tcp_minisocks.c | 26 +--
net/ipv4/tcp_output.c | 12 +-
net/ipv6/tcp_ipv6.c | 358 +++++++++++++++++++++++++++-----------------
6 files changed, 531 insertions(+), 331 deletions(-)
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index a778ee0..806266a 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -450,8 +450,14 @@ struct tcp_sock {
/* TCP AF-Specific parts; only used by MD5 Signature support so far */
const struct tcp_sock_af_ops *af_specific;
-/* TCP MD5 Signature Option information */
- struct tcp_md5sig_info *md5sig_info;
+/* IPV4 TCP MD5 Signature Option information */
+ struct tcp4_md5sig_info *md5sig_info4;
+#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+
+/* IPV6 TCP MD5 Signature Option information */
+ struct tcp6_md5sig_info *md5sig_info6;
+#endif
+
#endif
/* When the cookie options are generated and exchanged, then this
@@ -474,8 +480,8 @@ struct tcp_timewait_sock {
u32 tw_ts_recent;
long tw_ts_recent_stamp;
#ifdef CONFIG_TCP_MD5SIG
- u16 tw_md5_keylen;
- u8 tw_md5_key[TCP_MD5SIG_MAXKEYLEN];
+ /* MD5 key from parent socket */
+ struct tcp_md5sig_key *tw_md5sig_key;
#endif
/* Few sockets in timewait have cookies; in that case, then this
* object holds a reference to them (tw_cookie_values->kref).
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 3e4b33e..3f0dbec 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1092,33 +1092,46 @@ struct crypto_hash;
/* - key database */
struct tcp_md5sig_key {
- u8 *key;
- u8 keylen;
+ struct kref kref;
+ /* Actually we need only 1 byte for keylen,
+ * but we want key to be aligned
+ */
+ u32 keylen;
+ u8 key[0];
};
struct tcp4_md5sig_key {
- struct tcp_md5sig_key base;
+ struct tcp_md5sig_key *base;
__be32 addr;
+#ifdef CONFIG_64BIT
+ u32 unused;
+#endif
};
struct tcp6_md5sig_key {
- struct tcp_md5sig_key base;
+ struct tcp_md5sig_key *base;
#if 0
u32 scope_id; /* XXX */
#endif
struct in6_addr addr;
};
-/* - sock block */
-struct tcp_md5sig_info {
- struct tcp4_md5sig_key *keys4;
-#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
- struct tcp6_md5sig_key *keys6;
- u32 entries6;
- u32 alloced6;
+struct tcp4_md5sig_info {
+ u32 entries;
+#ifdef CONFIG_64BIT
+ u32 unused;
+#endif
+ struct rcu_head rcu_head;
+ struct tcp4_md5sig_key keys[0];
+};
+
+struct tcp6_md5sig_info {
+ u32 entries;
+#ifdef CONFIG_64BIT
+ u32 unused;
#endif
- u32 entries4;
- u32 alloced4;
+ struct rcu_head rcu_head;
+ struct tcp6_md5sig_key keys[0];
};
/* - pseudo header */
@@ -1153,21 +1166,29 @@ struct tcp_md5sig_pool {
#define TCP_MD5SIG_MAXKEYS (~(u32)0) /* really?! */
/* - functions */
-extern int tcp_v4_md5_hash_skb(char *md5_hash, struct tcp_md5sig_key *key,
- struct sock *sk, struct request_sock *req,
- struct sk_buff *skb);
-extern struct tcp_md5sig_key * tcp_v4_md5_lookup(struct sock *sk,
- struct sock *addr_sk);
-extern int tcp_v4_md5_do_add(struct sock *sk, __be32 addr, u8 *newkey,
- u8 newkeylen);
-extern int tcp_v4_md5_do_del(struct sock *sk, __be32 addr);
+extern int tcp_v4_md5_hash_skb(char *md5_hash,
+ struct tcp_md5sig_key *key,
+ struct sock *sk,
+ struct request_sock *req,
+ struct sk_buff *skb);
+
+extern struct tcp_md5sig_key *tcp_v4_md5_lookup(struct sock *sk,
+ struct sock *addr_sk);
+
+extern struct tcp_md5sig_key *tcp_v4_md5_get(struct sock *sk,
+ struct sock *addr_sk);
+
+extern void tcp_md5_put(struct tcp_md5sig_key *key);
+
+extern int tcp_v4_md5_do_add(struct sock *sk,
+ __be32 addr,
+ struct tcp_md5sig_key *key);
+
+extern int tcp_v4_md5_do_del(struct sock *sk,
+ __be32 addr);
#ifdef CONFIG_TCP_MD5SIG
-#define tcp_twsk_md5_key(twsk) ((twsk)->tw_md5_keylen ? \
- &(struct tcp_md5sig_key) { \
- .key = (twsk)->tw_md5_key, \
- .keylen = (twsk)->tw_md5_keylen, \
- } : NULL)
+#define tcp_twsk_md5_key(twsk) ((twsk)->tw_md5sig_key)
#else
#define tcp_twsk_md5_key(twsk) NULL
#endif
@@ -1413,6 +1434,9 @@ struct tcp_sock_af_ops {
#ifdef CONFIG_TCP_MD5SIG
struct tcp_md5sig_key *(*md5_lookup) (struct sock *sk,
struct sock *addr_sk);
+ struct tcp_md5sig_key *(*md5_get) (struct sock *sk,
+ struct sock *addr_sk);
+ void (*md5_put) (struct tcp_md5sig_key *key);
int (*calc_md5_hash) (char *location,
struct tcp_md5sig_key *md5,
struct sock *sk,
@@ -1420,8 +1444,7 @@ struct tcp_sock_af_ops {
struct sk_buff *skb);
int (*md5_add) (struct sock *sk,
struct sock *addr_sk,
- u8 *newkey,
- u8 len);
+ struct tcp_md5sig_key *key);
int (*md5_parse) (struct sock *sk,
char __user *optval,
int optlen);
@@ -1432,6 +1455,9 @@ struct tcp_request_sock_ops {
#ifdef CONFIG_TCP_MD5SIG
struct tcp_md5sig_key *(*md5_lookup) (struct sock *sk,
struct request_sock *req);
+ struct tcp_md5sig_key *(*md5_get) (struct sock *sk,
+ struct request_sock *req);
+ void (*md5_put) (struct tcp_md5sig_key *key);
int (*calc_md5_hash) (char *location,
struct tcp_md5sig_key *md5,
struct sock *sk,
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 0207662..6068b17 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -88,13 +88,13 @@ EXPORT_SYMBOL(sysctl_tcp_low_latency);
#ifdef CONFIG_TCP_MD5SIG
-static struct tcp_md5sig_key *tcp_v4_md5_do_lookup(struct sock *sk,
- __be32 addr);
+static struct tcp_md5sig_key *tcp_v4_md5_do_get(struct sock *sk,
+ __be32 addr);
static int tcp_v4_md5_hash_hdr(char *md5_hash, struct tcp_md5sig_key *key,
__be32 daddr, __be32 saddr, struct tcphdr *th);
#else
static inline
-struct tcp_md5sig_key *tcp_v4_md5_do_lookup(struct sock *sk, __be32 addr)
+struct tcp_md5sig_key *tcp_v4_md5_do_get(struct sock *sk, __be32 addr)
{
return NULL;
}
@@ -621,7 +621,7 @@ static void tcp_v4_send_reset(struct sock *sk,
struct sk_buff *skb)
arg.iov[0].iov_len = sizeof(rep.th);
#ifdef CONFIG_TCP_MD5SIG
- key = sk ? tcp_v4_md5_do_lookup(sk, ip_hdr(skb)->daddr) : NULL;
+ key = sk ? tcp_v4_md5_do_get(sk, ip_hdr(skb)->daddr) : NULL;
if (key) {
rep.opt[0] = htonl((TCPOPT_NOP << 24) |
(TCPOPT_NOP << 16) |
@@ -634,6 +634,7 @@ static void tcp_v4_send_reset(struct sock *sk,
struct sk_buff *skb)
tcp_v4_md5_hash_hdr((__u8 *) &rep.opt[1],
key, ip_hdr(skb)->saddr,
ip_hdr(skb)->daddr, &rep.th);
+ tcp_md5_put(key);
}
#endif
arg.csum = csum_tcpudp_nofold(ip_hdr(skb)->daddr,
@@ -743,12 +744,17 @@ static void tcp_v4_timewait_ack(struct sock *sk,
struct sk_buff *skb)
static void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
struct request_sock *req)
{
+ struct tcp_md5sig_key *key = tcp_v4_md5_do_get(sk, ip_hdr(skb)->daddr);
+
tcp_v4_send_ack(skb, tcp_rsk(req)->snt_isn + 1,
tcp_rsk(req)->rcv_isn + 1, req->rcv_wnd,
req->ts_recent,
0,
- tcp_v4_md5_do_lookup(sk, ip_hdr(skb)->daddr),
+ key,
inet_rsk(req)->no_srccheck ? IP_REPLY_ARG_NOSRCCHECK : 0);
+
+ if (key)
+ tcp_md5_put(key);
}
/*
@@ -844,20 +850,38 @@ static struct ip_options
*tcp_v4_save_options(struct sock *sk,
/* Find the Key structure for an address. */
static struct tcp_md5sig_key *
- tcp_v4_md5_do_lookup(struct sock *sk, __be32 addr)
+ __tcp_v4_md5_do_lookup(struct sock *sk, __be32 addr)
{
struct tcp_sock *tp = tcp_sk(sk);
+ struct tcp4_md5sig_info *md5_info = rcu_dereference(tp->md5sig_info4);
int i;
- if (!tp->md5sig_info || !tp->md5sig_info->entries4)
+ if (!md5_info)
return NULL;
- for (i = 0; i < tp->md5sig_info->entries4; i++) {
- if (tp->md5sig_info->keys4[i].addr == addr)
- return &tp->md5sig_info->keys4[i].base;
+
+ for (i = 0; i < md5_info->entries; i++) {
+ if (md5_info->keys[i].addr == addr)
+ return md5_info->keys[i].base;
}
return NULL;
}
+static struct tcp_md5sig_key *
+ tcp_v4_md5_do_lookup(struct sock *sk, __be32 addr)
+{
+ struct tcp_md5sig_key *res;
+
+ /* Short path */
+ if (!tcp_sk(sk)->md5sig_info4)
+ return NULL;
+
+ rcu_read_lock();
+ res = __tcp_v4_md5_do_lookup(sk, addr);
+ rcu_read_unlock();
+
+ return res;
+}
+
struct tcp_md5sig_key *tcp_v4_md5_lookup(struct sock *sk,
struct sock *addr_sk)
{
@@ -871,96 +895,171 @@ static struct tcp_md5sig_key
*tcp_v4_reqsk_md5_lookup(struct sock *sk,
return tcp_v4_md5_do_lookup(sk, inet_rsk(req)->rmt_addr);
}
+/* Find and lock the Key structure for an address. */
+static struct tcp_md5sig_key *
+ tcp_v4_md5_do_get(struct sock *sk, __be32 addr)
+{
+ struct tcp_md5sig_key *res;
+
+ /* Short path */
+ if (!tcp_sk(sk)->md5sig_info4)
+ return NULL;
+
+ rcu_read_lock();
+ res = __tcp_v4_md5_do_lookup(sk, addr);
+ if (res)
+ kref_get(&res->kref);
+ rcu_read_unlock();
+
+ return res;
+}
+
+struct tcp_md5sig_key *tcp_v4_md5_get(struct sock *sk,
+ struct sock *addr_sk)
+{
+ return tcp_v4_md5_do_get(sk, inet_sk(addr_sk)->inet_daddr);
+}
+EXPORT_SYMBOL(tcp_v4_md5_get);
+
+static struct tcp_md5sig_key *tcp_v4_reqsk_md5_get(struct sock *sk,
+ struct request_sock *req)
+{
+ return tcp_v4_md5_do_get(sk, inet_rsk(req)->rmt_addr);
+}
+
+static void md5sig_key_release(struct kref *kref)
+{
+ kfree(container_of(kref, struct tcp_md5sig_key, kref));
+ tcp_free_md5sig_pool();
+}
+
+/* Put md5sig key */
+void tcp_md5_put(struct tcp_md5sig_key *key)
+{
+ kref_put(&key->kref, md5sig_key_release);
+}
+EXPORT_SYMBOL(tcp_md5_put);
+
/* This can be called on a newly created socket, from other files */
int tcp_v4_md5_do_add(struct sock *sk, __be32 addr,
- u8 *newkey, u8 newkeylen)
+ struct tcp_md5sig_key *key)
{
/* Add Key to the list */
- struct tcp_md5sig_key *key;
struct tcp_sock *tp = tcp_sk(sk);
- struct tcp4_md5sig_key *keys;
+ struct tcp4_md5sig_info *old_info = NULL;
+ struct tcp4_md5sig_info *new_info;
+ int place = 0;
+ u32 entries;
+
+ if (tp->md5sig_info4) {
+ old_info = tp->md5sig_info4;
+ /* Check if we have to replace old key */
+ for (; place < old_info->entries; place++) {
+ if (old_info->keys[place].addr == addr)
+ break;
+ }
+ } else {
+ sk_nocaps_add(sk, NETIF_F_GSO_MASK);
+ }
- key = tcp_v4_md5_do_lookup(sk, addr);
- if (key) {
- /* Pre-existing entry - just update that one. */
- kfree(key->key);
- key->key = newkey;
- key->keylen = newkeylen;
+ /* Number of entries in new_info */
+ if (old_info) {
+ entries = old_info->entries;
+ if (place == old_info->entries)
+ ++entries;
} else {
- struct tcp_md5sig_info *md5sig;
-
- if (!tp->md5sig_info) {
- tp->md5sig_info = kzalloc(sizeof(*tp->md5sig_info),
- GFP_ATOMIC);
- if (!tp->md5sig_info) {
- kfree(newkey);
- return -ENOMEM;
- }
- sk_nocaps_add(sk, NETIF_F_GSO_MASK);
- }
- if (tcp_alloc_md5sig_pool(sk) == NULL) {
- kfree(newkey);
- return -ENOMEM;
- }
- md5sig = tp->md5sig_info;
-
- if (md5sig->alloced4 == md5sig->entries4) {
- keys = kmalloc((sizeof(*keys) *
- (md5sig->entries4 + 1)), GFP_ATOMIC);
- if (!keys) {
- kfree(newkey);
- tcp_free_md5sig_pool();
- return -ENOMEM;
- }
+ entries = 1;
+ }
- if (md5sig->entries4)
- memcpy(keys, md5sig->keys4,
- sizeof(*keys) * md5sig->entries4);
+ new_info = kmalloc(sizeof(*new_info) +
+ sizeof(new_info->keys[0]) * entries,
+ GFP_ATOMIC);
- /* Free old key list, and reference new one */
- kfree(md5sig->keys4);
- md5sig->keys4 = keys;
- md5sig->alloced4++;
- }
- md5sig->entries4++;
- md5sig->keys4[md5sig->entries4 - 1].addr = addr;
- md5sig->keys4[md5sig->entries4 - 1].base.key = newkey;
- md5sig->keys4[md5sig->entries4 - 1].base.keylen = newkeylen;
+ if (!new_info) {
+ tcp_md5_put(key);
+ return -ENOMEM;
+ }
+
+ new_info->entries = entries;
+
+ if (old_info)
+ memcpy(new_info->keys, old_info->keys,
+ old_info->entries * sizeof(old_info->keys[0]));
+
+ new_info->keys[place].addr = addr;
+ new_info->keys[place].base = key;
+ rcu_assign_pointer(tp->md5sig_info4, new_info);
+
+ /* This function may be called from setsockopt (synchronize_rcu is ok)
+ * or on a newly created socket (old_info == NULL)
+ */
+ if (old_info) {
+ synchronize_rcu();
+ if (place != old_info->entries) /* Put old key */
+ tcp_md5_put(old_info->keys[place].base);
+ kfree(old_info);
}
+
return 0;
}
EXPORT_SYMBOL(tcp_v4_md5_do_add);
static int tcp_v4_md5_add_func(struct sock *sk, struct sock *addr_sk,
- u8 *newkey, u8 newkeylen)
+ struct tcp_md5sig_key *key)
+{
+ return tcp_v4_md5_do_add(sk, inet_sk(addr_sk)->inet_daddr, key);
+}
+
+static int
+ tcp_v4_md5_do_del_ith(struct tcp4_md5sig_info **new_info,
+ struct tcp4_md5sig_info *old_info,
+ int i)
{
- return tcp_v4_md5_do_add(sk, inet_sk(addr_sk)->inet_daddr,
- newkey, newkeylen);
+ struct tcp4_md5sig_info *res_info = NULL;
+
+ if (old_info->entries > 1) {
+ res_info = kmalloc(sizeof(*res_info) +
+ sizeof(res_info->keys[0]) *
+ (old_info->entries - 1),
+ GFP_ATOMIC);
+ if (!res_info)
+ return -ENOMEM;
+ res_info->entries = old_info->entries - 1;
+
+ memcpy(res_info->keys,
+ old_info->keys,
+ i * sizeof(res_info->keys[0]));
+ memcpy(&res_info->keys[i],
+ &old_info->keys[i + 1],
+ (res_info->entries - i) * sizeof(res_info->keys[0]));
+ }
+
+ *new_info = res_info;
+ return 0;
}
int tcp_v4_md5_do_del(struct sock *sk, __be32 addr)
{
struct tcp_sock *tp = tcp_sk(sk);
+ struct tcp4_md5sig_info *old_info = tp->md5sig_info4;
int i;
- for (i = 0; i < tp->md5sig_info->entries4; i++) {
- if (tp->md5sig_info->keys4[i].addr == addr) {
- /* Free the key */
- kfree(tp->md5sig_info->keys4[i].base.key);
- tp->md5sig_info->entries4--;
-
- if (tp->md5sig_info->entries4 == 0) {
- kfree(tp->md5sig_info->keys4);
- tp->md5sig_info->keys4 = NULL;
- tp->md5sig_info->alloced4 = 0;
- } else if (tp->md5sig_info->entries4 != i) {
- /* Need to do some manipulation */
- memmove(&tp->md5sig_info->keys4[i],
- &tp->md5sig_info->keys4[i+1],
- (tp->md5sig_info->entries4 - i) *
- sizeof(struct tcp4_md5sig_key));
- }
- tcp_free_md5sig_pool();
+ if (!old_info)
+ return -ENOENT;
+
+ for (i = 0; i < old_info->entries; i++) {
+ if (old_info->keys[i].addr == addr) {
+ struct tcp4_md5sig_info *new_info;
+ int res;
+
+ res = tcp_v4_md5_do_del_ith(&new_info, old_info, i);
+ if (res)
+ return res;
+
+ rcu_assign_pointer(tp->md5sig_info4, new_info);
+ synchronize_rcu();
+ tcp_md5_put(old_info->keys[i].base);
+ kfree(old_info);
return 0;
}
}
@@ -968,25 +1067,27 @@ int tcp_v4_md5_do_del(struct sock *sk, __be32 addr)
}
EXPORT_SYMBOL(tcp_v4_md5_do_del);
+static void tcp_v4_md5_clear_info(struct rcu_head *head)
+{
+ struct tcp4_md5sig_info *md5_info =
+ container_of(head, struct tcp4_md5sig_info, rcu_head);
+ int i;
+
+ /* Free each key, then the set of key keys
+ */
+ for (i = 0; i < md5_info->entries; i++)
+ tcp_md5_put(md5_info->keys[i].base);
+ kfree(md5_info);
+}
+
static void tcp_v4_clear_md5_list(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
+ struct tcp4_md5sig_info *md5_info = tp->md5sig_info4;
- /* Free each key, then the set of key keys,
- * the crypto element, and then decrement our
- * hold on the last resort crypto.
- */
- if (tp->md5sig_info->entries4) {
- int i;
- for (i = 0; i < tp->md5sig_info->entries4; i++)
- kfree(tp->md5sig_info->keys4[i].base.key);
- tp->md5sig_info->entries4 = 0;
- tcp_free_md5sig_pool();
- }
- if (tp->md5sig_info->keys4) {
- kfree(tp->md5sig_info->keys4);
- tp->md5sig_info->keys4 = NULL;
- tp->md5sig_info->alloced4 = 0;
+ if (md5_info) {
+ rcu_assign_pointer(tp->md5sig_info4, NULL);
+ call_rcu(&md5_info->rcu_head, tcp_v4_md5_clear_info);
}
}
@@ -995,7 +1096,7 @@ static int tcp_v4_parse_md5_keys(struct sock *sk,
char __user *optval,
{
struct tcp_md5sig cmd;
struct sockaddr_in *sin = (struct sockaddr_in *)&cmd.tcpm_addr;
- u8 *newkey;
+ struct tcp_md5sig_key *newkey;
if (optlen < sizeof(cmd))
return -EINVAL;
@@ -1006,32 +1107,26 @@ static int tcp_v4_parse_md5_keys(struct sock
*sk, char __user *optval,
if (sin->sin_family != AF_INET)
return -EINVAL;
- if (!cmd.tcpm_key || !cmd.tcpm_keylen) {
- if (!tcp_sk(sk)->md5sig_info)
- return -ENOENT;
+ if (!cmd.tcpm_keylen)
return tcp_v4_md5_do_del(sk, sin->sin_addr.s_addr);
- }
if (cmd.tcpm_keylen > TCP_MD5SIG_MAXKEYLEN)
return -EINVAL;
- if (!tcp_sk(sk)->md5sig_info) {
- struct tcp_sock *tp = tcp_sk(sk);
- struct tcp_md5sig_info *p;
-
- p = kzalloc(sizeof(*p), sk->sk_allocation);
- if (!p)
- return -EINVAL;
+ newkey = kmalloc(sizeof(*newkey) + cmd.tcpm_keylen, sk->sk_allocation);
+ if (!newkey)
+ return -ENOMEM;
- tp->md5sig_info = p;
- sk_nocaps_add(sk, NETIF_F_GSO_MASK);
+ if (tcp_alloc_md5sig_pool(sk) == NULL) {
+ kfree(newkey);
+ return -ENOMEM;
}
- newkey = kmemdup(cmd.tcpm_key, cmd.tcpm_keylen, sk->sk_allocation);
- if (!newkey)
- return -ENOMEM;
- return tcp_v4_md5_do_add(sk, sin->sin_addr.s_addr,
- newkey, cmd.tcpm_keylen);
+ kref_init(&newkey->kref);
+ newkey->keylen = cmd.tcpm_keylen;
+ memcpy(newkey->key, cmd.tcpm_key, cmd.tcpm_keylen);
+
+ return tcp_v4_md5_do_add(sk, sin->sin_addr.s_addr, newkey);
}
static int tcp_v4_md5_hash_pseudoheader(struct tcp_md5sig_pool *hp,
@@ -1157,7 +1252,7 @@ static int tcp_v4_inbound_md5_hash(struct sock
*sk, struct sk_buff *skb)
int genhash;
unsigned char newhash[16];
- hash_expected = tcp_v4_md5_do_lookup(sk, iph->saddr);
+ hash_expected = tcp_v4_md5_do_get(sk, iph->saddr);
hash_location = tcp_parse_md5sig_option(th);
/* We've parsed the options - do we have a hash? */
@@ -1165,6 +1260,7 @@ static int tcp_v4_inbound_md5_hash(struct sock
*sk, struct sk_buff *skb)
return 0;
if (hash_expected && !hash_location) {
+ tcp_md5_put(hash_expected);
NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPMD5NOTFOUND);
return 1;
}
@@ -1181,6 +1277,8 @@ static int tcp_v4_inbound_md5_hash(struct sock
*sk, struct sk_buff *skb)
hash_expected,
NULL, NULL, skb);
+ tcp_md5_put(hash_expected);
+
if (genhash || memcmp(hash_location, newhash, 16) != 0) {
if (net_ratelimit()) {
printk(KERN_INFO "MD5 Hash failed for (%pI4, %d)->(%pI4, %d)%s\n",
@@ -1207,6 +1305,8 @@ struct request_sock_ops tcp_request_sock_ops
__read_mostly = {
#ifdef CONFIG_TCP_MD5SIG
static const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops = {
+ .md5_get = tcp_v4_reqsk_md5_get,
+ .md5_put = tcp_md5_put,
.md5_lookup = tcp_v4_reqsk_md5_lookup,
.calc_md5_hash = tcp_v4_md5_hash_skb,
};
@@ -1453,20 +1553,9 @@ struct sock *tcp_v4_syn_recv_sock(struct sock
*sk, struct sk_buff *skb,
#ifdef CONFIG_TCP_MD5SIG
/* Copy over the MD5 key from the original socket */
- key = tcp_v4_md5_do_lookup(sk, newinet->inet_daddr);
- if (key != NULL) {
- /*
- * We're using one, so create a matching key
- * on the newsk structure. If we fail to get
- * memory, then we end up not copying the key
- * across. Shucks.
- */
- char *newkey = kmemdup(key->key, key->keylen, GFP_ATOMIC);
- if (newkey != NULL)
- tcp_v4_md5_do_add(newsk, newinet->inet_daddr,
- newkey, key->keylen);
- sk_nocaps_add(newsk, NETIF_F_GSO_MASK);
- }
+ key = tcp_v4_md5_do_get(sk, newinet->inet_daddr);
+ if (key != NULL)
+ tcp_v4_md5_do_add(newsk, newinet->inet_daddr, key);
#endif
__inet_hash_nolisten(newsk, NULL);
@@ -1547,16 +1636,6 @@ static __sum16 tcp_v4_checksum_init(struct sk_buff *skb)
int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
{
struct sock *rsk;
-#ifdef CONFIG_TCP_MD5SIG
- /*
- * We really want to reject the packet as early as possible
- * if:
- * o We're expecting an MD5'd packet and this is no MD5 tcp option
- * o There is an MD5 option and we're not expecting one
- */
- if (tcp_v4_inbound_md5_hash(sk, skb))
- goto discard;
-#endif
if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */
sock_rps_save_rxhash(sk, skb->rxhash);
@@ -1680,6 +1759,17 @@ process:
skb->dev = NULL;
+#ifdef CONFIG_TCP_MD5SIG
+ /*
+ * We really want to reject the packet as early as possible
+ * if:
+ * o We're expecting an MD5'd packet and this is no MD5 tcp option
+ * o There is an MD5 option and we're not expecting one
+ */
+ if (tcp_v4_inbound_md5_hash(sk, skb))
+ goto discard_and_relse;
+#endif
+
bh_lock_sock_nested(sk);
ret = 0;
if (!sock_owned_by_user(sk)) {
@@ -1842,6 +1932,8 @@ EXPORT_SYMBOL(ipv4_specific);
#ifdef CONFIG_TCP_MD5SIG
static const struct tcp_sock_af_ops tcp_sock_ipv4_specific = {
+ .md5_get = tcp_v4_md5_get,
+ .md5_put = tcp_md5_put,
.md5_lookup = tcp_v4_md5_lookup,
.calc_md5_hash = tcp_v4_md5_hash_skb,
.md5_add = tcp_v4_md5_add_func,
@@ -1931,11 +2023,7 @@ void tcp_v4_destroy_sock(struct sock *sk)
#ifdef CONFIG_TCP_MD5SIG
/* Clean up the MD5 key list, if any */
- if (tp->md5sig_info) {
- tcp_v4_clear_md5_list(sk);
- kfree(tp->md5sig_info);
- tp->md5sig_info = NULL;
- }
+ tcp_v4_clear_md5_list(sk);
#endif
#ifdef CONFIG_NET_DMA
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index f25b56c..599752c 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -306,22 +306,11 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
#ifdef CONFIG_TCP_MD5SIG
/*
* The timewait bucket does not have the key DB from the
- * sock structure. We just make a quick copy of the
- * md5 key being used (if indeed we are using one)
+ * sock structure. We just get the md5 key being used
+ * (if indeed we are using one)
* so the timewait ack generating code has the key.
*/
- do {
- struct tcp_md5sig_key *key;
- memset(tcptw->tw_md5_key, 0, sizeof(tcptw->tw_md5_key));
- tcptw->tw_md5_keylen = 0;
- key = tp->af_specific->md5_lookup(sk, sk);
- if (key != NULL) {
- memcpy(&tcptw->tw_md5_key, key->key, key->keylen);
- tcptw->tw_md5_keylen = key->keylen;
- if (tcp_alloc_md5sig_pool(sk) == NULL)
- BUG();
- }
- } while (0);
+ tcptw->tw_md5sig_key = tp->af_specific->md5_get(sk, sk);
#endif
/* Linkage updates. */
@@ -358,8 +347,8 @@ void tcp_twsk_destructor(struct sock *sk)
{
#ifdef CONFIG_TCP_MD5SIG
struct tcp_timewait_sock *twsk = tcp_twsk(sk);
- if (twsk->tw_md5_keylen)
- tcp_free_md5sig_pool();
+ if (twsk->tw_md5sig_key)
+ tcp_md5_put(twsk->tw_md5sig_key);
#endif
}
EXPORT_SYMBOL_GPL(tcp_twsk_destructor);
@@ -496,7 +485,10 @@ struct sock *tcp_create_openreq_child(struct sock
*sk, struct request_sock *req,
newtp->tcp_header_len = sizeof(struct tcphdr);
}
#ifdef CONFIG_TCP_MD5SIG
- newtp->md5sig_info = NULL; /*XXX*/
+ newtp->md5sig_info4 = NULL; /*XXX*/
+#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+ newtp->md5sig_info6 = NULL; /*XXX*/
+#endif
if (newtp->af_specific->md5_lookup(sk, newsk))
newtp->tcp_header_len += TCPOLEN_MD5SIG_ALIGNED;
#endif
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index de3bd84..8954453 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -569,7 +569,7 @@ static unsigned tcp_syn_options(struct sock *sk,
struct sk_buff *skb,
0;
#ifdef CONFIG_TCP_MD5SIG
- *md5 = tp->af_specific->md5_lookup(sk, sk);
+ *md5 = tp->af_specific->md5_get(sk, sk);
if (*md5) {
opts->options |= OPTION_MD5;
remaining -= TCPOLEN_MD5SIG_ALIGNED;
@@ -671,7 +671,7 @@ static unsigned tcp_synack_options(struct sock *sk,
0;
#ifdef CONFIG_TCP_MD5SIG
- *md5 = tcp_rsk(req)->af_specific->md5_lookup(sk, req);
+ *md5 = tcp_rsk(req)->af_specific->md5_get(sk, req);
if (*md5) {
opts->options |= OPTION_MD5;
remaining -= TCPOLEN_MD5SIG_ALIGNED;
@@ -745,7 +745,7 @@ static unsigned tcp_established_options(struct
sock *sk, struct sk_buff *skb,
unsigned int eff_sacks;
#ifdef CONFIG_TCP_MD5SIG
- *md5 = tp->af_specific->md5_lookup(sk, sk);
+ *md5 = tp->af_specific->md5_get(sk, sk);
if (unlikely(*md5)) {
opts->options |= OPTION_MD5;
size += TCPOLEN_MD5SIG_ALIGNED;
@@ -876,6 +876,7 @@ static int tcp_transmit_skb(struct sock *sk,
struct sk_buff *skb, int clone_it,
sk_nocaps_add(sk, NETIF_F_GSO_MASK);
tp->af_specific->calc_md5_hash(opts.hash_location,
md5, sk, NULL, skb);
+ tp->af_specific->md5_put(md5);
}
#endif
@@ -1258,6 +1259,10 @@ unsigned int tcp_current_mss(struct sock *sk)
header_len = tcp_established_options(sk, NULL, &opts, &md5) +
sizeof(struct tcphdr);
+#ifdef CONFIG_TCP_MD5SIG
+ if (unlikely(md5))
+ tp->af_specific->md5_put(md5);
+#endif
/* The mss_cache is sized based on tp->tcp_header_len, which assumes
* some common options. If this is an odd packet (because we have SACK
* blocks etc) then our calculated header_len will be different, and
@@ -2515,6 +2520,7 @@ struct sk_buff *tcp_make_synack(struct sock *sk,
struct dst_entry *dst,
if (md5) {
tcp_rsk(req)->af_specific->calc_md5_hash(opts.hash_location,
md5, NULL, req, skb);
+ tcp_rsk(req)->af_specific->md5_put(md5);
}
#endif
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index fe6d404..80d2d20 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -85,8 +85,8 @@ static const struct inet_connection_sock_af_ops ipv6_specific;
static const struct tcp_sock_af_ops tcp_sock_ipv6_specific;
static const struct tcp_sock_af_ops tcp_sock_ipv6_mapped_specific;
#else
-static struct tcp_md5sig_key *tcp_v6_md5_do_lookup(struct sock *sk,
- struct in6_addr *addr)
+static struct tcp_md5sig_key *tcp_v6_md5_do_get(struct sock *sk,
+ struct in6_addr *addr)
{
return NULL;
}
@@ -542,24 +542,41 @@ static void tcp_v6_reqsk_destructor(struct
request_sock *req)
}
#ifdef CONFIG_TCP_MD5SIG
-static struct tcp_md5sig_key *tcp_v6_md5_do_lookup(struct sock *sk,
+static struct tcp_md5sig_key *__tcp_v6_md5_do_lookup(struct sock *sk,
struct in6_addr *addr)
{
struct tcp_sock *tp = tcp_sk(sk);
+ struct tcp6_md5sig_info *md5_info = rcu_dereference(tp->md5sig_info6);
int i;
BUG_ON(tp == NULL);
- if (!tp->md5sig_info || !tp->md5sig_info->entries6)
+ if (!md5_info)
return NULL;
- for (i = 0; i < tp->md5sig_info->entries6; i++) {
- if (ipv6_addr_equal(&tp->md5sig_info->keys6[i].addr, addr))
- return &tp->md5sig_info->keys6[i].base;
+ for (i = 0; i < md5_info->entries; i++) {
+ if (ipv6_addr_equal(&md5_info->keys[i].addr, addr))
+ return md5_info->keys[i].base;
}
return NULL;
}
+static struct tcp_md5sig_key *tcp_v6_md5_do_lookup(struct sock *sk,
+ struct in6_addr *addr)
+{
+ struct tcp_md5sig_key *res;
+
+ /* Short path */
+ if (!tcp_sk(sk)->md5sig_info6)
+ return NULL;
+
+ rcu_read_lock();
+ res = __tcp_v6_md5_do_lookup(sk, addr);
+ rcu_read_unlock();
+
+ return res;
+}
+
static struct tcp_md5sig_key *tcp_v6_md5_lookup(struct sock *sk,
struct sock *addr_sk)
{
@@ -572,135 +589,194 @@ static struct tcp_md5sig_key
*tcp_v6_reqsk_md5_lookup(struct sock *sk,
return tcp_v6_md5_do_lookup(sk, &inet6_rsk(req)->rmt_addr);
}
-static int tcp_v6_md5_do_add(struct sock *sk, struct in6_addr *peer,
- char *newkey, u8 newkeylen)
+/* Find and lock the Key structure for an address. */
+static struct tcp_md5sig_key *
+ tcp_v6_md5_do_get(struct sock *sk, struct in6_addr *addr)
{
- /* Add key to the list */
- struct tcp_md5sig_key *key;
+ struct tcp_md5sig_key *res;
+
+ /* Short path */
+ if (!tcp_sk(sk)->md5sig_info6)
+ return NULL;
+
+ rcu_read_lock();
+ res = __tcp_v6_md5_do_lookup(sk, addr);
+ if (res)
+ kref_get(&res->kref);
+ rcu_read_unlock();
+
+ return res;
+}
+
+struct tcp_md5sig_key *tcp_v6_md5_get(struct sock *sk,
+ struct sock *addr_sk)
+{
+ return tcp_v6_md5_do_get(sk, &inet6_sk(addr_sk)->daddr);
+}
+
+static struct tcp_md5sig_key *tcp_v6_reqsk_md5_get(struct sock *sk,
+ struct request_sock *req)
+{
+ return tcp_v6_md5_do_get(sk, &inet6_rsk(req)->rmt_addr);
+}
+
+static int tcp_v6_md5_do_add(struct sock *sk, struct in6_addr *addr,
+ struct tcp_md5sig_key *key)
+{
+ /* Add Key to the list */
struct tcp_sock *tp = tcp_sk(sk);
- struct tcp6_md5sig_key *keys;
+ struct tcp6_md5sig_info *old_info = NULL;
+ struct tcp6_md5sig_info *new_info;
+ int place = 0;
+ u32 entries;
+
+ if (tp->md5sig_info6) {
+ old_info = tp->md5sig_info6;
+ /* Check if we have to replace old key */
+ for (; place < old_info->entries; place++) {
+ if (ipv6_addr_equal(&old_info->keys[place].addr,
+ addr))
+ break;
+ }
+ } else {
+ sk_nocaps_add(sk, NETIF_F_GSO_MASK);
+ }
- key = tcp_v6_md5_do_lookup(sk, peer);
- if (key) {
- /* modify existing entry - just update that one */
- kfree(key->key);
- key->key = newkey;
- key->keylen = newkeylen;
+ /* Number of entries in new_info */
+ if (old_info) {
+ entries = old_info->entries;
+ if (place == old_info->entries)
+ ++entries;
} else {
- /* reallocate new list if current one is full. */
- if (!tp->md5sig_info) {
- tp->md5sig_info = kzalloc(sizeof(*tp->md5sig_info), GFP_ATOMIC);
- if (!tp->md5sig_info) {
- kfree(newkey);
- return -ENOMEM;
- }
- sk_nocaps_add(sk, NETIF_F_GSO_MASK);
- }
- if (tcp_alloc_md5sig_pool(sk) == NULL) {
- kfree(newkey);
- return -ENOMEM;
- }
- if (tp->md5sig_info->alloced6 == tp->md5sig_info->entries6) {
- keys = kmalloc((sizeof (tp->md5sig_info->keys6[0]) *
- (tp->md5sig_info->entries6 + 1)), GFP_ATOMIC);
-
- if (!keys) {
- tcp_free_md5sig_pool();
- kfree(newkey);
- return -ENOMEM;
- }
+ entries = 1;
+ }
- if (tp->md5sig_info->entries6)
- memmove(keys, tp->md5sig_info->keys6,
- (sizeof (tp->md5sig_info->keys6[0]) *
- tp->md5sig_info->entries6));
+ new_info = kmalloc(sizeof(*new_info) +
+ sizeof(new_info->keys[0]) * entries,
+ GFP_ATOMIC);
- kfree(tp->md5sig_info->keys6);
- tp->md5sig_info->keys6 = keys;
- tp->md5sig_info->alloced6++;
- }
+ if (!new_info) {
+ tcp_md5_put(key);
+ return -ENOMEM;
+ }
+
+ new_info->entries = entries;
- ipv6_addr_copy(&tp->md5sig_info->keys6[tp->md5sig_info->entries6].addr,
- peer);
- tp->md5sig_info->keys6[tp->md5sig_info->entries6].base.key = newkey;
- tp->md5sig_info->keys6[tp->md5sig_info->entries6].base.keylen = newkeylen;
+ if (old_info)
+ memcpy(new_info->keys, old_info->keys,
+ old_info->entries * sizeof(old_info->keys[0]));
- tp->md5sig_info->entries6++;
+ if (!old_info || place == old_info->entries)
+ ipv6_addr_copy(&new_info->keys[place].addr, addr);
+
+ new_info->keys[place].base = key;
+ rcu_assign_pointer(tp->md5sig_info6, new_info);
+
+ /* This function may be called from setsockopt (synchronize_rcu is ok)
+ * or on a newly created socket (old_info == NULL)
+ */
+ if (old_info) {
+ synchronize_rcu();
+ if (place != old_info->entries) /* Put old key */
+ tcp_md5_put(old_info->keys[place].base);
+ kfree(old_info);
}
+
return 0;
}
static int tcp_v6_md5_add_func(struct sock *sk, struct sock *addr_sk,
- u8 *newkey, __u8 newkeylen)
+ struct tcp_md5sig_key *key)
{
- return tcp_v6_md5_do_add(sk, &inet6_sk(addr_sk)->daddr,
- newkey, newkeylen);
+ return tcp_v6_md5_do_add(sk, &inet6_sk(addr_sk)->daddr, key);
+}
+
+static int
+ tcp_v6_md5_do_del_ith(struct tcp6_md5sig_info **new_info,
+ struct tcp6_md5sig_info *old_info,
+ int i)
+{
+ struct tcp6_md5sig_info *res_info = NULL;
+
+ if (old_info->entries > 1) {
+ res_info = kmalloc(sizeof(*res_info) +
+ sizeof(res_info->keys[0]) *
+ (old_info->entries - 1),
+ GFP_ATOMIC);
+ if (!res_info)
+ return -ENOMEM;
+ res_info->entries = old_info->entries - 1;
+
+ memcpy(res_info->keys,
+ old_info->keys,
+ i * sizeof(res_info->keys[0]));
+ memcpy(&res_info->keys[i],
+ &old_info->keys[i + 1],
+ (res_info->entries - i) * sizeof(res_info->keys[0]));
+ }
+
+ *new_info = res_info;
+ return 0;
}
static int tcp_v6_md5_do_del(struct sock *sk, struct in6_addr *peer)
{
struct tcp_sock *tp = tcp_sk(sk);
+ struct tcp6_md5sig_info *old_info = tp->md5sig_info6;
int i;
- for (i = 0; i < tp->md5sig_info->entries6; i++) {
- if (ipv6_addr_equal(&tp->md5sig_info->keys6[i].addr, peer)) {
- /* Free the key */
- kfree(tp->md5sig_info->keys6[i].base.key);
- tp->md5sig_info->entries6--;
-
- if (tp->md5sig_info->entries6 == 0) {
- kfree(tp->md5sig_info->keys6);
- tp->md5sig_info->keys6 = NULL;
- tp->md5sig_info->alloced6 = 0;
- } else {
- /* shrink the database */
- if (tp->md5sig_info->entries6 != i)
- memmove(&tp->md5sig_info->keys6[i],
- &tp->md5sig_info->keys6[i+1],
- (tp->md5sig_info->entries6 - i)
- * sizeof (tp->md5sig_info->keys6[0]));
- }
- tcp_free_md5sig_pool();
+ if (!old_info)
+ return -ENOENT;
+
+ for (i = 0; i < old_info->entries; i++) {
+ if (ipv6_addr_equal(&old_info->keys[i].addr, peer)) {
+ struct tcp6_md5sig_info *new_info;
+ int res;
+
+ res = tcp_v6_md5_do_del_ith(&new_info, old_info, i);
+ if (res)
+ return res;
+
+ rcu_assign_pointer(tp->md5sig_info6, new_info);
+ synchronize_rcu();
+ tcp_md5_put(old_info->keys[i].base);
+ kfree(old_info);
return 0;
}
}
return -ENOENT;
}
-static void tcp_v6_clear_md5_list (struct sock *sk)
+static void tcp_v6_md5_clear_info(struct rcu_head *head)
{
- struct tcp_sock *tp = tcp_sk(sk);
+ struct tcp6_md5sig_info *md5_info =
+ container_of(head, struct tcp6_md5sig_info, rcu_head);
int i;
- if (tp->md5sig_info->entries6) {
- for (i = 0; i < tp->md5sig_info->entries6; i++)
- kfree(tp->md5sig_info->keys6[i].base.key);
- tp->md5sig_info->entries6 = 0;
- tcp_free_md5sig_pool();
- }
+ /* Free each key, then the set of keys
+ */
+ for (i = 0; i < md5_info->entries; i++)
+ tcp_md5_put(md5_info->keys[i].base);
+ kfree(md5_info);
+}
- kfree(tp->md5sig_info->keys6);
- tp->md5sig_info->keys6 = NULL;
- tp->md5sig_info->alloced6 = 0;
+static void tcp_v6_clear_md5_list(struct sock *sk)
+{
+ struct tcp_sock *tp = tcp_sk(sk);
+ struct tcp6_md5sig_info *md5_info = tp->md5sig_info6;
- if (tp->md5sig_info->entries4) {
- for (i = 0; i < tp->md5sig_info->entries4; i++)
- kfree(tp->md5sig_info->keys4[i].base.key);
- tp->md5sig_info->entries4 = 0;
- tcp_free_md5sig_pool();
+ if (md5_info) {
+ rcu_assign_pointer(tp->md5sig_info6, NULL);
+ call_rcu(&md5_info->rcu_head, tcp_v6_md5_clear_info);
}
-
- kfree(tp->md5sig_info->keys4);
- tp->md5sig_info->keys4 = NULL;
- tp->md5sig_info->alloced4 = 0;
}
-static int tcp_v6_parse_md5_keys (struct sock *sk, char __user *optval,
- int optlen)
+static int tcp_v6_parse_md5_keys(struct sock *sk, char __user *optval,
+ int optlen)
{
struct tcp_md5sig cmd;
struct sockaddr_in6 *sin6 = (struct sockaddr_in6 *)&cmd.tcpm_addr;
- u8 *newkey;
+ struct tcp_md5sig_key *newkey;
if (optlen < sizeof(cmd))
return -EINVAL;
@@ -712,8 +788,6 @@ static int tcp_v6_parse_md5_keys (struct sock *sk,
char __user *optval,
return -EINVAL;
if (!cmd.tcpm_keylen) {
- if (!tcp_sk(sk)->md5sig_info)
- return -ENOENT;
if (ipv6_addr_v4mapped(&sin6->sin6_addr))
return tcp_v4_md5_do_del(sk, sin6->sin6_addr.s6_addr32[3]);
return tcp_v6_md5_do_del(sk, &sin6->sin6_addr);
@@ -722,26 +796,24 @@ static int tcp_v6_parse_md5_keys (struct sock
*sk, char __user *optval,
if (cmd.tcpm_keylen > TCP_MD5SIG_MAXKEYLEN)
return -EINVAL;
- if (!tcp_sk(sk)->md5sig_info) {
- struct tcp_sock *tp = tcp_sk(sk);
- struct tcp_md5sig_info *p;
-
- p = kzalloc(sizeof(struct tcp_md5sig_info), GFP_KERNEL);
- if (!p)
- return -ENOMEM;
+ newkey = kmalloc(sizeof(*newkey) + cmd.tcpm_keylen, GFP_KERNEL);
+ if (!newkey)
+ return -ENOMEM;
- tp->md5sig_info = p;
- sk_nocaps_add(sk, NETIF_F_GSO_MASK);
+ if (tcp_alloc_md5sig_pool(sk) == NULL) {
+ kfree(newkey);
+ return -ENOMEM;
}
- newkey = kmemdup(cmd.tcpm_key, cmd.tcpm_keylen, GFP_KERNEL);
- if (!newkey)
- return -ENOMEM;
- if (ipv6_addr_v4mapped(&sin6->sin6_addr)) {
+ kref_init(&newkey->kref);
+ newkey->keylen = cmd.tcpm_keylen;
+ memcpy(newkey->key, cmd.tcpm_key, cmd.tcpm_keylen);
+
+ if (ipv6_addr_v4mapped(&sin6->sin6_addr))
return tcp_v4_md5_do_add(sk, sin6->sin6_addr.s6_addr32[3],
- newkey, cmd.tcpm_keylen);
- }
- return tcp_v6_md5_do_add(sk, &sin6->sin6_addr, newkey, cmd.tcpm_keylen);
+ newkey);
+
+ return tcp_v6_md5_do_add(sk, &sin6->sin6_addr, newkey);
}
static int tcp_v6_md5_hash_pseudoheader(struct tcp_md5sig_pool *hp,
@@ -854,7 +926,7 @@ static int tcp_v6_inbound_md5_hash (struct sock
*sk, struct sk_buff *skb)
int genhash;
u8 newhash[16];
- hash_expected = tcp_v6_md5_do_lookup(sk, &ip6h->saddr);
+ hash_expected = tcp_v6_md5_do_get(sk, &ip6h->saddr);
hash_location = tcp_parse_md5sig_option(th);
/* We've parsed the options - do we have a hash? */
@@ -862,6 +934,7 @@ static int tcp_v6_inbound_md5_hash (struct sock
*sk, struct sk_buff *skb)
return 0;
if (hash_expected && !hash_location) {
+ tcp_md5_put(hash_expected);
NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPMD5NOTFOUND);
return 1;
}
@@ -876,6 +949,8 @@ static int tcp_v6_inbound_md5_hash (struct sock
*sk, struct sk_buff *skb)
hash_expected,
NULL, NULL, skb);
+ tcp_md5_put(hash_expected);
+
if (genhash || memcmp(hash_location, newhash, 16) != 0) {
if (net_ratelimit()) {
printk(KERN_INFO "MD5 Hash %s for [%pI6c]:%u->[%pI6c]:%u\n",
@@ -901,6 +976,8 @@ struct request_sock_ops tcp6_request_sock_ops
__read_mostly = {
#ifdef CONFIG_TCP_MD5SIG
static const struct tcp_request_sock_ops tcp_request_sock_ipv6_ops = {
+ .md5_get = tcp_v6_reqsk_md5_get,
+ .md5_put = tcp_md5_put,
.md5_lookup = tcp_v6_reqsk_md5_lookup,
.calc_md5_hash = tcp_v6_md5_hash_skb,
};
@@ -1092,7 +1169,7 @@ static void tcp_v6_send_reset(struct sock *sk,
struct sk_buff *skb)
#ifdef CONFIG_TCP_MD5SIG
if (sk)
- key = tcp_v6_md5_do_lookup(sk, &ipv6_hdr(skb)->daddr);
+ key = tcp_v6_md5_do_get(sk, &ipv6_hdr(skb)->daddr);
#endif
if (th->ack)
@@ -1102,6 +1179,9 @@ static void tcp_v6_send_reset(struct sock *sk,
struct sk_buff *skb)
(th->doff << 2);
tcp_v6_send_response(skb, seq, ack_seq, 0, 0, key, 1);
+
+ if (key)
+ tcp_md5_put(key);
}
static void tcp_v6_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
u32 win, u32 ts,
@@ -1125,8 +1205,15 @@ static void tcp_v6_timewait_ack(struct sock
*sk, struct sk_buff *skb)
static void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
struct request_sock *req)
{
- tcp_v6_send_ack(skb, tcp_rsk(req)->snt_isn + 1,
tcp_rsk(req)->rcv_isn + 1, req->rcv_wnd, req->ts_recent,
- tcp_v6_md5_do_lookup(sk, &ipv6_hdr(skb)->daddr));
+ struct tcp_md5sig_key *key = tcp_v6_md5_do_get(sk,
+ &ipv6_hdr(skb)->daddr);
+
+ tcp_v6_send_ack(skb, tcp_rsk(req)->snt_isn + 1,
+ tcp_rsk(req)->rcv_isn + 1, req->rcv_wnd,
+ req->ts_recent, key);
+
+ if (key)
+ tcp_md5_put(key);
}
@@ -1484,17 +1571,9 @@ static struct sock *
tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
#ifdef CONFIG_TCP_MD5SIG
/* Copy over the MD5 key from the original socket */
- if ((key = tcp_v6_md5_do_lookup(sk, &newnp->daddr)) != NULL) {
- /* We're using one, so create a matching key
- * on the newsk structure. If we fail to get
- * memory, then we end up not copying the key
- * across. Shucks.
- */
- char *newkey = kmemdup(key->key, key->keylen, GFP_ATOMIC);
- if (newkey != NULL)
- tcp_v6_md5_do_add(newsk, &newnp->daddr,
- newkey, key->keylen);
- }
+ key = tcp_v6_md5_do_get(sk, &newnp->daddr);
+ if (key != NULL)
+ tcp_v6_md5_do_add(newsk, &newnp->daddr, key);
#endif
__inet6_hash(newsk, NULL);
@@ -1557,11 +1636,6 @@ static int tcp_v6_do_rcv(struct sock *sk,
struct sk_buff *skb)
if (skb->protocol == htons(ETH_P_IP))
return tcp_v4_do_rcv(sk, skb);
-#ifdef CONFIG_TCP_MD5SIG
- if (tcp_v6_inbound_md5_hash (sk, skb))
- goto discard;
-#endif
-
if (sk_filter(sk, skb))
goto discard;
@@ -1726,6 +1800,11 @@ process:
skb->dev = NULL;
+#ifdef CONFIG_TCP_MD5SIG
+ if (tcp_v6_inbound_md5_hash(sk, skb))
+ goto discard_and_relse;
+#endif
+
bh_lock_sock_nested(sk);
ret = 0;
if (!sock_owned_by_user(sk)) {
@@ -1841,6 +1920,8 @@ static const struct inet_connection_sock_af_ops
ipv6_specific = {
#ifdef CONFIG_TCP_MD5SIG
static const struct tcp_sock_af_ops tcp_sock_ipv6_specific = {
+ .md5_get = tcp_v6_md5_get,
+ .md5_put = tcp_md5_put,
.md5_lookup = tcp_v6_md5_lookup,
.calc_md5_hash = tcp_v6_md5_hash_skb,
.md5_add = tcp_v6_md5_add_func,
@@ -1873,6 +1954,8 @@ static const struct inet_connection_sock_af_ops
ipv6_mapped = {
#ifdef CONFIG_TCP_MD5SIG
static const struct tcp_sock_af_ops tcp_sock_ipv6_mapped_specific = {
+ .md5_get = tcp_v4_md5_get,
+ .md5_put = tcp_md5_put,
.md5_lookup = tcp_v4_md5_lookup,
.calc_md5_hash = tcp_v4_md5_hash_skb,
.md5_add = tcp_v6_md5_add_func,
@@ -1950,8 +2033,7 @@ static void tcp_v6_destroy_sock(struct sock *sk)
{
#ifdef CONFIG_TCP_MD5SIG
/* Clean up the MD5 key list */
- if (tcp_sk(sk)->md5sig_info)
- tcp_v6_clear_md5_list(sk);
+ tcp_v6_clear_md5_list(sk);
#endif
tcp_v4_destroy_sock(sk);
inet6_destroy_sock(sk);
^ permalink raw reply related
* Re: [BUG net-2.6 vlan/bonding] lockdep splats
From: Jarek Poplawski @ 2010-10-27 12:03 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev, David Miller, Jesse Gross
In-Reply-To: <1288175070.2709.86.camel@edumazet-laptop>
On Wed, Oct 27, 2010 at 12:24:30PM +0200, Eric Dumazet wrote:
> On latest net-2.6 kernel I got following splat, not sure its related to
> vlan changes...
Seems to be even older. Could you try this patch?
Thanks,
Jarek P.
---
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index beb3b7c..bdb68a6 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -493,9 +493,9 @@ static void bond_vlan_rx_register(struct net_device *bond_dev,
struct slave *slave;
int i;
- write_lock(&bond->lock);
+ write_lock_bh(&bond->lock);
bond->vlgrp = grp;
- write_unlock(&bond->lock);
+ write_unlock_bh(&bond->lock);
bond_for_each_slave(bond, slave, i) {
struct net_device *slave_dev = slave->dev;
^ permalink raw reply related
* Re: [PATCH net-next-2.6 v2] can: Topcliff: PCH_CAN driver: Fix buildwarnings
From: Marc Kleine-Budde @ 2010-10-27 11:58 UTC (permalink / raw)
To: Wolfgang Grandegger
Cc: andrew.chih.howe.khor-ral2JQCrhuEAvxtiuMwx3w,
socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
sameo-VuQAYsv1563Yd54FQh9/CA,
margie.foster-ral2JQCrhuEAvxtiuMwx3w,
netdev-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
yong.y.wang-ral2JQCrhuEAvxtiuMwx3w,
masa-korg-ECg8zkTtlr0C6LszWs/t0g,
kok.howg.ewe-ral2JQCrhuEAvxtiuMwx3w, chripell-VaTbYqLCNhc,
morinaga526-ECg8zkTtlr0C6LszWs/t0g, David Miller,
joel.clark-ral2JQCrhuEAvxtiuMwx3w, qi.wang-ral2JQCrhuEAvxtiuMwx3w
In-Reply-To: <4CC813B9.3080506-5Yr1BZd7O62+XT7JhA+gdA@public.gmane.org>
[-- Attachment #1.1: Type: text/plain, Size: 1048 bytes --]
On 10/27/2010 01:57 PM, Wolfgang Grandegger wrote:
> On 10/27/2010 01:27 PM, Tomoya MORINAGA wrote:
>> On Wednesday, October 27, 2010 3:52 AM : Marc Kleine-Budde and Wolfgang Grandegge wrote:
>>
>> The following is some inarticulate points I have for your questions.
>> Please give me more information.
>>
>>> Do I understand your code correctly? You have a big loop, but only do
>>> two different things at certain values of the loop? Smells fishy.
>> Uh, I can't understand your intention.
>> Please show in detail.
>> This processing does configuration for all message objects.
>
> Not all, but just a few of them. We believe it can be implemented more
> efficiently.
I misread the code...sorry - I'm just writing a longer answer.
cheers, Marc
--
Pengutronix e.K. | Marc Kleine-Budde |
Industrial Linux Solutions | Phone: +49-231-2826-924 |
Vertretung West/Dortmund | Fax: +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de |
[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]
[-- Attachment #2: Type: text/plain, Size: 188 bytes --]
_______________________________________________
Socketcan-core mailing list
Socketcan-core-0fE9KPoRgkgATYTw5x5z8w@public.gmane.org
https://lists.berlios.de/mailman/listinfo/socketcan-core
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox