* [PATCH v5 0/7] ARM: davinci: add support for the am1808 based enbw_cmc board
From: Heiko Schocher @ 2012-05-30 10:18 UTC (permalink / raw)
To: davinci-linux-open-source-VycZQUHpC/PFrsHnngEfi1aTQe2KTcn/
Cc: Heiko Schocher, linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
devicetree-discuss-uLR06cmDAlY/bJ5BZ2RsiQ,
linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
linux-i2c-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
David Woodhouse, Ben Dooks, Wolfram Sang, Sekhar Nori,
Kevin Hilman, Wolfgang Denk, Sergei Shtylyov, Grant Likely
this patchserie add support for the davinci am1808 based
enbw_cmc board.
changes for v2:
Post this patchserie now as v2, as reworked in the
comments I got for the RFC serie.
changes for v3:
- Interrupt Controller:
- comment from Sergei Shtylyov:
- rename compatible" prop to "ti,cp_intc"
- cp_intc_init() is now also for the of case
the name of the init function (it calls the
"new" __cp_intc_init() function, which was
the "old" cp_intc_init()). Through this
rework the changes for OF is better visible.
As the OF case uses the irq_domain rework from
Grant Likely, maybe the none OF case can use
this also, but this should be tested on a hw ...
changes for v4:
- Interrupt Controller:
- split in two patches as Nori Sekhar suggested
one for the irq_domain change
one for DT support
- add comment from Grant Likely for the DT part:
remove if/else clause, not needed.
Make use of DT runtime configurable
The non OF case is not tested!
changes for v5:
- Interrupt Controller:
add comments from Sergei Shtylyov:
- s/intc/cp_intc in commit subject
- Codingstyle fixes
add comment from Grant Likely:
- rename compatible" prop to "ti,cp-intc"
- call irq_domain_add also in the non DT case
(was fixed in v4)
- switched from using d->irq to d->hwirq for the hardware
irq number in irq_chip hooks
- I2C DT support:
add comments from Grant Likely:
- do not change value of dev->dev->platform_data, instead
hold a copy in davinci_i2c_dev.
Got no comments to the following points, I noted in the
RFC series, so posting this patchseries with them:
- ARM: davinci: configure davinci aemif chipselects through OF
not moved to mfd, as mentioned in this discussion:
http://davinci-linux-open-source.1494791.n2.nabble.com/PATCH-arm-davinci-configure-davinci-aemif-chipselects-through-OF-td7059739.html
instead use a phandle in the DTS, so drivers which
uses the davinci aemif, can call davinci_aemif_setup_timing_of()
This is just thought as an RFC ... The enbw_cmc board
support not really need to setup this bus timings, as
they are setup in U-Boot ... but I want to post this,
as I think, it is a nice to have, and I am not really
sure, if this has to be a MFD device (If so, all bus
interfaces for other SoCs should be converted also to
MFD devices) ... as an example how this can be used
I add this to the davinci nand controller OF support
patch, in this patchserie.
- ARM: davinci: mux: add OF support
I want to get rid of the pin setup code in board code ...
This patch introduces a davinci_cfg_reg_of() function,
which davinci drivers can call, if they found a
"pinmux-handle", so used in the following drivers in
this patchserie:
drivers/net/ethernet/ti/davinci_emac
drivers/i2c/busses/i2c-davinci.c
drivers/mtd/nand/davinci_nand.c
This is removed for v4 serie, as Nori Sekhar suggested.
- post this board support with USB support, even though
USB is only working with the 10 ms "workaround", posted here:
http://comments.gmane.org/gmane.linux.usb.general/54505
I see this issue also on the AM1808 TMDXEXP1808L evalboard.
change for v4:
The 10 ms delay is no longer needed, see discussion here:
http://www.spinics.net/lists/linux-usb/msg64232.html
shows the way to go ...
- MMC and USB are not using OF support yet, ideas how to port
this are welcome. I need for USB and MMC board specific
callbacks, how to solve this with OF support?
Signed-off-by: Heiko Schocher <hs-ynQEQJNshbs@public.gmane.org>
Cc: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org
Cc: devicetree-discuss-uLR06cmDAlY/bJ5BZ2RsiQ@public.gmane.org
Cc: davinci-linux-open-source-VycZQUHpC/PFrsHnngEfi1aTQe2KTcn/@public.gmane.org
Cc: linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org
Cc: linux-i2c-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: David Woodhouse <dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
Cc: Ben Dooks <ben-linux-elnMNo+KYs3YtjvyW6yDsg@public.gmane.org>
Cc: Wolfram Sang <w.sang-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org>
Cc: Sekhar Nori <nsekhar-l0cyMroinI0@public.gmane.org>
Cc: Kevin Hilman <khilman-l0cyMroinI0@public.gmane.org>
Cc: Wolfgang Denk <wd-ynQEQJNshbs@public.gmane.org>
Cc: Sergei Shtylyov <sshtylyov-Igf4POYTYCDQT0dZR+AlfA@public.gmane.org>
Cc: Grant Likely <grant.likely-s3s/WqlpOiPyB63q8FvJNQ@public.gmane.org>
Heiko Schocher (7):
ARM: davinci, cp_intc: Add irq domain support
ARM: davinci, cp_intc: Add OF support for TI interrupt controller
ARM: davinci: configure davinci aemif chipselects through OF
ARM: davinci: net: davinci_emac: add OF support
ARM: davinci: i2c: add OF support
ARM: mtd: nand: davinci: add OF support for davinci nand controller
ARM: davinci: add support for the am1808 based enbw_cmc board
.../devicetree/bindings/arm/davinci/aemif.txt | 119 +++++++
.../devicetree/bindings/arm/davinci/i2c.txt | 31 ++
.../devicetree/bindings/arm/davinci/intc.txt | 27 ++
.../devicetree/bindings/arm/davinci/nand.txt | 72 ++++
.../devicetree/bindings/net/davinci_emac.txt | 41 +++
arch/arm/boot/dts/enbw_cmc.dts | 172 +++++++++
arch/arm/configs/enbw_cmc_defconfig | 123 +++++++
arch/arm/mach-davinci/Kconfig | 9 +
arch/arm/mach-davinci/Makefile | 1 +
arch/arm/mach-davinci/aemif.c | 86 +++++-
arch/arm/mach-davinci/board-enbw-cmc.c | 374 ++++++++++++++++++++
arch/arm/mach-davinci/cp_intc.c | 83 ++++-
arch/arm/mach-davinci/include/mach/aemif.h | 1 +
arch/arm/mach-davinci/include/mach/uncompress.h | 1 +
drivers/i2c/busses/i2c-davinci.c | 49 +++-
drivers/mtd/nand/davinci_nand.c | 80 ++++-
drivers/net/ethernet/ti/davinci_emac.c | 87 +++++-
17 files changed, 1335 insertions(+), 21 deletions(-)
create mode 100644 Documentation/devicetree/bindings/arm/davinci/aemif.txt
create mode 100644 Documentation/devicetree/bindings/arm/davinci/i2c.txt
create mode 100644 Documentation/devicetree/bindings/arm/davinci/intc.txt
create mode 100644 Documentation/devicetree/bindings/arm/davinci/nand.txt
create mode 100644 Documentation/devicetree/bindings/net/davinci_emac.txt
create mode 100644 arch/arm/boot/dts/enbw_cmc.dts
create mode 100644 arch/arm/configs/enbw_cmc_defconfig
create mode 100644 arch/arm/mach-davinci/board-enbw-cmc.c
--
1.7.7.6
^ permalink raw reply
* Re: [PATCH RFC] virtio-net: remove useless disable on freeze
From: Rusty Russell @ 2012-05-30 10:11 UTC (permalink / raw)
To: Michael S. Tsirkin, netdev; +Cc: Amit Shah, linux-kernel, kvm, virtualization
In-Reply-To: <20120528125325.GA22576@redhat.com>
On Mon, 28 May 2012 15:53:25 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Wed, Apr 04, 2012 at 12:19:54PM +0300, Michael S. Tsirkin wrote:
> > disable_cb is just an optimization: it
> > can not guarantee that there are no callbacks.
> >
> > I didn't yet figure out whether a callback
> > in freeze will trigger a bug, but disable_cb
> > won't address it in any case. So let's remove
> > the useless calls as a first step.
> >
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>
> Looks like this isn't in the 3.5 pull request -
> just lost in the shuffle?
> disable_cb is advisory so can't be relied upon.
I always (try to?) reply as I accept patches.
This one did slip by, but it's harmless so no need to push AFAICT.
Applied.
Thanks!
Rusty.
^ permalink raw reply
* Re: Difficulties to get 1Gbps on be2net ethernet card
From: Jean-Michel Hautbois @ 2012-05-30 10:07 UTC (permalink / raw)
To: Sathya.Perla; +Cc: eric.dumazet, netdev
In-Reply-To: <3367B80B08154D42A3B2BC708B5D41F647C678B73F@EXMAIL.ad.emulex.com>
2012/5/30 <Sathya.Perla@emulex.com>:
>>-----Original Message-----
>>From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On
>>Behalf Of Jean-Michel Hautbois
>>
>>2012/5/30 Jean-Michel Hautbois <jhautbois@gmail.com>:
>>
>>I used vmstat in order to see the differences between the two kernels.
>>The main difference is the number of interrupts per second.
>>I have an average of 87500 on 3.2 and 7500 on 2.6, 10 times lower !
>>I suspect the be2net driver to be the main cause, and I checkes the
>>/proc/interrupts file in order to be sure.
>>
>>I have for eth1-tx on 2.6.26 about 2200 interrupts per second and 23000 on 3.2.
>>BTW, it is named eth1-q0 on 3.2 (and tx and rx are the same IRQ)
>>whereas there is eth1-rx0 and eth1-tx on 2.6.26.
>
> Yes, there is an issue with be2net interrupt mitigation in the recent code with
> RX and TX on the same Evt-Q (commit 10ef9ab4). The high interrupt rate happens when a TX blast is
> done while RX is relatively silent on a queue pair. Interrupt rate due to TX completions is not being
> mitigated.
>
> I have a fix and will send it out soon..
>
> thanks,
> -Sathya
Hi Sathya !
Thanks for this information !
I had the correct diagnostic :). I am waiting for your fix.
Regards,
JM
^ permalink raw reply
* Re: Difficulties to get 1Gbps on be2net ethernet card
From: Jean-Michel Hautbois @ 2012-05-30 10:06 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1338371774.2760.134.camel@edumazet-glaptop>
2012/5/30 Eric Dumazet <eric.dumazet@gmail.com>:
> On Wed, 2012-05-30 at 11:40 +0200, Jean-Michel Hautbois wrote:
>
>> I used vmstat in order to see the differences between the two kernels.
>> The main difference is the number of interrupts per second.
>> I have an average of 87500 on 3.2 and 7500 on 2.6, 10 times lower !
>> I suspect the be2net driver to be the main cause, and I checkes the
>> /proc/interrupts file in order to be sure.
>>
>> I have for eth1-tx on 2.6.26 about 2200 interrupts per second and 23000 on 3.2.
>> BTW, it is named eth1-q0 on 3.2 (and tx and rx are the same IRQ)
>> whereas there is eth1-rx0 and eth1-tx on 2.6.26.
>>
>
> Might be different coalescing params :
>
> ethtool -c eth1
>
Yes, as stated in my first e-mail, this is different, in 2.6.26 the
adaptive-tx coalescing is off, while it is on for 3.4 (sorry, I said
3.2 before but it is 3.4).
But I can't change this setting since commit 10ef9ab...
JM
^ permalink raw reply
* RE: Difficulties to get 1Gbps on be2net ethernet card
From: Sathya.Perla @ 2012-05-30 10:04 UTC (permalink / raw)
To: jhautbois, eric.dumazet; +Cc: netdev
In-Reply-To: <CAL8zT=gJT1Frn_44SU0CrpZjPxwC_VuHFE4k9jvOGNmFomzhHA@mail.gmail.com>
>-----Original Message-----
>From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On
>Behalf Of Jean-Michel Hautbois
>
>2012/5/30 Jean-Michel Hautbois <jhautbois@gmail.com>:
>
>I used vmstat in order to see the differences between the two kernels.
>The main difference is the number of interrupts per second.
>I have an average of 87500 on 3.2 and 7500 on 2.6, 10 times lower !
>I suspect the be2net driver to be the main cause, and I checkes the
>/proc/interrupts file in order to be sure.
>
>I have for eth1-tx on 2.6.26 about 2200 interrupts per second and 23000 on 3.2.
>BTW, it is named eth1-q0 on 3.2 (and tx and rx are the same IRQ)
>whereas there is eth1-rx0 and eth1-tx on 2.6.26.
Yes, there is an issue with be2net interrupt mitigation in the recent code with
RX and TX on the same Evt-Q (commit 10ef9ab4). The high interrupt rate happens when a TX blast is
done while RX is relatively silent on a queue pair. Interrupt rate due to TX completions is not being
mitigated.
I have a fix and will send it out soon..
thanks,
-Sathya
^ permalink raw reply
* Re: Difficulties to get 1Gbps on be2net ethernet card
From: Eric Dumazet @ 2012-05-30 9:56 UTC (permalink / raw)
To: Jean-Michel Hautbois; +Cc: netdev
In-Reply-To: <CAL8zT=gJT1Frn_44SU0CrpZjPxwC_VuHFE4k9jvOGNmFomzhHA@mail.gmail.com>
On Wed, 2012-05-30 at 11:40 +0200, Jean-Michel Hautbois wrote:
> I used vmstat in order to see the differences between the two kernels.
> The main difference is the number of interrupts per second.
> I have an average of 87500 on 3.2 and 7500 on 2.6, 10 times lower !
> I suspect the be2net driver to be the main cause, and I checkes the
> /proc/interrupts file in order to be sure.
>
> I have for eth1-tx on 2.6.26 about 2200 interrupts per second and 23000 on 3.2.
> BTW, it is named eth1-q0 on 3.2 (and tx and rx are the same IRQ)
> whereas there is eth1-rx0 and eth1-tx on 2.6.26.
>
Might be different coalescing params :
ethtool -c eth1
^ permalink raw reply
* Re: [RFC PATCH 2/2] tcp: Early SYN limit and SYN cookie handling to mitigate SYN floods
From: Eric Dumazet @ 2012-05-30 9:46 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Andi Kleen, netdev, Christoph Paasch, David S. Miller,
Martin Topholm, Florian Westphal, Hans Schillstrom
In-Reply-To: <1338369863.7747.96.camel@localhost>
On Wed, 2012-05-30 at 11:24 +0200, Jesper Dangaard Brouer wrote:
> I don't dare to go into that battle with the network ninja, I surrender.
> DaveM, Eric's patches take precedence over mine...
>
> /me Crawing back into my cave, and switching to boring bugzilla cases of
> backporting kernel patches instead...
>
Hey, I only wanted to say that we were working on the same area and that
we should expect conflicts.
In the long term, we want a scalable listener solution, but I can
understand if some customers want an immediate solution (SYN flood
mitigation)
^ permalink raw reply
* Re: Difficulties to get 1Gbps on be2net ethernet card
From: Jean-Michel Hautbois @ 2012-05-30 9:40 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev
In-Reply-To: <CAL8zT=jHWhDdv-DyXL8XnL8qRc7o-jZfA7ADU-qt54hB5aLyCw@mail.gmail.com>
2012/5/30 Jean-Michel Hautbois <jhautbois@gmail.com>:
> 2012/5/30 Eric Dumazet <eric.dumazet@gmail.com>:
>> On Wed, 2012-05-30 at 08:51 +0200, Jean-Michel Hautbois wrote:
>>> 2012/5/30 Eric Dumazet <eric.dumazet@gmail.com>:
>>> > On Wed, 2012-05-30 at 08:28 +0200, Jean-Michel Hautbois wrote:
>>> >
>>> >> If this can help, setting tx queue length to 5000 seems to make the
>>> >> problem disappear.
>>> >
>>> > Then you should have drops at Qdisc layer (before your change to 5000)
>>> >
>>> > tc -s -d qdisc
>>> >
>>> >> I didn't specified it : MTU is 4096, UDP packets are 4000 bytes.
>>> >
>>>
>>> Yes :
>>> qdisc mq 0: dev eth1 root
>>> Sent 5710049154383 bytes 1413544639 pkt (dropped 73078, overlimits 0
>>> requeues 281540)
>>> backlog 0b 0p requeues 281540
>>>
>>> Why ? With a 2.6.26 kernel it works well with a tx queue length of 1000.
>>
>> If you send big bursts of packets, then you need a large enough queue.
>>
>> Maybe your kernel is now faster than before and queue fills faster, or
>> TX ring is smaller ?
>>
>> ethtool -g eth0
>>
>> Note that everybody try to reduce dumb queue sizes because of latencies.
>>
>
> TX ring is not the same :
> On 3.2 :
> $> ethtool -g eth1
> Ring parameters for eth1:
> Pre-set maximums:
> RX: 1024
> RX Mini: 0
> RX Jumbo: 0
> TX: 2048
> Current hardware settings:
> RX: 1024
> RX Mini: 0
> RX Jumbo: 0
> TX: 2048
>
>
> On 2.6.26 :
> $>ethtool -g eth1
> Ring parameters for eth1:
> Pre-set maximums:
> RX: 1024
> RX Mini: 0
> RX Jumbo: 0
> TX: 2048
> Current hardware settings:
> RX: 1003
> RX Mini: 0
> RX Jumbo: 0
> TX: 0
>
> I can't set TX ring using ethtool -G eth1 tx N : operation not supported
> I am not really impacted by latency, but the lower the better.
>
> JM
I used vmstat in order to see the differences between the two kernels.
The main difference is the number of interrupts per second.
I have an average of 87500 on 3.2 and 7500 on 2.6, 10 times lower !
I suspect the be2net driver to be the main cause, and I checkes the
/proc/interrupts file in order to be sure.
I have for eth1-tx on 2.6.26 about 2200 interrupts per second and 23000 on 3.2.
BTW, it is named eth1-q0 on 3.2 (and tx and rx are the same IRQ)
whereas there is eth1-rx0 and eth1-tx on 2.6.26.
JM
^ permalink raw reply
* Re: [RFC PATCH 2/2] tcp: Early SYN limit and SYN cookie handling to mitigate SYN floods
From: Jesper Dangaard Brouer @ 2012-05-30 9:24 UTC (permalink / raw)
To: Eric Dumazet
Cc: Andi Kleen, netdev, Christoph Paasch, David S. Miller,
Martin Topholm, Florian Westphal, Hans Schillstrom,
Martin Topholm
In-Reply-To: <1338365702.2760.112.camel@edumazet-glaptop>
On Wed, 2012-05-30 at 10:15 +0200, Eric Dumazet wrote:
> On Wed, 2012-05-30 at 09:45 +0200, Jesper Dangaard Brouer wrote:
>
> > Sounds interesting, but TCP Fast Open is primarily concerned with
> > enabling data exchange during SYN establishment. I don't see any
> > indication that they have implemented parallel SYN handling.
> >
>
> Not at all, TCP fast open main goal is to allow connection establishment
> with a single packet (thus removing one RTT). This also removes the
> whole idea of having half-sockets (in SYN_RCV state)
>
> Then, allowing DATA in the SYN packet is an extra bonus, only if the
> whole request can fit in the packet (it is unlikely for typical http
> requests)
>
>
> > Implementing parallel SYN handling, should also benefit their work.
>
> Why do you think I am working on this ? Hint : I am a Google coworker.
Did know you work for Google, but didn't know you worked actively on
parallel SYN handling. Your previous quote "eventually in a short
time", indicated to me, that I should solve the issue my self first, and
then we would replace my code with your full solution later.
> > After studying this code path, I also see great performance benefit in
> > also optimizing the normal 3WHS on sock's in sk_state == LISTEN.
> > Perhaps we should split up the code path for LISTEN vs. ESTABLISHED, as
> > they are very entangled at the moment AFAIKS.
> >
> > > Yuchung Cheng and Jerry Chu should upstream this code in a very near
> > > future.
> >
> > Looking forward to see the code, and the fallout discussions, on
> > transferring data on SYN packets.
> >
>
> Problem is this code will be delayed if we change net-next code in this
> area, because we'll have to rebase and retest everything.
Okay, don't want to delay your work. We can wait merging my cleanup
patches, and I can take the pain of rebasing them after your work is
merged. And then we will see if my performance patches have gotten
obsolete.
I'm going to post some updated v2 patches, just because I know some
people that are desperate for a quick solution to their DDoS issues, and
are willing patch their kernels for production.
> > > Another way to mitigate SYN scalability issues before the full RCU
> > > solution I was cooking is to either :
> > >
> > > 1) Use a hardware filter (like on Intel NICS) to force all SYN packets
> > > going to one queue (so that they are all serviced on one CPU)
> > >
> > > 2) Tweak RPS (__skb_get_rxhash()) so that SYN packets rxhash is not
> > > dependent on src port/address, to get same effect (All SYN packets
> > > processed by one cpu). Note this only address the SYN flood problem, not
> > > the general 3WHS scalability one, since if real connection is
> > > established, the third packet (ACK from client) will have the 'real'
> > > rxhash and will be processed by another cpu.
> >
> > I don't like the idea of overloading one CPU with SYN packets. As the
> > attacker can still cause a DoS on new connections.
> >
>
> One CPU can handle more than one million SYN per second, while 32 cpus
> fighting on socket lock can not handle 1 % of this load.
Not sure, one CPU can handle 1Mpps on this particular path. And Hans
have some other measurements, although I'm assuming he has small CPUs.
But if you are working on the real solution, we don't need to discuss
this :-)
> If Intel chose to implement this hardware filter in their NIC, its for a
> good reason.
>
>
> > My "unlocked" parallel SYN cookie approach, should favor established
> > connections, as they are allowed to run under a BH lock, and thus don't
> > let new SYN packets in (on this CPU), until the establish conn packet is
> > finished. Unless I have misunderstood something... I think I have,
> > established connections have their own/seperate struck sock, and thus
> > this is another slock spinlock, right?. (Well let Eric bash me for
> > this ;-))
>
> It seems you forgot I have patches to have full parallelism, not only
> the SYNCOOKIE hack.
I'm so much, looking forward to this :-)
> I am still polishing them, its a _long_ process, especially if network
> tree changes a lot.
>
> If you believe you can beat me on this, please let me know so that I can
> switch to other tasks.
I don't dare to go into that battle with the network ninja, I surrender.
DaveM, Eric's patches take precedence over mine...
/me Crawing back into my cave, and switching to boring bugzilla cases of
backporting kernel patches instead...
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Sr. Network Kernel Developer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply
* [PATCH stable] l2tp: fix oops in L2TP IP sockets for connect() AF_UNSPEC case
From: James Chapman @ 2012-05-30 9:13 UTC (permalink / raw)
To: netdev; +Cc: levinsasha928, James Chapman
An application may call connect() to disconnect a socket using an
address with family AF_UNSPEC. The L2TP IP sockets were not handling
this case when the socket is not bound and an attempt to connect()
using AF_UNSPEC in such cases would result in an oops. This patch
addresses the problem by protecting the sk_prot->disconnect() call
against trying to unhash the socket before it is bound.
The patch also adds more checks that the sockaddr supplied to bind()
and connect() calls is valid.
RIP: 0010:[<ffffffff82e133b0>] [<ffffffff82e133b0>] inet_unhash+0x50/0xd0
RSP: 0018:ffff88001989be28 EFLAGS: 00010293
Stack:
ffff8800407a8000 0000000000000000 ffff88001989be78 ffffffff82e3a249
ffffffff82e3a050 ffff88001989bec8 ffff88001989be88 ffff8800407a8000
0000000000000010 ffff88001989bec8 ffff88001989bea8 ffffffff82e42639
Call Trace:
[<ffffffff82e3a249>] udp_disconnect+0x1f9/0x290
[<ffffffff82e42639>] inet_dgram_connect+0x29/0x80
[<ffffffff82d012fc>] sys_connect+0x9c/0x100
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
---
A version of this patch is already applied to the net tree.
net/l2tp/l2tp_ip.c | 30 ++++++++++++++++++++++++------
1 files changed, 24 insertions(+), 6 deletions(-)
diff --git a/net/l2tp/l2tp_ip.c b/net/l2tp/l2tp_ip.c
index 6274f0b..cc8ad7b 100644
--- a/net/l2tp/l2tp_ip.c
+++ b/net/l2tp/l2tp_ip.c
@@ -251,9 +251,16 @@ static int l2tp_ip_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
{
struct inet_sock *inet = inet_sk(sk);
struct sockaddr_l2tpip *addr = (struct sockaddr_l2tpip *) uaddr;
- int ret = -EINVAL;
+ int ret;
int chk_addr_ret;
+ if (!sock_flag(sk, SOCK_ZAPPED))
+ return -EINVAL;
+ if (addr_len < sizeof(struct sockaddr_l2tpip))
+ return -EINVAL;
+ if (addr->l2tp_family != AF_INET)
+ return -EINVAL;
+
ret = -EADDRINUSE;
read_lock_bh(&l2tp_ip_lock);
if (__l2tp_ip_bind_lookup(&init_net, addr->l2tp_addr.s_addr, sk->sk_bound_dev_if, addr->l2tp_conn_id))
@@ -284,6 +291,8 @@ static int l2tp_ip_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len)
sk_del_node_init(sk);
write_unlock_bh(&l2tp_ip_lock);
ret = 0;
+ sock_reset_flag(sk, SOCK_ZAPPED);
+
out:
release_sock(sk);
@@ -304,13 +313,14 @@ static int l2tp_ip_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len
__be32 saddr;
int oif, rc;
- rc = -EINVAL;
+ if (sock_flag(sk, SOCK_ZAPPED)) /* Must bind first - autobinding does not work */
+ return -EINVAL;
+
if (addr_len < sizeof(*lsa))
- goto out;
+ return -EINVAL;
- rc = -EAFNOSUPPORT;
if (lsa->l2tp_family != AF_INET)
- goto out;
+ return -EAFNOSUPPORT;
lock_sock(sk);
@@ -364,6 +374,14 @@ out:
return rc;
}
+static int l2tp_ip_disconnect(struct sock *sk, int flags)
+{
+ if (sock_flag(sk, SOCK_ZAPPED))
+ return 0;
+
+ return udp_disconnect(sk, flags);
+}
+
static int l2tp_ip_getname(struct socket *sock, struct sockaddr *uaddr,
int *uaddr_len, int peer)
{
@@ -599,7 +617,7 @@ static struct proto l2tp_ip_prot = {
.close = l2tp_ip_close,
.bind = l2tp_ip_bind,
.connect = l2tp_ip_connect,
- .disconnect = udp_disconnect,
+ .disconnect = l2tp_ip_disconnect,
.ioctl = udp_ioctl,
.destroy = l2tp_ip_destroy_sock,
.setsockopt = ip_setsockopt,
--
1.7.0.4
^ permalink raw reply related
* Re: [PATCH] l2tp: fix oops in L2TP IP sockets for connect() AF_UNSPEC case
From: David Miller @ 2012-05-30 9:05 UTC (permalink / raw)
To: jchapman; +Cc: netdev, levinsasha928
In-Reply-To: <4FC5E022.6020609@katalix.com>
From: James Chapman <jchapman@katalix.com>
Date: Wed, 30 May 2012 09:53:54 +0100
> The patch doesn't apply to stable due to recent l2tp_ip changes (IPv6
> support) already merged. I'll spin a version for -stable.
That would be helpful, please do.
^ permalink raw reply
* Re: [PATCH] l2tp: fix oops in L2TP IP sockets for connect() AF_UNSPEC case
From: James Chapman @ 2012-05-30 8:53 UTC (permalink / raw)
To: David Miller; +Cc: netdev, levinsasha928
In-Reply-To: <20120529.172008.875375243438479060.davem@davemloft.net>
On 29/05/12 22:20, David Miller wrote:
> From: James Chapman <jchapman@katalix.com>
> Date: Tue, 29 May 2012 14:30:42 +0100
>
>> An application may call connect() to disconnect a socket using an
>> address with family AF_UNSPEC. The L2TP IP sockets were not handling
>> this case when the socket is not bound and an attempt to connect()
>> using AF_UNSPEC in such cases would result in an oops. This patch
>> addresses the problem by protecting the sk_prot->disconnect() call
>> against trying to unhash the socket before it is bound.
>>
>> The L2TP IPv4 and IPv6 sockets have the same problem. Both are fixed
>> by this patch.
>>
>> The patch also adds more checks that the sockaddr supplied to bind()
>> and connect() calls is valid.
>>
>> RIP: 0010:[<ffffffff82e133b0>] [<ffffffff82e133b0>] inet_unhash+0x50/0xd0
>> RSP: 0018:ffff88001989be28 EFLAGS: 00010293
>> Stack:
>> ffff8800407a8000 0000000000000000 ffff88001989be78 ffffffff82e3a249
>> ffffffff82e3a050 ffff88001989bec8 ffff88001989be88 ffff8800407a8000
>> 0000000000000010 ffff88001989bec8 ffff88001989bea8 ffffffff82e42639
>> Call Trace:
>> [<ffffffff82e3a249>] udp_disconnect+0x1f9/0x290
>> [<ffffffff82e42639>] inet_dgram_connect+0x29/0x80
>> [<ffffffff82d012fc>] sys_connect+0x9c/0x100
>>
>> Reported-by: Sasha Levin <levinsasha928@gmail.com>
>> Signed-off-by: James Chapman <jchapman@katalix.com>
>
> Applied and queued up for -stable, thanks James.
The patch doesn't apply to stable due to recent l2tp_ip changes (IPv6
support) already merged. I'll spin a version for -stable.
--
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development
^ permalink raw reply
* Re: [RFC PATCH 0/2] Faster/parallel SYN handling to mitigate SYN floods
From: Christoph Paasch @ 2012-05-30 8:53 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: netdev, Eric Dumazet, David S. Miller, Martin Topholm,
Florian Westphal, opurdila, Hans Schillstrom, Andi Kleen
In-Reply-To: <1338367497.7747.72.camel@localhost>
On 05/30/2012 10:44 AM, Jesper Dangaard Brouer wrote:
>> >
>> > Then the receiver will receive two SYN/ACK's for the same SYN with
>> > different sequence-numbers. As the "SYN cookie SYN-ACK" will arrive
>> > second, it will be discarded and seq-numbers from the first one will be
>> > taken on the client-side.
> I thought that the retransmitted SYN packet, were caused by the SYN-ACK
> didn't reach the client?
Or, if the SYN/ACK got somehow delayed in the network and the
SYN-retransmission timer on the client-side fires before the SYN/ACK
reaches the client.
Christoph
--
Christoph Paasch
PhD Student
IP Networking Lab --- http://inl.info.ucl.ac.be
MultiPath TCP in the Linux Kernel --- http://mptcp.info.ucl.ac.be
Université Catholique de Louvain
--
^ permalink raw reply
* Re: [RFC PATCH 0/2] Faster/parallel SYN handling to mitigate SYN floods
From: Eric Dumazet @ 2012-05-30 8:50 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: christoph.paasch, netdev, David S. Miller, Martin Topholm,
Florian Westphal, opurdila, Hans Schillstrom, Andi Kleen
In-Reply-To: <1338367497.7747.72.camel@localhost>
On Wed, 2012-05-30 at 10:44 +0200, Jesper Dangaard Brouer wrote:
> Choosing that code path, should be easy by simply returning 0 (no_limit)
> from my function tcp_v4_syn_conn_limit(), to indicate that the normal
> slow code path should be chosen.
>
> I guess this will not pose a big attack angle, as the entries in
> reqsk_queue will be fairly small.
Not sure what you mean.
I know some people have 64K entries in it.
(sk_ack_backlog / sk_max_ack_backlog being 16bits,
listen(fd, 65536 + 1) can give unexpected results)
^ permalink raw reply
* Re: [RFC PATCH 0/2] Faster/parallel SYN handling to mitigate SYN floods
From: Jesper Dangaard Brouer @ 2012-05-30 8:44 UTC (permalink / raw)
To: christoph.paasch
Cc: netdev, Eric Dumazet, David S. Miller, Martin Topholm,
Florian Westphal, opurdila, Hans Schillstrom, Andi Kleen
In-Reply-To: <4FC53353.2050801@uclouvain.be>
On Tue, 2012-05-29 at 22:36 +0200, Christoph Paasch wrote:
[...cut...]
> >> Concerning (2):
> >>
> >> Imagine, a SYN coming in, when the reqsk-queue is not yet full. A
> >> request-sock will be added to the reqsk-queue. Then, a retransmission of
> >> this SYN comes in and the queue got full by the time. This time
> >> tcp_v4_syn_conn_limit will do syn-cookies and thus generate a different
> >> seq-number for the SYN/ACK.
> >
> > I have addressed your issue, by checking the reqsk_queue in
> > tcp_v4_syn_conn_limit() before allocating a new req via
> > inet_reqsk_alloc().
> > If I find an existing reqsk, I choose to drop it, so the SYN cookie
> > SYN-ACK takes precedence, as the path/handling of the last ACK doesn't
> > find this reqsk. This is done under the lock.
>
> Then the receiver will receive two SYN/ACK's for the same SYN with
> different sequence-numbers. As the "SYN cookie SYN-ACK" will arrive
> second, it will be discarded and seq-numbers from the first one will be
> taken on the client-side.
I thought that the retransmitted SYN packet, were caused by the SYN-ACK
didn't reach the client?
> Then, the connection will never establish, as both sides "agreed" on
> different sequence numbers.
>
> I would say, you have to handle the retransmitted SYN as in
> tcp_v4_hnd_req by calling tcp_check_req.
Choosing that code path, should be easy by simply returning 0 (no_limit)
from my function tcp_v4_syn_conn_limit(), to indicate that the normal
slow code path should be chosen.
I guess this will not pose a big attack angle, as the entries in
reqsk_queue will be fairly small.
^ permalink raw reply
* Re: Strange latency spikes/TX network stalls on Sun Fire X4150(x86) and e1000e
From: Eric Dumazet @ 2012-05-30 8:40 UTC (permalink / raw)
To: Hiroaki SHIMODA
Cc: Tom Herbert, Denys Fedoryshchenko, netdev, e1000-devel,
jeffrey.t.kirsher, jesse.brandeburg, davem
In-Reply-To: <20120530090602.6204d857.shimoda.hiroaki@gmail.com>
On Wed, 2012-05-30 at 09:06 +0900, Hiroaki SHIMODA wrote:
> While reading the bql code, I have some questions.
>
> 1) dql_completed() and dql_queued() can be called concurrently,
> so dql->num_queued could change while processing
> dql_completed().
> Is it intentional to refer num_queued from "dql->" each time ?
>
not sure it can have problems, but doing the read once is indeed a good
plan.
> 2) From the comment in the code
> * - The queue was over-limit in the previous interval and
> * when enqueuing it was possible that all queued data
> * had been consumed.
>
> and
>
> * Queue was not starved, check if the limit can be decreased.
> * A decrease is only considered if the queue has been busy in
> * the whole interval (the check above).
>
> the calculation of all_prev_completed should take into account
> completed == dql->prev_num_queued case ?
> On current implementation, limit shrinks easily and some NIC
> hit TX stalls.
> To mitigate TX stalls, should we fix all_prev_completed rather
> than individual driver ?
>
Not sure what you mean
> 3) limit calculation fails to consider integer wrap around in
> one place ?
>
Yes
> Here is the patch what I meant.
>
> diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c
> @@ -11,22 +11,27 @@
> #include <linux/dynamic_queue_limits.h>
>
> #define POSDIFF(A, B) ((A) > (B) ? (A) - (B) : 0)
> +#define POSDIFFI(A, B) ((int)((A) - (B)) > 0 ? (A) - (B) : 0)
> +#define AFTER_EQ(A, B) ((int)((A) - (B)) >= 0)
>
> /* Records completed count and recalculates the queue limit */
> void dql_completed(struct dql *dql, unsigned int count)
> {
> unsigned int inprogress, prev_inprogress, limit;
> - unsigned int ovlimit, all_prev_completed, completed;
> + unsigned int ovlimit, completed, num_queued;
> + bool all_prev_completed;
> +
> + num_queued = dql->num_queued;
I suggest :
num_queued = ACCESS_ONCE(dql->num_queued);
Or else compiler is free to do whatever he wants.
^ permalink raw reply
* Re: [RFC PATCH 2/2] tcp: Early SYN limit and SYN cookie handling to mitigate SYN floods
From: Eric Dumazet @ 2012-05-30 8:24 UTC (permalink / raw)
To: Hans Schillstrom
Cc: Andi Kleen, Jesper Dangaard Brouer, Jesper Dangaard Brouer,
netdev@vger.kernel.org, Christoph Paasch, David S. Miller,
Martin Topholm, Florian Westphal, Tom Herbert
In-Reply-To: <201205301013.10797.hans.schillstrom@ericsson.com>
On Wed, 2012-05-30 at 10:03 +0200, Hans Schillstrom wrote:
> We have this option running right now, and it gave slightly higher values.
> The upside is only one core is running at 100% load.
>
> To be able to process more SYN an attempt was made to spread them with RPS to
> 2 other cores gave 60% more SYN:s per sec
> i.e. syn filter in NIC sending all irq:s to one core gave ~ 52k syn. pkts/sec
> adding RPS and sending syn to two other core:s gave ~80k syn. pkts/sec
> Adding more cores than two didn't help that much.
When you say 52.000 pkt/s, is that for fully established sockets, or
SYNFLOOD ?
19.23 us to handle _one_ SYN message seems pretty wrong to me, if there
is no contention on listener socket.
^ permalink raw reply
* Re: [RFC PATCH 2/2] tcp: Early SYN limit and SYN cookie handling to mitigate SYN floods
From: Eric Dumazet @ 2012-05-30 8:15 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Andi Kleen, netdev, Christoph Paasch, David S. Miller,
Martin Topholm, Florian Westphal, opurdila, Hans Schillstrom,
Tom Herbert
In-Reply-To: <1338363926.7747.55.camel@localhost>
On Wed, 2012-05-30 at 09:45 +0200, Jesper Dangaard Brouer wrote:
> Sounds interesting, but TCP Fast Open is primarily concerned with
> enabling data exchange during SYN establishment. I don't see any
> indication that they have implemented parallel SYN handling.
>
Not at all, TCP fast open main goal is to allow connection establishment
with a single packet (thus removing one RTT). This also removes the
whole idea of having half-sockets (in SYN_RCV state)
Then, allowing DATA in the SYN packet is an extra bonus, only if the
whole request can fit in the packet (it is unlikely for typical http
requests)
> Implementing parallel SYN handling, should also benefit their work.
Why do you think I am working on this ? Hint : I am a Google coworker.
> After studying this code path, I also see great performance benefit in
> also optimizing the normal 3WHS on sock's in sk_state == LISTEN.
> Perhaps we should split up the code path for LISTEN vs. ESTABLISHED, as
> they are very entangled at the moment AFAIKS.
>
> > Yuchung Cheng and Jerry Chu should upstream this code in a very near
> > future.
>
> Looking forward to see the code, and the fallout discussions, on
> transferring data on SYN packets.
>
Problem is this code will be delayed if we change net-next code in this
area, because we'll have to rebase and retest everything.
>
> > Another way to mitigate SYN scalability issues before the full RCU
> > solution I was cooking is to either :
> >
> > 1) Use a hardware filter (like on Intel NICS) to force all SYN packets
> > going to one queue (so that they are all serviced on one CPU)
> >
> > 2) Tweak RPS (__skb_get_rxhash()) so that SYN packets rxhash is not
> > dependent on src port/address, to get same effect (All SYN packets
> > processed by one cpu). Note this only address the SYN flood problem, not
> > the general 3WHS scalability one, since if real connection is
> > established, the third packet (ACK from client) will have the 'real'
> > rxhash and will be processed by another cpu.
>
> I don't like the idea of overloading one CPU with SYN packets. As the
> attacker can still cause a DoS on new connections.
>
One CPU can handle more than one million SYN per second, while 32 cpus
fighting on socket lock can not handle 1 % of this load.
If Intel chose to implement this hardware filter in their NIC, its for a
good reason.
> My "unlocked" parallel SYN cookie approach, should favor established
> connections, as they are allowed to run under a BH lock, and thus don't
> let new SYN packets in (on this CPU), until the establish conn packet is
> finished. Unless I have misunderstood something... I think I have,
> established connections have their own/seperate struck sock, and thus
> this is another slock spinlock, right?. (Well let Eric bash me for
> this ;-))
It seems you forgot I have patches to have full parallelism, not only
the SYNCOOKIE hack.
I am still polishing them, its a _long_ process, especially if network
tree changes a lot.
If you believe you can beat me on this, please let me know so that I can
switch to other tasks.
^ permalink raw reply
* Re: [RFC PATCH 2/2] tcp: Early SYN limit and SYN cookie handling to mitigate SYN floods
From: Hans Schillstrom @ 2012-05-30 8:03 UTC (permalink / raw)
To: Eric Dumazet
Cc: Andi Kleen, Jesper Dangaard Brouer, Jesper Dangaard Brouer,
netdev@vger.kernel.org, Christoph Paasch, David S. Miller,
Martin Topholm, Florian Westphal, opurdila@ixiacom.com,
Tom Herbert
In-Reply-To: <1338360073.2760.81.camel@edumazet-glaptop>
On Wednesday 30 May 2012 08:41:13 Eric Dumazet wrote:
> On Tue, 2012-05-29 at 12:37 -0700, Andi Kleen wrote:
>
> > So basically handling syncookie lockless?
> >
> > Makes sense. Syncookies is a bit obsolete these days of course, due
> > to the lack of options. But may be still useful for this.
> >
> > Obviously you'll need to clean up the patch and support IPv6,
> > but the basic idea looks good to me.
>
> Also TCP Fast Open should be a good way to make the SYN flood no more
> effective.
>
> Yuchung Cheng and Jerry Chu should upstream this code in a very near
> future.
>
> Another way to mitigate SYN scalability issues before the full RCU
> solution I was cooking is to either :
>
> 1) Use a hardware filter (like on Intel NICS) to force all SYN packets
> going to one queue (so that they are all serviced on one CPU)
We have this option running right now, and it gave slightly higher values.
The upside is only one core is running at 100% load.
To be able to process more SYN an attempt was made to spread them with RPS to
2 other cores gave 60% more SYN:s per sec
i.e. syn filter in NIC sending all irq:s to one core gave ~ 52k syn. pkts/sec
adding RPS and sending syn to two other core:s gave ~80k syn. pkts/sec
Adding more cores than two didn't help that much.
> 2) Tweak RPS (__skb_get_rxhash()) so that SYN packets rxhash is not
> dependent on src port/address, to get same effect (All SYN packets
> processed by one cpu). Note this only address the SYN flood problem, not
> the general 3WHS scalability one, since if real connection is
> established, the third packet (ACK from client) will have the 'real'
> rxhash and will be processed by another cpu.
Neither the NIC:s SYN filter or this scale that well..
> (Of course, RPS must be enabled to benefit from this)
>
> Untested patch to get the idea :
>
> include/net/flow_keys.h | 1 +
> net/core/dev.c | 8 ++++++++
> net/core/flow_dissector.c | 9 +++++++++
> 3 files changed, 18 insertions(+)
>
> diff --git a/include/net/flow_keys.h b/include/net/flow_keys.h
> index 80461c1..b5bae21 100644
> --- a/include/net/flow_keys.h
> +++ b/include/net/flow_keys.h
> @@ -10,6 +10,7 @@ struct flow_keys {
> __be16 port16[2];
> };
> u8 ip_proto;
> + u8 tcpflags;
> };
>
> extern bool skb_flow_dissect(const struct sk_buff *skb, struct flow_keys *flow);
> diff --git a/net/core/dev.c b/net/core/dev.c
> index cd09819..c9c039e 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -135,6 +135,7 @@
> #include <linux/net_tstamp.h>
> #include <linux/static_key.h>
> #include <net/flow_keys.h>
> +#include <net/tcp.h>
>
> #include "net-sysfs.h"
>
> @@ -2614,6 +2615,12 @@ void __skb_get_rxhash(struct sk_buff *skb)
> return;
>
> if (keys.ports) {
> + if ((keys.tcpflags & (TCPHDR_SYN | TCPHDR_ACK)) == TCPHDR_SYN) {
> + hash = jhash_2words((__force u32)keys.dst,
> + (__force u32)keys.port16[1],
> + hashrnd);
> + goto end;
> + }
> if ((__force u16)keys.port16[1] < (__force u16)keys.port16[0])
> swap(keys.port16[0], keys.port16[1]);
> skb->l4_rxhash = 1;
> @@ -2626,6 +2633,7 @@ void __skb_get_rxhash(struct sk_buff *skb)
> hash = jhash_3words((__force u32)keys.dst,
> (__force u32)keys.src,
> (__force u32)keys.ports, hashrnd);
> +end:
> if (!hash)
> hash = 1;
>
> diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
> index a225089..cd4aedf 100644
> --- a/net/core/flow_dissector.c
> +++ b/net/core/flow_dissector.c
> @@ -137,6 +137,15 @@ ipv6:
> ports = skb_header_pointer(skb, nhoff, sizeof(_ports), &_ports);
> if (ports)
> flow->ports = *ports;
> + if (ip_proto == IPPROTO_TCP) {
> + __u8 *tcpflags, _tcpflags;
> +
> + tcpflags = skb_header_pointer(skb, nhoff + 13,
> + sizeof(_tcpflags),
> + &_tcpflags);
> + if (tcpflags)
> + flow->tcpflags = *tcpflags;
> + }
> }
>
> return true;
>
>
>
--
Regards
Hans Schillstrom <hans.schillstrom@ericsson.com>
^ permalink raw reply
* Re: [RFC PATCH 1/4] inet: add counter to inet_bind_hashbucket
From: Eric Dumazet @ 2012-05-30 8:00 UTC (permalink / raw)
To: Alexandru Copot
Cc: davem, gerrit, kuznet, jmorris, yoshfuji, kaber, netdev,
Daniel Baluta, Lucian Grijincu
In-Reply-To: <1338363410-6562-2-git-send-email-alex.mihai.c@gmail.com>
On Wed, 2012-05-30 at 10:36 +0300, Alexandru Copot wrote:
> The counter will be used by the upcoming INET lookup algorithm to
> choose the shortest chain after secondary hash is added.
>
> Signed-off-by: Alexandru Copot <alex.mihai.c@gmail.com>
> Cc: Daniel Baluta <dbaluta@ixiacom.com>
> Cc: Lucian Grijincu <lucian.grijincu@gmail.com>
> ---
> include/net/inet_hashtables.h | 4 +++-
> include/net/inet_timewait_sock.h | 4 +++-
> net/dccp/proto.c | 1 +
> net/ipv4/inet_hashtables.c | 9 ++++++---
> net/ipv4/inet_timewait_sock.c | 7 ++++---
> net/ipv4/tcp.c | 1 +
> 6 files changed, 18 insertions(+), 8 deletions(-)
>
> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> index 808fc5f..8c6addc 100644
> --- a/include/net/inet_hashtables.h
> +++ b/include/net/inet_hashtables.h
> @@ -98,6 +98,7 @@ static inline struct net *ib_net(struct inet_bind_bucket *ib)
> struct inet_bind_hashbucket {
> spinlock_t lock;
> struct hlist_head chain;
> + unsigned int count;
> };
>
Are you still using 32bit kernel ?
better use :
struct inet_bind_hashbucket {
spinlock_t lock;
unsigned int count;
struct hlist_head chain;
};
^ permalink raw reply
* Re: [RFC PATCH 0/4] inet: add second hash table
From: Eric Dumazet @ 2012-05-30 7:57 UTC (permalink / raw)
To: Alexandru Copot
Cc: davem, gerrit, kuznet, jmorris, yoshfuji, kaber, netdev,
Daniel Baluta, Lucian Grijincu
In-Reply-To: <1338363410-6562-1-git-send-email-alex.mihai.c@gmail.com>
On Wed, 2012-05-30 at 10:36 +0300, Alexandru Copot wrote:
> This patchset implements all the operations needed to use a second
> (port,address) bind hash table for inet. It uses a similar approach
> as the UDP implementation.
>
> The performance improvements for port allocation are very good and
> detailed in the last message.
>
> This is based on a series of patches written by Lucian Grijincu at Ixia.
>
> Signed-off-by: Alexandru Copot <alex.mihai.c@gmail.com>
> Cc: Daniel Baluta <dbaluta@ixiacom.com>
> Cc: Lucian Grijincu <lucian.grijincu@gmail.com>
> ---
> Alexandru Copot (4):
> inet: add counter to inet_bind_hashbucket
> inet: add a second bind hash
> inet: add/remove inet buckets in the second bind hash
> inet: use second hash in inet_csk_get_port
>
> include/net/inet_hashtables.h | 140 +++++++++++++++++++++++++++++++--
> include/net/inet_timewait_sock.h | 5 +-
> net/dccp/proto.c | 37 ++++++++-
> net/ipv4/inet_connection_sock.c | 66 ++++++++--------
> net/ipv4/inet_hashtables.c | 158 ++++++++++++++++++++++++++++++++++++--
> net/ipv4/inet_timewait_sock.c | 16 ++--
> net/ipv4/tcp.c | 17 ++++
> net/ipv6/inet6_hashtables.c | 95 +++++++++++++++++++++++
> 8 files changed, 477 insertions(+), 57 deletions(-)
Its a huge change (with many details to look at), for a yet to be
understood need.
What sensible workload needs this at all ?
^ permalink raw reply
* Re: [RFC PATCH 2/2] tcp: Early SYN limit and SYN cookie handling to mitigate SYN floods
From: Jesper Dangaard Brouer @ 2012-05-30 7:45 UTC (permalink / raw)
To: Eric Dumazet
Cc: Andi Kleen, netdev, Christoph Paasch, David S. Miller,
Martin Topholm, Florian Westphal, opurdila, Hans Schillstrom,
Tom Herbert
In-Reply-To: <1338360073.2760.81.camel@edumazet-glaptop>
On Wed, 2012-05-30 at 08:41 +0200, Eric Dumazet wrote:
> On Tue, 2012-05-29 at 12:37 -0700, Andi Kleen wrote:
>
> > So basically handling syncookie lockless?
> >
> > Makes sense. Syncookies is a bit obsolete these days of course, due
> > to the lack of options. But may be still useful for this.
> >
> > Obviously you'll need to clean up the patch and support IPv6,
> > but the basic idea looks good to me.
>
> Also TCP Fast Open should be a good way to make the SYN flood no more
> effective.
Sounds interesting, but TCP Fast Open is primarily concerned with
enabling data exchange during SYN establishment. I don't see any
indication that they have implemented parallel SYN handling.
Implementing parallel SYN handling, should also benefit their work.
After studying this code path, I also see great performance benefit in
also optimizing the normal 3WHS on sock's in sk_state == LISTEN.
Perhaps we should split up the code path for LISTEN vs. ESTABLISHED, as
they are very entangled at the moment AFAIKS.
> Yuchung Cheng and Jerry Chu should upstream this code in a very near
> future.
Looking forward to see the code, and the fallout discussions, on
transferring data on SYN packets.
> Another way to mitigate SYN scalability issues before the full RCU
> solution I was cooking is to either :
>
> 1) Use a hardware filter (like on Intel NICS) to force all SYN packets
> going to one queue (so that they are all serviced on one CPU)
>
> 2) Tweak RPS (__skb_get_rxhash()) so that SYN packets rxhash is not
> dependent on src port/address, to get same effect (All SYN packets
> processed by one cpu). Note this only address the SYN flood problem, not
> the general 3WHS scalability one, since if real connection is
> established, the third packet (ACK from client) will have the 'real'
> rxhash and will be processed by another cpu.
I don't like the idea of overloading one CPU with SYN packets. As the
attacker can still cause a DoS on new connections.
My "unlocked" parallel SYN cookie approach, should favor established
connections, as they are allowed to run under a BH lock, and thus don't
let new SYN packets in (on this CPU), until the establish conn packet is
finished. Unless I have misunderstood something... I think I have,
established connections have their own/seperate struck sock, and thus
this is another slock spinlock, right?. (Well let Eric bash me for
this ;-))
[...cut...]
^ permalink raw reply
* [RFC PATCH 4/4] inet: use second hash in inet_csk_get_port
From: Alexandru Copot @ 2012-05-30 7:36 UTC (permalink / raw)
To: davem
Cc: gerrit, kuznet, jmorris, yoshfuji, kaber, netdev, Alexandru Copot,
Daniel Baluta, Lucian Grijincu
In-Reply-To: <1338363410-6562-1-git-send-email-alex.mihai.c@gmail.com>
This results in a massive improvement when there are many sockets
bound to the same port, but different addresses for both bind() and
listen() system calls (both call inet_csk_get_port).
Tests were run with 16000 subinterfaces each with a distinct
IPv4 address. The sockets are first bound to the same port and
then put on listen().
* Without patch and without SO_REUSEADDR:
* bind: 1.543 s
* listen: 3.050 s
* Without patch and with SO_REUSEADDR set:
* bind: 0.066 s
* listen: 3.050 s
* With patch and SO_REUSEADDR set / without SO_REUSEADDR:
* bind: 0.066 s
* listen: 0.095 s
Signed-off-by: Alexandru Copot <alex.mihai.c@gmail.com>
Cc: Daniel Baluta <dbaluta@ixiacom.com>
Cc: Lucian Grijincu <lucian.grijincu@gmail.com>
---
include/net/inet_hashtables.h | 48 +++++++++++++++
net/ipv4/inet_connection_sock.c | 63 ++++++++------------
net/ipv4/inet_hashtables.c | 125 ++++++++++++++++++++++++++++++++++++++-
net/ipv6/inet6_hashtables.c | 95 +++++++++++++++++++++++++++++
4 files changed, 292 insertions(+), 39 deletions(-)
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index bc06168..2f589bb 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -81,6 +81,15 @@ struct inet_bind_bucket {
struct net *ib_net;
#endif
unsigned short port;
+ union {
+ struct in6_addr ib_addr_ipv6;
+ struct {
+ __be32 _1;
+ __be32 _2;
+ __be32 _3;
+ __be32 ib_addr_ipv4;
+ };
+ };
signed short fastreuse;
int num_owners;
struct hlist_node node;
@@ -226,6 +235,7 @@ static inline void inet_ehash_locks_free(struct inet_hashinfo *hashinfo)
extern struct inet_bind_bucket *
inet_bind_bucket_create(struct kmem_cache *cachep,
+ struct sock *sk,
struct net *net,
struct inet_bind_hashbucket *head,
struct inet_bind_hashbucket *portaddr_head,
@@ -257,6 +267,14 @@ static inline struct inet_bind_hashbucket *
return &hinfo->portaddr_bhash[h & (hinfo->portaddr_bhash_size - 1)];
}
+
+struct inet_bind_bucket *
+inet4_find_bind_buckets(struct sock *sk,
+ unsigned short port,
+ struct inet_bind_hashbucket **p_bhead,
+ struct inet_bind_hashbucket **p_portaddr_bhead);
+
+
#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
static inline unsigned int inet6_portaddr_bhashfn(struct net *net,
const struct in6_addr *addr6,
@@ -283,6 +301,14 @@ static inline struct inet_bind_hashbucket *
unsigned int h = inet6_portaddr_bhashfn(net, addr6, port);
return &hinfo->portaddr_bhash[h & (hinfo->portaddr_bhash_size - 1)];
}
+
+
+struct inet_bind_bucket *
+ inet6_find_bind_buckets(struct sock *sk,
+ unsigned short port,
+ struct inet_bind_hashbucket **p_bhead,
+ struct inet_bind_hashbucket **p_portaddr_bhead);
+
#endif
@@ -306,6 +332,28 @@ static inline struct inet_bind_hashbucket *
return inet4_portaddr_hashbucket(hinfo, net, INADDR_ANY, port);
}
+
+static inline struct inet_bind_bucket *
+ inet_find_bind_buckets(struct sock *sk,
+ unsigned short port,
+ struct inet_bind_hashbucket **p_bhead,
+ struct inet_bind_hashbucket **p_portaddr_bhead)
+{
+ switch (sk->sk_family) {
+ case AF_INET:
+ return inet4_find_bind_buckets(sk, port, p_bhead,
+ p_portaddr_bhead);
+#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+ case AF_INET6:
+ return inet6_find_bind_buckets(sk, port, p_bhead,
+ p_portaddr_bhead);
+#endif
+ }
+ WARN(1, "unrecognised sk->sk_family in inet_portaddr_hashbucket");
+ return NULL;
+}
+
+
extern void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
const unsigned short snum);
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 336531a..bd92466 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -100,8 +100,7 @@ EXPORT_SYMBOL_GPL(inet_csk_bind_conflict);
int inet_csk_get_port(struct sock *sk, unsigned short snum)
{
struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
- struct inet_bind_hashbucket *head;
- struct hlist_node *node;
+ struct inet_bind_hashbucket *head, *portaddr_bhead;
struct inet_bind_bucket *tb;
int ret, attempts = 5;
struct net *net = sock_net(sk);
@@ -120,31 +119,26 @@ again:
do {
if (inet_is_reserved_local_port(rover))
goto next_nolock;
- head = &hashinfo->bhash[inet_bhashfn(net, rover,
- hashinfo->bhash_size)];
- spin_lock(&head->lock);
- inet_bind_bucket_for_each(tb, node, &head->chain)
- if (net_eq(ib_net(tb), net) && tb->port == rover) {
- if (tb->fastreuse > 0 &&
- sk->sk_reuse &&
- sk->sk_state != TCP_LISTEN &&
- (tb->num_owners < smallest_size || smallest_size == -1)) {
- smallest_size = tb->num_owners;
- smallest_rover = rover;
- if (atomic_read(&hashinfo->bsockets) > (high - low) + 1 &&
- !inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, false)) {
- snum = smallest_rover;
- goto tb_found;
- }
- }
- if (!inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, false)) {
- snum = rover;
- goto tb_found;
- }
- goto next;
+
+ tb = inet_find_bind_buckets(sk, rover, &head, &portaddr_bhead);
+ if (!tb)
+ break;
+ if (tb->fastreuse > 0 && sk->sk_reuse &&
+ sk->sk_state != TCP_LISTEN &&
+ (tb->num_owners < smallest_size || smallest_size == -1)) {
+ smallest_size = tb->num_owners;
+ smallest_rover = rover;
+ if (atomic_read(&hashinfo->bsockets) > (high - low) + 1 &&
+ !inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, false)) {
+ snum = smallest_rover;
+ goto tb_found;
}
- break;
- next:
+ }
+ if (!inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, false)) {
+ snum = rover;
+ goto tb_found;
+ }
+ spin_unlock(&portaddr_bhead->lock);
spin_unlock(&head->lock);
next_nolock:
if (++rover > high)
@@ -171,12 +165,9 @@ again:
snum = rover;
} else {
have_snum:
- head = &hashinfo->bhash[inet_bhashfn(net, snum,
- hashinfo->bhash_size)];
- spin_lock(&head->lock);
- inet_bind_bucket_for_each(tb, node, &head->chain)
- if (net_eq(ib_net(tb), net) && tb->port == snum)
- goto tb_found;
+ tb = inet_find_bind_buckets(sk, snum, &head, &portaddr_bhead);
+ if (tb)
+ goto tb_found;
}
tb = NULL;
goto tb_not_found;
@@ -194,6 +185,7 @@ tb_found:
if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, true)) {
if (sk->sk_reuse && sk->sk_state != TCP_LISTEN &&
smallest_size != -1 && --attempts >= 0) {
+ spin_unlock(&portaddr_bhead->lock);
spin_unlock(&head->lock);
goto again;
}
@@ -205,12 +197,8 @@ tb_found:
tb_not_found:
ret = 1;
if (!tb) {
- struct inet_bind_hashbucket *portaddr_head;
- portaddr_head = inet_portaddr_hashbucket(hashinfo, sk, snum);
- spin_lock(&portaddr_head->lock);
tb = inet_bind_bucket_create(hashinfo->bind_bucket_cachep,
- net, head, portaddr_head, snum);
- spin_unlock(&portaddr_head->lock);
+ sk, net, head, portaddr_bhead, snum);
if (!tb)
goto fail_unlock;
}
@@ -229,6 +217,7 @@ success:
ret = 0;
fail_unlock:
+ spin_unlock(&portaddr_bhead->lock);
spin_unlock(&head->lock);
fail:
local_bh_enable();
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index edb2a4e..26c7f9d 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -29,6 +29,7 @@
* The bindhash mutex for snum's hash chain must be held here.
*/
struct inet_bind_bucket *inet_bind_bucket_create(struct kmem_cache *cachep,
+ struct sock *sk,
struct net *net,
struct inet_bind_hashbucket *head,
struct inet_bind_hashbucket *portaddr_head,
@@ -37,6 +38,32 @@ struct inet_bind_bucket *inet_bind_bucket_create(struct kmem_cache *cachep,
struct inet_bind_bucket *tb = kmem_cache_alloc(cachep, GFP_ATOMIC);
if (tb != NULL) {
+ switch (sk->sk_family) {
+ case AF_INET:
+ /* ::ffff:x.y.z.y is the IPv4-mapped IPv6 address for
+ * IPv4 address x.y.z.t, but only if it's not the any addr */
+ if (INADDR_ANY == sk_rcv_saddr(sk))
+ memset(&tb->ib_addr_ipv6, 0, sizeof(struct in6_addr));
+ else
+ ipv6_addr_set(&tb->ib_addr_ipv6, 0, 0,
+ htonl(0x0000FFFF),
+ sk_rcv_saddr(sk));
+
+ /* if no alignment problems appear, the IPv4 address
+ * should be written to ib_addr_ipv6. If this gets
+ * triggered check the inet_bind_bucket structure. */
+ WARN_ON(tb->ib_addr_ipv4 != sk_rcv_saddr(sk));
+ break;
+#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+ case AF_INET6:
+ memcpy(&tb->ib_addr_ipv6, &inet6_sk(sk)->rcv_saddr,
+ sizeof(struct in6_addr));
+ break;
+#endif
+ default:
+ WARN(1, "unrecognised sk_family in inet_bind_bucket_create");
+ }
+
write_pnet(&tb->ib_net, hold_net(net));
tb->port = snum;
tb->fastreuse = 0;
@@ -142,8 +169,10 @@ int __inet_inherit_port(struct sock *sk, struct sock *child)
break;
}
if (!node) {
+ portaddr_head = inet_portaddr_hashbucket(table, sk, tb->port);
+
tb = inet_bind_bucket_create(table->bind_bucket_cachep,
- sock_net(sk), head,
+ sk, sock_net(sk), head,
portaddr_head, port);
if (!tb) {
spin_unlock(&head->lock);
@@ -521,7 +550,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
portaddr_head = inet_portaddr_hashbucket(hinfo, sk, port);
spin_lock(&portaddr_head->lock);
tb = inet_bind_bucket_create(hinfo->bind_bucket_cachep,
- net, head, portaddr_head, port);
+ sk, net, head, portaddr_head, port);
spin_unlock(&portaddr_head->lock);
if (!tb) {
@@ -584,6 +613,98 @@ out:
}
}
+struct inet_bind_bucket *
+inet4_find_bind_buckets(struct sock *sk,
+ unsigned short port,
+ struct inet_bind_hashbucket **p_bhead,
+ struct inet_bind_hashbucket **p_portaddr_bhead)
+{
+ struct net *net = sock_net(sk);
+ struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
+ struct inet_bind_bucket *tb = NULL;
+ struct hlist_node *node;
+
+ struct inet_bind_hashbucket *bhead, *portaddr_bhead, *portaddrany_bhead;
+ bhead = &hinfo->bhash[inet_bhashfn(net, port, hinfo->bhash_size)];
+ portaddr_bhead = inet4_portaddr_hashbucket(hinfo, net,
+ sk_rcv_saddr(sk), port);
+ portaddrany_bhead = inet4_portaddr_hashbucket(hinfo, net,
+ INADDR_ANY, port);
+
+ *p_portaddr_bhead = portaddr_bhead;
+ *p_bhead = bhead;
+
+ /*
+ * prevent dead locks by always taking locks in a fixed order:
+ * - always take the port-only lock first. This is done because in some
+ * other places this is the lock taken, being folllowed in only some
+ * cases by the portaddr lock.
+ * - between portaddr and portaddrany always choose the one with the
+ * lower address. Unlock ordering is not important, as long as the
+ * locking order is consistent.
+ * - make sure to not take the same lock twice
+ */
+ spin_lock(&bhead->lock);
+ if (portaddr_bhead > portaddrany_bhead) {
+ spin_lock(&portaddrany_bhead->lock);
+ spin_lock(&portaddr_bhead->lock);
+ } else if (portaddr_bhead < portaddrany_bhead) {
+ spin_lock(&portaddr_bhead->lock);
+ spin_lock(&portaddrany_bhead->lock);
+ } else {
+ spin_lock(&portaddr_bhead->lock);
+ }
+
+ if (sk_rcv_saddr(sk) != INADDR_ANY) {
+ struct inet_bind_hashbucket *_head;
+
+ _head = portaddr_bhead;
+ if (bhead->count < portaddr_bhead->count) {
+ _head = bhead;
+ inet_bind_bucket_for_each(tb, node, &_head->chain)
+ if ((net_eq(ib_net(tb), net)) &&
+ (tb->port == port) &&
+ (tb->ib_addr_ipv4 == sk_rcv_saddr(sk)))
+ goto found;
+ } else {
+ inet_portaddr_bind_bucket_for_each(tb, node, &_head->chain)
+ if ((net_eq(ib_net(tb), net)) &&
+ (tb->port == port) &&
+ (tb->ib_addr_ipv4 == sk_rcv_saddr(sk)))
+ goto found;
+ }
+ _head = portaddrany_bhead;
+ if (bhead->count < portaddrany_bhead->count) {
+ _head = bhead;
+ inet_bind_bucket_for_each(tb, node, &_head->chain)
+ if ((ib_net(tb) == net) &&
+ (tb->port == port) &&
+ (tb->ib_addr_ipv4 == INADDR_ANY))
+ goto found;
+ } else {
+ inet_portaddr_bind_bucket_for_each(tb, node, &_head->chain)
+ if ((ib_net(tb) == net) &&
+ (tb->port == port) &&
+ (tb->ib_addr_ipv4 == INADDR_ANY))
+ goto found;
+ }
+ } else {
+ inet_bind_bucket_for_each(tb, node, &bhead->chain)
+ if ((ib_net(tb) == net) && (tb->port == port))
+ goto found;
+ }
+
+ tb = NULL;
+found:
+ if (portaddr_bhead != portaddrany_bhead)
+ spin_unlock(&portaddrany_bhead->lock);
+
+ /* the other locks remain taken, as the caller
+ * may want to change the hash tabels */
+ return tb;
+}
+
+
/*
* Bind a port for a connect operation and hash it.
*/
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 73f1a00..62f1eff 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -294,6 +294,101 @@ static inline u32 inet6_sk_port_offset(const struct sock *sk)
inet->inet_dport);
}
+
+struct inet_bind_bucket *
+inet6_find_bind_buckets(struct sock *sk,
+ unsigned short port,
+ struct inet_bind_hashbucket **p_bhead,
+ struct inet_bind_hashbucket **p_portaddr_bhead)
+{
+ struct net *net = sock_net(sk);
+ struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
+ struct inet_bind_bucket *tb = NULL;
+ struct hlist_node *node;
+
+ struct inet_bind_hashbucket *bhead, *portaddr_bhead, *portaddrany_bhead;
+ bhead = &hinfo->bhash[inet_bhashfn(net, port, hinfo->bhash_size)];
+ portaddr_bhead = inet6_portaddr_hashbucket(hinfo, net,
+ inet6_rcv_saddr(sk), port);
+ portaddrany_bhead = inet6_portaddr_hashbucket(hinfo, net,
+ &in6addr_any, port);
+
+ *p_portaddr_bhead = portaddr_bhead;
+ *p_bhead = bhead;
+
+ /*
+ * prevent dead locks by always taking locks in a fixed order:
+ * - always take the port-only lock first. This is done because in some
+ * other places this is the lock taken, being folllowed in only some
+ * cases by the portaddr lock.
+ * - between portaddr and portaddrany always choose the one with the
+ * lower address. Unlock ordering is not important, as long as the
+ * locking order is consistent.
+ * - make sure to not take the same lock twice
+ */
+ spin_lock(&bhead->lock);
+ if (portaddr_bhead > portaddrany_bhead) {
+ spin_lock(&portaddrany_bhead->lock);
+ spin_lock(&portaddr_bhead->lock);
+ } else if (portaddr_bhead < portaddrany_bhead) {
+ spin_lock(&portaddr_bhead->lock);
+ spin_lock(&portaddrany_bhead->lock);
+ } else {
+ spin_lock(&portaddr_bhead->lock);
+ }
+
+ if (ipv6_addr_any(inet6_rcv_saddr(sk))) {
+ struct inet_bind_hashbucket *_head;
+
+ _head = portaddr_bhead;
+ if (bhead->count < portaddr_bhead->count) {
+ _head = bhead;
+ inet_bind_bucket_for_each(tb, node, &_head->chain)
+ if ((net_eq(ib_net(tb), net)) &&
+ (tb->port == port) &&
+ ipv6_addr_equal(&tb->ib_addr_ipv6,
+ inet6_rcv_saddr(sk)))
+ goto found;
+ } else {
+ inet_portaddr_bind_bucket_for_each(tb, node, &_head->chain)
+ if ((net_eq(ib_net(tb), net)) &&
+ (tb->port == port) &&
+ ipv6_addr_equal(&tb->ib_addr_ipv6,
+ inet6_rcv_saddr(sk)))
+ goto found;
+ }
+ _head = portaddrany_bhead;
+ if (bhead->count < portaddrany_bhead->count) {
+ _head = bhead;
+ inet_bind_bucket_for_each(tb, node, &_head->chain)
+ if ((ib_net(tb) == net) &&
+ (tb->port == port) &&
+ ipv6_addr_any(&tb->ib_addr_ipv6))
+ goto found;
+ } else {
+ inet_portaddr_bind_bucket_for_each(tb, node, &_head->chain)
+ if ((ib_net(tb) == net) &&
+ (tb->port == port) &&
+ ipv6_addr_any(&tb->ib_addr_ipv6))
+ goto found;
+ }
+ } else {
+ inet_bind_bucket_for_each(tb, node, &bhead->chain)
+ if ((ib_net(tb) == net) && (tb->port == port))
+ goto found;
+ }
+
+ tb = NULL;
+found:
+ if (portaddr_bhead != portaddrany_bhead)
+ spin_unlock(&portaddrany_bhead->lock);
+
+ /* the other locks remain taken, as the caller
+ * may want to change the hash tabels */
+ return tb;
+}
+
+
int inet6_hash_connect(struct inet_timewait_death_row *death_row,
struct sock *sk)
{
--
1.7.10.2
^ permalink raw reply related
* [RFC PATCH 3/4] inet: add/remove inet buckets in the second bind hash
From: Alexandru Copot @ 2012-05-30 7:36 UTC (permalink / raw)
To: davem
Cc: gerrit, kuznet, jmorris, yoshfuji, kaber, netdev, Alexandru Copot,
Daniel Baluta, Lucian Grijincu
In-Reply-To: <1338363410-6562-1-git-send-email-alex.mihai.c@gmail.com>
Signed-off-by: Alexandru Copot <alex.mihai.c@gmail.com>
Cc: Daniel Baluta <dbaluta@ixiacom.com>
Cc: Lucian Grijincu <lucian.grijincu@gmail.com>
---
include/net/inet_hashtables.h | 77 +++++++++++++++++++++++++++++++++++---
include/net/inet_timewait_sock.h | 3 +-
net/ipv4/inet_connection_sock.c | 13 +++++--
net/ipv4/inet_hashtables.c | 34 ++++++++++++++---
net/ipv4/inet_timewait_sock.c | 15 +++++---
5 files changed, 122 insertions(+), 20 deletions(-)
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index a6d0db2..bc06168 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -225,13 +225,15 @@ static inline void inet_ehash_locks_free(struct inet_hashinfo *hashinfo)
}
extern struct inet_bind_bucket *
- inet_bind_bucket_create(struct kmem_cache *cachep,
- struct net *net,
- struct inet_bind_hashbucket *head,
- const unsigned short snum);
+ inet_bind_bucket_create(struct kmem_cache *cachep,
+ struct net *net,
+ struct inet_bind_hashbucket *head,
+ struct inet_bind_hashbucket *portaddr_head,
+ const unsigned short snum);
extern void inet_bind_bucket_destroy(struct kmem_cache *cachep,
struct inet_bind_bucket *tb,
- struct inet_bind_hashbucket *head);
+ struct inet_bind_hashbucket *head,
+ struct inet_bind_hashbucket *portaddr_head);
static inline int inet_bhashfn(struct net *net,
const __u16 lport, const int bhash_size)
@@ -239,6 +241,71 @@ static inline int inet_bhashfn(struct net *net,
return (lport + net_hash_mix(net)) & (bhash_size - 1);
}
+static inline unsigned int inet4_portaddr_bhashfn(struct net *net, __be32 saddr,
+ unsigned int port)
+{
+ return jhash_1word(saddr, net_hash_mix(net)) ^ port;
+}
+
+static inline struct inet_bind_hashbucket *
+ inet4_portaddr_hashbucket(struct inet_hashinfo *hinfo,
+ struct net *net,
+ __be32 saddr,
+ unsigned int port)
+{
+ unsigned int h = inet4_portaddr_bhashfn(net, saddr, port);
+ return &hinfo->portaddr_bhash[h & (hinfo->portaddr_bhash_size - 1)];
+}
+
+#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+static inline unsigned int inet6_portaddr_bhashfn(struct net *net,
+ const struct in6_addr *addr6,
+ unsigned int port)
+{
+ unsigned int hash, mix = net_hash_mix(net);
+
+ if (ipv6_addr_any(addr6))
+ hash = jhash_1word(0, mix);
+ else if (ipv6_addr_v4mapped(addr6))
+ hash = jhash_1word(addr6->s6_addr32[3], mix);
+ else
+ hash = jhash2(addr6->s6_addr32, 4, mix);
+
+ return hash ^ port;
+}
+
+static inline struct inet_bind_hashbucket *
+ inet6_portaddr_hashbucket(struct inet_hashinfo *hinfo,
+ struct net *net,
+ const struct in6_addr *addr6,
+ unsigned int port)
+{
+ unsigned int h = inet6_portaddr_bhashfn(net, addr6, port);
+ return &hinfo->portaddr_bhash[h & (hinfo->portaddr_bhash_size - 1)];
+}
+#endif
+
+
+static inline struct inet_bind_hashbucket *
+ inet_portaddr_hashbucket(struct inet_hashinfo *hinfo,
+ struct sock *sk,
+ unsigned int port)
+{
+ struct net *net = sock_net(sk);
+ switch (sk->sk_family) {
+ case AF_INET:
+ return inet4_portaddr_hashbucket(hinfo, net,
+ inet_sk(sk)->inet_rcv_saddr, port);
+#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+ case AF_INET6:
+ return inet6_portaddr_hashbucket(hinfo, net,
+ &inet6_sk(sk)->rcv_saddr, port);
+#endif
+ }
+ WARN(1, "unrecognised sk->sk_family in inet_portaddr_hashbucket");
+ return inet4_portaddr_hashbucket(hinfo, net, INADDR_ANY, port);
+}
+
extern void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
const unsigned short snum);
diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index 725e903..d60d8a9 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -199,7 +199,8 @@ extern int inet_twsk_unhash(struct inet_timewait_sock *tw);
extern int inet_twsk_bind_unhash(struct inet_timewait_sock *tw,
struct inet_hashinfo *hashinfo,
- struct inet_bind_hashbucket *head);
+ struct inet_bind_hashbucket *head,
+ struct inet_bind_hashbucket *portaddr_head);
extern struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk,
const int state);
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 95e61596..336531a 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -204,9 +204,16 @@ tb_found:
}
tb_not_found:
ret = 1;
- if (!tb && (tb = inet_bind_bucket_create(hashinfo->bind_bucket_cachep,
- net, head, snum)) == NULL)
- goto fail_unlock;
+ if (!tb) {
+ struct inet_bind_hashbucket *portaddr_head;
+ portaddr_head = inet_portaddr_hashbucket(hashinfo, sk, snum);
+ spin_lock(&portaddr_head->lock);
+ tb = inet_bind_bucket_create(hashinfo->bind_bucket_cachep,
+ net, head, portaddr_head, snum);
+ spin_unlock(&portaddr_head->lock);
+ if (!tb)
+ goto fail_unlock;
+ }
if (hlist_empty(&tb->owners)) {
if (sk->sk_reuse && sk->sk_state != TCP_LISTEN)
tb->fastreuse = 1;
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index c1f6f28..edb2a4e 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -31,6 +31,7 @@
struct inet_bind_bucket *inet_bind_bucket_create(struct kmem_cache *cachep,
struct net *net,
struct inet_bind_hashbucket *head,
+ struct inet_bind_hashbucket *portaddr_head,
const unsigned short snum)
{
struct inet_bind_bucket *tb = kmem_cache_alloc(cachep, GFP_ATOMIC);
@@ -43,6 +44,8 @@ struct inet_bind_bucket *inet_bind_bucket_create(struct kmem_cache *cachep,
INIT_HLIST_HEAD(&tb->owners);
hlist_add_head(&tb->node, &head->chain);
head->count++;
+ hlist_add_head(&tb->portaddr_node, &portaddr_head->chain);
+ portaddr_head->count++;
}
return tb;
}
@@ -51,11 +54,14 @@ struct inet_bind_bucket *inet_bind_bucket_create(struct kmem_cache *cachep,
* Caller must hold hashbucket lock for this tb with local BH disabled
*/
void inet_bind_bucket_destroy(struct kmem_cache *cachep, struct inet_bind_bucket *tb,
- struct inet_bind_hashbucket *head)
+ struct inet_bind_hashbucket *head,
+ struct inet_bind_hashbucket *portaddr_head)
{
if (hlist_empty(&tb->owners)) {
head->count--;
__hlist_del(&tb->node);
+ portaddr_head->count--;
+ __hlist_del(&tb->portaddr_node);
release_net(ib_net(tb));
kmem_cache_free(cachep, tb);
}
@@ -83,17 +89,22 @@ static void __inet_put_port(struct sock *sk)
const int bhash = inet_bhashfn(sock_net(sk), inet_sk(sk)->inet_num,
hashinfo->bhash_size);
struct inet_bind_hashbucket *head = &hashinfo->bhash[bhash];
+ struct inet_bind_hashbucket *portaddr_head =
+ inet_portaddr_hashbucket(hashinfo, sk, inet_sk(sk)->inet_num);
struct inet_bind_bucket *tb;
atomic_dec(&hashinfo->bsockets);
spin_lock(&head->lock);
+ spin_lock(&portaddr_head->lock);
tb = inet_csk(sk)->icsk_bind_hash;
__sk_del_bind_node(sk);
tb->num_owners--;
inet_csk(sk)->icsk_bind_hash = NULL;
inet_sk(sk)->inet_num = 0;
- inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, tb, head);
+ inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, tb,
+ head, portaddr_head);
+ spin_unlock(&portaddr_head->lock);
spin_unlock(&head->lock);
}
@@ -112,6 +123,8 @@ int __inet_inherit_port(struct sock *sk, struct sock *child)
const int bhash = inet_bhashfn(sock_net(sk), port,
table->bhash_size);
struct inet_bind_hashbucket *head = &table->bhash[bhash];
+ struct inet_bind_hashbucket *portaddr_head =
+ inet_portaddr_hashbucket(table, sk, port);
struct inet_bind_bucket *tb;
spin_lock(&head->lock);
@@ -130,7 +143,8 @@ int __inet_inherit_port(struct sock *sk, struct sock *child)
}
if (!node) {
tb = inet_bind_bucket_create(table->bind_bucket_cachep,
- sock_net(sk), head, port);
+ sock_net(sk), head,
+ portaddr_head, port);
if (!tb) {
spin_unlock(&head->lock);
return -ENOMEM;
@@ -462,7 +476,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
{
struct inet_hashinfo *hinfo = death_row->hashinfo;
const unsigned short snum = inet_sk(sk)->inet_num;
- struct inet_bind_hashbucket *head;
+ struct inet_bind_hashbucket *head, *portaddr_head;
struct inet_bind_bucket *tb;
int ret;
struct net *net = sock_net(sk);
@@ -504,8 +518,12 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
}
}
+ portaddr_head = inet_portaddr_hashbucket(hinfo, sk, port);
+ spin_lock(&portaddr_head->lock);
tb = inet_bind_bucket_create(hinfo->bind_bucket_cachep,
- net, head, port);
+ net, head, portaddr_head, port);
+ spin_unlock(&portaddr_head->lock);
+
if (!tb) {
spin_unlock(&head->lock);
break;
@@ -529,8 +547,12 @@ ok:
inet_sk(sk)->inet_sport = htons(port);
twrefcnt += hash(sk, tw);
}
+ portaddr_head = inet_portaddr_hashbucket(hinfo, sk, port);
+ spin_lock(&portaddr_head->lock);
if (tw)
- twrefcnt += inet_twsk_bind_unhash(tw, hinfo, head);
+ twrefcnt += inet_twsk_bind_unhash(tw, hinfo,
+ head, portaddr_head);
+ spin_unlock(&portaddr_head->lock);
spin_unlock(&head->lock);
if (tw) {
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 5b7bcd0..29f8061 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -50,7 +50,8 @@ int inet_twsk_unhash(struct inet_timewait_sock *tw)
*/
int inet_twsk_bind_unhash(struct inet_timewait_sock *tw,
struct inet_hashinfo *hashinfo,
- struct inet_bind_hashbucket *head)
+ struct inet_bind_hashbucket *head,
+ struct inet_bind_hashbucket *portaddr_head)
{
struct inet_bind_bucket *tb = tw->tw_tb;
@@ -59,7 +60,8 @@ int inet_twsk_bind_unhash(struct inet_timewait_sock *tw,
__hlist_del(&tw->tw_bind_node);
tw->tw_tb = NULL;
- inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, tb, head);
+ inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, tb,
+ head, portaddr_head);
/*
* We cannot call inet_twsk_put() ourself under lock,
* caller must call it for us.
@@ -71,7 +73,7 @@ int inet_twsk_bind_unhash(struct inet_timewait_sock *tw,
static void __inet_twsk_kill(struct inet_timewait_sock *tw,
struct inet_hashinfo *hashinfo)
{
- struct inet_bind_hashbucket *bhead;
+ struct inet_bind_hashbucket *bhead, *portaddr_bhead;
int refcnt;
/* Unlink from established hashes. */
spinlock_t *lock = inet_ehash_lockp(hashinfo, tw->tw_hash);
@@ -83,9 +85,12 @@ static void __inet_twsk_kill(struct inet_timewait_sock *tw,
/* Disassociate with bind bucket. */
bhead = &hashinfo->bhash[inet_bhashfn(twsk_net(tw), tw->tw_num,
hashinfo->bhash_size)];
-
+ portaddr_bhead = inet_portaddr_hashbucket(hashinfo, (struct sock *)tw,
+ tw->tw_num);
spin_lock(&bhead->lock);
- refcnt += inet_twsk_bind_unhash(tw, hashinfo, bhead);
+ spin_lock(&portaddr_bhead->lock);
+ refcnt += inet_twsk_bind_unhash(tw, hashinfo, bhead, portaddr_bhead);
+ spin_unlock(&portaddr_bhead->lock);
spin_unlock(&bhead->lock);
#ifdef SOCK_REFCNT_DEBUG
--
1.7.10.2
^ permalink raw reply related
* [RFC PATCH 2/4] inet: add a second bind hash
From: Alexandru Copot @ 2012-05-30 7:36 UTC (permalink / raw)
To: davem
Cc: gerrit, kuznet, jmorris, yoshfuji, kaber, netdev, Alexandru Copot,
Daniel Baluta, Lucian Grijincu
In-Reply-To: <1338363410-6562-1-git-send-email-alex.mihai.c@gmail.com>
Add a second bind hash table which hashes by bound port and address.
Signed-off-by: Alexandru Copot <alex.mihai.c@gmail.com>
Cc: Daniel Baluta <dbaluta@ixiacom.com>
Cc: Lucian Grijincu <lucian.grijincu@gmail.com>
---
include/net/inet_hashtables.h | 13 ++++++++++---
net/dccp/proto.c | 36 ++++++++++++++++++++++++++++++++++--
net/ipv4/tcp.c | 16 ++++++++++++++++
3 files changed, 60 insertions(+), 5 deletions(-)
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 8c6addc..a6d0db2 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -84,6 +84,7 @@ struct inet_bind_bucket {
signed short fastreuse;
int num_owners;
struct hlist_node node;
+ struct hlist_node portaddr_node;
struct hlist_head owners;
};
@@ -94,6 +95,8 @@ static inline struct net *ib_net(struct inet_bind_bucket *ib)
#define inet_bind_bucket_for_each(tb, pos, head) \
hlist_for_each_entry(tb, pos, head, node)
+#define inet_portaddr_bind_bucket_for_each(tb, pos, head) \
+ hlist_for_each_entry(tb, pos, head, portaddr_node)
struct inet_bind_hashbucket {
spinlock_t lock;
@@ -129,13 +132,17 @@ struct inet_hashinfo {
unsigned int ehash_mask;
unsigned int ehash_locks_mask;
- /* Ok, let's try this, I give up, we do need a local binding
- * TCP hash as well as the others for fast bind/connect.
+ /*
+ * bhash: hashes the buckets by port.
+ * portaddr_bhash: hashes bind buckets by bound port and address.
+ * When bhash gets too large, we try to lookup on
+ * portaddr_bhash.
*/
struct inet_bind_hashbucket *bhash;
+ struct inet_bind_hashbucket *portaddr_bhash;
unsigned int bhash_size;
- /* 4 bytes hole on 64 bit */
+ unsigned int portaddr_bhash_size;
struct kmem_cache *bind_bucket_cachep;
diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index e777beb..298f5c1 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -1109,7 +1109,7 @@ EXPORT_SYMBOL_GPL(dccp_debug);
static int __init dccp_init(void)
{
unsigned long goal;
- int ehash_order, bhash_order, i;
+ int ehash_order, bhash_order, portaddr_bhash_order, i;
int rc;
BUILD_BUG_ON(sizeof(struct dccp_skb_cb) >
@@ -1189,9 +1189,34 @@ static int __init dccp_init(void)
INIT_HLIST_HEAD(&dccp_hashinfo.bhash[i].chain);
}
+ portaddr_bhash_order = bhash_order;
+
+ do {
+ dccp_hashinfo.portaddr_bhash_size =
+ (1UL << portaddr_bhash_order) *
+ PAGE_SIZE / sizeof(struct inet_bind_hashbucket);
+ if ((dccp_hashinfo.portaddr_bhash_size > (64 * 1024)) &&
+ portaddr_bhash_order > 0)
+ continue;
+ dccp_hashinfo.portaddr_bhash = (struct inet_bind_hashbucket *)
+ __get_free_pages(GFP_ATOMIC|__GFP_NOWARN,
+ portaddr_bhash_order);
+ } while (!dccp_hashinfo.portaddr_bhash && --portaddr_bhash_order >= 0);
+
+ if (!dccp_hashinfi.portaddr_bhash) {
+ DCCP_CRIT("Failed to allocate DCCP portaddr bind hash table");
+ goto out_free_dccp_hash;
+ }
+
+ for (i = 0; i < dccp_hashinfo.portaddr_bhash_size; i++) {
+ dccp_hashinfo.portaddr_bhash[i].count = 0;
+ spin_lock_init(&dccp_hashinfo.portaddr_bhash[i].lock);
+ INIT_HLIST_HEAD(&dccp_hashinfo.portaddr_bhash[i].chain);
+ }
+
rc = dccp_mib_init();
if (rc)
- goto out_free_dccp_bhash;
+ goto out_free_dccp_portaddr_bhash;
rc = dccp_ackvec_init();
if (rc)
@@ -1215,6 +1240,10 @@ out_ackvec_exit:
dccp_ackvec_exit();
out_free_dccp_mib:
dccp_mib_exit();
+out_free_dccp_portaddr_bhash:
+ free_pages((unsigned long)dccp_hashinfo.portaddr_bhash,
+ portaddr_bhash_order);
+ dccp_hashinfo.portaddr_bhash = NULL;
out_free_dccp_bhash:
free_pages((unsigned long)dccp_hashinfo.bhash, bhash_order);
out_free_dccp_locks:
@@ -1239,6 +1268,9 @@ static void __exit dccp_fini(void)
free_pages((unsigned long)dccp_hashinfo.bhash,
get_order(dccp_hashinfo.bhash_size *
sizeof(struct inet_bind_hashbucket)));
+ free_pages((unsigned long)dccp_hashinfo.portaddr_bhash,
+ get_order(dccp_hashinfo.portaddr_bhash_size *
+ sizeof(struct inet_bind_hashbucket)));
free_pages((unsigned long)dccp_hashinfo.ehash,
get_order((dccp_hashinfo.ehash_mask + 1) *
sizeof(struct inet_ehash_bucket)));
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 52cdf67..7dd3e19 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3538,6 +3538,22 @@ void __init tcp_init(void)
INIT_HLIST_HEAD(&tcp_hashinfo.bhash[i].chain);
}
+ tcp_hashinfo.portaddr_bhash =
+ alloc_large_system_hash("TCP portaddr_bind",
+ sizeof(struct inet_bind_hashbucket),
+ tcp_hashinfo.bhash_size,
+ (totalram_pages >= 128 * 1024) ?
+ 13 : 15,
+ 0,
+ &tcp_hashinfo.portaddr_bhash_size,
+ NULL,
+ 64 * 1024);
+ tcp_hashinfo.portaddr_bhash_size = 1U << tcp_hashinfo.portaddr_bhash_size;
+ for (i = 0; i < tcp_hashinfo.portaddr_bhash_size; i++) {
+ tcp_hashinfo.portaddr_bhash[i].count = 0;
+ spin_lock_init(&tcp_hashinfo.portaddr_bhash[i].lock);
+ INIT_HLIST_HEAD(&tcp_hashinfo.portaddr_bhash[i].chain);
+ }
cnt = tcp_hashinfo.ehash_mask + 1;
--
1.7.10.2
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox