* Re: [bug?] r8169: hangs under heavy load
From: Eric Dumazet @ 2011-11-25 20:32 UTC (permalink / raw)
To: Jonathan Nieder
Cc: netdev, nic_swsd, Francois Romieu, linux-kernel, Armin Kazmi,
Gerd
In-Reply-To: <20111125201936.GA26692@elie.hsd1.il.comcast.net>
Le vendredi 25 novembre 2011 à 14:19 -0600, Jonathan Nieder a écrit :
> Hi,
>
> Gerd writes[1]:
>
> > Today I installed the 3.1 kernel and did some testing by copying files
> > with samba.
> [...]
> > Now the CPU hangs during the interrupts:
> [...]
> > r8169 0000:02:00.0: eth0: link up
> > ------------[ cut here ]------------
> > WARNING: at [...]/net/core/dev.c:3827 net_rx_action+0xda/0x17e()
> >
> > Hardware name: CM-iAM/SBC-FITPC2i
> > Modules linked in: cpufreq_userspace cpufreq_conservative cpufreq_stats nfsd lockd nfs_acl auth_rpcgss sunrpc speedstep_lib cpufreq_powersave fuse ext2 coretemp acpi_cpufreq mperf loop arc4 snd_hda_codec_realtek rt2800usb rt2800lib crc_ccitt rt2x00usb rt2x00lib mac80211 psb_gfx(C) snd_hda_intel cfg80211 snd_hda_codec drm_kms_helper i2c_isch drm snd_hwdep rfkill tpm_tis tpm tpm_bios lpc_sch mfd_core evdev snd_pcm snd_seq snd_timer snd_seq_device battery processor button psmouse snd pcspkr serio_raw ac i2c_algo_bit power_supply i2c_core soundcore snd_page_alloc video usbhid ext4 hid mbcache jbd2 crc16 sd_mod crc_t10dif ata_generic uhci_hcd pata_sch libata ehci_hcd scsi_mod usbcore sdhci_pci sdhci r8169 mii thermal thermal_sys mmc_core [last unloaded: scsi_wait_scan]
> > Pid: 0, comm: swapper Tainted: G C 3.1.0-1-686-pae #1
> > Call Trace:
> > [<c1037698>] ? warn_slowpath_common+0x68/0x79
> > [<c120d061>] ? net_rx_action+0xda/0x17e
> > [<c10376b6>] ? warn_slowpath_null+0xd/0x10
> > [<c120d061>] ? net_rx_action+0xda/0x17e
> > [<c103c05d>] ? local_bh_enable+0x2/0x2
> > [<c103c0f1>] ? __do_softirq+0x94/0x12f
> > [<c103c05d>] ? local_bh_enable+0x2/0x2
> > <IRQ> [<c103c2e2>] ? irq_exit+0x32/0x80
> > [<c100ca6e>] ? do_IRQ+0x65/0x76
> > [<c12b2a30>] ? common_interrupt+0x30/0x38
> > [<c103007b>] ? sched_debug_show+0x165/0xb17
> > [<c118519b>] ? intel_idle+0xb9/0xde
> > [<c11f54d2>] ? cpuidle_idle_call+0xcd/0x140
> > [<c100aef1>] ? cpu_idle+0x86/0xaa
> > [<c143e708>] ? start_kernel+0x32a/0x32f
> > ---[ end trace 6d03368d0e01d4ae ]---
> > r8169 0000:02:00.0: eth0: link up
> > r8169 0000:02:00.0: eth0: link up
> > r8169 0000:02:00.0: eth0: link up
> > r8169 0000:02:00.0: eth0: link up
> [...]
> > If you need any more information or log files please send me a mail.
>
> This is
>
> work = n->poll(n, weight);
> [...]
> WARN_ON_ONCE(work > weight);
>
> From the same log:
>
> > [ 1.478652] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
> > [ 1.478722] r8169 0000:02:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
> > [ 1.478779] r8169 0000:02:00.0: setting latency timer to 64
> > [ 1.478850] r8169 0000:02:00.0: irq 40 for MSI/MSI-X
> [...]
> > [ 1.509507] r8169 0000:02:00.0: eth0: RTL8168c/8111c at 0xf82c2000, 00:01:c0:08:aa:31, XID 1c4000c0 IRQ 40
> > [ 1.509569] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
> > [ 1.509633] r8169 0000:03:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
> > [ 1.509691] r8169 0000:03:00.0: setting latency timer to 64
> > [ 1.509765] r8169 0000:03:00.0: irq 41 for MSI/MSI-X
> > [ 1.511188] r8169 0000:03:00.0: eth1: RTL8168c/8111c at 0xf82d0000, 00:01:c0:08:aa:32, XID 1c4000c0 IRQ 41
>
> From another log, using a kernel without pae support:
>
> > [ 844.056012] BUG: soft lockup - CPU#0 stuck for 23s! [smbd:2770]
> [...]
> > [ 872.056011] BUG: soft lockup - CPU#0 stuck for 23s! [smbd:2770]
> [...]
> > [ 900.056011] BUG: soft lockup - CPU#0 stuck for 23s! [smbd:2770]
>
> Gerd previously was getting transmit queue timeouts with a v2.6.32-based
> kernel; with Debian's 3.1.1-1 kernel, the system hangs instead. See [1]
> for the details, including full logs.
>
> Thanks for keeping the r8169 driver well maintained. Any ideas for
> tracking this down?
>
> Looking forward to your thoughts,
> Jonathan
>
> [1] http://bugs.debian.org/642911
> http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=34;filename=dmesg_20111125_kernel_3.1-pae.txt;att=2;bug=642911
rtl8169_rx_interrupt(..., budget) can return budget + 1 sometimes
because of :
/* Work around for AMD plateform. */
if ((desc->opts2 & cpu_to_le32(0xfffe000)) &&
(tp->mac_version == RTL_GIGA_MAC_VER_05)) {
desc->opts2 = 0;
cur_rx++;
}
Sorry, I wont patch this today, its black Friday, and David said to
patch submitters :
"stick to turkey and wine you're better at it"
:)
^ permalink raw reply
* Re: [bug?] r8169: hangs under heavy load
From: Jonathan Nieder @ 2011-11-25 20:31 UTC (permalink / raw)
To: netdev; +Cc: nic_swsd, Francois Romieu, linux-kernel, Armin Kazmi, Gerd
In-Reply-To: <20111125201936.GA26692@elie.hsd1.il.comcast.net>
Jonathan Nieder wrote:
> From another log, using a kernel without pae support:
>
>> [ 844.056012] BUG: soft lockup - CPU#0 stuck for 23s! [smbd:2770]
> [...]
>> [ 872.056011] BUG: soft lockup - CPU#0 stuck for 23s! [smbd:2770]
> [...]
>> [ 900.056011] BUG: soft lockup - CPU#0 stuck for 23s! [smbd:2770]
What I meant is a kernel without SMP support (it lacks PAE, too, but
that's less relevant). Description:
http://packages.debian.org/sid/linux-image-3.1.0-1-486
.configs for the two kernels mentioned:
http://alioth.debian.org/~jrnieder-guest/temp/config-3.1.0-1-686-pae
http://alioth.debian.org/~jrnieder-guest/temp/config-3.1.0-1-486
Sorry for the lack of clarity.
^ permalink raw reply
* Re: Open vSwitch Design
From: Jesse Gross @ 2011-11-25 20:20 UTC (permalink / raw)
To: jhs-jkUAjuhPggJWk0Htik3J/w
Cc: dev-yBygre7rU0TnMu66kgdUjQ, chrisw-H+wXaHxf7aLQT0dZR+AlfA,
Eric Dumazet, netdev-u79uwXL29TY76Z2rM5mHXA, Florian Westphal,
john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
herbert-F6s6mLieUQo7FNHlEwC/lvQIK84fMopw,
shemminger-ZtmgI6mnKB3QT0dZR+AlfA, David Miller
In-Reply-To: <1322220862.1908.79.camel@mojatatu>
On Fri, Nov 25, 2011 at 3:34 AM, jamal <hadi-fAAogVwAN2Kw5LPnMra/2Q@public.gmane.org> wrote:
>
> Hrm. I forgot about the flow classifier - it may be what the openflow
> folks need. It is more friendly for the well defined tuples than u32.
The flow classifier isn't really designed to do rule lookup in the way
that OpenFlow/Open vSwitch does, since it's more about choosing which
fields are considered significant to the flow. I'm sure that it could
be extended in some way but it seems that the better approach would be
to factor out the common pieces (such as the header extraction
mentioned before) than try to cram both models into one component.
I understand that you see some commonalities with various parts of the
system but often there are enough conceptual differences that you end
up trying to shove a square peg into a round hole. As Stephen
mentioned about the bridge, many of these components are already
fairly complex and combining more functionality into them isn't always
a win.
^ permalink raw reply
* [bug?] r8169: hangs under heavy load
From: Jonathan Nieder @ 2011-11-25 20:19 UTC (permalink / raw)
To: netdev; +Cc: nic_swsd, Francois Romieu, linux-kernel, Armin Kazmi, Gerd
In-Reply-To: <4ECFE7A7.5070300@wolke7.net>
Hi,
Gerd writes[1]:
> Today I installed the 3.1 kernel and did some testing by copying files
> with samba.
[...]
> Now the CPU hangs during the interrupts:
[...]
> r8169 0000:02:00.0: eth0: link up
> ------------[ cut here ]------------
> WARNING: at [...]/net/core/dev.c:3827 net_rx_action+0xda/0x17e()
>
> Hardware name: CM-iAM/SBC-FITPC2i
> Modules linked in: cpufreq_userspace cpufreq_conservative cpufreq_stats nfsd lockd nfs_acl auth_rpcgss sunrpc speedstep_lib cpufreq_powersave fuse ext2 coretemp acpi_cpufreq mperf loop arc4 snd_hda_codec_realtek rt2800usb rt2800lib crc_ccitt rt2x00usb rt2x00lib mac80211 psb_gfx(C) snd_hda_intel cfg80211 snd_hda_codec drm_kms_helper i2c_isch drm snd_hwdep rfkill tpm_tis tpm tpm_bios lpc_sch mfd_core evdev snd_pcm snd_seq snd_timer snd_seq_device battery processor button psmouse snd pcspkr serio_raw ac i2c_algo_bit power_supply i2c_core soundcore snd_page_alloc video usbhid ext4 hid mbcache jbd2 crc16 sd_mod crc_t10dif ata_generic uhci_hcd pata_sch libata ehci_hcd scsi_mod usbcore sdhci_pci sdhci r8169 mii thermal thermal_sys mmc_core [last unloaded: scsi_wait_scan]
> Pid: 0, comm: swapper Tainted: G C 3.1.0-1-686-pae #1
> Call Trace:
> [<c1037698>] ? warn_slowpath_common+0x68/0x79
> [<c120d061>] ? net_rx_action+0xda/0x17e
> [<c10376b6>] ? warn_slowpath_null+0xd/0x10
> [<c120d061>] ? net_rx_action+0xda/0x17e
> [<c103c05d>] ? local_bh_enable+0x2/0x2
> [<c103c0f1>] ? __do_softirq+0x94/0x12f
> [<c103c05d>] ? local_bh_enable+0x2/0x2
> <IRQ> [<c103c2e2>] ? irq_exit+0x32/0x80
> [<c100ca6e>] ? do_IRQ+0x65/0x76
> [<c12b2a30>] ? common_interrupt+0x30/0x38
> [<c103007b>] ? sched_debug_show+0x165/0xb17
> [<c118519b>] ? intel_idle+0xb9/0xde
> [<c11f54d2>] ? cpuidle_idle_call+0xcd/0x140
> [<c100aef1>] ? cpu_idle+0x86/0xaa
> [<c143e708>] ? start_kernel+0x32a/0x32f
> ---[ end trace 6d03368d0e01d4ae ]---
> r8169 0000:02:00.0: eth0: link up
> r8169 0000:02:00.0: eth0: link up
> r8169 0000:02:00.0: eth0: link up
> r8169 0000:02:00.0: eth0: link up
[...]
> If you need any more information or log files please send me a mail.
This is
work = n->poll(n, weight);
[...]
WARN_ON_ONCE(work > weight);
>From the same log:
> [ 1.478652] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
> [ 1.478722] r8169 0000:02:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
> [ 1.478779] r8169 0000:02:00.0: setting latency timer to 64
> [ 1.478850] r8169 0000:02:00.0: irq 40 for MSI/MSI-X
[...]
> [ 1.509507] r8169 0000:02:00.0: eth0: RTL8168c/8111c at 0xf82c2000, 00:01:c0:08:aa:31, XID 1c4000c0 IRQ 40
> [ 1.509569] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
> [ 1.509633] r8169 0000:03:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
> [ 1.509691] r8169 0000:03:00.0: setting latency timer to 64
> [ 1.509765] r8169 0000:03:00.0: irq 41 for MSI/MSI-X
> [ 1.511188] r8169 0000:03:00.0: eth1: RTL8168c/8111c at 0xf82d0000, 00:01:c0:08:aa:32, XID 1c4000c0 IRQ 41
>From another log, using a kernel without pae support:
> [ 844.056012] BUG: soft lockup - CPU#0 stuck for 23s! [smbd:2770]
[...]
> [ 872.056011] BUG: soft lockup - CPU#0 stuck for 23s! [smbd:2770]
[...]
> [ 900.056011] BUG: soft lockup - CPU#0 stuck for 23s! [smbd:2770]
Gerd previously was getting transmit queue timeouts with a v2.6.32-based
kernel; with Debian's 3.1.1-1 kernel, the system hangs instead. See [1]
for the details, including full logs.
Thanks for keeping the r8169 driver well maintained. Any ideas for
tracking this down?
Looking forward to your thoughts,
Jonathan
[1] http://bugs.debian.org/642911
http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=34;filename=dmesg_20111125_kernel_3.1-pae.txt;att=2;bug=642911
^ permalink raw reply
* Re: Open vSwitch Design
From: Jesse Gross @ 2011-11-25 20:14 UTC (permalink / raw)
To: Eric Dumazet
Cc: dev-yBygre7rU0TnMu66kgdUjQ, Chris Wright, Herbert Xu, netdev,
hadi-fAAogVwAN2Kw5LPnMra/2Q, jhs-jkUAjuhPggJWk0Htik3J/w,
John Fastabend, Stephen Hemminger, David Miller
In-Reply-To: <1322201883.2872.19.camel@edumazet-laptop>
On Thu, Nov 24, 2011 at 10:18 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le jeudi 24 novembre 2011 à 21:20 -0800, Stephen Hemminger a écrit :
>
>> The problem is that there are two flow classifiers, one in OpenVswitch
>> in the kernel, and the other in the user space flow manager. I think the
>> issue is that the two have different code.
>
> We have kind of same duplication in kernel already :)
>
> __skb_get_rxhash() and net/sched/cls_flow.c contain roughly the same
> logic...
>
> Maybe its time to factorize the thing, eventually use it in a third
> component (Open vSwitch...)
I agree, there's no need to have three copies of packet header parsing
code and that's certainly something that we would be willing to work
on improving.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev
^ permalink raw reply
* Re: [PATCH net] net: Revert ARCNET and PHYLIB to tristate options
From: David Miller @ 2011-11-25 20:05 UTC (permalink / raw)
To: ben; +Cc: jeffrey.t.kirsher, netdev, debian-kernel
In-Reply-To: <1322249863.2839.375.camel@deadeye>
From: Ben Hutchings <ben@decadent.org.uk>
Date: Fri, 25 Nov 2011 19:37:43 +0000
> On Fri, 2011-11-25 at 13:50 -0500, David Miller wrote:
>> From: Ben Hutchings <ben@decadent.org.uk>
>> Date: Fri, 25 Nov 2011 18:40:42 +0000
>>
>> > On Fri, 2011-11-25 at 13:22 -0500, David Miller wrote:
>> >> Try allmodconfig for yourself.
>> >
>> > OK, on x86_64, this does end up with PHYLIB=y but only because
>> > NET_DSA=y. And I don't believe NET_DSA is appropriate for a distro
>> > kernel.
>>
>> Do you think we can modularize the NET_DSA reference somehow?
>
> Maybe, but I just don't care about DSA. It requires platform data to
> work, so AFAICS it's only useful in a custom kernel for some platform
> that has one of the supported chips.
As far as I understand it, that's not true in the case of device tree,
which can instantiate the platform information dynamically.
And, in any event, having only one specific case force PHYLIB to 'y' is
at best disappointing.
^ permalink raw reply
* Re: Open vSwitch Design
From: Justin Pettit @ 2011-11-25 19:52 UTC (permalink / raw)
To: Stephen Hemminger
Cc: dev-yBygre7rU0TnMu66kgdUjQ, Chris Wright, Herbert Xu,
Eric Dumazet, netdev, hadi-fAAogVwAN2Kw5LPnMra/2Q,
jhs-jkUAjuhPggJWk0Htik3J/w, John Fastabend, David Miller
In-Reply-To: <20111124212021.2ae2fb7f-QE31Isp8l5DVJhW05BI4jyWSNWFUUkiGXqFh9Ls21Oc@public.gmane.org>
On Nov 24, 2011, at 9:20 PM, Stephen Hemminger wrote:
>> This can be achieved easily with zero changes to the kernel code.
>> You need to have default filters that redirect flows to user space
>> when you fail to match.
>
> Actually, this is what puts me off on the current implementation.
> I would prefer that the kernel implementation was just a software
> implementation of a hardware OpenFlow switch. That way it would
> be transparent that the control plane in user space was talking to kernel
> or hardware.
A big difficulty is finding an appropriate hardware abstraction. I've worked on porting Open vSwitch to a few different vendors' switching ASICs, and they've all looked quite different from each other. Even within a vendor, there can be fairly substantial differences. Packet processing is broken up into stages (e.g., VLAN preprocessing, ingress ACL processing, L2 lookup, L3 lookup, packet modification, packet queuing, packet replication, egress ACL processing, etc.) and these can be done in different orders and have quite different behaviors. Also, the size of the various tables varies widely between ASICs--even within the same family.
Hardware typically makes use of TCAMs, which support fast lookups of wildcarded flows. They're expensive, though, so they're typically limited to entries in the very low thousands. In software, we can trivially store 100,000s of entries, but supporting wildcarded lookups is very slow. If we only use exact-match flows in the kernel (and leave the wildcarding in userspace for kernel misses), we can do extremely fast lookups with hashing on what becomes the fastpath.
Using exact-match entries has another big advantage: we can innovate the userspace portion without requiring changes to the kernel. For example, we recently went from supporting a single OpenFlow table to 255 without any kernel changes. This has an added benefit that a flow requiring multiple table lookups becomes a single hash lookup in the kernel, which is a huge performance gain in the fastpath. Another example is our introduction of a number of metadata "registers" between tables that are never seen in the kernel, but open up a lot of interesting applications for OpenFlow controller writers.
If you're interested, we include a porting guide in the distribution that describes how one would go about bringing Open vSwitch to a new hardware or software platform:
http://openvswitch.org/cgi-bin/gitweb.cgi?p=openvswitch;a=blob;f=PORTING
Obviously, it's not that relevant here, since there's already a port to Linux. :-) But we've iterated over a few different designs and worked on other ports, and we've found this hardware/software abstraction layer to work pretty well. In fact, multiple ports of Open vSwitch have been done by name-brand third party vendors (this is the avenue most vendors use to get their OpenFlow support) and are now shipping.
We're always open to discussing ways that we can improve this interfaces, too, of course!
--Justin
^ permalink raw reply
* Re: [PATCH net] net: Revert ARCNET and PHYLIB to tristate options
From: Ben Hutchings @ 2011-11-25 19:37 UTC (permalink / raw)
To: David Miller; +Cc: jeffrey.t.kirsher, netdev, debian-kernel
In-Reply-To: <20111125.135022.2150651060225144120.davem@davemloft.net>
[-- Attachment #1: Type: text/plain, Size: 746 bytes --]
On Fri, 2011-11-25 at 13:50 -0500, David Miller wrote:
> From: Ben Hutchings <ben@decadent.org.uk>
> Date: Fri, 25 Nov 2011 18:40:42 +0000
>
> > On Fri, 2011-11-25 at 13:22 -0500, David Miller wrote:
> >> Try allmodconfig for yourself.
> >
> > OK, on x86_64, this does end up with PHYLIB=y but only because
> > NET_DSA=y. And I don't believe NET_DSA is appropriate for a distro
> > kernel.
>
> Do you think we can modularize the NET_DSA reference somehow?
Maybe, but I just don't care about DSA. It requires platform data to
work, so AFAICS it's only useful in a custom kernel for some platform
that has one of the supported chips.
Ben.
--
Ben Hutchings
Teamwork is essential - it allows you to blame someone else.
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply
* Re: [PATCH net] net: Revert ARCNET and PHYLIB to tristate options
From: David Miller @ 2011-11-25 18:50 UTC (permalink / raw)
To: ben; +Cc: jeffrey.t.kirsher, netdev, debian-kernel
In-Reply-To: <1322246442.2839.358.camel@deadeye>
From: Ben Hutchings <ben@decadent.org.uk>
Date: Fri, 25 Nov 2011 18:40:42 +0000
> On Fri, 2011-11-25 at 13:22 -0500, David Miller wrote:
>> Try allmodconfig for yourself.
>
> OK, on x86_64, this does end up with PHYLIB=y but only because
> NET_DSA=y. And I don't believe NET_DSA is appropriate for a distro
> kernel.
Do you think we can modularize the NET_DSA reference somehow?
^ permalink raw reply
* Re: [PATCH net] net: Revert ARCNET and PHYLIB to tristate options
From: Ben Hutchings @ 2011-11-25 18:40 UTC (permalink / raw)
To: David Miller; +Cc: jeffrey.t.kirsher, netdev, debian-kernel
In-Reply-To: <20111125.132220.150546128040363058.davem@davemloft.net>
[-- Attachment #1: Type: text/plain, Size: 625 bytes --]
On Fri, 2011-11-25 at 13:22 -0500, David Miller wrote:
> From: Ben Hutchings <ben@decadent.org.uk>
> Date: Fri, 25 Nov 2011 14:07:51 +0000
>
> > Well, I can't think why it would be built in, since PHY modules can be
> > auto-loaded now.
>
> It's because drivers select the thing.
Drivers are also built as modules in a distribution kernel.
> Try allmodconfig for yourself.
OK, on x86_64, this does end up with PHYLIB=y but only because
NET_DSA=y. And I don't believe NET_DSA is appropriate for a distro
kernel.
Ben.
--
Ben Hutchings
Teamwork is essential - it allows you to blame someone else.
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply
* Re: [v4 PATCH 1/2] NETFILTER module xt_hmark, new target for HASH based fwmark
From: Jan Engelhardt @ 2011-11-25 18:31 UTC (permalink / raw)
To: Pablo Neira Ayuso
Cc: Hans Schillstrom, kaber, netfilter-devel, netdev,
hans.schillstrom
In-Reply-To: <20111125173649.GA9304@1984>
On Friday 2011-11-25 18:36, Pablo Neira Ayuso wrote:
>On Fri, Nov 25, 2011 at 10:36:26AM +0100, Hans Schillstrom wrote:
>> diff --git a/include/net/ipv6.h b/include/net/ipv6.h
>> index 3f0258d..9e4d4f9 100644
>> --- a/include/net/ipv6.h
>> +++ b/include/net/ipv6.h
>> @@ -39,6 +39,7 @@
>> #define NEXTHDR_ICMP 58 /* ICMP for IPv6. */
>> #define NEXTHDR_NONE 59 /* No next header */
>> #define NEXTHDR_DEST 60 /* Destination options header. */
>> +#define NEXTHDR_SCTP 132 /* Stream Control Transport Protocol */
>> #define NEXTHDR_MOBILITY 135 /* Mobility header. */
>>
>> #define NEXTHDR_MAX 255
>
>This has to go in a separated patch. Please, send it to netdev. I
>think davem can pick that for 3.2-rc
I do have to wonder a little why we need the l4proto values twice
(IPPROTO_SCTP plus NEXTHDR_SCTP). Has nobody ever thought of
doing one foobar_<PROTOCOL>?
>> + icmph->type != ICMP_REDIRECT)
>> + return nhoff;
>> + /* Checkin full IP header plus 8 bytes of protocol to
>> + * avoid additional coding at protocol handlers.
>> + */
>> + if (!pskb_may_pull(skb, nhoff + iphsz + sizeof(_ih) + 8))
>> + return nhoff;
NB:I point out that the preferred long comment style begins with /*\n
(to match the trailing \n*/, naturally) like in
>> +/*
>> + * ICMPv6
>> + * Input nhoff Offset into network header
>> + * offset where ICMPv6 header starts
>> + * Returns true if it's a icmp error and updates nhoff
>> + */
^ permalink raw reply
* Re: [PATCH v2] netns: fix proxy ARP entries listing on a netns
From: David Miller @ 2011-11-25 18:24 UTC (permalink / raw)
To: jorge; +Cc: netdev
In-Reply-To: <1322232089-31902-1-git-send-email-jorge@dti2.net>
From: "Jorge Boncompte [DTI2]" <jorge@dti2.net>
Date: Fri, 25 Nov 2011 15:41:29 +0100
> From: "Jorge Boncompte [DTI2]" <jorge@dti2.net>
>
> Skip entries from foreign network namespaces.
>
> V2:
> Fixed as suggested by David Miller to avoid a goto.
>
> Signed-off-by: Jorge Boncompte [DTI2] <jorge@dti2.net>
Applied, thanks.
^ permalink raw reply
* Re: [PATCH net] net: Revert ARCNET and PHYLIB to tristate options
From: David Miller @ 2011-11-25 18:22 UTC (permalink / raw)
To: ben; +Cc: jeffrey.t.kirsher, netdev, debian-kernel
In-Reply-To: <1322230071.2839.337.camel@deadeye>
From: Ben Hutchings <ben@decadent.org.uk>
Date: Fri, 25 Nov 2011 14:07:51 +0000
> Well, I can't think why it would be built in, since PHY modules can be
> auto-loaded now.
It's because drivers select the thing.
Try allmodconfig for yourself.
^ permalink raw reply
* Re: [PATCH 2/2] ehea: Use round_jiffies_relative to align workqueue
From: David Miller @ 2011-11-25 18:00 UTC (permalink / raw)
To: anton; +Cc: cascardo, netdev
In-Reply-To: <20111123211354.74a81a88@kryten>
From: Anton Blanchard <anton@samba.org>
Date: Wed, 23 Nov 2011 21:13:54 +1100
>
> Use round_jiffies_relative to align the ehea workqueue and avoid
> extra wakeups.
>
> Signed-off-by: Anton Blanchard <anton@samba.org>
Applied.
^ permalink raw reply
* Re: [PATCH 1/2] ehea: Reduce memory usage in buffer pools
From: David Miller @ 2011-11-25 18:00 UTC (permalink / raw)
To: anton; +Cc: cascardo, netdev
In-Reply-To: <20111123211302.2a37debb@kryten>
From: Anton Blanchard <anton@samba.org>
Date: Wed, 23 Nov 2011 21:13:02 +1100
>
> Now that we enable multiqueue by default the ehea driver is using
> quite a lot of memory for its buffer pools. With 4 queues we
> consume 64MB in the jumbo packet ring, 16MB in the medium packet
> ring and 16MB in the tiny packet ring.
>
> We should only fill the jumbo ring once the MTU is increased but
> for now halve it's size so it consumes 32MB. Also reduce the tiny
> packet ring, with 4 queues we had 16k entries which is overkill.
>
> Signed-off-by: Anton Blanchard <anton@samba.org>
Applied.
^ permalink raw reply
* Re: Open vSwitch Design
From: Jesse Gross @ 2011-11-25 17:55 UTC (permalink / raw)
To: Stephen Hemminger
Cc: dev-yBygre7rU0TnMu66kgdUjQ, Chris Wright, Herbert Xu,
Eric Dumazet, netdev, hadi-fAAogVwAN2Kw5LPnMra/2Q,
jhs-jkUAjuhPggJWk0Htik3J/w, John Fastabend, David Miller
In-Reply-To: <20111124212021.2ae2fb7f-QE31Isp8l5DVJhW05BI4jyWSNWFUUkiGXqFh9Ls21Oc@public.gmane.org>
On Thu, Nov 24, 2011 at 9:20 PM, Stephen Hemminger
<shemminger@vyatta.com> wrote:
> On Thu, 24 Nov 2011 17:30:33 -0500
> jamal <hadi@cyberus.ca> wrote:
>> On Thu, 2011-11-24 at 12:10 -0800, Jesse Gross wrote:
>> > * Userspace interfaces: One of the difficulties of having a
>> > specialized, exact match flow lookup engine is maintaining
>> > compatibility across differing kernel/userspace versions. This
>> > compatibility shows up heavily in the userspace interfaces and is
>> > achieved by passing the kernel's version of the flow along with packet
>> > information. This allows userspace to install appropriate flows even
>> > if its interpretation of a packet differs from the kernel's without
>> > version checks or maintaining multiple implementations of the flow
>> > extraction code in the kernel.
>>
>> I didnt quiet follow - are we talking about backward/forward
>> compatibility?
>
> The problem is that there are two flow classifiers, one in OpenVswitch
> in the kernel, and the other in the user space flow manager. I think the
> issue is that the two have different code.
Yes, since userspace is installing exact match entries, these flows
obviously need to be of the same form that the kernel would extract
from the packet. Over time, I'm sure that additional packet formats
will be added so it's important to handle the case where there is a
mismatch.
> Is the kernel/userspace API for OpenVswitch nailed down and documented
> well enough that alternative control plane software could be built?
Yes, that's actually the reason why it took so long to actually submit
the code for upstream - we spent a lot of time cleaning up and
stripping down the interfaces so they could be locked down (or cleanly
extended).
There's a fair amount of documentation on how to maintain
compatibility for flows as mentioned above in the patch that I
submitted and we're certainly happy to write more if other things are
unclear.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev
^ permalink raw reply
* Re: cache forver in 3.2.0-rc2-00400-g866d43c ?
From: Arkadiusz Miśkiewicz @ 2011-11-25 17:53 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev
In-Reply-To: <201111232135.55122.a.miskiewicz@gmail.com>
On Wednesday 23 of November 2011, Arkadiusz Miśkiewicz wrote:
> On Wednesday 23 of November 2011, Eric Dumazet wrote:
> > Le mercredi 23 novembre 2011 à 19:37 +0100, Eric Dumazet a écrit :
> > > Le mercredi 23 novembre 2011 à 19:31 +0100, Arkadiusz Miśkiewicz a
> > >
> > > écrit :
> > > > Mine 00400-g866d43c was after
> > > > 6fe4c6d466e95d31164f14b1ac4aefb51f0f4f82 (which is merge of ipv4:
> > > > fix redirect handling), so I have it.
> > > >
> > > > (I'm using pure linus git repo)
> > >
> > > OK thanks for this information, I am working on a patch.
> >
> > Please test the following patch, thanks !
>
> Applied and running, results in a few days (since using net A only once per
> day).
Two days and no problems. More tests possible in next week.
Thanks!
--
Arkadiusz Miśkiewicz PLD/Linux Team
arekm / maven.pl http://ftp.pld-linux.org/
^ permalink raw reply
* Re: [PATCH iproute2 1/2] utils: add s32 parser
From: Hagen Paul Pfeifer @ 2011-11-25 17:47 UTC (permalink / raw)
To: David Laight; +Cc: Stephen Hemminger, netdev
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6D8AEE6@saturn3.aculab.com>
* David Laight | 2011-11-25 17:34:21 [-0000]:
>If you are that worried about numeric overflow (IIRC) you have
>have to check the result for LONG_MIN/MAX (etc) before looking
>at errno.
>
>strtoul() is defined to support -ve values, and I think the
>C rules for conversion between signed and unsigned ints
>DTRT even for non 2's compliment systems.
>
>Some of these bound checks are a waste of time.
>The SUS doesn't require standard utilities to perform them.
David: are you able to fix all conversations functions in utils.c (and add
get_s32). If not I will repost get_s32 with strtol() with the same error check
mechanism as get_* (to be consistent).
Stephen, any other ideas?
Hagen
^ permalink raw reply
* [PATCH] Fix skb_update_prio
From: igorm @ 2011-11-25 17:44 UTC (permalink / raw)
To: netdev; +Cc: Igor Maravic
In-Reply-To: <1322243094-10420-1-git-send-email-igorm@etf.rs>
From: Igor Maravic <igorm@etf.rs>
Change function rcu_dereference to rcu_dereference_bh to avoid warning
[ INFO: suspicious RCU usage. ]
-------------------------------
net/core/dev.c:2459 suspicious rcu_dereference_check() usage!
because we are locking with
rcu_read_lock_bh();
in function dev_queue_xmit(struct sk_buff *skb)
Signed-off-by: Igor Maravic <igorm@etf.rs>
---
net/core/dev.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index 8afb244..d1f1071 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2452,7 +2452,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
#if IS_ENABLED(CONFIG_NETPRIO_CGROUP)
static void skb_update_prio(struct sk_buff *skb)
{
- struct netprio_map *map = rcu_dereference(skb->dev->priomap);
+ struct netprio_map *map = rcu_dereference_bh(skb->dev->priomap);
if ((!skb->priority) && (skb->sk) && map)
skb->priority = map->priomap[skb->sk->sk_cgrp_prioidx];
--
1.7.5.4
^ permalink raw reply related
* [PATCH] Fix skb_update_prio
From: igorm @ 2011-11-25 17:44 UTC (permalink / raw)
To: netdev; +Cc: Igor Maravic
From: Igor Maravic <igorm@etf.rs>
Fixed warning
Nov 25 14:50:07 igor-PC kernel: [ 94.864139]
Nov 25 14:50:07 igor-PC kernel: [ 94.864143] ===============================
Nov 25 14:50:07 igor-PC kernel: [ 94.864144] [ INFO: suspicious RCU usage. ]
Nov 25 14:50:07 igor-PC kernel: [ 94.864146] -------------------------------
Nov 25 14:50:07 igor-PC kernel: [ 94.864149] net/core/dev.c:2459 suspicious rcu_dereference_check() usage!
Nov 25 14:50:07 igor-PC kernel: [ 94.864151]
Nov 25 14:50:07 igor-PC kernel: [ 94.864151] other info that might help us debug this:
Nov 25 14:50:07 igor-PC kernel: [ 94.864152]
Nov 25 14:50:07 igor-PC kernel: [ 94.864153]
Nov 25 14:50:07 igor-PC kernel: [ 94.864154] rcu_scheduler_active = 1, debug_locks = 1
Nov 25 14:50:07 igor-PC kernel: [ 94.864156] 3 locks held by kworker/0:0/0:
Nov 25 14:50:07 igor-PC kernel: [ 94.864158] #0: (&n->timer){+.-...}, at: [<c1074320>] call_timer_fn+0x0/0x2d0
Nov 25 14:50:07 igor-PC kernel: [ 94.864168] #1: (&n->lock){++--..}, at: [<c156d3d8>] arp_solicit+0x228/0x300
Nov 25 14:50:07 igor-PC kernel: [ 94.864173] #2: (rcu_read_lock_bh){.+....}, at: [<c14fdd10>] dev_queue_xmit+0x0/0x6e0
Nov 25 14:50:07 igor-PC kernel: [ 94.864179]
Nov 25 14:50:07 igor-PC kernel: [ 94.864180] stack backtrace:
Nov 25 14:50:07 igor-PC kernel: [ 94.864183] Pid: 0, comm: kworker/0:0 Not tainted 3.2.0-rc2-net-next-mpls+ #13
Nov 25 14:50:07 igor-PC kernel: [ 94.864185] Call Trace:
Nov 25 14:50:07 igor-PC kernel: [ 94.864189] [<c1617d55>] ? printk+0x2d/0x2f
Nov 25 14:50:07 igor-PC kernel: [ 94.864194] [<c109dfaa>] lockdep_rcu_suspicious+0xaa/0xc0
Nov 25 14:50:07 igor-PC kernel: [ 94.864197] [<c14fe17a>] dev_queue_xmit+0x46a/0x6e0
Nov 25 14:50:07 igor-PC kernel: [ 94.864200] [<c14fdd10>] ? dev_hard_start_xmit+0x650/0x650
Nov 25 14:50:07 igor-PC kernel: [ 94.864203] [<c156cc6f>] ? arp_create+0x1ff/0x220
Nov 25 14:50:07 igor-PC kernel: [ 94.864205] [<c156cf5d>] arp_xmit+0x1d/0x60
Nov 25 14:50:07 igor-PC kernel: [ 94.864209] [<c1516fd0>] ? eth_rebuild_header+0x80/0x80
Nov 25 14:50:07 igor-PC kernel: [ 94.864211] [<c156cff5>] arp_send+0x55/0x60
Nov 25 14:50:07 igor-PC kernel: [ 94.864214] [<c156d40a>] arp_solicit+0x25a/0x300
Nov 25 14:50:07 igor-PC kernel: [ 94.864217] [<c15041dd>] neigh_probe+0x3d/0x60
Nov 25 14:50:07 igor-PC kernel: [ 94.864220] [<c1506f5f>] neigh_timer_handler+0x16f/0x260
Nov 25 14:50:07 igor-PC kernel: [ 94.864223] [<c107439d>] call_timer_fn+0x7d/0x2d0
Nov 25 14:50:07 igor-PC kernel: [ 94.864226] [<c1074320>] ? init_timer_deferrable_key+0x20/0x20
Nov 25 14:50:07 igor-PC kernel: [ 94.864230] [<c1630837>] ? _raw_spin_unlock_irq+0x27/0x40
Nov 25 14:50:07 igor-PC kernel: [ 94.864233] [<c1506df0>] ? neigh_update+0x560/0x560
Nov 25 14:50:07 igor-PC kernel: [ 94.864236] [<c10746e0>] run_timer_softirq+0xf0/0x260
Nov 25 14:50:07 igor-PC kernel: [ 94.864239] [<c12e1802>] ? blk_done_softirq+0x42/0x80
Nov 25 14:50:07 igor-PC kernel: [ 94.864242] [<c106b6e0>] ? remote_softirq_receive+0x80/0x80
Nov 25 14:50:07 igor-PC kernel: [ 94.864245] [<c1506df0>] ? neigh_update+0x560/0x560
Nov 25 14:50:07 igor-PC kernel: [ 94.864248] [<c106b777>] __do_softirq+0x97/0x320
Nov 25 14:50:07 igor-PC kernel: [ 94.864251] [<c106b6e0>] ? remote_softirq_receive+0x80/0x80
Nov 25 14:50:07 igor-PC kernel: [ 94.864252] <IRQ> [<c106bcd6>] ? irq_exit+0x86/0xb0
Nov 25 14:50:07 igor-PC kernel: [ 94.864258] [<c1639139>] ? smp_apic_timer_interrupt+0x59/0x88
Nov 25 14:50:07 igor-PC kernel: [ 94.864262] [<c12ff648>] ? trace_hardirqs_off_thunk+0xc/0x14
Nov 25 14:50:07 igor-PC kernel: [ 94.864265] [<c1631492>] ? apic_timer_interrupt+0x36/0x3c
Nov 25 14:50:07 igor-PC kernel: [ 94.864269] [<c103ac4a>] ? native_safe_halt+0xa/0x10
Nov 25 14:50:07 igor-PC kernel: [ 94.864272] [<c101cdf1>] ? default_idle.part.5+0x41/0x230
Nov 25 14:50:07 igor-PC kernel: [ 94.864275] [<c101cfff>] ? default_idle+0x1f/0x50
Nov 25 14:50:07 igor-PC kernel: [ 94.864277] [<c101d0da>] ? amd_e400_idle+0xaa/0x140
Nov 25 14:50:07 igor-PC kernel: [ 94.864280] [<c10146d6>] ? cpu_idle+0xb6/0x120
Nov 25 14:50:07 igor-PC kernel: [ 94.864284] [<c16103f2>] ? start_secondary+0x101/0x106
Just replaced function rcu_dereference with rcu_dereference_bh to avoid this warning
in function skb_update_prio
BR
Igor
Igor Maravic (1):
Fix skb_update_prio
net/core/dev.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
--
1.7.5.4
^ permalink raw reply
* [PATCH v6 10/10] Disable task moving when using kernel memory accounting
From: Glauber Costa @ 2011-11-25 17:38 UTC (permalink / raw)
To: linux-kernel
Cc: lizf, kamezawa.hiroyu, ebiederm, davem, paul, gthelen, netdev,
linux-mm, kirill, avagin, devel, eric.dumazet, cgroups,
Glauber Costa
In-Reply-To: <1322242696-27682-1-git-send-email-glommer@parallels.com>
Since this code is still experimental, we are leaving the exact
details of how to move tasks between cgroups when kernel memory
accounting is used as future work.
For now, we simply disallow movement if there are any pending
accounted memory.
Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/memcontrol.c | 23 ++++++++++++++++++++++-
1 files changed, 22 insertions(+), 1 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2df5d3c..ab7e57b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5451,10 +5451,19 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
{
int ret = 0;
struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup);
+ struct mem_cgroup *from = mem_cgroup_from_task(p);
+
+#if defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM) && defined(CONFIG_INET)
+ if (from != mem && !mem_cgroup_is_root(from) &&
+ res_counter_read_u64(&from->tcp_mem.tcp_memory_allocated, RES_USAGE)) {
+ printk(KERN_WARNING "Can't move tasks between cgroups: "
+ "Kernel memory held. task: %s\n", p->comm);
+ return 1;
+ }
+#endif
if (mem->move_charge_at_immigrate) {
struct mm_struct *mm;
- struct mem_cgroup *from = mem_cgroup_from_task(p);
VM_BUG_ON(from == mem);
@@ -5622,6 +5631,18 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
struct cgroup *cgroup,
struct task_struct *p)
{
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup);
+ struct mem_cgroup *from = mem_cgroup_from_task(p);
+
+#if defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM) && defined(CONFIG_INET)
+ if (from != mem && !mem_cgroup_is_root(from) &&
+ res_counter_read_u64(&from->tcp_mem.tcp_memory_allocated, RES_USAGE)) {
+ printk(KERN_WARNING "Can't move tasks between cgroups: "
+ "Kernel memory held. task: %s\n", p->comm);
+ return 1;
+ }
+#endif
+
return 0;
}
static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
--
1.7.6.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related
* [PATCH v6 09/10] Display maximum tcp memory allocation in kmem cgroup
From: Glauber Costa @ 2011-11-25 17:38 UTC (permalink / raw)
To: linux-kernel
Cc: lizf, kamezawa.hiroyu, ebiederm, davem, paul, gthelen, netdev,
linux-mm, kirill, avagin, devel, eric.dumazet, cgroups,
Glauber Costa
In-Reply-To: <1322242696-27682-1-git-send-email-glommer@parallels.com>
This patch introduces kmem.tcp.max_usage_in_bytes file, living in the
kmem_cgroup filesystem. The root cgroup will display a value equal
to RESOURCE_MAX. This is to avoid introducing any locking schemes in
the network paths when cgroups are not being actively used.
All others, will see the maximum memory ever used by this cgroup.
Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
net/ipv4/tcp_memcg.c | 12 +++++++++++-
1 files changed, 11 insertions(+), 1 deletions(-)
diff --git a/net/ipv4/tcp_memcg.c b/net/ipv4/tcp_memcg.c
index da8d9c0..e702f6a 100644
--- a/net/ipv4/tcp_memcg.c
+++ b/net/ipv4/tcp_memcg.c
@@ -28,6 +28,12 @@ static struct cftype tcp_files[] = {
.trigger = tcp_cgroup_reset,
.read_u64 = tcp_cgroup_read,
},
+ {
+ .name = "kmem.tcp.max_usage_in_bytes",
+ .private = RES_MAX_USAGE,
+ .trigger = tcp_cgroup_reset,
+ .read_u64 = tcp_cgroup_read,
+ },
};
static inline struct tcp_memcontrol *tcp_from_cgproto(struct cg_proto *cg_proto)
@@ -196,7 +202,8 @@ static u64 tcp_cgroup_read(struct cgroup *cont, struct cftype *cft)
val = tcp_read_usage(memcg);
break;
case RES_FAILCNT:
- val = tcp_read_stat(memcg, RES_FAILCNT, 0);
+ case RES_MAX_USAGE:
+ val = tcp_read_stat(memcg, cft->private, 0);
break;
default:
BUG();
@@ -217,6 +224,9 @@ static int tcp_cgroup_reset(struct cgroup *cont, unsigned int event)
tcp = tcp_from_cgproto(cg_proto);
switch (event) {
+ case RES_MAX_USAGE:
+ res_counter_reset_max(&tcp->tcp_memory_allocated);
+ break;
case RES_FAILCNT:
res_counter_reset_failcnt(&tcp->tcp_memory_allocated);
break;
--
1.7.6.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related
* [PATCH v6 08/10] Display current tcp failcnt in kmem cgroup
From: Glauber Costa @ 2011-11-25 17:38 UTC (permalink / raw)
To: linux-kernel
Cc: lizf, kamezawa.hiroyu, ebiederm, davem, paul, gthelen, netdev,
linux-mm, kirill, avagin, devel, eric.dumazet, cgroups,
Glauber Costa
In-Reply-To: <1322242696-27682-1-git-send-email-glommer@parallels.com>
This patch introduces kmem.tcp.failcnt file, living in the
kmem_cgroup filesystem. Following the pattern in the other
memcg resources, this files keeps a counter of how many times
allocation failed due to limits being hit in this cgroup.
The root cgroup will always show a failcnt of 0.
Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
net/ipv4/tcp_memcg.c | 31 +++++++++++++++++++++++++++++++
1 files changed, 31 insertions(+), 0 deletions(-)
diff --git a/net/ipv4/tcp_memcg.c b/net/ipv4/tcp_memcg.c
index a1ab613..da8d9c0 100644
--- a/net/ipv4/tcp_memcg.c
+++ b/net/ipv4/tcp_memcg.c
@@ -8,6 +8,7 @@
static u64 tcp_cgroup_read(struct cgroup *cont, struct cftype *cft);
static int tcp_cgroup_write(struct cgroup *cont, struct cftype *cft,
const char *buffer);
+static int tcp_cgroup_reset(struct cgroup *cont, unsigned int event);
static struct cftype tcp_files[] = {
{
@@ -21,6 +22,12 @@ static struct cftype tcp_files[] = {
.read_u64 = tcp_cgroup_read,
.private = RES_USAGE,
},
+ {
+ .name = "kmem.tcp.failcnt",
+ .private = RES_FAILCNT,
+ .trigger = tcp_cgroup_reset,
+ .read_u64 = tcp_cgroup_read,
+ },
};
static inline struct tcp_memcontrol *tcp_from_cgproto(struct cg_proto *cg_proto)
@@ -188,12 +195,36 @@ static u64 tcp_cgroup_read(struct cgroup *cont, struct cftype *cft)
case RES_USAGE:
val = tcp_read_usage(memcg);
break;
+ case RES_FAILCNT:
+ val = tcp_read_stat(memcg, RES_FAILCNT, 0);
+ break;
default:
BUG();
}
return val;
}
+static int tcp_cgroup_reset(struct cgroup *cont, unsigned int event)
+{
+ struct mem_cgroup *memcg;
+ struct tcp_memcontrol *tcp;
+ struct cg_proto *cg_proto;
+
+ memcg = mem_cgroup_from_cont(cont);
+ cg_proto = tcp_prot.proto_cgroup(memcg);
+ if (!cg_proto)
+ return 0;
+ tcp = tcp_from_cgproto(cg_proto);
+
+ switch (event) {
+ case RES_FAILCNT:
+ res_counter_reset_failcnt(&tcp->tcp_memory_allocated);
+ break;
+ }
+
+ return 0;
+}
+
unsigned long long tcp_max_memory(const struct mem_cgroup *memcg)
{
struct tcp_memcontrol *tcp;
--
1.7.6.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related
* [PATCH v6 07/10] Display current tcp memory allocation in kmem cgroup
From: Glauber Costa @ 2011-11-25 17:38 UTC (permalink / raw)
To: linux-kernel
Cc: lizf, kamezawa.hiroyu, ebiederm, davem, paul, gthelen, netdev,
linux-mm, kirill, avagin, devel, eric.dumazet, cgroups,
Glauber Costa
In-Reply-To: <1322242696-27682-1-git-send-email-glommer@parallels.com>
This patch introduces kmem.tcp.usage_in_bytes file, living in the
kmem_cgroup filesystem. It is a simple read-only file that displays the
amount of kernel memory currently consumed by the cgroup.
Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
Documentation/cgroups/memory.txt | 1 +
net/ipv4/tcp_memcg.c | 21 +++++++++++++++++++++
2 files changed, 22 insertions(+), 0 deletions(-)
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index c1db134..00f1a88 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -79,6 +79,7 @@ Brief summary of control files.
memory.independent_kmem_limit # select whether or not kernel memory limits are
independent of user limits
memory.kmem.tcp.limit_in_bytes # set/show hard limit for tcp buf memory
+ memory.kmem.tcp.usage_in_bytes # show current tcp buf memory allocation
1. History
diff --git a/net/ipv4/tcp_memcg.c b/net/ipv4/tcp_memcg.c
index b3721c3..a1ab613 100644
--- a/net/ipv4/tcp_memcg.c
+++ b/net/ipv4/tcp_memcg.c
@@ -16,6 +16,11 @@ static struct cftype tcp_files[] = {
.read_u64 = tcp_cgroup_read,
.private = RES_LIMIT,
},
+ {
+ .name = "kmem.tcp.usage_in_bytes",
+ .read_u64 = tcp_cgroup_read,
+ .private = RES_USAGE,
+ },
};
static inline struct tcp_memcontrol *tcp_from_cgproto(struct cg_proto *cg_proto)
@@ -158,6 +163,19 @@ static u64 tcp_read_stat(struct mem_cgroup *memcg, int type, u64 default_val)
return res_counter_read_u64(&tcp->tcp_memory_allocated, type);
}
+static u64 tcp_read_usage(struct mem_cgroup *memcg)
+{
+ struct tcp_memcontrol *tcp;
+ struct cg_proto *cg_proto;
+
+ cg_proto = tcp_prot.proto_cgroup(memcg);
+ if (!cg_proto)
+ return atomic_long_read(&tcp_memory_allocated) << PAGE_SHIFT;
+
+ tcp = tcp_from_cgproto(cg_proto);
+ return res_counter_read_u64(&tcp->tcp_memory_allocated, RES_USAGE);
+}
+
static u64 tcp_cgroup_read(struct cgroup *cont, struct cftype *cft)
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
@@ -167,6 +185,9 @@ static u64 tcp_cgroup_read(struct cgroup *cont, struct cftype *cft)
case RES_LIMIT:
val = tcp_read_stat(memcg, RES_LIMIT, RESOURCE_MAX);
break;
+ case RES_USAGE:
+ val = tcp_read_usage(memcg);
+ break;
default:
BUG();
}
--
1.7.6.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related
* [PATCH v6 06/10] tcp buffer limitation: per-cgroup limit
From: Glauber Costa @ 2011-11-25 17:38 UTC (permalink / raw)
To: linux-kernel
Cc: lizf, kamezawa.hiroyu, ebiederm, davem, paul, gthelen, netdev,
linux-mm, kirill, avagin, devel, eric.dumazet, cgroups,
Glauber Costa
In-Reply-To: <1322242696-27682-1-git-send-email-glommer@parallels.com>
This patch uses the "tcp.limit_in_bytes" field of the kmem_cgroup to
effectively control the amount of kernel memory pinned by a cgroup.
This value is ignored in the root cgroup, and in all others,
caps the value specified by the admin in the net namespaces'
view of tcp_sysctl_mem.
If namespaces are being used, the admin is allowed to set a
value bigger than cgroup's maximum, the same way it is allowed
to set pretty much unlimited values in a real box.
Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
Documentation/cgroups/memory.txt | 1 +
include/net/tcp_memcg.h | 3 +
net/ipv4/sysctl_net_ipv4.c | 14 ++++
net/ipv4/tcp_memcg.c | 138 +++++++++++++++++++++++++++++++++++++-
4 files changed, 154 insertions(+), 2 deletions(-)
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index bf00cd2..c1db134 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -78,6 +78,7 @@ Brief summary of control files.
memory.independent_kmem_limit # select whether or not kernel memory limits are
independent of user limits
+ memory.kmem.tcp.limit_in_bytes # set/show hard limit for tcp buf memory
1. History
diff --git a/include/net/tcp_memcg.h b/include/net/tcp_memcg.h
index 5f5e158..2c8bb6b 100644
--- a/include/net/tcp_memcg.h
+++ b/include/net/tcp_memcg.h
@@ -14,4 +14,7 @@ struct tcp_memcontrol {
struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg);
int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss);
void tcp_destroy_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss);
+unsigned long long tcp_max_memory(const struct mem_cgroup *memcg);
+void tcp_prot_mem(struct mem_cgroup *memcg, long val, int idx);
+int tcp_update_limit(struct mem_cgroup *memcg, u64 val);
#endif /* _TCP_MEMCG_H */
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index bbd67ab..17aaa1b 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -24,6 +24,7 @@
#include <net/cipso_ipv4.h>
#include <net/inet_frag.h>
#include <net/ping.h>
+#include <net/tcp_memcg.h>
static int zero;
static int tcp_retr1_max = 255;
@@ -182,6 +183,9 @@ static int ipv4_tcp_mem(ctl_table *ctl, int write,
int ret;
unsigned long vec[3];
struct net *net = current->nsproxy->net_ns;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+ struct mem_cgroup *cg;
+#endif
ctl_table tmp = {
.data = &vec,
@@ -198,6 +202,16 @@ static int ipv4_tcp_mem(ctl_table *ctl, int write,
if (ret)
return ret;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+ rcu_read_lock();
+ cg = mem_cgroup_from_task(current);
+
+ tcp_prot_mem(cg, vec[0], 0);
+ tcp_prot_mem(cg, vec[1], 1);
+ tcp_prot_mem(cg, vec[2], 2);
+ rcu_read_unlock();
+#endif
+
net->ipv4.sysctl_tcp_mem[0] = vec[0];
net->ipv4.sysctl_tcp_mem[1] = vec[1];
net->ipv4.sysctl_tcp_mem[2] = vec[2];
diff --git a/net/ipv4/tcp_memcg.c b/net/ipv4/tcp_memcg.c
index 1dbc0f3..b3721c3 100644
--- a/net/ipv4/tcp_memcg.c
+++ b/net/ipv4/tcp_memcg.c
@@ -5,6 +5,19 @@
#include <linux/nsproxy.h>
#include <linux/memcontrol.h>
+static u64 tcp_cgroup_read(struct cgroup *cont, struct cftype *cft);
+static int tcp_cgroup_write(struct cgroup *cont, struct cftype *cft,
+ const char *buffer);
+
+static struct cftype tcp_files[] = {
+ {
+ .name = "kmem.tcp.limit_in_bytes",
+ .write_string = tcp_cgroup_write,
+ .read_u64 = tcp_cgroup_read,
+ .private = RES_LIMIT,
+ },
+};
+
static inline struct tcp_memcontrol *tcp_from_cgproto(struct cg_proto *cg_proto)
{
return container_of(cg_proto, struct tcp_memcontrol, cg_proto);
@@ -26,7 +39,7 @@ int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss)
cg_proto = tcp_prot.proto_cgroup(memcg);
if (!cg_proto)
- return 0;
+ goto create_files;
tcp = tcp_from_cgproto(cg_proto);
cg_proto->parent = tcp_prot.proto_cgroup(parent);
@@ -47,7 +60,9 @@ int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss)
cg_proto->memory_allocated = &tcp->tcp_memory_allocated;
cg_proto->sockets_allocated = &tcp->tcp_sockets_allocated;
- return 0;
+create_files:
+ return cgroup_add_files(cgrp, ss, tcp_files,
+ ARRAY_SIZE(tcp_files));
}
EXPORT_SYMBOL(tcp_init_cgroup);
@@ -56,6 +71,7 @@ void tcp_destroy_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss)
struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
struct cg_proto *cg_proto;
struct tcp_memcontrol *tcp;
+ u64 val;
cg_proto = tcp_prot.proto_cgroup(memcg);
if (!cg_proto)
@@ -63,5 +79,123 @@ void tcp_destroy_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss)
tcp = tcp_from_cgproto(cg_proto);
percpu_counter_destroy(&tcp->tcp_sockets_allocated);
+
+ val = res_counter_read_u64(&tcp->tcp_memory_allocated, RES_USAGE);
+
+ if (val != RESOURCE_MAX)
+ jump_label_dec(&memcg_socket_limit_enabled);
}
EXPORT_SYMBOL(tcp_destroy_cgroup);
+
+int tcp_update_limit(struct mem_cgroup *memcg, u64 val)
+{
+ struct net *net = current->nsproxy->net_ns;
+ struct tcp_memcontrol *tcp;
+ struct cg_proto *cg_proto;
+ int i;
+ int ret;
+
+ cg_proto = tcp_prot.proto_cgroup(memcg);
+ if (!cg_proto)
+ return -EINVAL;
+
+ tcp = tcp_from_cgproto(cg_proto);
+
+ ret = res_counter_set_limit(&tcp->tcp_memory_allocated, val);
+ if (ret)
+ return ret;
+
+ val >>= PAGE_SHIFT;
+
+ for (i = 0; i < 3; i++)
+ tcp->tcp_prot_mem[i] = min_t(long, val,
+ net->ipv4.sysctl_tcp_mem[i]);
+
+ if (val == RESOURCE_MAX)
+ jump_label_dec(&memcg_socket_limit_enabled);
+ else {
+ u64 old_lim;
+ old_lim = res_counter_read_u64(&tcp->tcp_memory_allocated,
+ RES_LIMIT);
+ if (old_lim == RESOURCE_MAX)
+ jump_label_inc(&memcg_socket_limit_enabled);
+ }
+ return 0;
+}
+
+static int tcp_cgroup_write(struct cgroup *cont, struct cftype *cft,
+ const char *buffer)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+ unsigned long long val;
+ int ret = 0;
+
+ switch (cft->private) {
+ case RES_LIMIT:
+ /* see memcontrol.c */
+ ret = res_counter_memparse_write_strategy(buffer, &val);
+ if (ret)
+ break;
+ ret = tcp_update_limit(memcg, val);
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+ return ret;
+}
+
+static u64 tcp_read_stat(struct mem_cgroup *memcg, int type, u64 default_val)
+{
+ struct tcp_memcontrol *tcp;
+ struct cg_proto *cg_proto;
+
+ cg_proto = tcp_prot.proto_cgroup(memcg);
+ if (!cg_proto)
+ return default_val;
+
+ tcp = tcp_from_cgproto(cg_proto);
+ return res_counter_read_u64(&tcp->tcp_memory_allocated, type);
+}
+
+static u64 tcp_cgroup_read(struct cgroup *cont, struct cftype *cft)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+ u64 val;
+
+ switch (cft->private) {
+ case RES_LIMIT:
+ val = tcp_read_stat(memcg, RES_LIMIT, RESOURCE_MAX);
+ break;
+ default:
+ BUG();
+ }
+ return val;
+}
+
+unsigned long long tcp_max_memory(const struct mem_cgroup *memcg)
+{
+ struct tcp_memcontrol *tcp;
+ struct cg_proto *cg_proto;
+
+ cg_proto = tcp_prot.proto_cgroup((struct mem_cgroup *)memcg);
+ if (!cg_proto)
+ return 0;
+
+ tcp = tcp_from_cgproto(cg_proto);
+ return res_counter_read_u64(&tcp->tcp_memory_allocated, RES_LIMIT);
+}
+
+void tcp_prot_mem(struct mem_cgroup *memcg, long val, int idx)
+{
+ struct tcp_memcontrol *tcp;
+ struct cg_proto *cg_proto;
+
+ cg_proto = tcp_prot.proto_cgroup(memcg);
+ if (!cg_proto)
+ return;
+
+ tcp = tcp_from_cgproto(cg_proto);
+
+ tcp->tcp_prot_mem[idx] = val;
+}
--
1.7.6.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox