From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Pankaj Gupta <pagupta@redhat.com>
Cc: Tariq Toukan <ttoukan.linux@gmail.com>,
Mel Gorman <mgorman@techsingularity.net>,
Tariq Toukan <tariqt@mellanox.com>,
netdev@vger.kernel.org, akpm@linux-foundation.org,
linux-mm <linux-mm@kvack.org>,
Saeed Mahameed <saeedm@mellanox.com>,
brouer@redhat.com
Subject: Re: Page allocator order-0 optimizations merged
Date: Mon, 27 Mar 2017 10:55:14 +0200 [thread overview]
Message-ID: <20170327105514.1ed5b1ba@redhat.com> (raw)
In-Reply-To: <2123321554.7161128.1490599967015.JavaMail.zimbra@redhat.com>
On Mon, 27 Mar 2017 03:32:47 -0400 (EDT)
Pankaj Gupta <pagupta@redhat.com> wrote:
> Hello,
>
> It looks like a race with softirq and normal process context.
>
> Just thinking if we really want allocations from 'softirqs' to be
> done using per cpu list?
Yes, softirq need fast page allocs. The softirq use-case is refilling
the DMA RX rings, which is time critical, especially for NIC drivers.
For this reason most drivers implement different page recycling tricks.
> Or we can have some check in 'free_hot_cold_page' for softirqs
> to check if we are on a path of returning from hard interrupt don't
> allocate from per cpu list.
A possible solution, would be use the local_bh_{disable,enable} instead
of the {preempt_disable,enable} calls. But it is slower, using numbers
from [1] (19 vs 11 cycles), thus the expected cycles saving is 38-19=19.
The problematic part of using local_bh_enable is that this adds a
softirq/bottom-halves rescheduling point (as it checks for pending
BHs). Thus, this might affects real workloads.
I'm unsure what the best option is. I'm leaning towards partly
reverting[1] and go back to doing the slower local_irq_save +
local_irq_restore as before.
Afterwards we can add a bulk page alloc+free call, that can amortize
this 38 cycles cost (of local_irq_{save,restore}). Or add a function
call that MUST only be called from contexts with IRQs enabled, which
allow using the unconditionally local_irq_{disable,enable} as it only
costs 7 cycles.
[1] commit 374ad05ab64d ("mm, page_alloc: only use per-cpu allocator for irq-safe requests")
https://git.kernel.org/torvalds/c/374ad05ab64d
> > On 26/03/2017 11:21 AM, Tariq Toukan wrote:
> > >
> > >
> > > On 23/03/2017 4:51 PM, Mel Gorman wrote:
> > >> On Thu, Mar 23, 2017 at 02:43:47PM +0100, Jesper Dangaard Brouer wrote:
> > >>> On Wed, 22 Mar 2017 23:40:04 +0000
> > >>> Mel Gorman <mgorman@techsingularity.net> wrote:
> > >>>
> > >>>> On Wed, Mar 22, 2017 at 07:39:17PM +0200, Tariq Toukan wrote:
> > >>>>>>>> This modification may slow allocations from IRQ context slightly
> > >>>>>>>> but the
> > >>>>>>>> main gain from the per-cpu allocator is that it scales better for
> > >>>>>>>> allocations from multiple contexts. There is an implicit
> > >>>>>>>> assumption that
> > >>>>>>>> intensive allocations from IRQ contexts on multiple CPUs from a
> > >>>>>>>> single
> > >>>>>>>> NUMA node are rare
> > >>>>> Hi Mel, Jesper, and all.
> > >>>>>
> > >>>>> This assumption contradicts regular multi-stream traffic that is
> > >>>>> naturally
> > >>>>> handled
> > >>>>> over close numa cores. I compared iperf TCP multistream (8 streams)
> > >>>>> over CX4 (mlx5 driver) with kernels v4.10 (before this series) vs
> > >>>>> kernel v4.11-rc1 (with this series).
> > >>>>> I disabled the page-cache (recycle) mechanism to stress the page
> > >>>>> allocator,
> > >>>>> and see a drastic degradation in BW, from 47.5 G in v4.10 to 31.4 G in
> > >>>>> v4.11-rc1 (34% drop).
> > >>>>> I noticed queued_spin_lock_slowpath occupies 62.87% of CPU time.
> > >>>>
> > >>>> Can you get the stack trace for the spin lock slowpath to confirm it's
> > >>>> from IRQ context?
> > >>>
> > >>> AFAIK allocations happen in softirq. Argh and during review I missed
> > >>> that in_interrupt() also covers softirq. To Mel, can we use a in_irq()
> > >>> check instead?
> > >>>
> > >>> (p.s. just landed and got home)
> > >
> > > Glad to hear. Thanks for your suggestion.
> > >
> > >>
> > >> Not built or even boot tested. I'm unable to run tests at the moment
> > >
> > > Thanks Mel, I will test it soon.
> > >
> > Crashed in iperf single stream test:
> >
> > [ 3974.123386] ------------[ cut here ]------------
> > [ 3974.128778] WARNING: CPU: 2 PID: 8754 at lib/list_debug.c:53
> > __list_del_entry_valid+0xa3/0xd0
> > [ 3974.138751] list_del corruption. prev->next should be
> > ffffea0040369c60, but was dead000000000100
> > [ 3974.149016] Modules linked in: netconsole nfsv3 nfs fscache dm_mirror
> > dm_region_hash dm_log dm_mod sb_edac edac_core x86_pkg_temp_thermal
> > coretemp i2c_diolan_u2c kvm irqbypass ipmi_si ipmi_devintf crc32_pclmul
> > iTCO_wdt ghash_clmulni_intel ipmi_msghandler dcdbas iTCO_vendor_support
> > sg pcspkr lpc_ich shpchp wmi mfd_core acpi_power_meter nfsd auth_rpcgss
> > nfs_acl lockd grace sunrpc binfmt_misc ip_tables mlx4_en sr_mod cdrom
> > sd_mod i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt
> > fb_sys_fops mlx5_core ttm tg3 ahci libahci mlx4_core drm libata ptp
> > megaraid_sas crc32c_intel i2c_core pps_core [last unloaded: netconsole]
> > [ 3974.212743] CPU: 2 PID: 8754 Comm: iperf Not tainted 4.11.0-rc2+ #30
> > [ 3974.220073] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS
> > 1.5.4 10/002/2015
> > [ 3974.228974] Call Trace:
> > [ 3974.231925] <IRQ>
> > [ 3974.234405] dump_stack+0x63/0x8c
> > [ 3974.238355] __warn+0xd1/0xf0
> > [ 3974.241891] warn_slowpath_fmt+0x4f/0x60
> > [ 3974.246494] __list_del_entry_valid+0xa3/0xd0
> > [ 3974.251583] get_page_from_freelist+0x84c/0xb40
> > [ 3974.256868] ? napi_gro_receive+0x38/0x140
> > [ 3974.261666] __alloc_pages_nodemask+0xca/0x200
> > [ 3974.266866] mlx5e_alloc_rx_wqe+0x49/0x130 [mlx5_core]
> > [ 3974.272862] mlx5e_post_rx_wqes+0x84/0xc0 [mlx5_core]
> > [ 3974.278725] mlx5e_napi_poll+0xc7/0x450 [mlx5_core]
> > [ 3974.284409] net_rx_action+0x23d/0x3a0
> > [ 3974.288819] __do_softirq+0xd1/0x2a2
> > [ 3974.293054] irq_exit+0xb5/0xc0
> > [ 3974.296783] do_IRQ+0x51/0xd0
> > [ 3974.300353] common_interrupt+0x89/0x89
> > [ 3974.304859] RIP: 0010:free_hot_cold_page+0x228/0x280
> > [ 3974.310629] RSP: 0018:ffffc9000ea07c90 EFLAGS: 00000202 ORIG_RAX:
> > ffffffffffffffa8
> > [ 3974.319565] RAX: 0000000000000001 RBX: ffff88103f85f158 RCX:
> > ffffea0040369c60
> > [ 3974.327764] RDX: ffffea0040369c60 RSI: ffff88103f85f168 RDI:
> > ffffea0040369ca0
> > [ 3974.335961] RBP: ffffc9000ea07cc0 R08: ffff88103f85f168 R09:
> > 00000000000005a8
> > [ 3974.344178] R10: 00000000000005a8 R11: 0000000000010468 R12:
> > ffffea0040369c80
> > [ 3974.352387] R13: ffff88103f85f168 R14: ffff88107ffdeb80 R15:
> > ffffea0040369ca0
> > [ 3974.360577] </IRQ>
> > [ 3974.363145] __put_page+0x34/0x40
> > [ 3974.367068] skb_release_data+0xca/0xe0
> > [ 3974.371575] skb_release_all+0x24/0x30
> > [ 3974.375984] __kfree_skb+0x12/0x20
> > [ 3974.380003] tcp_recvmsg+0x6ac/0xaf0
> > [ 3974.384251] inet_recvmsg+0x3c/0xa0
> > [ 3974.388394] sock_recvmsg+0x3d/0x50
> > [ 3974.392511] SYSC_recvfrom+0xd3/0x140
> > [ 3974.396826] ? handle_mm_fault+0xce/0x240
> > [ 3974.401535] ? SyS_futex+0x71/0x150
> > [ 3974.405653] SyS_recvfrom+0xe/0x10
> > [ 3974.409673] entry_SYSCALL_64_fastpath+0x1a/0xa9
> > [ 3974.415056] RIP: 0033:0x7f04ca9315bb
> > [ 3974.419309] RSP: 002b:00007f04c955de70 EFLAGS: 00000246 ORIG_RAX:
> > 000000000000002d
> > [ 3974.428243] RAX: ffffffffffffffda RBX: 0000000000020000 RCX:
> > 00007f04ca9315bb
> > [ 3974.436450] RDX: 0000000000020000 RSI: 00007f04bc0008f0 RDI:
> > 0000000000000004
> > [ 3974.444653] RBP: 0000000000000000 R08: 0000000000000000 R09:
> > 0000000000000000
> > [ 3974.452851] R10: 0000000000000000 R11: 0000000000000246 R12:
> > 00007f04bc0008f0
> > [ 3974.461051] R13: 0000000000034ac8 R14: 00007f04bc020910 R15:
> > 000000000001c480
> > [ 3974.469297] ---[ end trace 6fd472c9e1973d53 ]---
> >
> >
> > >>
> > >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > >> index 6cbde310abed..f82225725bc1 100644
> > >> --- a/mm/page_alloc.c
> > >> +++ b/mm/page_alloc.c
> > >> @@ -2481,7 +2481,7 @@ void free_hot_cold_page(struct page *page, bool
> > >> cold)
> > >> unsigned long pfn = page_to_pfn(page);
> > >> int migratetype;
> > >>
> > >> - if (in_interrupt()) {
> > >> + if (in_irq()) {
> > >> __free_pages_ok(page, 0);
> > >> return;
> > >> }
> > >> @@ -2647,7 +2647,7 @@ static struct page *__rmqueue_pcplist(struct
> > >> zone *zone, int migratetype,
> > >> {
> > >> struct page *page;
> > >>
> > >> - VM_BUG_ON(in_interrupt());
> > >> + VM_BUG_ON(in_irq());
> > >>
> > >> do {
> > >> if (list_empty(list)) {
> > >> @@ -2704,7 +2704,7 @@ struct page *rmqueue(struct zone *preferred_zone,
> > >> unsigned long flags;
> > >> struct page *page;
> > >>
> > >> - if (likely(order == 0) && !in_interrupt()) {
> > >> + if (likely(order == 0) && !in_irq()) {
> > >> page = rmqueue_pcplist(preferred_zone, zone, order,
> > >> gfp_flags, migratetype);
> > >> goto out;
> > >>
> >
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2017-03-27 8:55 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <58b48b1f.F/jo2/WiSxvvGm/z%akpm@linux-foundation.org>
2017-03-01 13:48 ` Page allocator order-0 optimizations merged Jesper Dangaard Brouer
2017-03-01 17:36 ` Tariq Toukan
2017-03-22 17:39 ` Tariq Toukan
2017-03-22 23:40 ` Mel Gorman
2017-03-23 13:43 ` Jesper Dangaard Brouer
2017-03-23 14:51 ` Mel Gorman
2017-03-26 8:21 ` Tariq Toukan
2017-03-26 10:17 ` Tariq Toukan
2017-03-27 7:32 ` Pankaj Gupta
2017-03-27 8:55 ` Jesper Dangaard Brouer [this message]
2017-03-27 12:28 ` Mel Gorman
2017-03-27 12:39 ` Jesper Dangaard Brouer
2017-03-27 13:32 ` Mel Gorman
2017-03-28 7:32 ` Tariq Toukan
2017-03-28 8:29 ` Jesper Dangaard Brouer
2017-03-28 16:05 ` Tariq Toukan
2017-03-28 18:24 ` Jesper Dangaard Brouer
2017-03-29 7:13 ` Tariq Toukan
2017-03-28 8:28 ` Pankaj Gupta
2017-03-27 14:15 ` Matthew Wilcox
2017-03-27 15:15 ` Jesper Dangaard Brouer
2017-03-27 16:58 ` in_irq_or_nmi() Matthew Wilcox
2017-03-29 8:12 ` in_irq_or_nmi() Peter Zijlstra
2017-03-29 8:59 ` in_irq_or_nmi() Jesper Dangaard Brouer
2017-03-29 9:19 ` in_irq_or_nmi() Peter Zijlstra
2017-03-29 18:12 ` in_irq_or_nmi() Matthew Wilcox
2017-03-29 19:11 ` in_irq_or_nmi() Jesper Dangaard Brouer
2017-03-29 19:44 ` in_irq_or_nmi() and RFC patch Jesper Dangaard Brouer
2017-03-30 6:49 ` Peter Zijlstra
2017-03-30 7:12 ` Jesper Dangaard Brouer
2017-03-30 7:35 ` Peter Zijlstra
2017-03-30 9:46 ` Jesper Dangaard Brouer
2017-03-30 13:04 ` Mel Gorman
2017-03-30 15:07 ` Jesper Dangaard Brouer
2017-04-03 12:05 ` Mel Gorman
2017-04-05 8:53 ` Mel Gorman
2017-04-10 14:31 ` Page allocator order-0 optimizations merged zhong jiang
2017-04-10 15:10 ` Mel Gorman
2017-04-11 1:54 ` zhong jiang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170327105514.1ed5b1ba@redhat.com \
--to=brouer@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@techsingularity.net \
--cc=netdev@vger.kernel.org \
--cc=pagupta@redhat.com \
--cc=saeedm@mellanox.com \
--cc=tariqt@mellanox.com \
--cc=ttoukan.linux@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).