From: Mel Gorman <mgorman@techsingularity.net>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>,
Andreas Gruenbacher <agruenba@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
Andy Lutomirski <luto@kernel.org>,
LKML <linux-kernel@vger.kernel.org>,
Bob Peterson <rpeterso@redhat.com>,
Steven Whitehouse <swhiteho@redhat.com>,
linux-mm <linux-mm@kvack.org>
Subject: Re: CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit
Date: Wed, 26 Oct 2016 23:03:39 +0100 [thread overview]
Message-ID: <20161026220339.GE2699@techsingularity.net> (raw)
In-Reply-To: <CA+55aFy21NqcYTeLVVz4x4kfQ7A+o4HEv7srone6ppKAjCwn7g@mail.gmail.com>
On Wed, Oct 26, 2016 at 02:26:57PM -0700, Linus Torvalds wrote:
> On Wed, Oct 26, 2016 at 1:31 PM, Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > IO wait activity is not all that matters. We hit the lock/unlock paths
> > during a lot of operations like reclaim.
>
> I doubt we do.
>
> Yes, we hit the lock/unlock itself, but do we hit the *contention*?
>
> The current code is nasty, and always ends up touching the wait-queue
> regardless of whether it needs to or not, but we have a fix for that.
>
To be clear, are you referring to PeterZ's patch that avoids the lookup? If
so, I see your point.
> With that fixed, do we actually get contention on a per-page basis?
Reclaim would have to running parallel to migrations, faults, clearing
write-protect etc. I can't think of a situation where a normal workload
would hit it regularly and/or for long durations.
> Because without contention, we'd never actually look up the wait-queue
> at all.
>
> I suspect that without IO, it's really really hard to actually get
> that contention, because things like reclaim end up looking at the LRU
> queue etc wioth their own locking, so it should look at various
> individual pages one at a time, not have multiple queues look at the
> same page.
>
Except many direct reclaimers on small LRUs while a system is thrashing --
not a case that really matters, you've already lost.
> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> >> index 7f2ae99e5daf..0f088f3a2fed 100644
> >> --- a/include/linux/mmzone.h
> >> +++ b/include/linux/mmzone.h
> >> @@ -440,33 +440,7 @@ struct zone {
> >> + int initialized;
> >>
> >> /* Write-intensive fields used from the page allocator */
> >> ZONE_PADDING(_pad1_)
> >
> > zone_is_initialized is mostly the domain of hotplug. A potential cleanup
> > is to use a page flag and shrink the size of zone slightly. Nothing to
> > panic over.
>
> I really did that to make it very obvious that there was no semantic
> change. I just set the "initialized" flag in the same place where it
> used to initialize the wait_table, so that this:
>
> >> static inline bool zone_is_initialized(struct zone *zone)
> >> {
> >> - return !!zone->wait_table;
> >> + return zone->initialized;
> >> }
>
> ends up being obviously equivalent.
>
No problem with that.
> >> +#define WAIT_TABLE_BITS 8
> >> +#define WAIT_TABLE_SIZE (1 << WAIT_TABLE_BITS)
> >> +static wait_queue_head_t bit_wait_table[WAIT_TABLE_SIZE] __cacheline_aligned;
> >> +
> >> +wait_queue_head_t *bit_waitqueue(void *word, int bit)
> >> +{
> >> + const int shift = BITS_PER_LONG == 32 ? 5 : 6;
> >> + unsigned long val = (unsigned long)word << shift | bit;
> >> +
> >> + return bit_wait_table + hash_long(val, WAIT_TABLE_BITS);
> >> +}
> >> +EXPORT_SYMBOL(bit_waitqueue);
> >> +
> >
> > Minor nit that it's unfortunate this moved to the scheduler core. It
> > wouldn't have been a complete disaster to add a page_waitqueue_init() or
> > something similar after sched_init.
>
> I considered that, but decided that "minimal patch" was better. Plus,
> with that bit_waitqueue() actually also being used for the page
> locking queues (which act _kind of_ but not quite, like a bitlock),
> the bit_wait_table is actually more core than just the bit-wait code.
>
> In fact, I considered just renaming it to "hashed_wait_queue", because
> that's effectively how we use it now, rather than being particularly
> specific to the bit-waiting. But again, that would have made the patch
> bigger, which I wanted to avoid since this is a post-rc2 thing due to
> the gfs2 breakage.
>
No objection. Shuffling it around does not make it obviously better in
any way.
In the meantime, a machine freed up. FWIW, it survived booting on a 2-socket
and about 20 minutes of bashing on reclaim paths from multiple processes
to beat on lock/unlock. I didn't do a performance comparison or gather
profile data but I wouldn't expect anything interesting from profiles
other than some cycles saved.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: Mel Gorman <mgorman@techsingularity.net>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>,
Andreas Gruenbacher <agruenba@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
Andy Lutomirski <luto@kernel.org>,
LKML <linux-kernel@vger.kernel.org>,
Bob Peterson <rpeterso@redhat.com>,
Steven Whitehouse <swhiteho@redhat.com>,
linux-mm <linux-mm@kvack.org>
Subject: Re: CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit
Date: Wed, 26 Oct 2016 23:03:39 +0100 [thread overview]
Message-ID: <20161026220339.GE2699@techsingularity.net> (raw)
In-Reply-To: <CA+55aFy21NqcYTeLVVz4x4kfQ7A+o4HEv7srone6ppKAjCwn7g@mail.gmail.com>
On Wed, Oct 26, 2016 at 02:26:57PM -0700, Linus Torvalds wrote:
> On Wed, Oct 26, 2016 at 1:31 PM, Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > IO wait activity is not all that matters. We hit the lock/unlock paths
> > during a lot of operations like reclaim.
>
> I doubt we do.
>
> Yes, we hit the lock/unlock itself, but do we hit the *contention*?
>
> The current code is nasty, and always ends up touching the wait-queue
> regardless of whether it needs to or not, but we have a fix for that.
>
To be clear, are you referring to PeterZ's patch that avoids the lookup? If
so, I see your point.
> With that fixed, do we actually get contention on a per-page basis?
Reclaim would have to running parallel to migrations, faults, clearing
write-protect etc. I can't think of a situation where a normal workload
would hit it regularly and/or for long durations.
> Because without contention, we'd never actually look up the wait-queue
> at all.
>
> I suspect that without IO, it's really really hard to actually get
> that contention, because things like reclaim end up looking at the LRU
> queue etc wioth their own locking, so it should look at various
> individual pages one at a time, not have multiple queues look at the
> same page.
>
Except many direct reclaimers on small LRUs while a system is thrashing --
not a case that really matters, you've already lost.
> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> >> index 7f2ae99e5daf..0f088f3a2fed 100644
> >> --- a/include/linux/mmzone.h
> >> +++ b/include/linux/mmzone.h
> >> @@ -440,33 +440,7 @@ struct zone {
> >> + int initialized;
> >>
> >> /* Write-intensive fields used from the page allocator */
> >> ZONE_PADDING(_pad1_)
> >
> > zone_is_initialized is mostly the domain of hotplug. A potential cleanup
> > is to use a page flag and shrink the size of zone slightly. Nothing to
> > panic over.
>
> I really did that to make it very obvious that there was no semantic
> change. I just set the "initialized" flag in the same place where it
> used to initialize the wait_table, so that this:
>
> >> static inline bool zone_is_initialized(struct zone *zone)
> >> {
> >> - return !!zone->wait_table;
> >> + return zone->initialized;
> >> }
>
> ends up being obviously equivalent.
>
No problem with that.
> >> +#define WAIT_TABLE_BITS 8
> >> +#define WAIT_TABLE_SIZE (1 << WAIT_TABLE_BITS)
> >> +static wait_queue_head_t bit_wait_table[WAIT_TABLE_SIZE] __cacheline_aligned;
> >> +
> >> +wait_queue_head_t *bit_waitqueue(void *word, int bit)
> >> +{
> >> + const int shift = BITS_PER_LONG == 32 ? 5 : 6;
> >> + unsigned long val = (unsigned long)word << shift | bit;
> >> +
> >> + return bit_wait_table + hash_long(val, WAIT_TABLE_BITS);
> >> +}
> >> +EXPORT_SYMBOL(bit_waitqueue);
> >> +
> >
> > Minor nit that it's unfortunate this moved to the scheduler core. It
> > wouldn't have been a complete disaster to add a page_waitqueue_init() or
> > something similar after sched_init.
>
> I considered that, but decided that "minimal patch" was better. Plus,
> with that bit_waitqueue() actually also being used for the page
> locking queues (which act _kind of_ but not quite, like a bitlock),
> the bit_wait_table is actually more core than just the bit-wait code.
>
> In fact, I considered just renaming it to "hashed_wait_queue", because
> that's effectively how we use it now, rather than being particularly
> specific to the bit-waiting. But again, that would have made the patch
> bigger, which I wanted to avoid since this is a post-rc2 thing due to
> the gfs2 breakage.
>
No objection. Shuffling it around does not make it obviously better in
any way.
In the meantime, a machine freed up. FWIW, it survived booting on a 2-socket
and about 20 minutes of bashing on reclaim paths from multiple processes
to beat on lock/unlock. I didn't do a performance comparison or gather
profile data but I wouldn't expect anything interesting from profiles
other than some cycles saved.
--
Mel Gorman
SUSE Labs
next prev parent reply other threads:[~2016-10-26 22:03 UTC|newest]
Thread overview: 54+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-10-26 12:51 CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit Andreas Gruenbacher
2016-10-26 15:51 ` Andy Lutomirski
2016-10-26 16:32 ` Linus Torvalds
2016-10-26 16:32 ` Linus Torvalds
2016-10-26 17:15 ` Linus Torvalds
2016-10-26 17:58 ` Linus Torvalds
2016-10-26 17:58 ` Linus Torvalds
2016-10-26 18:04 ` Bob Peterson
2016-10-26 18:04 ` Bob Peterson
2016-10-26 18:10 ` Linus Torvalds
2016-10-26 18:10 ` Linus Torvalds
2016-10-26 19:11 ` Bob Peterson
2016-10-26 19:11 ` Bob Peterson
2016-10-26 21:01 ` Bob Peterson
2016-10-26 21:01 ` Bob Peterson
2016-10-26 21:30 ` Linus Torvalds
2016-10-26 21:30 ` Linus Torvalds
2016-10-26 22:45 ` Borislav Petkov
2016-10-26 22:45 ` Borislav Petkov
2016-10-26 23:13 ` Borislav Petkov
2016-10-26 23:13 ` Borislav Petkov
2016-10-27 0:37 ` Bob Peterson
2016-10-27 12:36 ` Borislav Petkov
2016-10-27 12:36 ` Borislav Petkov
2016-10-27 18:51 ` Bob Peterson
2016-10-27 18:51 ` Bob Peterson
2016-10-27 19:19 ` Borislav Petkov
2016-10-27 19:19 ` Borislav Petkov
2016-10-27 21:03 ` Bob Peterson
2016-10-27 21:03 ` Bob Peterson
2016-10-27 21:19 ` Borislav Petkov
2016-10-27 21:19 ` Borislav Petkov
2016-10-28 8:37 ` [tip:x86/urgent] x86/microcode/AMD: Fix more fallout from CONFIG_RANDOMIZE_MEMORY=y tip-bot for Borislav Petkov
2016-10-28 8:37 ` tip-bot for Borislav Petkov
2016-10-26 20:31 ` CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit Mel Gorman
2016-10-26 20:31 ` Mel Gorman
2016-10-26 21:26 ` Linus Torvalds
2016-10-26 21:26 ` Linus Torvalds
2016-10-26 22:03 ` Mel Gorman [this message]
2016-10-26 22:03 ` Mel Gorman
2016-10-26 22:09 ` Linus Torvalds
2016-10-26 22:09 ` Linus Torvalds
2016-10-26 23:07 ` Mel Gorman
2016-10-26 23:07 ` Mel Gorman
2016-10-27 8:08 ` Peter Zijlstra
2016-10-27 8:08 ` Peter Zijlstra
2016-10-27 9:07 ` Mel Gorman
2016-10-27 9:07 ` Mel Gorman
2016-10-27 9:44 ` Peter Zijlstra
2016-10-27 9:44 ` Peter Zijlstra
2016-10-27 9:59 ` Mel Gorman
2016-10-27 9:59 ` Mel Gorman
2016-10-27 11:56 ` Nicholas Piggin
2016-10-27 11:56 ` Nicholas Piggin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20161026220339.GE2699@techsingularity.net \
--to=mgorman@techsingularity.net \
--cc=agruenba@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@amacapital.net \
--cc=luto@kernel.org \
--cc=peterz@infradead.org \
--cc=rpeterso@redhat.com \
--cc=swhiteho@redhat.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.