Truncate regression due to commit 69b6c1319b6

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Truncate regression due to commit 69b6c1319b6
@ 2019-02-26 16:56 Jan Kara
  2019-02-26 17:27 ` Matthew Wilcox
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Kara @ 2019-02-26 16:56 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, mgorman

[-- Attachment #1: Type: text/plain, Size: 1322 bytes --]

Hello Matthew,

after some peripeties, I was able to bisect down to a regression in
truncate performance caused by commit 69b6c1319b6 "mm: Convert truncate to
XArray". The initial benchmark that indicated the regression was a bonnie++
file delete test (some 4-5% regression). It is however much easier (and
faster!) to see the regression with the attached benchmark. With it I can
see about 10% regression on my test machine: Average of 10 runs, time in us.

COMMIT      AVG            STDDEV
a97e7904c0  1431256.500000 1489.361759
69b6c1319b  1566944.000000 2252.692877

The benchmark has to be run so that the total file size is about 2x the
memory size (as the regression seems to be in trucating existing workingset
entries). So on my test machine with 32 GB of RAM I run it like:

# This prepares the environment
mkfs.xfs -f /dev/sdb1
mount -t xfs /dev/sdb1 /mnt
./trunc-bench /mnt/file 64 1
# This does the truncation
./trunc-bench /mnt/file 64 2

I've gathered also perf profiles but from the first look they don't show
anything surprising besides xas_load() and xas_store() taking up more time
than original counterparts did. I'll try to dig more into this but any idea
is appreciated.

My test machine is 8 CPU Intel(R) Xeon(R) CPU E3-1240 v5 @ 3.50GHz.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: trunc-bench.c --]
[-- Type: text/x-c, Size: 1478 bytes --]

#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <fcntl.h>
#include <stdlib.h>
#include <sys/time.h>

#define COUNT 1024
#define BUFSIZE (1024*1024)

char *buf;

void read_file(int fd)
{
	int i;

	if (ftruncate(fd, BUFSIZE*COUNT) < 0) {
		perror("truncate");
		exit(1);
	}
	for (i = 0; i < COUNT; i++) {
		if (read(fd, buf, BUFSIZE) != BUFSIZE) {
			perror("read");
			exit(1);
		}
	}
}

int main(int argc, char **argv)
{
	int fd;
	int fcount, i, stages;
	char fname[128];
	struct timeval start, end;

	if (argc != 4) {
		fprintf(stderr, "Usage: trunc-bench <file> <file-count> <stages>\n");
		return 1;
	}
	fcount = strtol(argv[2], NULL, 0);
	stages = strtol(argv[3], NULL, 0);
	buf = malloc(BUFSIZE);
	if (!buf) {
		fprintf(stderr, "Cannot allocate write buffer.\n");
		return 1;
	}
	memset(buf, 'a', BUFSIZE);
	
	if (stages & 1) {
		fprintf(stderr, "Creating files...\n");
		for (i = 0; i < fcount; i++ ) {
			sprintf(fname, "%s%d", argv[1], i);
			fd = open(fname, O_RDWR | O_TRUNC | O_CREAT, 0644);
			if (fd < 0) {
				perror("open");
				return 1;
			}
			read_file(fd);
			close(fd);
		}
	}
	if (stages & 2) {
		fprintf(stderr, "Removing files...\n");
		gettimeofday(&start, NULL);
		for (i = 0; i < fcount; i++ ) {
			sprintf(fname, "%s%d", argv[1], i);
			truncate(fname, 0);
		}
		gettimeofday(&end, NULL);
		printf("%lu us\n", (end.tv_sec - start.tv_sec) * 1000000UL + (end.tv_usec - start.tv_usec));
	}
	fprintf(stderr, "Done.\n");

	return 0;
}

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Truncate regression due to commit 69b6c1319b6
  2019-02-26 16:56 Truncate regression due to commit 69b6c1319b6 Jan Kara
@ 2019-02-26 17:27 ` Matthew Wilcox
  2019-02-27  6:03   ` Nicholas Piggin
  2019-02-27 11:27   ` Jan Kara
  0 siblings, 2 replies; 13+ messages in thread
From: Matthew Wilcox @ 2019-02-26 17:27 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-mm, mgorman

On Tue, Feb 26, 2019 at 05:56:28PM +0100, Jan Kara wrote:
> after some peripeties, I was able to bisect down to a regression in
> truncate performance caused by commit 69b6c1319b6 "mm: Convert truncate to
> XArray".

[...]

> I've gathered also perf profiles but from the first look they don't show
> anything surprising besides xas_load() and xas_store() taking up more time
> than original counterparts did. I'll try to dig more into this but any idea
> is appreciated.

Well, that's a short and sweet little commit.  Stripped of comment
changes, it's just:

-       struct radix_tree_node *node;
-       void **slot;
+       XA_STATE(xas, &mapping->i_pages, index);
 
-       if (!__radix_tree_lookup(&mapping->i_pages, index, &node, &slot))
+       xas_set_update(&xas, workingset_update_node);
+       if (xas_load(&xas) != entry)
                return;
-       if (*slot != entry)
-               return;
-       __radix_tree_replace(&mapping->i_pages, node, slot, NULL,
-                            workingset_update_node);
+       xas_store(&xas, NULL);

I have a few reactions to this:

1. I'm concerned that the XArray may generally be slower than the radix
tree was.  I didn't notice that in my testing, but maybe I didn't do
the right tests.

2. The setup overhead of the XA_STATE might be a problem.
If so, we can do some batching in order to improve things.
I suspect your test is calling __clear_shadow_entry through the
truncate_exceptional_pvec_entries() path, which is already a batch.
Maybe something like patch [1] at the end of this mail.

3. Perhaps we can actually get rid of truncate_exceptional_pvec_entries().
It seems a little daft for page_cache_delete_batch() to skip value
entries, only for truncate_exceptional_pvec_entries() to erase them in
a second pass.  Truncation is truncation, and perhaps we can handle all
of it in one place?

4. Now that calling through a function pointer is expensive, thanks to
Spectre/Meltdown/..., I've been considering removing the general-purpose
update function, which is only used by the page cache.  Instead move parts
of workingset.c into the XArray code and use a bit in the xa_flags to
indicate that the node should be tracked on an LRU if it contains only
value entries.

[1]

diff --git a/mm/truncate.c b/mm/truncate.c
index 798e7ccfb030..9384f48eff2a 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -31,23 +31,23 @@
  * lock.
  */
 static inline void __clear_shadow_entry(struct address_space *mapping,
-				pgoff_t index, void *entry)
+		struct xa_state *xas, void *entry)
 {
-	XA_STATE(xas, &mapping->i_pages, index);
-
-	xas_set_update(&xas, workingset_update_node);
-	if (xas_load(&xas) != entry)
+	if (xas_load(xas) != entry)
 		return;
-	xas_store(&xas, NULL);
+	xas_store(xas, NULL);
 	mapping->nrexceptional--;
 }
 
 static void clear_shadow_entry(struct address_space *mapping, pgoff_t index,
 			       void *entry)
 {
-	xa_lock_irq(&mapping->i_pages);
-	__clear_shadow_entry(mapping, index, entry);
-	xa_unlock_irq(&mapping->i_pages);
+	XA_STATE(xas, &mapping->i_pages, index);
+	xas_set_update(&xas, workingset_update_node);
+
+	xas_lock_irq(&xas);
+	__clear_shadow_entry(mapping, &xas, entry);
+	xas_unlock_irq(&xas);
 }
 
 /*
@@ -59,9 +59,12 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping,
 				struct pagevec *pvec, pgoff_t *indices,
 				pgoff_t end)
 {
+	XA_STATE(xas, &mapping->i_pages, 0);
 	int i, j;
 	bool dax, lock;
 
+	xas_set_update(&xas, workingset_update_node);
+
 	/* Handled by shmem itself */
 	if (shmem_mapping(mapping))
 		return;
@@ -95,7 +98,8 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping,
 			continue;
 		}
 
-		__clear_shadow_entry(mapping, index, page);
+		xas_set(&xas, index);
+		__clear_shadow_entry(mapping, &xas, page);
 	}
 
 	if (lock)


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: Truncate regression due to commit 69b6c1319b6
  2019-02-26 17:27 ` Matthew Wilcox
@ 2019-02-27  6:03   ` Nicholas Piggin
  2019-02-27 12:35     ` Matthew Wilcox
  2019-02-27 11:27   ` Jan Kara
  1 sibling, 1 reply; 13+ messages in thread
From: Nicholas Piggin @ 2019-02-27  6:03 UTC (permalink / raw)
  To: Jan Kara, Matthew Wilcox; +Cc: linux-mm, mgorman

Matthew Wilcox's on February 27, 2019 3:27 am:
> On Tue, Feb 26, 2019 at 05:56:28PM +0100, Jan Kara wrote:
>> after some peripeties, I was able to bisect down to a regression in
>> truncate performance caused by commit 69b6c1319b6 "mm: Convert truncate to
>> XArray".
> 
> [...]
> 
>> I've gathered also perf profiles but from the first look they don't show
>> anything surprising besides xas_load() and xas_store() taking up more time
>> than original counterparts did. I'll try to dig more into this but any idea
>> is appreciated.
> 
> Well, that's a short and sweet little commit.  Stripped of comment
> changes, it's just:
> 
> -       struct radix_tree_node *node;
> -       void **slot;
> +       XA_STATE(xas, &mapping->i_pages, index);
>  
> -       if (!__radix_tree_lookup(&mapping->i_pages, index, &node, &slot))
> +       xas_set_update(&xas, workingset_update_node);
> +       if (xas_load(&xas) != entry)
>                 return;
> -       if (*slot != entry)
> -               return;
> -       __radix_tree_replace(&mapping->i_pages, node, slot, NULL,
> -                            workingset_update_node);
> +       xas_store(&xas, NULL);
> 
> I have a few reactions to this:
> 
> 1. I'm concerned that the XArray may generally be slower than the radix
> tree was.  I didn't notice that in my testing, but maybe I didn't do
> the right tests.
> 
> 2. The setup overhead of the XA_STATE might be a problem.
> If so, we can do some batching in order to improve things.
> I suspect your test is calling __clear_shadow_entry through the
> truncate_exceptional_pvec_entries() path, which is already a batch.
> Maybe something like patch [1] at the end of this mail.

One nasty thing about the XA_STATE stack object as opposed to just
passing the parameters (in the same order) down to children is that 
you get the same memory accessed nearby, but in different ways
(different base register, offset, addressing mode etc). Which can
reduce effectiveness of memory disambiguation prediction, at least
in cold predictor case.

I've seen (on some POWER CPUs at least) flushes due to aliasing
access in some of these xarray call chains, although no idea if
that actually makes a noticable difference in microbenchmark like
this.

But it's not the greatest pattern to use for passing to low level
performance critical functions :( Ideally the compiler could just
do a big LTO pass right at the end and unwind it all back into
registers and fix everything, but that will never happen.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Truncate regression due to commit 69b6c1319b6
  2019-02-27  6:03   ` Nicholas Piggin
@ 2019-02-27 12:35     ` Matthew Wilcox
  2019-02-28  9:31       ` Nicholas Piggin
  0 siblings, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2019-02-27 12:35 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: Jan Kara, linux-mm, mgorman

On Wed, Feb 27, 2019 at 04:03:25PM +1000, Nicholas Piggin wrote:
> Matthew Wilcox's on February 27, 2019 3:27 am:
> > 2. The setup overhead of the XA_STATE might be a problem.
> > If so, we can do some batching in order to improve things.
> > I suspect your test is calling __clear_shadow_entry through the
> > truncate_exceptional_pvec_entries() path, which is already a batch.
> > Maybe something like patch [1] at the end of this mail.
> 
> One nasty thing about the XA_STATE stack object as opposed to just
> passing the parameters (in the same order) down to children is that 
> you get the same memory accessed nearby, but in different ways
> (different base register, offset, addressing mode etc). Which can
> reduce effectiveness of memory disambiguation prediction, at least
> in cold predictor case.

That is nasty.  At the C level, it's a really attractive pattern.
Shame it doesn't work out so well on hardware.  I wouldn't mind
turning shift/sibs/offset into a manually-extracted unsigned long
if that'll help with the addressing mode mispredictions?

> I've seen (on some POWER CPUs at least) flushes due to aliasing
> access in some of these xarray call chains, although no idea if
> that actually makes a noticable difference in microbenchmark like
> this.
> 
> But it's not the greatest pattern to use for passing to low level
> performance critical functions :( Ideally the compiler could just
> do a big LTO pass right at the end and unwind it all back into
> registers and fix everything, but that will never happen.

I wonder if we could get the compiler people to introduce a structure
attribute telling the compiler to pass this whole thing back-and-forth in
registers ... 6 registers is a lot to ask the compiler to reserve though.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Truncate regression due to commit 69b6c1319b6
  2019-02-27 12:35     ` Matthew Wilcox
@ 2019-02-28  9:31       ` Nicholas Piggin
  0 siblings, 0 replies; 13+ messages in thread
From: Nicholas Piggin @ 2019-02-28  9:31 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Jan Kara, linux-mm, mgorman

Matthew Wilcox's on February 27, 2019 10:35 pm:
> On Wed, Feb 27, 2019 at 04:03:25PM +1000, Nicholas Piggin wrote:
>> Matthew Wilcox's on February 27, 2019 3:27 am:
>> > 2. The setup overhead of the XA_STATE might be a problem.
>> > If so, we can do some batching in order to improve things.
>> > I suspect your test is calling __clear_shadow_entry through the
>> > truncate_exceptional_pvec_entries() path, which is already a batch.
>> > Maybe something like patch [1] at the end of this mail.
>> 
>> One nasty thing about the XA_STATE stack object as opposed to just
>> passing the parameters (in the same order) down to children is that 
>> you get the same memory accessed nearby, but in different ways
>> (different base register, offset, addressing mode etc). Which can
>> reduce effectiveness of memory disambiguation prediction, at least
>> in cold predictor case.
> 
> That is nasty.  At the C level, it's a really attractive pattern.
> Shame it doesn't work out so well on hardware.  I wouldn't mind
> turning shift/sibs/offset into a manually-extracted unsigned long
> if that'll help with the addressing mode mispredictions?

If you can get it to pass in registers by value. Some shifts or
masks should be ~zero cost by comparison.

> 
>> I've seen (on some POWER CPUs at least) flushes due to aliasing
>> access in some of these xarray call chains, although no idea if
>> that actually makes a noticable difference in microbenchmark like
>> this.
>> 
>> But it's not the greatest pattern to use for passing to low level
>> performance critical functions :( Ideally the compiler could just
>> do a big LTO pass right at the end and unwind it all back into
>> registers and fix everything, but that will never happen.
> 
> I wonder if we could get the compiler people to introduce a structure
> attribute telling the compiler to pass this whole thing back-and-forth in
> registers ... 6 registers is a lot to ask the compiler to reserve though.
> 

Yeah I don't have a good idea, I think it may be a fundamentally hard
problem for hardware, and it's very difficult for compiler. But yeah
some special option for non-standard calling convention might be
interesting.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Truncate regression due to commit 69b6c1319b6
  2019-02-26 17:27 ` Matthew Wilcox
  2019-02-27  6:03   ` Nicholas Piggin
@ 2019-02-27 11:27   ` Jan Kara
  2019-02-27 12:24     ` Matthew Wilcox
  1 sibling, 1 reply; 13+ messages in thread
From: Jan Kara @ 2019-02-27 11:27 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Jan Kara, linux-mm, mgorman

On Tue 26-02-19 09:27:44, Matthew Wilcox wrote:
> On Tue, Feb 26, 2019 at 05:56:28PM +0100, Jan Kara wrote:
> > after some peripeties, I was able to bisect down to a regression in
> > truncate performance caused by commit 69b6c1319b6 "mm: Convert truncate to
> > XArray".
> 
> [...]
> 
> > I've gathered also perf profiles but from the first look they don't show
> > anything surprising besides xas_load() and xas_store() taking up more time
> > than original counterparts did. I'll try to dig more into this but any idea
> > is appreciated.
> 
> Well, that's a short and sweet little commit.  Stripped of comment
> changes, it's just:
> 
> -       struct radix_tree_node *node;
> -       void **slot;
> +       XA_STATE(xas, &mapping->i_pages, index);
>  
> -       if (!__radix_tree_lookup(&mapping->i_pages, index, &node, &slot))
> +       xas_set_update(&xas, workingset_update_node);
> +       if (xas_load(&xas) != entry)
>                 return;
> -       if (*slot != entry)
> -               return;
> -       __radix_tree_replace(&mapping->i_pages, node, slot, NULL,
> -                            workingset_update_node);
> +       xas_store(&xas, NULL);

Yes, the code change is small. Thanks to you splitting the changes, the
regression is easier to analyze so thanks for that :)

> I have a few reactions to this:
> 
> 1. I'm concerned that the XArray may generally be slower than the radix
> tree was.  I didn't notice that in my testing, but maybe I didn't do
> the right tests.

So one difference I've noticed when staring into the code and annotated
perf traces is that xas_store() will call xas_init_marks() when stored
entry is 0. __radix_tree_replace() didn't do this. And the cache misses we
take from checking tags do add up. After hacking the code in xas_store() so
that __clear_shadow_entry() does not touch tags, I get around half of the
regression back. For now I didn't think how to do this so that the API
remains reasonably clean. So now we are at:

COMMIT      AVG            STDDEV
a97e7904c0  1431256.500000 1489.361759
69b6c1319b  1566944.000000 2252.692877
notaginit   1483740.700000 7680.583455

> 2. The setup overhead of the XA_STATE might be a problem.
> If so, we can do some batching in order to improve things.
> I suspect your test is calling __clear_shadow_entry through the
> truncate_exceptional_pvec_entries() path, which is already a batch.
> Maybe something like patch [1] at the end of this mail.

So this apparently contributes as well but not too much. With your patch
applied on top of 'notaginit' kernel above I've got to:

batched-xas 1473900.300000 950.439377

> 3. Perhaps we can actually get rid of truncate_exceptional_pvec_entries().
> It seems a little daft for page_cache_delete_batch() to skip value
> entries, only for truncate_exceptional_pvec_entries() to erase them in
> a second pass.  Truncation is truncation, and perhaps we can handle all
> of it in one place?
> 
> 4. Now that calling through a function pointer is expensive, thanks to
> Spectre/Meltdown/..., I've been considering removing the general-purpose
> update function, which is only used by the page cache.  Instead move parts
> of workingset.c into the XArray code and use a bit in the xa_flags to
> indicate that the node should be tracked on an LRU if it contains only
> value entries.

I agree these two are good ideas to improve the speed. But old radix tree
code has these issues as well so they are not the reason of this
regression. So I'd like to track down where Xarray code is slower first.

I'm going to dig more into annotated profiles...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Truncate regression due to commit 69b6c1319b6
  2019-02-27 11:27   ` Jan Kara
@ 2019-02-27 12:24     ` Matthew Wilcox
  2019-02-27 16:55       ` Jan Kara
  0 siblings, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2019-02-27 12:24 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-mm, mgorman

On Wed, Feb 27, 2019 at 12:27:21PM +0100, Jan Kara wrote:
> On Tue 26-02-19 09:27:44, Matthew Wilcox wrote:
> > On Tue, Feb 26, 2019 at 05:56:28PM +0100, Jan Kara wrote:
> > > after some peripeties, I was able to bisect down to a regression in
> > > truncate performance caused by commit 69b6c1319b6 "mm: Convert truncate to
> > > XArray".
> > 
> > [...]
> > 
> > > I've gathered also perf profiles but from the first look they don't show
> > > anything surprising besides xas_load() and xas_store() taking up more time
> > > than original counterparts did. I'll try to dig more into this but any idea
> > > is appreciated.
> > 
> > Well, that's a short and sweet little commit.  Stripped of comment
> > changes, it's just:
> > 
> > -       struct radix_tree_node *node;
> > -       void **slot;
> > +       XA_STATE(xas, &mapping->i_pages, index);
> >  
> > -       if (!__radix_tree_lookup(&mapping->i_pages, index, &node, &slot))
> > +       xas_set_update(&xas, workingset_update_node);
> > +       if (xas_load(&xas) != entry)
> >                 return;
> > -       if (*slot != entry)
> > -               return;
> > -       __radix_tree_replace(&mapping->i_pages, node, slot, NULL,
> > -                            workingset_update_node);
> > +       xas_store(&xas, NULL);
> 
> Yes, the code change is small. Thanks to you splitting the changes, the
> regression is easier to analyze so thanks for that :)

That was why I split the patches so small.  I'm kind of glad it paid off ...
not glad to have caused a performance regression!

> > I have a few reactions to this:
> > 
> > 1. I'm concerned that the XArray may generally be slower than the radix
> > tree was.  I didn't notice that in my testing, but maybe I didn't do
> > the right tests.
> 
> So one difference I've noticed when staring into the code and annotated
> perf traces is that xas_store() will call xas_init_marks() when stored
> entry is 0. __radix_tree_replace() didn't do this. And the cache misses we
> take from checking tags do add up. After hacking the code in xas_store() so
> that __clear_shadow_entry() does not touch tags, I get around half of the
> regression back. For now I didn't think how to do this so that the API
> remains reasonably clean. So now we are at:
> 
> COMMIT      AVG            STDDEV
> a97e7904c0  1431256.500000 1489.361759
> 69b6c1319b  1566944.000000 2252.692877
> notaginit   1483740.700000 7680.583455

Well, that seems worth doing.  For the page cache case, we know that
shadow entries have no tags set (at least right now), so it seems
reasonable to move the xas_init_marks() from xas_store() to its various
callers.

> > 2. The setup overhead of the XA_STATE might be a problem.
> > If so, we can do some batching in order to improve things.
> > I suspect your test is calling __clear_shadow_entry through the
> > truncate_exceptional_pvec_entries() path, which is already a batch.
> > Maybe something like patch [1] at the end of this mail.
> 
> So this apparently contributes as well but not too much. With your patch
> applied on top of 'notaginit' kernel above I've got to:
> 
> batched-xas 1473900.300000 950.439377

Fascinating that it reduces the stddev so much.  We can probably take this
further (getting into the realm of #3 below) -- the call to xas_set() will
restart the walk from the top of the tree each time.  Clearly this usecase
(many thousands of shadow entries) is going to construct a very deep tree,
and we're effectively doing a linear scan over the bottom of the tree, so
starting from the top each time is O(n.log n) instead of O(n).  I think
you said the file was 64GB, which is 16 million 4k entries, or 24 bits of
tree index.  That's 4 levels deep so it'll add up.

> > 3. Perhaps we can actually get rid of truncate_exceptional_pvec_entries().
> > It seems a little daft for page_cache_delete_batch() to skip value
> > entries, only for truncate_exceptional_pvec_entries() to erase them in
> > a second pass.  Truncation is truncation, and perhaps we can handle all
> > of it in one place?
> > 
> > 4. Now that calling through a function pointer is expensive, thanks to
> > Spectre/Meltdown/..., I've been considering removing the general-purpose
> > update function, which is only used by the page cache.  Instead move parts
> > of workingset.c into the XArray code and use a bit in the xa_flags to
> > indicate that the node should be tracked on an LRU if it contains only
> > value entries.
> 
> I agree these two are good ideas to improve the speed. But old radix tree
> code has these issues as well so they are not the reason of this
> regression. So I'd like to track down where Xarray code is slower first.
> 
> I'm going to dig more into annotated profiles...

Thanks!  I'll work on a patch to remove the xas_init_marks() from xas_store().


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Truncate regression due to commit 69b6c1319b6
  2019-02-27 12:24     ` Matthew Wilcox
@ 2019-02-27 16:55       ` Jan Kara
  2019-02-28 22:53         ` Matthew Wilcox
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Kara @ 2019-02-27 16:55 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Jan Kara, linux-mm, mgorman

On Wed 27-02-19 04:24:51, Matthew Wilcox wrote:
> On Wed, Feb 27, 2019 at 12:27:21PM +0100, Jan Kara wrote:
> > On Tue 26-02-19 09:27:44, Matthew Wilcox wrote:
> > > On Tue, Feb 26, 2019 at 05:56:28PM +0100, Jan Kara wrote:
> > > > after some peripeties, I was able to bisect down to a regression in
> > > > truncate performance caused by commit 69b6c1319b6 "mm: Convert truncate to
> > > > XArray".
> > > 
> > > [...]
> > > 
> > > > I've gathered also perf profiles but from the first look they don't show
> > > > anything surprising besides xas_load() and xas_store() taking up more time
> > > > than original counterparts did. I'll try to dig more into this but any idea
> > > > is appreciated.
> > > 
> > > Well, that's a short and sweet little commit.  Stripped of comment
> > > changes, it's just:
> > > 
> > > -       struct radix_tree_node *node;
> > > -       void **slot;
> > > +       XA_STATE(xas, &mapping->i_pages, index);
> > >  
> > > -       if (!__radix_tree_lookup(&mapping->i_pages, index, &node, &slot))
> > > +       xas_set_update(&xas, workingset_update_node);
> > > +       if (xas_load(&xas) != entry)
> > >                 return;
> > > -       if (*slot != entry)
> > > -               return;
> > > -       __radix_tree_replace(&mapping->i_pages, node, slot, NULL,
> > > -                            workingset_update_node);
> > > +       xas_store(&xas, NULL);
> > 
> > Yes, the code change is small. Thanks to you splitting the changes, the
> > regression is easier to analyze so thanks for that :)
> 
> That was why I split the patches so small.  I'm kind of glad it paid off ...
> not glad to have caused a performance regression!
> 
> > > I have a few reactions to this:
> > > 
> > > 1. I'm concerned that the XArray may generally be slower than the radix
> > > tree was.  I didn't notice that in my testing, but maybe I didn't do
> > > the right tests.
> > 
> > So one difference I've noticed when staring into the code and annotated
> > perf traces is that xas_store() will call xas_init_marks() when stored
> > entry is 0. __radix_tree_replace() didn't do this. And the cache misses we
> > take from checking tags do add up. After hacking the code in xas_store() so
> > that __clear_shadow_entry() does not touch tags, I get around half of the
> > regression back. For now I didn't think how to do this so that the API
> > remains reasonably clean. So now we are at:
> > 
> > COMMIT      AVG            STDDEV
> > a97e7904c0  1431256.500000 1489.361759
> > 69b6c1319b  1566944.000000 2252.692877
> > notaginit   1483740.700000 7680.583455
> 
> Well, that seems worth doing.  For the page cache case, we know that
> shadow entries have no tags set (at least right now), so it seems
> reasonable to move the xas_init_marks() from xas_store() to its various
> callers.
> 
> > > 2. The setup overhead of the XA_STATE might be a problem.
> > > If so, we can do some batching in order to improve things.
> > > I suspect your test is calling __clear_shadow_entry through the
> > > truncate_exceptional_pvec_entries() path, which is already a batch.
> > > Maybe something like patch [1] at the end of this mail.
> > 
> > So this apparently contributes as well but not too much. With your patch
> > applied on top of 'notaginit' kernel above I've got to:
> > 
> > batched-xas 1473900.300000 950.439377
> 
> Fascinating that it reduces the stddev so much.  We can probably take this

I would not concentrate on the reduction of stddev too much. The high
stddev for 'notaginit' is caused by one relatively big outlier (otherwise
we are at ~2200). But still the patch reduced stddev to about a half so
there is some improvement.

> further (getting into the realm of #3 below) -- the call to xas_set() will
> restart the walk from the top of the tree each time.  Clearly this usecase
> (many thousands of shadow entries) is going to construct a very deep tree,
> and we're effectively doing a linear scan over the bottom of the tree, so
> starting from the top each time is O(n.log n) instead of O(n).  I think
> you said the file was 64GB, which is 16 million 4k entries, or 24 bits of
> tree index.  That's 4 levels deep so it'll add up.

Actually the benchmark creates 64 files, 1GB each, so the depth of the tree
will be just 3. But yes, traversing from top of the tree each time only to
zero out one long there looks just wasteful. It seems we should be able to
have much more efficient truncate implementation which would just trim the
whole node worth of xarray - working set entries would be directly
destroyed, pages returned locked (provided they can be locked with
trylock).

Looking more into perf traces, I didn't notice any other obvious low
hanging fruit. There is one suboptimality in the fact that
__clear_shadow_entry() does xas_load() and the first thing xas_store() does
is xas_load() again. Removing this double tree traversal does bring
something back but since the traversals are so close together, everything
is cache hot and so the overall gain is small (but still):

COMMIT     AVG            STDDEV
singleiter 1467763.900000 1078.261049

So still 34 ms to go to the original time.

What profiles do show is there's slightly more time spent here and there
adding to overall larger xas_store() time (compared to
__radix_tree_replace()) mostly due to what I'd blame to cache misses
(xas_store() is responsible for ~3.4% of cache misses after the patch while
xas_store() + __radix_tree_replace() caused only 1.5% together before).

Some of the expensive loads seem to be from 'xas' structure (kind
of matches with what Nick said), other expensive loads seem to be loads from
xa_node. And I don't have a great explanation why e.g. a load of
xa_node->count is expensive when we looked at xa_node->shift before -
either the cache line fell out of cache or the byte accesses somehow
confuse the CPU. Also xas_store() has some new accesses compared to
__radix_tree_replace() - e.g. it did not previously touch node->shift.

So overall I don't see easy way how to speed up xarray code further so
maybe just batching truncate to make up for some of the losses and live
with them where we cannot batch will be as good as it gets...

								Honza

> > > 3. Perhaps we can actually get rid of truncate_exceptional_pvec_entries().
> > > It seems a little daft for page_cache_delete_batch() to skip value
> > > entries, only for truncate_exceptional_pvec_entries() to erase them in
> > > a second pass.  Truncation is truncation, and perhaps we can handle all
> > > of it in one place?
> > > 
> > > 4. Now that calling through a function pointer is expensive, thanks to
> > > Spectre/Meltdown/..., I've been considering removing the general-purpose
> > > update function, which is only used by the page cache.  Instead move parts
> > > of workingset.c into the XArray code and use a bit in the xa_flags to
> > > indicate that the node should be tracked on an LRU if it contains only
> > > value entries.
> > 
> > I agree these two are good ideas to improve the speed. But old radix tree
> > code has these issues as well so they are not the reason of this
> > regression. So I'd like to track down where Xarray code is slower first.
> > 
> > I'm going to dig more into annotated profiles...
> 
> Thanks!  I'll work on a patch to remove the xas_init_marks() from xas_store().
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Truncate regression due to commit 69b6c1319b6
  2019-02-27 16:55       ` Jan Kara
@ 2019-02-28 22:53         ` Matthew Wilcox
  2019-03-14 11:10           ` Jan Kara
  0 siblings, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2019-02-28 22:53 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-mm, mgorman

On Wed, Feb 27, 2019 at 05:55:38PM +0100, Jan Kara wrote:
> On Wed 27-02-19 04:24:51, Matthew Wilcox wrote:
> Looking more into perf traces, I didn't notice any other obvious low
> hanging fruit. There is one suboptimality in the fact that
> __clear_shadow_entry() does xas_load() and the first thing xas_store() does
> is xas_load() again. Removing this double tree traversal does bring
> something back but since the traversals are so close together, everything
> is cache hot and so the overall gain is small (but still):

Calling xas_load() twice isn't too bad; the iterator stays at the base of
the tree, so it's just one pointer reload.  Still, I think we can avoid it.

> COMMIT     AVG            STDDEV
> singleiter 1467763.900000 1078.261049
> 
> So still 34 ms to go to the original time.
> 
> What profiles do show is there's slightly more time spent here and there
> adding to overall larger xas_store() time (compared to
> __radix_tree_replace()) mostly due to what I'd blame to cache misses
> (xas_store() is responsible for ~3.4% of cache misses after the patch while
> xas_store() + __radix_tree_replace() caused only 1.5% together before).
> 
> Some of the expensive loads seem to be from 'xas' structure (kind
> of matches with what Nick said), other expensive loads seem to be loads from
> xa_node. And I don't have a great explanation why e.g. a load of
> xa_node->count is expensive when we looked at xa_node->shift before -
> either the cache line fell out of cache or the byte accesses somehow
> confuse the CPU. Also xas_store() has some new accesses compared to
> __radix_tree_replace() - e.g. it did not previously touch node->shift.
> 
> So overall I don't see easy way how to speed up xarray code further so
> maybe just batching truncate to make up for some of the losses and live
> with them where we cannot batch will be as good as it gets...

Here's what I'm currently looking at.  xas_store() becomes a wrapper
around xas_replace() and xas_replace() avoids the xas_init_marks() and
xas_load() calls:

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index 0e01e6129145..26fdba17ce67 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -1455,6 +1455,7 @@ static inline bool xas_retry(struct xa_state *xas, const void *entry)
 
 void *xas_load(struct xa_state *);
 void *xas_store(struct xa_state *, void *entry);
+void xas_replace(struct xa_state *, void *curr, void *entry);
 void *xas_find(struct xa_state *, unsigned long max);
 void *xas_find_conflict(struct xa_state *);
 
diff --git a/lib/test_xarray.c b/lib/test_xarray.c
index 5d4bad8bd96a..b2e2cdf4eb74 100644
--- a/lib/test_xarray.c
+++ b/lib/test_xarray.c
@@ -38,6 +38,12 @@ static void *xa_store_index(struct xarray *xa, unsigned long index, gfp_t gfp)
 	return xa_store(xa, index, xa_mk_index(index), gfp);
 }
 
+static void xa_insert_index(struct xarray *xa, unsigned long index)
+{
+	XA_BUG_ON(xa, xa_insert(xa, index, xa_mk_index(index),
+				GFP_KERNEL) != 0);
+}
+
 static void xa_alloc_index(struct xarray *xa, unsigned long index, gfp_t gfp)
 {
 	u32 id;
@@ -338,6 +344,20 @@ static noinline void check_xa_shrink(struct xarray *xa)
 	}
 }
 
+static noinline void check_insert(struct xarray *xa)
+{
+	unsigned long i;
+
+	for (i = 0; i < 1024; i++) {
+		xa_insert_index(xa, i);
+		XA_BUG_ON(xa, xa_load(xa, i - 1) != NULL);
+		XA_BUG_ON(xa, xa_load(xa, i + 1) != NULL);
+		xa_erase_index(xa, i);
+	}
+
+	XA_BUG_ON(xa, !xa_empty(xa));
+}
+
 static noinline void check_cmpxchg(struct xarray *xa)
 {
 	void *FIVE = xa_mk_value(5);
@@ -1527,6 +1547,7 @@ static int xarray_checks(void)
 	check_xa_mark(&array);
 	check_xa_shrink(&array);
 	check_xas_erase(&array);
+	check_insert(&array);
 	check_cmpxchg(&array);
 	check_reserve(&array);
 	check_reserve(&xa0);
diff --git a/lib/xarray.c b/lib/xarray.c
index 6be3acbb861f..8ff605bd0fee 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -613,7 +613,7 @@ static int xas_expand(struct xa_state *xas, void *head)
 /*
  * xas_create() - Create a slot to store an entry in.
  * @xas: XArray operation state.
- * @allow_root: %true if we can store the entry in the root directly
+ * @entry: Entry which will be stored in the slot.
  *
  * Most users will not need to call this function directly, as it is called
  * by xas_store().  It is useful for doing conditional store operations
@@ -623,14 +623,14 @@ static int xas_expand(struct xa_state *xas, void *head)
  * If the slot was newly created, returns %NULL.  If it failed to create the
  * slot, returns %NULL and indicates the error in @xas.
  */
-static void *xas_create(struct xa_state *xas, bool allow_root)
+static void *xas_create(struct xa_state *xas, void *entry)
 {
 	struct xarray *xa = xas->xa;
-	void *entry;
 	void __rcu **slot;
 	struct xa_node *node = xas->xa_node;
 	int shift;
 	unsigned int order = xas->xa_shift;
+	bool allow_root = !xa_is_node(entry) && !xa_is_zero(entry);
 
 	if (xas_top(node)) {
 		entry = xa_head_locked(xa);
@@ -701,7 +701,7 @@ void xas_create_range(struct xa_state *xas)
 	xas->xa_sibs = 0;
 
 	for (;;) {
-		xas_create(xas, true);
+		xas_create(xas, XA_ZERO_ENTRY);
 		if (xas_error(xas))
 			goto restore;
 		if (xas->xa_index <= (index | XA_CHUNK_MASK))
@@ -745,44 +745,36 @@ static void update_node(struct xa_state *xas, struct xa_node *node,
 }
 
 /**
- * xas_store() - Store this entry in the XArray.
+ * xas_replace() - Replace one XArray entry with another.
  * @xas: XArray operation state.
+ * @curr: Current entry.
  * @entry: New entry.
  *
- * If @xas is operating on a multi-index entry, the entry returned by this
- * function is essentially meaningless (it may be an internal entry or it
- * may be %NULL, even if there are non-NULL entries at some of the indices
- * covered by the range).  This is not a problem for any current users,
- * and can be changed if needed.
- *
- * Return: The old entry at this index.
+ * This is not a cmpxchg operation.  The caller asserts that @curr is the
+ * current entry at the index referred to by @xas and wishes to replace it
+ * with @entry.  The slot must have already been created by xas_create()
+ * or by virtue of @curr being non-NULL.  The marks are not changed by
+ * this operation.
  */
-void *xas_store(struct xa_state *xas, void *entry)
+void xas_replace(struct xa_state *xas, void *curr, void *entry)
 {
 	struct xa_node *node;
 	void __rcu **slot = &xas->xa->xa_head;
 	unsigned int offset, max;
 	int count = 0;
 	int values = 0;
-	void *first, *next;
+	void *next;
 	bool value = xa_is_value(entry);
 
-	if (entry) {
-		bool allow_root = !xa_is_node(entry) && !xa_is_zero(entry);
-		first = xas_create(xas, allow_root);
-	} else {
-		first = xas_load(xas);
-	}
-
 	if (xas_invalid(xas))
-		return first;
+		return;
 	node = xas->xa_node;
 	if (node && (xas->xa_shift < node->shift))
 		xas->xa_sibs = 0;
-	if ((first == entry) && !xas->xa_sibs)
-		return first;
+	if ((curr == entry) && !xas->xa_sibs)
+		return;
 
-	next = first;
+	next = curr;
 	offset = xas->xa_offset;
 	max = xas->xa_offset + xas->xa_sibs;
 	if (node) {
@@ -790,8 +782,6 @@ void *xas_store(struct xa_state *xas, void *entry)
 		if (xas->xa_sibs)
 			xas_squash_marks(xas);
 	}
-	if (!entry)
-		xas_init_marks(xas);
 
 	for (;;) {
 		/*
@@ -807,7 +797,7 @@ void *xas_store(struct xa_state *xas, void *entry)
 		if (!node)
 			break;
 		count += !next - !entry;
-		values += !xa_is_value(first) - !value;
+		values += !xa_is_value(curr) - !value;
 		if (entry) {
 			if (offset == max)
 				break;
@@ -821,13 +811,41 @@ void *xas_store(struct xa_state *xas, void *entry)
 		if (!xa_is_sibling(next)) {
 			if (!entry && (offset > max))
 				break;
-			first = next;
+			curr = next;
 		}
 		slot++;
 	}
 
 	update_node(xas, node, count, values);
-	return first;
+}
+
+/**
+ * xas_store() - Store this entry in the XArray.
+ * @xas: XArray operation state.
+ * @entry: New entry.
+ *
+ * If @xas is operating on a multi-index entry, the entry returned by this
+ * function is essentially meaningless (it may be an internal entry or it
+ * may be %NULL, even if there are non-NULL entries at some of the indices
+ * covered by the range).  This is not a problem for any current users,
+ * and can be changed if needed.
+ *
+ * Return: The old entry at this index.
+ */
+void *xas_store(struct xa_state *xas, void *entry)
+{
+	void *curr;
+
+	if (entry) {
+		curr = xas_create(xas, entry);
+	} else {
+		curr = xas_load(xas);
+		if (curr)
+			xas_init_marks(xas);
+	}
+
+	xas_replace(xas, curr, entry);
+	return curr;
 }
 EXPORT_SYMBOL_GPL(xas_store);
 
@@ -1472,9 +1490,9 @@ int __xa_insert(struct xarray *xa, unsigned long index, void *entry, gfp_t gfp)
 		entry = XA_ZERO_ENTRY;
 
 	do {
-		curr = xas_load(&xas);
+		curr = xas_create(&xas, entry);
 		if (!curr) {
-			xas_store(&xas, entry);
+			xas_replace(&xas, curr, entry);
 			if (xa_track_free(xa))
 				xas_clear_mark(&xas, XA_FREE_MARK);
 		} else {
@@ -1553,7 +1571,7 @@ void *xa_store_range(struct xarray *xa, unsigned long first,
 			if (last + 1)
 				order = __ffs(last + 1);
 			xas_set_order(&xas, last, order);
-			xas_create(&xas, true);
+			xas_create(&xas, entry);
 			if (xas_error(&xas))
 				goto unlock;
 		}
@@ -1606,11 +1624,13 @@ int __xa_alloc(struct xarray *xa, u32 *id, void *entry,
 	do {
 		xas.xa_index = limit.min;
 		xas_find_marked(&xas, limit.max, XA_FREE_MARK);
-		if (xas.xa_node == XAS_RESTART)
+		if (xas.xa_node == XAS_RESTART) {
 			xas_set_err(&xas, -EBUSY);
-		else
+		} else {
+			xas_create(&xas, entry);
 			*id = xas.xa_index;
-		xas_store(&xas, entry);
+		}
+		xas_replace(&xas, NULL, entry);
 		xas_clear_mark(&xas, XA_FREE_MARK);
 	} while (__xas_nomem(&xas, gfp));
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 9f5e323e883e..56a7ef579879 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -131,8 +131,8 @@ static void page_cache_delete(struct address_space *mapping,
 	VM_BUG_ON_PAGE(PageTail(page), page);
 	VM_BUG_ON_PAGE(nr != 1 && shadow, page);
 
-	xas_store(&xas, shadow);
 	xas_init_marks(&xas);
+	xas_replace(&xas, page, shadow);
 
 	page->mapping = NULL;
 	/* Leave page->index set: truncation lookup relies upon it */
@@ -326,7 +326,7 @@ static void page_cache_delete_batch(struct address_space *mapping,
 					!= pvec->pages[i]->index, page);
 			tail_pages--;
 		}
-		xas_store(&xas, NULL);
+		xas_replace(&xas, page, NULL);
 		total_pages++;
 	}
 	mapping->nrpages -= total_pages;
@@ -771,7 +771,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 	new->index = offset;
 
 	xas_lock_irqsave(&xas, flags);
-	xas_store(&xas, new);
+	xas_replace(&xas, old, new);
 
 	old->mapping = NULL;
 	/* hugetlb pages do not participate in page cache accounting. */
diff --git a/mm/migrate.c b/mm/migrate.c
index d4fd680be3b0..083f52797d11 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -459,13 +459,13 @@ int migrate_page_move_mapping(struct address_space *mapping,
 		SetPageDirty(newpage);
 	}
 
-	xas_store(&xas, newpage);
+	xas_replace(&xas, page, newpage);
 	if (PageTransHuge(page)) {
 		int i;
 
 		for (i = 1; i < HPAGE_PMD_NR; i++) {
 			xas_next(&xas);
-			xas_store(&xas, newpage + i);
+			xas_replace(&xas, page + i, newpage + i);
 		}
 	}
 
@@ -536,7 +536,7 @@ int migrate_huge_page_move_mapping(struct address_space *mapping,
 
 	get_page(newpage);
 
-	xas_store(&xas, newpage);
+	xas_replace(&xas, page, newpage);
 
 	page_ref_unfreeze(page, expected_count - 1);
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 6ece1e2fe76e..83925601089d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -341,7 +341,7 @@ static int shmem_replace_entry(struct address_space *mapping,
 	item = xas_load(&xas);
 	if (item != expected)
 		return -ENOENT;
-	xas_store(&xas, replacement);
+	xas_replace(&xas, item, replacement);
 	return 0;
 }
 
diff --git a/mm/truncate.c b/mm/truncate.c
index 798e7ccfb030..0682b2f9ac0e 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -38,7 +38,7 @@ static inline void __clear_shadow_entry(struct address_space *mapping,
 	xas_set_update(&xas, workingset_update_node);
 	if (xas_load(&xas) != entry)
 		return;
-	xas_store(&xas, NULL);
+	xas_replace(&xas, entry, NULL);
 	mapping->nrexceptional--;
 }
 


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: Truncate regression due to commit 69b6c1319b6
  2019-02-28 22:53         ` Matthew Wilcox
@ 2019-03-14 11:10           ` Jan Kara
  2019-05-31 19:04             ` Matthew Wilcox
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Kara @ 2019-03-14 11:10 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Jan Kara, linux-mm, mgorman

Hi Matthew,

On Thu 28-02-19 14:53:17, Matthew Wilcox wrote:
> On Wed, Feb 27, 2019 at 05:55:38PM +0100, Jan Kara wrote:
> > On Wed 27-02-19 04:24:51, Matthew Wilcox wrote:
> > Looking more into perf traces, I didn't notice any other obvious low
> > hanging fruit. There is one suboptimality in the fact that
> > __clear_shadow_entry() does xas_load() and the first thing xas_store() does
> > is xas_load() again. Removing this double tree traversal does bring
> > something back but since the traversals are so close together, everything
> > is cache hot and so the overall gain is small (but still):
> 
> Calling xas_load() twice isn't too bad; the iterator stays at the base of
> the tree, so it's just one pointer reload.  Still, I think we can avoid it.
> 
> > COMMIT     AVG            STDDEV
> > singleiter 1467763.900000 1078.261049
> > 
> > So still 34 ms to go to the original time.
> > 
> > What profiles do show is there's slightly more time spent here and there
> > adding to overall larger xas_store() time (compared to
> > __radix_tree_replace()) mostly due to what I'd blame to cache misses
> > (xas_store() is responsible for ~3.4% of cache misses after the patch while
> > xas_store() + __radix_tree_replace() caused only 1.5% together before).
> > 
> > Some of the expensive loads seem to be from 'xas' structure (kind
> > of matches with what Nick said), other expensive loads seem to be loads from
> > xa_node. And I don't have a great explanation why e.g. a load of
> > xa_node->count is expensive when we looked at xa_node->shift before -
> > either the cache line fell out of cache or the byte accesses somehow
> > confuse the CPU. Also xas_store() has some new accesses compared to
> > __radix_tree_replace() - e.g. it did not previously touch node->shift.
> > 
> > So overall I don't see easy way how to speed up xarray code further so
> > maybe just batching truncate to make up for some of the losses and live
> > with them where we cannot batch will be as good as it gets...
> 
> Here's what I'm currently looking at.  xas_store() becomes a wrapper
> around xas_replace() and xas_replace() avoids the xas_init_marks() and
> xas_load() calls:

This looks reasonable to me. Do you have some official series I could test
or where do we stand?

								Honza

> 
> diff --git a/include/linux/xarray.h b/include/linux/xarray.h
> index 0e01e6129145..26fdba17ce67 100644
> --- a/include/linux/xarray.h
> +++ b/include/linux/xarray.h
> @@ -1455,6 +1455,7 @@ static inline bool xas_retry(struct xa_state *xas, const void *entry)
>  
>  void *xas_load(struct xa_state *);
>  void *xas_store(struct xa_state *, void *entry);
> +void xas_replace(struct xa_state *, void *curr, void *entry);
>  void *xas_find(struct xa_state *, unsigned long max);
>  void *xas_find_conflict(struct xa_state *);
>  
> diff --git a/lib/test_xarray.c b/lib/test_xarray.c
> index 5d4bad8bd96a..b2e2cdf4eb74 100644
> --- a/lib/test_xarray.c
> +++ b/lib/test_xarray.c
> @@ -38,6 +38,12 @@ static void *xa_store_index(struct xarray *xa, unsigned long index, gfp_t gfp)
>  	return xa_store(xa, index, xa_mk_index(index), gfp);
>  }
>  
> +static void xa_insert_index(struct xarray *xa, unsigned long index)
> +{
> +	XA_BUG_ON(xa, xa_insert(xa, index, xa_mk_index(index),
> +				GFP_KERNEL) != 0);
> +}
> +
>  static void xa_alloc_index(struct xarray *xa, unsigned long index, gfp_t gfp)
>  {
>  	u32 id;
> @@ -338,6 +344,20 @@ static noinline void check_xa_shrink(struct xarray *xa)
>  	}
>  }
>  
> +static noinline void check_insert(struct xarray *xa)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < 1024; i++) {
> +		xa_insert_index(xa, i);
> +		XA_BUG_ON(xa, xa_load(xa, i - 1) != NULL);
> +		XA_BUG_ON(xa, xa_load(xa, i + 1) != NULL);
> +		xa_erase_index(xa, i);
> +	}
> +
> +	XA_BUG_ON(xa, !xa_empty(xa));
> +}
> +
>  static noinline void check_cmpxchg(struct xarray *xa)
>  {
>  	void *FIVE = xa_mk_value(5);
> @@ -1527,6 +1547,7 @@ static int xarray_checks(void)
>  	check_xa_mark(&array);
>  	check_xa_shrink(&array);
>  	check_xas_erase(&array);
> +	check_insert(&array);
>  	check_cmpxchg(&array);
>  	check_reserve(&array);
>  	check_reserve(&xa0);
> diff --git a/lib/xarray.c b/lib/xarray.c
> index 6be3acbb861f..8ff605bd0fee 100644
> --- a/lib/xarray.c
> +++ b/lib/xarray.c
> @@ -613,7 +613,7 @@ static int xas_expand(struct xa_state *xas, void *head)
>  /*
>   * xas_create() - Create a slot to store an entry in.
>   * @xas: XArray operation state.
> - * @allow_root: %true if we can store the entry in the root directly
> + * @entry: Entry which will be stored in the slot.
>   *
>   * Most users will not need to call this function directly, as it is called
>   * by xas_store().  It is useful for doing conditional store operations
> @@ -623,14 +623,14 @@ static int xas_expand(struct xa_state *xas, void *head)
>   * If the slot was newly created, returns %NULL.  If it failed to create the
>   * slot, returns %NULL and indicates the error in @xas.
>   */
> -static void *xas_create(struct xa_state *xas, bool allow_root)
> +static void *xas_create(struct xa_state *xas, void *entry)
>  {
>  	struct xarray *xa = xas->xa;
> -	void *entry;
>  	void __rcu **slot;
>  	struct xa_node *node = xas->xa_node;
>  	int shift;
>  	unsigned int order = xas->xa_shift;
> +	bool allow_root = !xa_is_node(entry) && !xa_is_zero(entry);
>  
>  	if (xas_top(node)) {
>  		entry = xa_head_locked(xa);
> @@ -701,7 +701,7 @@ void xas_create_range(struct xa_state *xas)
>  	xas->xa_sibs = 0;
>  
>  	for (;;) {
> -		xas_create(xas, true);
> +		xas_create(xas, XA_ZERO_ENTRY);
>  		if (xas_error(xas))
>  			goto restore;
>  		if (xas->xa_index <= (index | XA_CHUNK_MASK))
> @@ -745,44 +745,36 @@ static void update_node(struct xa_state *xas, struct xa_node *node,
>  }
>  
>  /**
> - * xas_store() - Store this entry in the XArray.
> + * xas_replace() - Replace one XArray entry with another.
>   * @xas: XArray operation state.
> + * @curr: Current entry.
>   * @entry: New entry.
>   *
> - * If @xas is operating on a multi-index entry, the entry returned by this
> - * function is essentially meaningless (it may be an internal entry or it
> - * may be %NULL, even if there are non-NULL entries at some of the indices
> - * covered by the range).  This is not a problem for any current users,
> - * and can be changed if needed.
> - *
> - * Return: The old entry at this index.
> + * This is not a cmpxchg operation.  The caller asserts that @curr is the
> + * current entry at the index referred to by @xas and wishes to replace it
> + * with @entry.  The slot must have already been created by xas_create()
> + * or by virtue of @curr being non-NULL.  The marks are not changed by
> + * this operation.
>   */
> -void *xas_store(struct xa_state *xas, void *entry)
> +void xas_replace(struct xa_state *xas, void *curr, void *entry)
>  {
>  	struct xa_node *node;
>  	void __rcu **slot = &xas->xa->xa_head;
>  	unsigned int offset, max;
>  	int count = 0;
>  	int values = 0;
> -	void *first, *next;
> +	void *next;
>  	bool value = xa_is_value(entry);
>  
> -	if (entry) {
> -		bool allow_root = !xa_is_node(entry) && !xa_is_zero(entry);
> -		first = xas_create(xas, allow_root);
> -	} else {
> -		first = xas_load(xas);
> -	}
> -
>  	if (xas_invalid(xas))
> -		return first;
> +		return;
>  	node = xas->xa_node;
>  	if (node && (xas->xa_shift < node->shift))
>  		xas->xa_sibs = 0;
> -	if ((first == entry) && !xas->xa_sibs)
> -		return first;
> +	if ((curr == entry) && !xas->xa_sibs)
> +		return;
>  
> -	next = first;
> +	next = curr;
>  	offset = xas->xa_offset;
>  	max = xas->xa_offset + xas->xa_sibs;
>  	if (node) {
> @@ -790,8 +782,6 @@ void *xas_store(struct xa_state *xas, void *entry)
>  		if (xas->xa_sibs)
>  			xas_squash_marks(xas);
>  	}
> -	if (!entry)
> -		xas_init_marks(xas);
>  
>  	for (;;) {
>  		/*
> @@ -807,7 +797,7 @@ void *xas_store(struct xa_state *xas, void *entry)
>  		if (!node)
>  			break;
>  		count += !next - !entry;
> -		values += !xa_is_value(first) - !value;
> +		values += !xa_is_value(curr) - !value;
>  		if (entry) {
>  			if (offset == max)
>  				break;
> @@ -821,13 +811,41 @@ void *xas_store(struct xa_state *xas, void *entry)
>  		if (!xa_is_sibling(next)) {
>  			if (!entry && (offset > max))
>  				break;
> -			first = next;
> +			curr = next;
>  		}
>  		slot++;
>  	}
>  
>  	update_node(xas, node, count, values);
> -	return first;
> +}
> +
> +/**
> + * xas_store() - Store this entry in the XArray.
> + * @xas: XArray operation state.
> + * @entry: New entry.
> + *
> + * If @xas is operating on a multi-index entry, the entry returned by this
> + * function is essentially meaningless (it may be an internal entry or it
> + * may be %NULL, even if there are non-NULL entries at some of the indices
> + * covered by the range).  This is not a problem for any current users,
> + * and can be changed if needed.
> + *
> + * Return: The old entry at this index.
> + */
> +void *xas_store(struct xa_state *xas, void *entry)
> +{
> +	void *curr;
> +
> +	if (entry) {
> +		curr = xas_create(xas, entry);
> +	} else {
> +		curr = xas_load(xas);
> +		if (curr)
> +			xas_init_marks(xas);
> +	}
> +
> +	xas_replace(xas, curr, entry);
> +	return curr;
>  }
>  EXPORT_SYMBOL_GPL(xas_store);
>  
> @@ -1472,9 +1490,9 @@ int __xa_insert(struct xarray *xa, unsigned long index, void *entry, gfp_t gfp)
>  		entry = XA_ZERO_ENTRY;
>  
>  	do {
> -		curr = xas_load(&xas);
> +		curr = xas_create(&xas, entry);
>  		if (!curr) {
> -			xas_store(&xas, entry);
> +			xas_replace(&xas, curr, entry);
>  			if (xa_track_free(xa))
>  				xas_clear_mark(&xas, XA_FREE_MARK);
>  		} else {
> @@ -1553,7 +1571,7 @@ void *xa_store_range(struct xarray *xa, unsigned long first,
>  			if (last + 1)
>  				order = __ffs(last + 1);
>  			xas_set_order(&xas, last, order);
> -			xas_create(&xas, true);
> +			xas_create(&xas, entry);
>  			if (xas_error(&xas))
>  				goto unlock;
>  		}
> @@ -1606,11 +1624,13 @@ int __xa_alloc(struct xarray *xa, u32 *id, void *entry,
>  	do {
>  		xas.xa_index = limit.min;
>  		xas_find_marked(&xas, limit.max, XA_FREE_MARK);
> -		if (xas.xa_node == XAS_RESTART)
> +		if (xas.xa_node == XAS_RESTART) {
>  			xas_set_err(&xas, -EBUSY);
> -		else
> +		} else {
> +			xas_create(&xas, entry);
>  			*id = xas.xa_index;
> -		xas_store(&xas, entry);
> +		}
> +		xas_replace(&xas, NULL, entry);
>  		xas_clear_mark(&xas, XA_FREE_MARK);
>  	} while (__xas_nomem(&xas, gfp));
>  
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 9f5e323e883e..56a7ef579879 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -131,8 +131,8 @@ static void page_cache_delete(struct address_space *mapping,
>  	VM_BUG_ON_PAGE(PageTail(page), page);
>  	VM_BUG_ON_PAGE(nr != 1 && shadow, page);
>  
> -	xas_store(&xas, shadow);
>  	xas_init_marks(&xas);
> +	xas_replace(&xas, page, shadow);
>  
>  	page->mapping = NULL;
>  	/* Leave page->index set: truncation lookup relies upon it */
> @@ -326,7 +326,7 @@ static void page_cache_delete_batch(struct address_space *mapping,
>  					!= pvec->pages[i]->index, page);
>  			tail_pages--;
>  		}
> -		xas_store(&xas, NULL);
> +		xas_replace(&xas, page, NULL);
>  		total_pages++;
>  	}
>  	mapping->nrpages -= total_pages;
> @@ -771,7 +771,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
>  	new->index = offset;
>  
>  	xas_lock_irqsave(&xas, flags);
> -	xas_store(&xas, new);
> +	xas_replace(&xas, old, new);
>  
>  	old->mapping = NULL;
>  	/* hugetlb pages do not participate in page cache accounting. */
> diff --git a/mm/migrate.c b/mm/migrate.c
> index d4fd680be3b0..083f52797d11 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -459,13 +459,13 @@ int migrate_page_move_mapping(struct address_space *mapping,
>  		SetPageDirty(newpage);
>  	}
>  
> -	xas_store(&xas, newpage);
> +	xas_replace(&xas, page, newpage);
>  	if (PageTransHuge(page)) {
>  		int i;
>  
>  		for (i = 1; i < HPAGE_PMD_NR; i++) {
>  			xas_next(&xas);
> -			xas_store(&xas, newpage + i);
> +			xas_replace(&xas, page + i, newpage + i);
>  		}
>  	}
>  
> @@ -536,7 +536,7 @@ int migrate_huge_page_move_mapping(struct address_space *mapping,
>  
>  	get_page(newpage);
>  
> -	xas_store(&xas, newpage);
> +	xas_replace(&xas, page, newpage);
>  
>  	page_ref_unfreeze(page, expected_count - 1);
>  
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 6ece1e2fe76e..83925601089d 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -341,7 +341,7 @@ static int shmem_replace_entry(struct address_space *mapping,
>  	item = xas_load(&xas);
>  	if (item != expected)
>  		return -ENOENT;
> -	xas_store(&xas, replacement);
> +	xas_replace(&xas, item, replacement);
>  	return 0;
>  }
>  
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 798e7ccfb030..0682b2f9ac0e 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -38,7 +38,7 @@ static inline void __clear_shadow_entry(struct address_space *mapping,
>  	xas_set_update(&xas, workingset_update_node);
>  	if (xas_load(&xas) != entry)
>  		return;
> -	xas_store(&xas, NULL);
> +	xas_replace(&xas, entry, NULL);
>  	mapping->nrexceptional--;
>  }
>  
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Truncate regression due to commit 69b6c1319b6
  2019-03-14 11:10           ` Jan Kara
@ 2019-05-31 19:04             ` Matthew Wilcox
  2019-08-27 13:29               ` Jan Kara
  0 siblings, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2019-05-31 19:04 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-mm, mgorman

On Thu, Mar 14, 2019 at 12:10:12PM +0100, Jan Kara wrote:
> On Thu 28-02-19 14:53:17, Matthew Wilcox wrote:
> > Here's what I'm currently looking at.  xas_store() becomes a wrapper
> > around xas_replace() and xas_replace() avoids the xas_init_marks() and
> > xas_load() calls:
> 
> This looks reasonable to me. Do you have some official series I could test
> or where do we stand?

Hi Jan,

Sorry for the delay; I've put this into the xarray tree:

http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray

I'm planning to ask Linus to pull it in about a week.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Truncate regression due to commit 69b6c1319b6
  2019-05-31 19:04             ` Matthew Wilcox
@ 2019-08-27 13:29               ` Jan Kara
  2019-08-29 11:59                 ` Matthew Wilcox
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Kara @ 2019-08-27 13:29 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Jan Kara, linux-mm, mgorman

On Fri 31-05-19 12:04:31, Matthew Wilcox wrote:
> On Thu, Mar 14, 2019 at 12:10:12PM +0100, Jan Kara wrote:
> > On Thu 28-02-19 14:53:17, Matthew Wilcox wrote:
> > > Here's what I'm currently looking at.  xas_store() becomes a wrapper
> > > around xas_replace() and xas_replace() avoids the xas_init_marks() and
> > > xas_load() calls:
> > 
> > This looks reasonable to me. Do you have some official series I could test
> > or where do we stand?
> 
> Hi Jan,
> 
> Sorry for the delay; I've put this into the xarray tree:
> 
> http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray
> 
> I'm planning to ask Linus to pull it in about a week.

Hum, I still don't see these xarray changes (the change to use
xas_replace() in particular) upstream. What has happened?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Truncate regression due to commit 69b6c1319b6
  2019-08-27 13:29               ` Jan Kara
@ 2019-08-29 11:59                 ` Matthew Wilcox
  0 siblings, 0 replies; 13+ messages in thread
From: Matthew Wilcox @ 2019-08-29 11:59 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-mm, mgorman

On Tue, Aug 27, 2019 at 03:29:25PM +0200, Jan Kara wrote:
> On Fri 31-05-19 12:04:31, Matthew Wilcox wrote:
> > On Thu, Mar 14, 2019 at 12:10:12PM +0100, Jan Kara wrote:
> > > On Thu 28-02-19 14:53:17, Matthew Wilcox wrote:
> > > > Here's what I'm currently looking at.  xas_store() becomes a wrapper
> > > > around xas_replace() and xas_replace() avoids the xas_init_marks() and
> > > > xas_load() calls:
> > > 
> > > This looks reasonable to me. Do you have some official series I could test
> > > or where do we stand?
> > 
> > Hi Jan,
> > 
> > Sorry for the delay; I've put this into the xarray tree:
> > 
> > http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray
> > 
> > I'm planning to ask Linus to pull it in about a week.
> 
> Hum, I still don't see these xarray changes (the change to use
> xas_replace() in particular) upstream. What has happened?

It had a bug and I decided to pull the patch for now rather than find
the bug ... this regression is still on my mind.  Thanks for the ping.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-08-29 11:59 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-02-26 16:56 Truncate regression due to commit 69b6c1319b6 Jan Kara
2019-02-26 17:27 ` Matthew Wilcox
2019-02-27  6:03   ` Nicholas Piggin
2019-02-27 12:35     ` Matthew Wilcox
2019-02-28  9:31       ` Nicholas Piggin
2019-02-27 11:27   ` Jan Kara
2019-02-27 12:24     ` Matthew Wilcox
2019-02-27 16:55       ` Jan Kara
2019-02-28 22:53         ` Matthew Wilcox
2019-03-14 11:10           ` Jan Kara
2019-05-31 19:04             ` Matthew Wilcox
2019-08-27 13:29               ` Jan Kara
2019-08-29 11:59                 ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).