* Truncate regression due to commit 69b6c1319b6 @ 2019-02-26 16:56 Jan Kara 2019-02-26 17:27 ` Matthew Wilcox 0 siblings, 1 reply; 13+ messages in thread From: Jan Kara @ 2019-02-26 16:56 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-mm, mgorman [-- Attachment #1: Type: text/plain, Size: 1322 bytes --] Hello Matthew, after some peripeties, I was able to bisect down to a regression in truncate performance caused by commit 69b6c1319b6 "mm: Convert truncate to XArray". The initial benchmark that indicated the regression was a bonnie++ file delete test (some 4-5% regression). It is however much easier (and faster!) to see the regression with the attached benchmark. With it I can see about 10% regression on my test machine: Average of 10 runs, time in us. COMMIT AVG STDDEV a97e7904c0 1431256.500000 1489.361759 69b6c1319b 1566944.000000 2252.692877 The benchmark has to be run so that the total file size is about 2x the memory size (as the regression seems to be in trucating existing workingset entries). So on my test machine with 32 GB of RAM I run it like: # This prepares the environment mkfs.xfs -f /dev/sdb1 mount -t xfs /dev/sdb1 /mnt ./trunc-bench /mnt/file 64 1 # This does the truncation ./trunc-bench /mnt/file 64 2 I've gathered also perf profiles but from the first look they don't show anything surprising besides xas_load() and xas_store() taking up more time than original counterparts did. I'll try to dig more into this but any idea is appreciated. My test machine is 8 CPU Intel(R) Xeon(R) CPU E3-1240 v5 @ 3.50GHz. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR [-- Attachment #2: trunc-bench.c --] [-- Type: text/x-c, Size: 1478 bytes --] #include <stdio.h> #include <unistd.h> #include <string.h> #include <fcntl.h> #include <stdlib.h> #include <sys/time.h> #define COUNT 1024 #define BUFSIZE (1024*1024) char *buf; void read_file(int fd) { int i; if (ftruncate(fd, BUFSIZE*COUNT) < 0) { perror("truncate"); exit(1); } for (i = 0; i < COUNT; i++) { if (read(fd, buf, BUFSIZE) != BUFSIZE) { perror("read"); exit(1); } } } int main(int argc, char **argv) { int fd; int fcount, i, stages; char fname[128]; struct timeval start, end; if (argc != 4) { fprintf(stderr, "Usage: trunc-bench <file> <file-count> <stages>\n"); return 1; } fcount = strtol(argv[2], NULL, 0); stages = strtol(argv[3], NULL, 0); buf = malloc(BUFSIZE); if (!buf) { fprintf(stderr, "Cannot allocate write buffer.\n"); return 1; } memset(buf, 'a', BUFSIZE); if (stages & 1) { fprintf(stderr, "Creating files...\n"); for (i = 0; i < fcount; i++ ) { sprintf(fname, "%s%d", argv[1], i); fd = open(fname, O_RDWR | O_TRUNC | O_CREAT, 0644); if (fd < 0) { perror("open"); return 1; } read_file(fd); close(fd); } } if (stages & 2) { fprintf(stderr, "Removing files...\n"); gettimeofday(&start, NULL); for (i = 0; i < fcount; i++ ) { sprintf(fname, "%s%d", argv[1], i); truncate(fname, 0); } gettimeofday(&end, NULL); printf("%lu us\n", (end.tv_sec - start.tv_sec) * 1000000UL + (end.tv_usec - start.tv_usec)); } fprintf(stderr, "Done.\n"); return 0; } ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Truncate regression due to commit 69b6c1319b6 2019-02-26 16:56 Truncate regression due to commit 69b6c1319b6 Jan Kara @ 2019-02-26 17:27 ` Matthew Wilcox 2019-02-27 6:03 ` Nicholas Piggin 2019-02-27 11:27 ` Jan Kara 0 siblings, 2 replies; 13+ messages in thread From: Matthew Wilcox @ 2019-02-26 17:27 UTC (permalink / raw) To: Jan Kara; +Cc: linux-mm, mgorman On Tue, Feb 26, 2019 at 05:56:28PM +0100, Jan Kara wrote: > after some peripeties, I was able to bisect down to a regression in > truncate performance caused by commit 69b6c1319b6 "mm: Convert truncate to > XArray". [...] > I've gathered also perf profiles but from the first look they don't show > anything surprising besides xas_load() and xas_store() taking up more time > than original counterparts did. I'll try to dig more into this but any idea > is appreciated. Well, that's a short and sweet little commit. Stripped of comment changes, it's just: - struct radix_tree_node *node; - void **slot; + XA_STATE(xas, &mapping->i_pages, index); - if (!__radix_tree_lookup(&mapping->i_pages, index, &node, &slot)) + xas_set_update(&xas, workingset_update_node); + if (xas_load(&xas) != entry) return; - if (*slot != entry) - return; - __radix_tree_replace(&mapping->i_pages, node, slot, NULL, - workingset_update_node); + xas_store(&xas, NULL); I have a few reactions to this: 1. I'm concerned that the XArray may generally be slower than the radix tree was. I didn't notice that in my testing, but maybe I didn't do the right tests. 2. The setup overhead of the XA_STATE might be a problem. If so, we can do some batching in order to improve things. I suspect your test is calling __clear_shadow_entry through the truncate_exceptional_pvec_entries() path, which is already a batch. Maybe something like patch [1] at the end of this mail. 3. Perhaps we can actually get rid of truncate_exceptional_pvec_entries(). It seems a little daft for page_cache_delete_batch() to skip value entries, only for truncate_exceptional_pvec_entries() to erase them in a second pass. Truncation is truncation, and perhaps we can handle all of it in one place? 4. Now that calling through a function pointer is expensive, thanks to Spectre/Meltdown/..., I've been considering removing the general-purpose update function, which is only used by the page cache. Instead move parts of workingset.c into the XArray code and use a bit in the xa_flags to indicate that the node should be tracked on an LRU if it contains only value entries. [1] diff --git a/mm/truncate.c b/mm/truncate.c index 798e7ccfb030..9384f48eff2a 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -31,23 +31,23 @@ * lock. */ static inline void __clear_shadow_entry(struct address_space *mapping, - pgoff_t index, void *entry) + struct xa_state *xas, void *entry) { - XA_STATE(xas, &mapping->i_pages, index); - - xas_set_update(&xas, workingset_update_node); - if (xas_load(&xas) != entry) + if (xas_load(xas) != entry) return; - xas_store(&xas, NULL); + xas_store(xas, NULL); mapping->nrexceptional--; } static void clear_shadow_entry(struct address_space *mapping, pgoff_t index, void *entry) { - xa_lock_irq(&mapping->i_pages); - __clear_shadow_entry(mapping, index, entry); - xa_unlock_irq(&mapping->i_pages); + XA_STATE(xas, &mapping->i_pages, index); + xas_set_update(&xas, workingset_update_node); + + xas_lock_irq(&xas); + __clear_shadow_entry(mapping, &xas, entry); + xas_unlock_irq(&xas); } /* @@ -59,9 +59,12 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping, struct pagevec *pvec, pgoff_t *indices, pgoff_t end) { + XA_STATE(xas, &mapping->i_pages, 0); int i, j; bool dax, lock; + xas_set_update(&xas, workingset_update_node); + /* Handled by shmem itself */ if (shmem_mapping(mapping)) return; @@ -95,7 +98,8 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping, continue; } - __clear_shadow_entry(mapping, index, page); + xas_set(&xas, index); + __clear_shadow_entry(mapping, &xas, page); } if (lock) ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: Truncate regression due to commit 69b6c1319b6 2019-02-26 17:27 ` Matthew Wilcox @ 2019-02-27 6:03 ` Nicholas Piggin 2019-02-27 12:35 ` Matthew Wilcox 2019-02-27 11:27 ` Jan Kara 1 sibling, 1 reply; 13+ messages in thread From: Nicholas Piggin @ 2019-02-27 6:03 UTC (permalink / raw) To: Jan Kara, Matthew Wilcox; +Cc: linux-mm, mgorman Matthew Wilcox's on February 27, 2019 3:27 am: > On Tue, Feb 26, 2019 at 05:56:28PM +0100, Jan Kara wrote: >> after some peripeties, I was able to bisect down to a regression in >> truncate performance caused by commit 69b6c1319b6 "mm: Convert truncate to >> XArray". > > [...] > >> I've gathered also perf profiles but from the first look they don't show >> anything surprising besides xas_load() and xas_store() taking up more time >> than original counterparts did. I'll try to dig more into this but any idea >> is appreciated. > > Well, that's a short and sweet little commit. Stripped of comment > changes, it's just: > > - struct radix_tree_node *node; > - void **slot; > + XA_STATE(xas, &mapping->i_pages, index); > > - if (!__radix_tree_lookup(&mapping->i_pages, index, &node, &slot)) > + xas_set_update(&xas, workingset_update_node); > + if (xas_load(&xas) != entry) > return; > - if (*slot != entry) > - return; > - __radix_tree_replace(&mapping->i_pages, node, slot, NULL, > - workingset_update_node); > + xas_store(&xas, NULL); > > I have a few reactions to this: > > 1. I'm concerned that the XArray may generally be slower than the radix > tree was. I didn't notice that in my testing, but maybe I didn't do > the right tests. > > 2. The setup overhead of the XA_STATE might be a problem. > If so, we can do some batching in order to improve things. > I suspect your test is calling __clear_shadow_entry through the > truncate_exceptional_pvec_entries() path, which is already a batch. > Maybe something like patch [1] at the end of this mail. One nasty thing about the XA_STATE stack object as opposed to just passing the parameters (in the same order) down to children is that you get the same memory accessed nearby, but in different ways (different base register, offset, addressing mode etc). Which can reduce effectiveness of memory disambiguation prediction, at least in cold predictor case. I've seen (on some POWER CPUs at least) flushes due to aliasing access in some of these xarray call chains, although no idea if that actually makes a noticable difference in microbenchmark like this. But it's not the greatest pattern to use for passing to low level performance critical functions :( Ideally the compiler could just do a big LTO pass right at the end and unwind it all back into registers and fix everything, but that will never happen. Thanks, Nick ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Truncate regression due to commit 69b6c1319b6 2019-02-27 6:03 ` Nicholas Piggin @ 2019-02-27 12:35 ` Matthew Wilcox 2019-02-28 9:31 ` Nicholas Piggin 0 siblings, 1 reply; 13+ messages in thread From: Matthew Wilcox @ 2019-02-27 12:35 UTC (permalink / raw) To: Nicholas Piggin; +Cc: Jan Kara, linux-mm, mgorman On Wed, Feb 27, 2019 at 04:03:25PM +1000, Nicholas Piggin wrote: > Matthew Wilcox's on February 27, 2019 3:27 am: > > 2. The setup overhead of the XA_STATE might be a problem. > > If so, we can do some batching in order to improve things. > > I suspect your test is calling __clear_shadow_entry through the > > truncate_exceptional_pvec_entries() path, which is already a batch. > > Maybe something like patch [1] at the end of this mail. > > One nasty thing about the XA_STATE stack object as opposed to just > passing the parameters (in the same order) down to children is that > you get the same memory accessed nearby, but in different ways > (different base register, offset, addressing mode etc). Which can > reduce effectiveness of memory disambiguation prediction, at least > in cold predictor case. That is nasty. At the C level, it's a really attractive pattern. Shame it doesn't work out so well on hardware. I wouldn't mind turning shift/sibs/offset into a manually-extracted unsigned long if that'll help with the addressing mode mispredictions? > I've seen (on some POWER CPUs at least) flushes due to aliasing > access in some of these xarray call chains, although no idea if > that actually makes a noticable difference in microbenchmark like > this. > > But it's not the greatest pattern to use for passing to low level > performance critical functions :( Ideally the compiler could just > do a big LTO pass right at the end and unwind it all back into > registers and fix everything, but that will never happen. I wonder if we could get the compiler people to introduce a structure attribute telling the compiler to pass this whole thing back-and-forth in registers ... 6 registers is a lot to ask the compiler to reserve though. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Truncate regression due to commit 69b6c1319b6 2019-02-27 12:35 ` Matthew Wilcox @ 2019-02-28 9:31 ` Nicholas Piggin 0 siblings, 0 replies; 13+ messages in thread From: Nicholas Piggin @ 2019-02-28 9:31 UTC (permalink / raw) To: Matthew Wilcox; +Cc: Jan Kara, linux-mm, mgorman Matthew Wilcox's on February 27, 2019 10:35 pm: > On Wed, Feb 27, 2019 at 04:03:25PM +1000, Nicholas Piggin wrote: >> Matthew Wilcox's on February 27, 2019 3:27 am: >> > 2. The setup overhead of the XA_STATE might be a problem. >> > If so, we can do some batching in order to improve things. >> > I suspect your test is calling __clear_shadow_entry through the >> > truncate_exceptional_pvec_entries() path, which is already a batch. >> > Maybe something like patch [1] at the end of this mail. >> >> One nasty thing about the XA_STATE stack object as opposed to just >> passing the parameters (in the same order) down to children is that >> you get the same memory accessed nearby, but in different ways >> (different base register, offset, addressing mode etc). Which can >> reduce effectiveness of memory disambiguation prediction, at least >> in cold predictor case. > > That is nasty. At the C level, it's a really attractive pattern. > Shame it doesn't work out so well on hardware. I wouldn't mind > turning shift/sibs/offset into a manually-extracted unsigned long > if that'll help with the addressing mode mispredictions? If you can get it to pass in registers by value. Some shifts or masks should be ~zero cost by comparison. > >> I've seen (on some POWER CPUs at least) flushes due to aliasing >> access in some of these xarray call chains, although no idea if >> that actually makes a noticable difference in microbenchmark like >> this. >> >> But it's not the greatest pattern to use for passing to low level >> performance critical functions :( Ideally the compiler could just >> do a big LTO pass right at the end and unwind it all back into >> registers and fix everything, but that will never happen. > > I wonder if we could get the compiler people to introduce a structure > attribute telling the compiler to pass this whole thing back-and-forth in > registers ... 6 registers is a lot to ask the compiler to reserve though. > Yeah I don't have a good idea, I think it may be a fundamentally hard problem for hardware, and it's very difficult for compiler. But yeah some special option for non-standard calling convention might be interesting. Thanks, Nick ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Truncate regression due to commit 69b6c1319b6 2019-02-26 17:27 ` Matthew Wilcox 2019-02-27 6:03 ` Nicholas Piggin @ 2019-02-27 11:27 ` Jan Kara 2019-02-27 12:24 ` Matthew Wilcox 1 sibling, 1 reply; 13+ messages in thread From: Jan Kara @ 2019-02-27 11:27 UTC (permalink / raw) To: Matthew Wilcox; +Cc: Jan Kara, linux-mm, mgorman On Tue 26-02-19 09:27:44, Matthew Wilcox wrote: > On Tue, Feb 26, 2019 at 05:56:28PM +0100, Jan Kara wrote: > > after some peripeties, I was able to bisect down to a regression in > > truncate performance caused by commit 69b6c1319b6 "mm: Convert truncate to > > XArray". > > [...] > > > I've gathered also perf profiles but from the first look they don't show > > anything surprising besides xas_load() and xas_store() taking up more time > > than original counterparts did. I'll try to dig more into this but any idea > > is appreciated. > > Well, that's a short and sweet little commit. Stripped of comment > changes, it's just: > > - struct radix_tree_node *node; > - void **slot; > + XA_STATE(xas, &mapping->i_pages, index); > > - if (!__radix_tree_lookup(&mapping->i_pages, index, &node, &slot)) > + xas_set_update(&xas, workingset_update_node); > + if (xas_load(&xas) != entry) > return; > - if (*slot != entry) > - return; > - __radix_tree_replace(&mapping->i_pages, node, slot, NULL, > - workingset_update_node); > + xas_store(&xas, NULL); Yes, the code change is small. Thanks to you splitting the changes, the regression is easier to analyze so thanks for that :) > I have a few reactions to this: > > 1. I'm concerned that the XArray may generally be slower than the radix > tree was. I didn't notice that in my testing, but maybe I didn't do > the right tests. So one difference I've noticed when staring into the code and annotated perf traces is that xas_store() will call xas_init_marks() when stored entry is 0. __radix_tree_replace() didn't do this. And the cache misses we take from checking tags do add up. After hacking the code in xas_store() so that __clear_shadow_entry() does not touch tags, I get around half of the regression back. For now I didn't think how to do this so that the API remains reasonably clean. So now we are at: COMMIT AVG STDDEV a97e7904c0 1431256.500000 1489.361759 69b6c1319b 1566944.000000 2252.692877 notaginit 1483740.700000 7680.583455 > 2. The setup overhead of the XA_STATE might be a problem. > If so, we can do some batching in order to improve things. > I suspect your test is calling __clear_shadow_entry through the > truncate_exceptional_pvec_entries() path, which is already a batch. > Maybe something like patch [1] at the end of this mail. So this apparently contributes as well but not too much. With your patch applied on top of 'notaginit' kernel above I've got to: batched-xas 1473900.300000 950.439377 > 3. Perhaps we can actually get rid of truncate_exceptional_pvec_entries(). > It seems a little daft for page_cache_delete_batch() to skip value > entries, only for truncate_exceptional_pvec_entries() to erase them in > a second pass. Truncation is truncation, and perhaps we can handle all > of it in one place? > > 4. Now that calling through a function pointer is expensive, thanks to > Spectre/Meltdown/..., I've been considering removing the general-purpose > update function, which is only used by the page cache. Instead move parts > of workingset.c into the XArray code and use a bit in the xa_flags to > indicate that the node should be tracked on an LRU if it contains only > value entries. I agree these two are good ideas to improve the speed. But old radix tree code has these issues as well so they are not the reason of this regression. So I'd like to track down where Xarray code is slower first. I'm going to dig more into annotated profiles... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Truncate regression due to commit 69b6c1319b6 2019-02-27 11:27 ` Jan Kara @ 2019-02-27 12:24 ` Matthew Wilcox 2019-02-27 16:55 ` Jan Kara 0 siblings, 1 reply; 13+ messages in thread From: Matthew Wilcox @ 2019-02-27 12:24 UTC (permalink / raw) To: Jan Kara; +Cc: linux-mm, mgorman On Wed, Feb 27, 2019 at 12:27:21PM +0100, Jan Kara wrote: > On Tue 26-02-19 09:27:44, Matthew Wilcox wrote: > > On Tue, Feb 26, 2019 at 05:56:28PM +0100, Jan Kara wrote: > > > after some peripeties, I was able to bisect down to a regression in > > > truncate performance caused by commit 69b6c1319b6 "mm: Convert truncate to > > > XArray". > > > > [...] > > > > > I've gathered also perf profiles but from the first look they don't show > > > anything surprising besides xas_load() and xas_store() taking up more time > > > than original counterparts did. I'll try to dig more into this but any idea > > > is appreciated. > > > > Well, that's a short and sweet little commit. Stripped of comment > > changes, it's just: > > > > - struct radix_tree_node *node; > > - void **slot; > > + XA_STATE(xas, &mapping->i_pages, index); > > > > - if (!__radix_tree_lookup(&mapping->i_pages, index, &node, &slot)) > > + xas_set_update(&xas, workingset_update_node); > > + if (xas_load(&xas) != entry) > > return; > > - if (*slot != entry) > > - return; > > - __radix_tree_replace(&mapping->i_pages, node, slot, NULL, > > - workingset_update_node); > > + xas_store(&xas, NULL); > > Yes, the code change is small. Thanks to you splitting the changes, the > regression is easier to analyze so thanks for that :) That was why I split the patches so small. I'm kind of glad it paid off ... not glad to have caused a performance regression! > > I have a few reactions to this: > > > > 1. I'm concerned that the XArray may generally be slower than the radix > > tree was. I didn't notice that in my testing, but maybe I didn't do > > the right tests. > > So one difference I've noticed when staring into the code and annotated > perf traces is that xas_store() will call xas_init_marks() when stored > entry is 0. __radix_tree_replace() didn't do this. And the cache misses we > take from checking tags do add up. After hacking the code in xas_store() so > that __clear_shadow_entry() does not touch tags, I get around half of the > regression back. For now I didn't think how to do this so that the API > remains reasonably clean. So now we are at: > > COMMIT AVG STDDEV > a97e7904c0 1431256.500000 1489.361759 > 69b6c1319b 1566944.000000 2252.692877 > notaginit 1483740.700000 7680.583455 Well, that seems worth doing. For the page cache case, we know that shadow entries have no tags set (at least right now), so it seems reasonable to move the xas_init_marks() from xas_store() to its various callers. > > 2. The setup overhead of the XA_STATE might be a problem. > > If so, we can do some batching in order to improve things. > > I suspect your test is calling __clear_shadow_entry through the > > truncate_exceptional_pvec_entries() path, which is already a batch. > > Maybe something like patch [1] at the end of this mail. > > So this apparently contributes as well but not too much. With your patch > applied on top of 'notaginit' kernel above I've got to: > > batched-xas 1473900.300000 950.439377 Fascinating that it reduces the stddev so much. We can probably take this further (getting into the realm of #3 below) -- the call to xas_set() will restart the walk from the top of the tree each time. Clearly this usecase (many thousands of shadow entries) is going to construct a very deep tree, and we're effectively doing a linear scan over the bottom of the tree, so starting from the top each time is O(n.log n) instead of O(n). I think you said the file was 64GB, which is 16 million 4k entries, or 24 bits of tree index. That's 4 levels deep so it'll add up. > > 3. Perhaps we can actually get rid of truncate_exceptional_pvec_entries(). > > It seems a little daft for page_cache_delete_batch() to skip value > > entries, only for truncate_exceptional_pvec_entries() to erase them in > > a second pass. Truncation is truncation, and perhaps we can handle all > > of it in one place? > > > > 4. Now that calling through a function pointer is expensive, thanks to > > Spectre/Meltdown/..., I've been considering removing the general-purpose > > update function, which is only used by the page cache. Instead move parts > > of workingset.c into the XArray code and use a bit in the xa_flags to > > indicate that the node should be tracked on an LRU if it contains only > > value entries. > > I agree these two are good ideas to improve the speed. But old radix tree > code has these issues as well so they are not the reason of this > regression. So I'd like to track down where Xarray code is slower first. > > I'm going to dig more into annotated profiles... Thanks! I'll work on a patch to remove the xas_init_marks() from xas_store(). ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Truncate regression due to commit 69b6c1319b6 2019-02-27 12:24 ` Matthew Wilcox @ 2019-02-27 16:55 ` Jan Kara 2019-02-28 22:53 ` Matthew Wilcox 0 siblings, 1 reply; 13+ messages in thread From: Jan Kara @ 2019-02-27 16:55 UTC (permalink / raw) To: Matthew Wilcox; +Cc: Jan Kara, linux-mm, mgorman On Wed 27-02-19 04:24:51, Matthew Wilcox wrote: > On Wed, Feb 27, 2019 at 12:27:21PM +0100, Jan Kara wrote: > > On Tue 26-02-19 09:27:44, Matthew Wilcox wrote: > > > On Tue, Feb 26, 2019 at 05:56:28PM +0100, Jan Kara wrote: > > > > after some peripeties, I was able to bisect down to a regression in > > > > truncate performance caused by commit 69b6c1319b6 "mm: Convert truncate to > > > > XArray". > > > > > > [...] > > > > > > > I've gathered also perf profiles but from the first look they don't show > > > > anything surprising besides xas_load() and xas_store() taking up more time > > > > than original counterparts did. I'll try to dig more into this but any idea > > > > is appreciated. > > > > > > Well, that's a short and sweet little commit. Stripped of comment > > > changes, it's just: > > > > > > - struct radix_tree_node *node; > > > - void **slot; > > > + XA_STATE(xas, &mapping->i_pages, index); > > > > > > - if (!__radix_tree_lookup(&mapping->i_pages, index, &node, &slot)) > > > + xas_set_update(&xas, workingset_update_node); > > > + if (xas_load(&xas) != entry) > > > return; > > > - if (*slot != entry) > > > - return; > > > - __radix_tree_replace(&mapping->i_pages, node, slot, NULL, > > > - workingset_update_node); > > > + xas_store(&xas, NULL); > > > > Yes, the code change is small. Thanks to you splitting the changes, the > > regression is easier to analyze so thanks for that :) > > That was why I split the patches so small. I'm kind of glad it paid off ... > not glad to have caused a performance regression! > > > > I have a few reactions to this: > > > > > > 1. I'm concerned that the XArray may generally be slower than the radix > > > tree was. I didn't notice that in my testing, but maybe I didn't do > > > the right tests. > > > > So one difference I've noticed when staring into the code and annotated > > perf traces is that xas_store() will call xas_init_marks() when stored > > entry is 0. __radix_tree_replace() didn't do this. And the cache misses we > > take from checking tags do add up. After hacking the code in xas_store() so > > that __clear_shadow_entry() does not touch tags, I get around half of the > > regression back. For now I didn't think how to do this so that the API > > remains reasonably clean. So now we are at: > > > > COMMIT AVG STDDEV > > a97e7904c0 1431256.500000 1489.361759 > > 69b6c1319b 1566944.000000 2252.692877 > > notaginit 1483740.700000 7680.583455 > > Well, that seems worth doing. For the page cache case, we know that > shadow entries have no tags set (at least right now), so it seems > reasonable to move the xas_init_marks() from xas_store() to its various > callers. > > > > 2. The setup overhead of the XA_STATE might be a problem. > > > If so, we can do some batching in order to improve things. > > > I suspect your test is calling __clear_shadow_entry through the > > > truncate_exceptional_pvec_entries() path, which is already a batch. > > > Maybe something like patch [1] at the end of this mail. > > > > So this apparently contributes as well but not too much. With your patch > > applied on top of 'notaginit' kernel above I've got to: > > > > batched-xas 1473900.300000 950.439377 > > Fascinating that it reduces the stddev so much. We can probably take this I would not concentrate on the reduction of stddev too much. The high stddev for 'notaginit' is caused by one relatively big outlier (otherwise we are at ~2200). But still the patch reduced stddev to about a half so there is some improvement. > further (getting into the realm of #3 below) -- the call to xas_set() will > restart the walk from the top of the tree each time. Clearly this usecase > (many thousands of shadow entries) is going to construct a very deep tree, > and we're effectively doing a linear scan over the bottom of the tree, so > starting from the top each time is O(n.log n) instead of O(n). I think > you said the file was 64GB, which is 16 million 4k entries, or 24 bits of > tree index. That's 4 levels deep so it'll add up. Actually the benchmark creates 64 files, 1GB each, so the depth of the tree will be just 3. But yes, traversing from top of the tree each time only to zero out one long there looks just wasteful. It seems we should be able to have much more efficient truncate implementation which would just trim the whole node worth of xarray - working set entries would be directly destroyed, pages returned locked (provided they can be locked with trylock). Looking more into perf traces, I didn't notice any other obvious low hanging fruit. There is one suboptimality in the fact that __clear_shadow_entry() does xas_load() and the first thing xas_store() does is xas_load() again. Removing this double tree traversal does bring something back but since the traversals are so close together, everything is cache hot and so the overall gain is small (but still): COMMIT AVG STDDEV singleiter 1467763.900000 1078.261049 So still 34 ms to go to the original time. What profiles do show is there's slightly more time spent here and there adding to overall larger xas_store() time (compared to __radix_tree_replace()) mostly due to what I'd blame to cache misses (xas_store() is responsible for ~3.4% of cache misses after the patch while xas_store() + __radix_tree_replace() caused only 1.5% together before). Some of the expensive loads seem to be from 'xas' structure (kind of matches with what Nick said), other expensive loads seem to be loads from xa_node. And I don't have a great explanation why e.g. a load of xa_node->count is expensive when we looked at xa_node->shift before - either the cache line fell out of cache or the byte accesses somehow confuse the CPU. Also xas_store() has some new accesses compared to __radix_tree_replace() - e.g. it did not previously touch node->shift. So overall I don't see easy way how to speed up xarray code further so maybe just batching truncate to make up for some of the losses and live with them where we cannot batch will be as good as it gets... Honza > > > 3. Perhaps we can actually get rid of truncate_exceptional_pvec_entries(). > > > It seems a little daft for page_cache_delete_batch() to skip value > > > entries, only for truncate_exceptional_pvec_entries() to erase them in > > > a second pass. Truncation is truncation, and perhaps we can handle all > > > of it in one place? > > > > > > 4. Now that calling through a function pointer is expensive, thanks to > > > Spectre/Meltdown/..., I've been considering removing the general-purpose > > > update function, which is only used by the page cache. Instead move parts > > > of workingset.c into the XArray code and use a bit in the xa_flags to > > > indicate that the node should be tracked on an LRU if it contains only > > > value entries. > > > > I agree these two are good ideas to improve the speed. But old radix tree > > code has these issues as well so they are not the reason of this > > regression. So I'd like to track down where Xarray code is slower first. > > > > I'm going to dig more into annotated profiles... > > Thanks! I'll work on a patch to remove the xas_init_marks() from xas_store(). -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Truncate regression due to commit 69b6c1319b6 2019-02-27 16:55 ` Jan Kara @ 2019-02-28 22:53 ` Matthew Wilcox 2019-03-14 11:10 ` Jan Kara 0 siblings, 1 reply; 13+ messages in thread From: Matthew Wilcox @ 2019-02-28 22:53 UTC (permalink / raw) To: Jan Kara; +Cc: linux-mm, mgorman On Wed, Feb 27, 2019 at 05:55:38PM +0100, Jan Kara wrote: > On Wed 27-02-19 04:24:51, Matthew Wilcox wrote: > Looking more into perf traces, I didn't notice any other obvious low > hanging fruit. There is one suboptimality in the fact that > __clear_shadow_entry() does xas_load() and the first thing xas_store() does > is xas_load() again. Removing this double tree traversal does bring > something back but since the traversals are so close together, everything > is cache hot and so the overall gain is small (but still): Calling xas_load() twice isn't too bad; the iterator stays at the base of the tree, so it's just one pointer reload. Still, I think we can avoid it. > COMMIT AVG STDDEV > singleiter 1467763.900000 1078.261049 > > So still 34 ms to go to the original time. > > What profiles do show is there's slightly more time spent here and there > adding to overall larger xas_store() time (compared to > __radix_tree_replace()) mostly due to what I'd blame to cache misses > (xas_store() is responsible for ~3.4% of cache misses after the patch while > xas_store() + __radix_tree_replace() caused only 1.5% together before). > > Some of the expensive loads seem to be from 'xas' structure (kind > of matches with what Nick said), other expensive loads seem to be loads from > xa_node. And I don't have a great explanation why e.g. a load of > xa_node->count is expensive when we looked at xa_node->shift before - > either the cache line fell out of cache or the byte accesses somehow > confuse the CPU. Also xas_store() has some new accesses compared to > __radix_tree_replace() - e.g. it did not previously touch node->shift. > > So overall I don't see easy way how to speed up xarray code further so > maybe just batching truncate to make up for some of the losses and live > with them where we cannot batch will be as good as it gets... Here's what I'm currently looking at. xas_store() becomes a wrapper around xas_replace() and xas_replace() avoids the xas_init_marks() and xas_load() calls: diff --git a/include/linux/xarray.h b/include/linux/xarray.h index 0e01e6129145..26fdba17ce67 100644 --- a/include/linux/xarray.h +++ b/include/linux/xarray.h @@ -1455,6 +1455,7 @@ static inline bool xas_retry(struct xa_state *xas, const void *entry) void *xas_load(struct xa_state *); void *xas_store(struct xa_state *, void *entry); +void xas_replace(struct xa_state *, void *curr, void *entry); void *xas_find(struct xa_state *, unsigned long max); void *xas_find_conflict(struct xa_state *); diff --git a/lib/test_xarray.c b/lib/test_xarray.c index 5d4bad8bd96a..b2e2cdf4eb74 100644 --- a/lib/test_xarray.c +++ b/lib/test_xarray.c @@ -38,6 +38,12 @@ static void *xa_store_index(struct xarray *xa, unsigned long index, gfp_t gfp) return xa_store(xa, index, xa_mk_index(index), gfp); } +static void xa_insert_index(struct xarray *xa, unsigned long index) +{ + XA_BUG_ON(xa, xa_insert(xa, index, xa_mk_index(index), + GFP_KERNEL) != 0); +} + static void xa_alloc_index(struct xarray *xa, unsigned long index, gfp_t gfp) { u32 id; @@ -338,6 +344,20 @@ static noinline void check_xa_shrink(struct xarray *xa) } } +static noinline void check_insert(struct xarray *xa) +{ + unsigned long i; + + for (i = 0; i < 1024; i++) { + xa_insert_index(xa, i); + XA_BUG_ON(xa, xa_load(xa, i - 1) != NULL); + XA_BUG_ON(xa, xa_load(xa, i + 1) != NULL); + xa_erase_index(xa, i); + } + + XA_BUG_ON(xa, !xa_empty(xa)); +} + static noinline void check_cmpxchg(struct xarray *xa) { void *FIVE = xa_mk_value(5); @@ -1527,6 +1547,7 @@ static int xarray_checks(void) check_xa_mark(&array); check_xa_shrink(&array); check_xas_erase(&array); + check_insert(&array); check_cmpxchg(&array); check_reserve(&array); check_reserve(&xa0); diff --git a/lib/xarray.c b/lib/xarray.c index 6be3acbb861f..8ff605bd0fee 100644 --- a/lib/xarray.c +++ b/lib/xarray.c @@ -613,7 +613,7 @@ static int xas_expand(struct xa_state *xas, void *head) /* * xas_create() - Create a slot to store an entry in. * @xas: XArray operation state. - * @allow_root: %true if we can store the entry in the root directly + * @entry: Entry which will be stored in the slot. * * Most users will not need to call this function directly, as it is called * by xas_store(). It is useful for doing conditional store operations @@ -623,14 +623,14 @@ static int xas_expand(struct xa_state *xas, void *head) * If the slot was newly created, returns %NULL. If it failed to create the * slot, returns %NULL and indicates the error in @xas. */ -static void *xas_create(struct xa_state *xas, bool allow_root) +static void *xas_create(struct xa_state *xas, void *entry) { struct xarray *xa = xas->xa; - void *entry; void __rcu **slot; struct xa_node *node = xas->xa_node; int shift; unsigned int order = xas->xa_shift; + bool allow_root = !xa_is_node(entry) && !xa_is_zero(entry); if (xas_top(node)) { entry = xa_head_locked(xa); @@ -701,7 +701,7 @@ void xas_create_range(struct xa_state *xas) xas->xa_sibs = 0; for (;;) { - xas_create(xas, true); + xas_create(xas, XA_ZERO_ENTRY); if (xas_error(xas)) goto restore; if (xas->xa_index <= (index | XA_CHUNK_MASK)) @@ -745,44 +745,36 @@ static void update_node(struct xa_state *xas, struct xa_node *node, } /** - * xas_store() - Store this entry in the XArray. + * xas_replace() - Replace one XArray entry with another. * @xas: XArray operation state. + * @curr: Current entry. * @entry: New entry. * - * If @xas is operating on a multi-index entry, the entry returned by this - * function is essentially meaningless (it may be an internal entry or it - * may be %NULL, even if there are non-NULL entries at some of the indices - * covered by the range). This is not a problem for any current users, - * and can be changed if needed. - * - * Return: The old entry at this index. + * This is not a cmpxchg operation. The caller asserts that @curr is the + * current entry at the index referred to by @xas and wishes to replace it + * with @entry. The slot must have already been created by xas_create() + * or by virtue of @curr being non-NULL. The marks are not changed by + * this operation. */ -void *xas_store(struct xa_state *xas, void *entry) +void xas_replace(struct xa_state *xas, void *curr, void *entry) { struct xa_node *node; void __rcu **slot = &xas->xa->xa_head; unsigned int offset, max; int count = 0; int values = 0; - void *first, *next; + void *next; bool value = xa_is_value(entry); - if (entry) { - bool allow_root = !xa_is_node(entry) && !xa_is_zero(entry); - first = xas_create(xas, allow_root); - } else { - first = xas_load(xas); - } - if (xas_invalid(xas)) - return first; + return; node = xas->xa_node; if (node && (xas->xa_shift < node->shift)) xas->xa_sibs = 0; - if ((first == entry) && !xas->xa_sibs) - return first; + if ((curr == entry) && !xas->xa_sibs) + return; - next = first; + next = curr; offset = xas->xa_offset; max = xas->xa_offset + xas->xa_sibs; if (node) { @@ -790,8 +782,6 @@ void *xas_store(struct xa_state *xas, void *entry) if (xas->xa_sibs) xas_squash_marks(xas); } - if (!entry) - xas_init_marks(xas); for (;;) { /* @@ -807,7 +797,7 @@ void *xas_store(struct xa_state *xas, void *entry) if (!node) break; count += !next - !entry; - values += !xa_is_value(first) - !value; + values += !xa_is_value(curr) - !value; if (entry) { if (offset == max) break; @@ -821,13 +811,41 @@ void *xas_store(struct xa_state *xas, void *entry) if (!xa_is_sibling(next)) { if (!entry && (offset > max)) break; - first = next; + curr = next; } slot++; } update_node(xas, node, count, values); - return first; +} + +/** + * xas_store() - Store this entry in the XArray. + * @xas: XArray operation state. + * @entry: New entry. + * + * If @xas is operating on a multi-index entry, the entry returned by this + * function is essentially meaningless (it may be an internal entry or it + * may be %NULL, even if there are non-NULL entries at some of the indices + * covered by the range). This is not a problem for any current users, + * and can be changed if needed. + * + * Return: The old entry at this index. + */ +void *xas_store(struct xa_state *xas, void *entry) +{ + void *curr; + + if (entry) { + curr = xas_create(xas, entry); + } else { + curr = xas_load(xas); + if (curr) + xas_init_marks(xas); + } + + xas_replace(xas, curr, entry); + return curr; } EXPORT_SYMBOL_GPL(xas_store); @@ -1472,9 +1490,9 @@ int __xa_insert(struct xarray *xa, unsigned long index, void *entry, gfp_t gfp) entry = XA_ZERO_ENTRY; do { - curr = xas_load(&xas); + curr = xas_create(&xas, entry); if (!curr) { - xas_store(&xas, entry); + xas_replace(&xas, curr, entry); if (xa_track_free(xa)) xas_clear_mark(&xas, XA_FREE_MARK); } else { @@ -1553,7 +1571,7 @@ void *xa_store_range(struct xarray *xa, unsigned long first, if (last + 1) order = __ffs(last + 1); xas_set_order(&xas, last, order); - xas_create(&xas, true); + xas_create(&xas, entry); if (xas_error(&xas)) goto unlock; } @@ -1606,11 +1624,13 @@ int __xa_alloc(struct xarray *xa, u32 *id, void *entry, do { xas.xa_index = limit.min; xas_find_marked(&xas, limit.max, XA_FREE_MARK); - if (xas.xa_node == XAS_RESTART) + if (xas.xa_node == XAS_RESTART) { xas_set_err(&xas, -EBUSY); - else + } else { + xas_create(&xas, entry); *id = xas.xa_index; - xas_store(&xas, entry); + } + xas_replace(&xas, NULL, entry); xas_clear_mark(&xas, XA_FREE_MARK); } while (__xas_nomem(&xas, gfp)); diff --git a/mm/filemap.c b/mm/filemap.c index 9f5e323e883e..56a7ef579879 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -131,8 +131,8 @@ static void page_cache_delete(struct address_space *mapping, VM_BUG_ON_PAGE(PageTail(page), page); VM_BUG_ON_PAGE(nr != 1 && shadow, page); - xas_store(&xas, shadow); xas_init_marks(&xas); + xas_replace(&xas, page, shadow); page->mapping = NULL; /* Leave page->index set: truncation lookup relies upon it */ @@ -326,7 +326,7 @@ static void page_cache_delete_batch(struct address_space *mapping, != pvec->pages[i]->index, page); tail_pages--; } - xas_store(&xas, NULL); + xas_replace(&xas, page, NULL); total_pages++; } mapping->nrpages -= total_pages; @@ -771,7 +771,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask) new->index = offset; xas_lock_irqsave(&xas, flags); - xas_store(&xas, new); + xas_replace(&xas, old, new); old->mapping = NULL; /* hugetlb pages do not participate in page cache accounting. */ diff --git a/mm/migrate.c b/mm/migrate.c index d4fd680be3b0..083f52797d11 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -459,13 +459,13 @@ int migrate_page_move_mapping(struct address_space *mapping, SetPageDirty(newpage); } - xas_store(&xas, newpage); + xas_replace(&xas, page, newpage); if (PageTransHuge(page)) { int i; for (i = 1; i < HPAGE_PMD_NR; i++) { xas_next(&xas); - xas_store(&xas, newpage + i); + xas_replace(&xas, page + i, newpage + i); } } @@ -536,7 +536,7 @@ int migrate_huge_page_move_mapping(struct address_space *mapping, get_page(newpage); - xas_store(&xas, newpage); + xas_replace(&xas, page, newpage); page_ref_unfreeze(page, expected_count - 1); diff --git a/mm/shmem.c b/mm/shmem.c index 6ece1e2fe76e..83925601089d 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -341,7 +341,7 @@ static int shmem_replace_entry(struct address_space *mapping, item = xas_load(&xas); if (item != expected) return -ENOENT; - xas_store(&xas, replacement); + xas_replace(&xas, item, replacement); return 0; } diff --git a/mm/truncate.c b/mm/truncate.c index 798e7ccfb030..0682b2f9ac0e 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -38,7 +38,7 @@ static inline void __clear_shadow_entry(struct address_space *mapping, xas_set_update(&xas, workingset_update_node); if (xas_load(&xas) != entry) return; - xas_store(&xas, NULL); + xas_replace(&xas, entry, NULL); mapping->nrexceptional--; } ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: Truncate regression due to commit 69b6c1319b6 2019-02-28 22:53 ` Matthew Wilcox @ 2019-03-14 11:10 ` Jan Kara 2019-05-31 19:04 ` Matthew Wilcox 0 siblings, 1 reply; 13+ messages in thread From: Jan Kara @ 2019-03-14 11:10 UTC (permalink / raw) To: Matthew Wilcox; +Cc: Jan Kara, linux-mm, mgorman Hi Matthew, On Thu 28-02-19 14:53:17, Matthew Wilcox wrote: > On Wed, Feb 27, 2019 at 05:55:38PM +0100, Jan Kara wrote: > > On Wed 27-02-19 04:24:51, Matthew Wilcox wrote: > > Looking more into perf traces, I didn't notice any other obvious low > > hanging fruit. There is one suboptimality in the fact that > > __clear_shadow_entry() does xas_load() and the first thing xas_store() does > > is xas_load() again. Removing this double tree traversal does bring > > something back but since the traversals are so close together, everything > > is cache hot and so the overall gain is small (but still): > > Calling xas_load() twice isn't too bad; the iterator stays at the base of > the tree, so it's just one pointer reload. Still, I think we can avoid it. > > > COMMIT AVG STDDEV > > singleiter 1467763.900000 1078.261049 > > > > So still 34 ms to go to the original time. > > > > What profiles do show is there's slightly more time spent here and there > > adding to overall larger xas_store() time (compared to > > __radix_tree_replace()) mostly due to what I'd blame to cache misses > > (xas_store() is responsible for ~3.4% of cache misses after the patch while > > xas_store() + __radix_tree_replace() caused only 1.5% together before). > > > > Some of the expensive loads seem to be from 'xas' structure (kind > > of matches with what Nick said), other expensive loads seem to be loads from > > xa_node. And I don't have a great explanation why e.g. a load of > > xa_node->count is expensive when we looked at xa_node->shift before - > > either the cache line fell out of cache or the byte accesses somehow > > confuse the CPU. Also xas_store() has some new accesses compared to > > __radix_tree_replace() - e.g. it did not previously touch node->shift. > > > > So overall I don't see easy way how to speed up xarray code further so > > maybe just batching truncate to make up for some of the losses and live > > with them where we cannot batch will be as good as it gets... > > Here's what I'm currently looking at. xas_store() becomes a wrapper > around xas_replace() and xas_replace() avoids the xas_init_marks() and > xas_load() calls: This looks reasonable to me. Do you have some official series I could test or where do we stand? Honza > > diff --git a/include/linux/xarray.h b/include/linux/xarray.h > index 0e01e6129145..26fdba17ce67 100644 > --- a/include/linux/xarray.h > +++ b/include/linux/xarray.h > @@ -1455,6 +1455,7 @@ static inline bool xas_retry(struct xa_state *xas, const void *entry) > > void *xas_load(struct xa_state *); > void *xas_store(struct xa_state *, void *entry); > +void xas_replace(struct xa_state *, void *curr, void *entry); > void *xas_find(struct xa_state *, unsigned long max); > void *xas_find_conflict(struct xa_state *); > > diff --git a/lib/test_xarray.c b/lib/test_xarray.c > index 5d4bad8bd96a..b2e2cdf4eb74 100644 > --- a/lib/test_xarray.c > +++ b/lib/test_xarray.c > @@ -38,6 +38,12 @@ static void *xa_store_index(struct xarray *xa, unsigned long index, gfp_t gfp) > return xa_store(xa, index, xa_mk_index(index), gfp); > } > > +static void xa_insert_index(struct xarray *xa, unsigned long index) > +{ > + XA_BUG_ON(xa, xa_insert(xa, index, xa_mk_index(index), > + GFP_KERNEL) != 0); > +} > + > static void xa_alloc_index(struct xarray *xa, unsigned long index, gfp_t gfp) > { > u32 id; > @@ -338,6 +344,20 @@ static noinline void check_xa_shrink(struct xarray *xa) > } > } > > +static noinline void check_insert(struct xarray *xa) > +{ > + unsigned long i; > + > + for (i = 0; i < 1024; i++) { > + xa_insert_index(xa, i); > + XA_BUG_ON(xa, xa_load(xa, i - 1) != NULL); > + XA_BUG_ON(xa, xa_load(xa, i + 1) != NULL); > + xa_erase_index(xa, i); > + } > + > + XA_BUG_ON(xa, !xa_empty(xa)); > +} > + > static noinline void check_cmpxchg(struct xarray *xa) > { > void *FIVE = xa_mk_value(5); > @@ -1527,6 +1547,7 @@ static int xarray_checks(void) > check_xa_mark(&array); > check_xa_shrink(&array); > check_xas_erase(&array); > + check_insert(&array); > check_cmpxchg(&array); > check_reserve(&array); > check_reserve(&xa0); > diff --git a/lib/xarray.c b/lib/xarray.c > index 6be3acbb861f..8ff605bd0fee 100644 > --- a/lib/xarray.c > +++ b/lib/xarray.c > @@ -613,7 +613,7 @@ static int xas_expand(struct xa_state *xas, void *head) > /* > * xas_create() - Create a slot to store an entry in. > * @xas: XArray operation state. > - * @allow_root: %true if we can store the entry in the root directly > + * @entry: Entry which will be stored in the slot. > * > * Most users will not need to call this function directly, as it is called > * by xas_store(). It is useful for doing conditional store operations > @@ -623,14 +623,14 @@ static int xas_expand(struct xa_state *xas, void *head) > * If the slot was newly created, returns %NULL. If it failed to create the > * slot, returns %NULL and indicates the error in @xas. > */ > -static void *xas_create(struct xa_state *xas, bool allow_root) > +static void *xas_create(struct xa_state *xas, void *entry) > { > struct xarray *xa = xas->xa; > - void *entry; > void __rcu **slot; > struct xa_node *node = xas->xa_node; > int shift; > unsigned int order = xas->xa_shift; > + bool allow_root = !xa_is_node(entry) && !xa_is_zero(entry); > > if (xas_top(node)) { > entry = xa_head_locked(xa); > @@ -701,7 +701,7 @@ void xas_create_range(struct xa_state *xas) > xas->xa_sibs = 0; > > for (;;) { > - xas_create(xas, true); > + xas_create(xas, XA_ZERO_ENTRY); > if (xas_error(xas)) > goto restore; > if (xas->xa_index <= (index | XA_CHUNK_MASK)) > @@ -745,44 +745,36 @@ static void update_node(struct xa_state *xas, struct xa_node *node, > } > > /** > - * xas_store() - Store this entry in the XArray. > + * xas_replace() - Replace one XArray entry with another. > * @xas: XArray operation state. > + * @curr: Current entry. > * @entry: New entry. > * > - * If @xas is operating on a multi-index entry, the entry returned by this > - * function is essentially meaningless (it may be an internal entry or it > - * may be %NULL, even if there are non-NULL entries at some of the indices > - * covered by the range). This is not a problem for any current users, > - * and can be changed if needed. > - * > - * Return: The old entry at this index. > + * This is not a cmpxchg operation. The caller asserts that @curr is the > + * current entry at the index referred to by @xas and wishes to replace it > + * with @entry. The slot must have already been created by xas_create() > + * or by virtue of @curr being non-NULL. The marks are not changed by > + * this operation. > */ > -void *xas_store(struct xa_state *xas, void *entry) > +void xas_replace(struct xa_state *xas, void *curr, void *entry) > { > struct xa_node *node; > void __rcu **slot = &xas->xa->xa_head; > unsigned int offset, max; > int count = 0; > int values = 0; > - void *first, *next; > + void *next; > bool value = xa_is_value(entry); > > - if (entry) { > - bool allow_root = !xa_is_node(entry) && !xa_is_zero(entry); > - first = xas_create(xas, allow_root); > - } else { > - first = xas_load(xas); > - } > - > if (xas_invalid(xas)) > - return first; > + return; > node = xas->xa_node; > if (node && (xas->xa_shift < node->shift)) > xas->xa_sibs = 0; > - if ((first == entry) && !xas->xa_sibs) > - return first; > + if ((curr == entry) && !xas->xa_sibs) > + return; > > - next = first; > + next = curr; > offset = xas->xa_offset; > max = xas->xa_offset + xas->xa_sibs; > if (node) { > @@ -790,8 +782,6 @@ void *xas_store(struct xa_state *xas, void *entry) > if (xas->xa_sibs) > xas_squash_marks(xas); > } > - if (!entry) > - xas_init_marks(xas); > > for (;;) { > /* > @@ -807,7 +797,7 @@ void *xas_store(struct xa_state *xas, void *entry) > if (!node) > break; > count += !next - !entry; > - values += !xa_is_value(first) - !value; > + values += !xa_is_value(curr) - !value; > if (entry) { > if (offset == max) > break; > @@ -821,13 +811,41 @@ void *xas_store(struct xa_state *xas, void *entry) > if (!xa_is_sibling(next)) { > if (!entry && (offset > max)) > break; > - first = next; > + curr = next; > } > slot++; > } > > update_node(xas, node, count, values); > - return first; > +} > + > +/** > + * xas_store() - Store this entry in the XArray. > + * @xas: XArray operation state. > + * @entry: New entry. > + * > + * If @xas is operating on a multi-index entry, the entry returned by this > + * function is essentially meaningless (it may be an internal entry or it > + * may be %NULL, even if there are non-NULL entries at some of the indices > + * covered by the range). This is not a problem for any current users, > + * and can be changed if needed. > + * > + * Return: The old entry at this index. > + */ > +void *xas_store(struct xa_state *xas, void *entry) > +{ > + void *curr; > + > + if (entry) { > + curr = xas_create(xas, entry); > + } else { > + curr = xas_load(xas); > + if (curr) > + xas_init_marks(xas); > + } > + > + xas_replace(xas, curr, entry); > + return curr; > } > EXPORT_SYMBOL_GPL(xas_store); > > @@ -1472,9 +1490,9 @@ int __xa_insert(struct xarray *xa, unsigned long index, void *entry, gfp_t gfp) > entry = XA_ZERO_ENTRY; > > do { > - curr = xas_load(&xas); > + curr = xas_create(&xas, entry); > if (!curr) { > - xas_store(&xas, entry); > + xas_replace(&xas, curr, entry); > if (xa_track_free(xa)) > xas_clear_mark(&xas, XA_FREE_MARK); > } else { > @@ -1553,7 +1571,7 @@ void *xa_store_range(struct xarray *xa, unsigned long first, > if (last + 1) > order = __ffs(last + 1); > xas_set_order(&xas, last, order); > - xas_create(&xas, true); > + xas_create(&xas, entry); > if (xas_error(&xas)) > goto unlock; > } > @@ -1606,11 +1624,13 @@ int __xa_alloc(struct xarray *xa, u32 *id, void *entry, > do { > xas.xa_index = limit.min; > xas_find_marked(&xas, limit.max, XA_FREE_MARK); > - if (xas.xa_node == XAS_RESTART) > + if (xas.xa_node == XAS_RESTART) { > xas_set_err(&xas, -EBUSY); > - else > + } else { > + xas_create(&xas, entry); > *id = xas.xa_index; > - xas_store(&xas, entry); > + } > + xas_replace(&xas, NULL, entry); > xas_clear_mark(&xas, XA_FREE_MARK); > } while (__xas_nomem(&xas, gfp)); > > diff --git a/mm/filemap.c b/mm/filemap.c > index 9f5e323e883e..56a7ef579879 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -131,8 +131,8 @@ static void page_cache_delete(struct address_space *mapping, > VM_BUG_ON_PAGE(PageTail(page), page); > VM_BUG_ON_PAGE(nr != 1 && shadow, page); > > - xas_store(&xas, shadow); > xas_init_marks(&xas); > + xas_replace(&xas, page, shadow); > > page->mapping = NULL; > /* Leave page->index set: truncation lookup relies upon it */ > @@ -326,7 +326,7 @@ static void page_cache_delete_batch(struct address_space *mapping, > != pvec->pages[i]->index, page); > tail_pages--; > } > - xas_store(&xas, NULL); > + xas_replace(&xas, page, NULL); > total_pages++; > } > mapping->nrpages -= total_pages; > @@ -771,7 +771,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask) > new->index = offset; > > xas_lock_irqsave(&xas, flags); > - xas_store(&xas, new); > + xas_replace(&xas, old, new); > > old->mapping = NULL; > /* hugetlb pages do not participate in page cache accounting. */ > diff --git a/mm/migrate.c b/mm/migrate.c > index d4fd680be3b0..083f52797d11 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -459,13 +459,13 @@ int migrate_page_move_mapping(struct address_space *mapping, > SetPageDirty(newpage); > } > > - xas_store(&xas, newpage); > + xas_replace(&xas, page, newpage); > if (PageTransHuge(page)) { > int i; > > for (i = 1; i < HPAGE_PMD_NR; i++) { > xas_next(&xas); > - xas_store(&xas, newpage + i); > + xas_replace(&xas, page + i, newpage + i); > } > } > > @@ -536,7 +536,7 @@ int migrate_huge_page_move_mapping(struct address_space *mapping, > > get_page(newpage); > > - xas_store(&xas, newpage); > + xas_replace(&xas, page, newpage); > > page_ref_unfreeze(page, expected_count - 1); > > diff --git a/mm/shmem.c b/mm/shmem.c > index 6ece1e2fe76e..83925601089d 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -341,7 +341,7 @@ static int shmem_replace_entry(struct address_space *mapping, > item = xas_load(&xas); > if (item != expected) > return -ENOENT; > - xas_store(&xas, replacement); > + xas_replace(&xas, item, replacement); > return 0; > } > > diff --git a/mm/truncate.c b/mm/truncate.c > index 798e7ccfb030..0682b2f9ac0e 100644 > --- a/mm/truncate.c > +++ b/mm/truncate.c > @@ -38,7 +38,7 @@ static inline void __clear_shadow_entry(struct address_space *mapping, > xas_set_update(&xas, workingset_update_node); > if (xas_load(&xas) != entry) > return; > - xas_store(&xas, NULL); > + xas_replace(&xas, entry, NULL); > mapping->nrexceptional--; > } > -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Truncate regression due to commit 69b6c1319b6 2019-03-14 11:10 ` Jan Kara @ 2019-05-31 19:04 ` Matthew Wilcox 2019-08-27 13:29 ` Jan Kara 0 siblings, 1 reply; 13+ messages in thread From: Matthew Wilcox @ 2019-05-31 19:04 UTC (permalink / raw) To: Jan Kara; +Cc: linux-mm, mgorman On Thu, Mar 14, 2019 at 12:10:12PM +0100, Jan Kara wrote: > On Thu 28-02-19 14:53:17, Matthew Wilcox wrote: > > Here's what I'm currently looking at. xas_store() becomes a wrapper > > around xas_replace() and xas_replace() avoids the xas_init_marks() and > > xas_load() calls: > > This looks reasonable to me. Do you have some official series I could test > or where do we stand? Hi Jan, Sorry for the delay; I've put this into the xarray tree: http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray I'm planning to ask Linus to pull it in about a week. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Truncate regression due to commit 69b6c1319b6 2019-05-31 19:04 ` Matthew Wilcox @ 2019-08-27 13:29 ` Jan Kara 2019-08-29 11:59 ` Matthew Wilcox 0 siblings, 1 reply; 13+ messages in thread From: Jan Kara @ 2019-08-27 13:29 UTC (permalink / raw) To: Matthew Wilcox; +Cc: Jan Kara, linux-mm, mgorman On Fri 31-05-19 12:04:31, Matthew Wilcox wrote: > On Thu, Mar 14, 2019 at 12:10:12PM +0100, Jan Kara wrote: > > On Thu 28-02-19 14:53:17, Matthew Wilcox wrote: > > > Here's what I'm currently looking at. xas_store() becomes a wrapper > > > around xas_replace() and xas_replace() avoids the xas_init_marks() and > > > xas_load() calls: > > > > This looks reasonable to me. Do you have some official series I could test > > or where do we stand? > > Hi Jan, > > Sorry for the delay; I've put this into the xarray tree: > > http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray > > I'm planning to ask Linus to pull it in about a week. Hum, I still don't see these xarray changes (the change to use xas_replace() in particular) upstream. What has happened? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Truncate regression due to commit 69b6c1319b6 2019-08-27 13:29 ` Jan Kara @ 2019-08-29 11:59 ` Matthew Wilcox 0 siblings, 0 replies; 13+ messages in thread From: Matthew Wilcox @ 2019-08-29 11:59 UTC (permalink / raw) To: Jan Kara; +Cc: linux-mm, mgorman On Tue, Aug 27, 2019 at 03:29:25PM +0200, Jan Kara wrote: > On Fri 31-05-19 12:04:31, Matthew Wilcox wrote: > > On Thu, Mar 14, 2019 at 12:10:12PM +0100, Jan Kara wrote: > > > On Thu 28-02-19 14:53:17, Matthew Wilcox wrote: > > > > Here's what I'm currently looking at. xas_store() becomes a wrapper > > > > around xas_replace() and xas_replace() avoids the xas_init_marks() and > > > > xas_load() calls: > > > > > > This looks reasonable to me. Do you have some official series I could test > > > or where do we stand? > > > > Hi Jan, > > > > Sorry for the delay; I've put this into the xarray tree: > > > > http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray > > > > I'm planning to ask Linus to pull it in about a week. > > Hum, I still don't see these xarray changes (the change to use > xas_replace() in particular) upstream. What has happened? It had a bug and I decided to pull the patch for now rather than find the bug ... this regression is still on my mind. Thanks for the ping. ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2019-08-29 11:59 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-02-26 16:56 Truncate regression due to commit 69b6c1319b6 Jan Kara 2019-02-26 17:27 ` Matthew Wilcox 2019-02-27 6:03 ` Nicholas Piggin 2019-02-27 12:35 ` Matthew Wilcox 2019-02-28 9:31 ` Nicholas Piggin 2019-02-27 11:27 ` Jan Kara 2019-02-27 12:24 ` Matthew Wilcox 2019-02-27 16:55 ` Jan Kara 2019-02-28 22:53 ` Matthew Wilcox 2019-03-14 11:10 ` Jan Kara 2019-05-31 19:04 ` Matthew Wilcox 2019-08-27 13:29 ` Jan Kara 2019-08-29 11:59 ` Matthew Wilcox
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).