ext4 extent status tree LRU locking

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* ext4 extent status tree LRU locking
@ 2013-06-11 23:22 Dave Hansen
  2013-06-12  7:17 ` Zheng Liu
  2013-06-14 14:09 ` Zheng Liu
  0 siblings, 2 replies; 22+ messages in thread
From: Dave Hansen @ 2013-06-11 23:22 UTC (permalink / raw)
  To: linux-ext4, LKML, Theodore Ts'o, Jan kara

I've got a test case which I intended to use to stress the VM a bit.  It
fills memory up with page cache a couple of times.  It essentially runs
30 or so cp's in parallel.

98% of my CPU is system time, and 96% of _that_ is being spent on the
spinlock in ext4_es_lru_add().  I think the LRU list head and its lock
end up being *REALLY* hot cachelines and are *the* bottleneck on this
test.  Note that this is _before_ we go in to reclaim and actually start
calling in to the shrinker.  There is zero memory pressure in this test.

I'm not sure the benefits of having a proper in-order LRU during reclaim
outweigh such a drastic downside for the common case.

Any thoughts?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ext4 extent status tree LRU locking
  2013-06-11 23:22 ext4 extent status tree LRU locking Dave Hansen
@ 2013-06-12  7:17 ` Zheng Liu
  2013-06-12 15:09   ` Dave Hansen
  2013-06-14 14:09 ` Zheng Liu
  1 sibling, 1 reply; 22+ messages in thread
From: Zheng Liu @ 2013-06-12  7:17 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-ext4, LKML, Theodore Ts'o, Jan kara

Hi Dave

Thanks for reporting this.

On Tue, Jun 11, 2013 at 04:22:16PM -0700, Dave Hansen wrote:
> I've got a test case which I intended to use to stress the VM a bit.  It
> fills memory up with page cache a couple of times.  It essentially runs
> 30 or so cp's in parallel.

Could you please share your test case with me? I am glad to look at it
and think about how to improve LRU locking.

> 
> 98% of my CPU is system time, and 96% of _that_ is being spent on the
> spinlock in ext4_es_lru_add().  I think the LRU list head and its lock
> end up being *REALLY* hot cachelines and are *the* bottleneck on this
> test.  Note that this is _before_ we go in to reclaim and actually start
> calling in to the shrinker.  There is zero memory pressure in this test.
> 
> I'm not sure the benefits of having a proper in-order LRU during reclaim
> outweigh such a drastic downside for the common case.

A proper in-order LRU can help us to reclaim some memory from extent
status tree when we are under heavy memory pressure.  When shrinker
tries to reclaim extents from these trees, some extents of files that
are accessed infrequnetly will be reclaimed because we hope that
frequently accessed files' extents can be kept in memory as much as
possible.  That is why we need a proper in-order LRU list.

Regards,
                                                - Zheng

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ext4 extent status tree LRU locking
  2013-06-12  7:17 ` Zheng Liu
@ 2013-06-12 15:09   ` Dave Hansen
  2013-06-12 16:03     ` Zheng Liu
  2013-06-12 20:48     ` Theodore Ts'o
  0 siblings, 2 replies; 22+ messages in thread
From: Dave Hansen @ 2013-06-12 15:09 UTC (permalink / raw)
  To: linux-ext4, LKML, Theodore Ts'o, Jan kara, gnehzuil.liu

On 06/12/2013 12:17 AM, Zheng Liu wrote:
> On Tue, Jun 11, 2013 at 04:22:16PM -0700, Dave Hansen wrote:
>> I've got a test case which I intended to use to stress the VM a bit.  It
>> fills memory up with page cache a couple of times.  It essentially runs
>> 30 or so cp's in parallel.
> 
> Could you please share your test case with me? I am glad to look at it
> and think about how to improve LRU locking.

I'll look in to giving you the actual test case.  But I'm not sure of
the licensing on it.

Essentially, the test creates an (small (~256MB) ext4 fs on a
loopback-mounted ramfs device.  It then goes and creates 160 64GB sparse
files (one per cpu) and then cp's them all to /dev/null.

>> 98% of my CPU is system time, and 96% of _that_ is being spent on the
>> spinlock in ext4_es_lru_add().  I think the LRU list head and its lock
>> end up being *REALLY* hot cachelines and are *the* bottleneck on this
>> test.  Note that this is _before_ we go in to reclaim and actually start
>> calling in to the shrinker.  There is zero memory pressure in this test.
>>
>> I'm not sure the benefits of having a proper in-order LRU during reclaim
>> outweigh such a drastic downside for the common case.
> 
> A proper in-order LRU can help us to reclaim some memory from extent
> status tree when we are under heavy memory pressure.  When shrinker
> tries to reclaim extents from these trees, some extents of files that
> are accessed infrequnetly will be reclaimed because we hope that
> frequently accessed files' extents can be kept in memory as much as
> possible.  That is why we need a proper in-order LRU list.

Does it need to be _strictly_ in order, though?  In other words, do you
truly need a *global*, perfectly in-order LRU?

You could make per-cpu LRUs, and batch movement on and off the global
LRU once the local ones get to be a certain size.  Or, you could keep
them cpu-local *until* the shrinker is called, when the shrinker could
go drain all the percpu ones.

Or, you could tag each extent in memory with its last-used time.  You
write an algorithm to go and walk the tree and attempt to _generally_
free the oldest objects out of a limited window.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ext4 extent status tree LRU locking
  2013-06-12 15:09   ` Dave Hansen
@ 2013-06-12 16:03     ` Zheng Liu
  2013-06-12 17:52       ` Dave Hansen
  2013-06-12 20:48     ` Theodore Ts'o
  1 sibling, 1 reply; 22+ messages in thread
From: Zheng Liu @ 2013-06-12 16:03 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-ext4, LKML, Theodore Ts'o, Jan kara

On Wed, Jun 12, 2013 at 08:09:14AM -0700, Dave Hansen wrote:
> On 06/12/2013 12:17 AM, Zheng Liu wrote:
> > On Tue, Jun 11, 2013 at 04:22:16PM -0700, Dave Hansen wrote:
> >> I've got a test case which I intended to use to stress the VM a bit.  It
> >> fills memory up with page cache a couple of times.  It essentially runs
> >> 30 or so cp's in parallel.
> > 
> > Could you please share your test case with me? I am glad to look at it
> > and think about how to improve LRU locking.
> 
> I'll look in to giving you the actual test case.  But I'm not sure of
> the licensing on it.

That would be great if you could share it.

> 
> Essentially, the test creates an (small (~256MB) ext4 fs on a
> loopback-mounted ramfs device.  It then goes and creates 160 64GB sparse
> files (one per cpu) and then cp's them all to /dev/null.

Thanks for letting me know.

> 
> >> 98% of my CPU is system time, and 96% of _that_ is being spent on the
> >> spinlock in ext4_es_lru_add().  I think the LRU list head and its lock
> >> end up being *REALLY* hot cachelines and are *the* bottleneck on this
> >> test.  Note that this is _before_ we go in to reclaim and actually start
> >> calling in to the shrinker.  There is zero memory pressure in this test.
> >>
> >> I'm not sure the benefits of having a proper in-order LRU during reclaim
> >> outweigh such a drastic downside for the common case.
> > 
> > A proper in-order LRU can help us to reclaim some memory from extent
> > status tree when we are under heavy memory pressure.  When shrinker
> > tries to reclaim extents from these trees, some extents of files that
> > are accessed infrequnetly will be reclaimed because we hope that
> > frequently accessed files' extents can be kept in memory as much as
> > possible.  That is why we need a proper in-order LRU list.
> 
> Does it need to be _strictly_ in order, though?  In other words, do you
> truly need a *global*, perfectly in-order LRU?
> 
> You could make per-cpu LRUs, and batch movement on and off the global
> LRU once the local ones get to be a certain size.  Or, you could keep
> them cpu-local *until* the shrinker is called, when the shrinker could
> go drain all the percpu ones.
> 
> Or, you could tag each extent in memory with its last-used time.  You
> write an algorithm to go and walk the tree and attempt to _generally_
> free the oldest objects out of a limited window.

Thanks for your suggestions.  I will try these solutions, and look at
which one is best for us.

Regards,
                                                - Zheng

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ext4 extent status tree LRU locking
  2013-06-12 16:03     ` Zheng Liu
@ 2013-06-12 17:52       ` Dave Hansen
  0 siblings, 0 replies; 22+ messages in thread
From: Dave Hansen @ 2013-06-12 17:52 UTC (permalink / raw)
  To: Dave Hansen, linux-ext4, LKML, Theodore Ts'o, Jan kara,
	gnehzuil.liu

FWIW, I'm also seeing this contention to a much smaller extent on much
more "normal" hardware.  I've got a 6-core (as opposed to 80) desktop
machine where I'm doing the same test on a real disk (instead of a
loopback-mounted ramfs file).

I'm seeing _about_ 10% CPU time being used on just spinning on the same
ext4 LRU lock.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ext4 extent status tree LRU locking
  2013-06-12 15:09   ` Dave Hansen
  2013-06-12 16:03     ` Zheng Liu
@ 2013-06-12 20:48     ` Theodore Ts'o
  2013-06-13 13:27       ` Zheng Liu
  1 sibling, 1 reply; 22+ messages in thread
From: Theodore Ts'o @ 2013-06-12 20:48 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-ext4, LKML, Jan kara, gnehzuil.liu

On Wed, Jun 12, 2013 at 08:09:14AM -0700, Dave Hansen wrote:
> You could make per-cpu LRUs, and batch movement on and off the global
> LRU once the local ones get to be a certain size.  Or, you could keep
> them cpu-local *until* the shrinker is called, when the shrinker could
> go drain all the percpu ones.
> 
> Or, you could tag each extent in memory with its last-used time.  You
> write an algorithm to go and walk the tree and attempt to _generally_
> free the oldest objects out of a limited window.

Another approach might be to tag each inode with the last time an
ext4_map_blocks() was called on the inode, and keep an unsorted list
of inodes which has one or more entries in the extent cache.

When we decide to discard entries from the extent cache, we should
drop all of the entries for the inode --- and then when we read in
part of the extent tree leaf block, we should create entries in the
extent cache for all of the extents found in the extent leaf block.

       	     	     	    	    	     - Ted

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ext4 extent status tree LRU locking
  2013-06-12 20:48     ` Theodore Ts'o
@ 2013-06-13 13:27       ` Zheng Liu
  2013-06-13 13:35         ` Theodore Ts'o
  0 siblings, 1 reply; 22+ messages in thread
From: Zheng Liu @ 2013-06-13 13:27 UTC (permalink / raw)
  To: Theodore Ts'o, Dave Hansen, linux-ext4, LKML, Jan kara

Hi Ted,

On Wed, Jun 12, 2013 at 04:48:54PM -0400, Theodore Ts'o wrote:
> On Wed, Jun 12, 2013 at 08:09:14AM -0700, Dave Hansen wrote:
> > You could make per-cpu LRUs, and batch movement on and off the global
> > LRU once the local ones get to be a certain size.  Or, you could keep
> > them cpu-local *until* the shrinker is called, when the shrinker could
> > go drain all the percpu ones.
> > 
> > Or, you could tag each extent in memory with its last-used time.  You
> > write an algorithm to go and walk the tree and attempt to _generally_
> > free the oldest objects out of a limited window.
> 
> Another approach might be to tag each inode with the last time an
> ext4_map_blocks() was called on the inode, and keep an unsorted list
> of inodes which has one or more entries in the extent cache.
> 
> When we decide to discard entries from the extent cache, we should
> drop all of the entries for the inode --- and then when we read in
> part of the extent tree leaf block, we should create entries in the
> extent cache for all of the extents found in the extent leaf block.

Thanks for your suggestion.  But, sorry, I couldn't get your point here.
As you suggested above, we can tag each inode with the last access time
when ext4_map_blocks() is called.  Then we will get an unsorted list of
inodes with some extent entries.  When we tries to reclaim some entries
from extent cache, we can call list_sort() to get an sorted list of
inodes and try to reclaim some entries according to the last access
time.  My question is why do we need to drop all of the entries from all
inodes.

Af far as I understand, we can do the following improvement.  We tag the
last access time for each inode.  So that would avoid to move the inode
in lru list frequently.  When we try to reclaim some entries, we call
list_sort() to get a sorted lru list, and drop some entries from lru
list.  Please correct me if I misunderstood.

For creating all extent leaf blocks, that is another improvement.  I
will try to add it after I solve current problem that Dave reported.

Thanks,
                                                - Zheng

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ext4 extent status tree LRU locking
  2013-06-13 13:27       ` Zheng Liu
@ 2013-06-13 13:35         ` Theodore Ts'o
  2013-06-14  3:27           ` Zheng Liu
  0 siblings, 1 reply; 22+ messages in thread
From: Theodore Ts'o @ 2013-06-13 13:35 UTC (permalink / raw)
  To: Dave Hansen, linux-ext4, LKML, Jan kara

On Thu, Jun 13, 2013 at 09:27:15PM +0800, Zheng Liu wrote:
> Thanks for your suggestion.  But, sorry, I couldn't get your point here.
> As you suggested above, we can tag each inode with the last access time
> when ext4_map_blocks() is called.  Then we will get an unsorted list of
> inodes with some extent entries.  When we tries to reclaim some entries
> from extent cache, we can call list_sort() to get an sorted list of
> inodes and try to reclaim some entries according to the last access
> time.  My question is why do we need to drop all of the entries from all
> inodes.

Sorry, what I meant to say is to drop all of the entries from a
particular inode.  That is, instead of making the decision to drop one
or two extents from inode A because they haven't been used in a long
time, or one or two extents from inode B because they haven't been
used in a long time, we make the decision on an inode by inode basis.

> Af far as I understand, we can do the following improvement.  We tag the
> last access time for each inode.  So that would avoid to move the inode
> in lru list frequently.  When we try to reclaim some entries, we call
> list_sort() to get a sorted lru list, and drop some entries from lru
> list.  Please correct me if I misunderstood.

Yes, that's what I meant; although note that depending on how many
items we need to drop, it might not be necesary to do a full sort of
the list --- we could just look for the inode with the oldest time,
which would be O(n) rather than O(n log n).

If we are under enough memory pressure that the shrinker is getting
called a lot, maybe it would be worthwhile keep a list of the oldest N
inodes, which only gets updated every so many seconds.  (Since we can
always validate whether or not an inode has been touched since it had
been put on the oldest N inodes very easily.)

> For creating all extent leaf blocks, that is another improvement.  I
> will try to add it after I solve current problem that Dave reported.

Yes, it's a separable issue, but it's related to the same idea.  We
should drop items in the cache which can be easily instaniated at the
time same time (i.e., drop all of the items related to a particular
extent leaf block, or all of the items related to a particular inode).

There are of course, other hueristics we could use, such as only
searching inodes which have open file descriptors.  It's not clear the
benefit is worth the cost of determining which inodes fit a particular
criteria or not, though.

One other possibility while we're brainstorming is using hints from
posix_fadvise or madvise to drop items from the extent cache.  The
cautions with this idea are (a) not that many applications provide
those hints, and (b) just because one process declares that it doesn't
anticipate needing to access a particular file, doesn't mean that it
is true for other users of the same file.  A related classical problem
is the backup program which reads every single file in the file system
once.  There's no particular good reason to have the backup program's
activities affect the caches in the system, whether it be the page
cache or the extent cache.  The trick is determining whether or not a
particular program is behaving like a backup program or not, either
via a hueristic or via an explicit hint/declaration from the program.

Cheers,

      		       	  	   		    - Ted

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ext4 extent status tree LRU locking
  2013-06-13 13:35         ` Theodore Ts'o
@ 2013-06-14  3:27           ` Zheng Liu
  0 siblings, 0 replies; 22+ messages in thread
From: Zheng Liu @ 2013-06-14  3:27 UTC (permalink / raw)
  To: Theodore Ts'o, Dave Hansen, linux-ext4, LKML, Jan kara

On Thu, Jun 13, 2013 at 09:35:17AM -0400, Theodore Ts'o wrote:
> On Thu, Jun 13, 2013 at 09:27:15PM +0800, Zheng Liu wrote:
> > Thanks for your suggestion.  But, sorry, I couldn't get your point here.
> > As you suggested above, we can tag each inode with the last access time
> > when ext4_map_blocks() is called.  Then we will get an unsorted list of
> > inodes with some extent entries.  When we tries to reclaim some entries
> > from extent cache, we can call list_sort() to get an sorted list of
> > inodes and try to reclaim some entries according to the last access
> > time.  My question is why do we need to drop all of the entries from all
> > inodes.
> 
> Sorry, what I meant to say is to drop all of the entries from a
> particular inode.  That is, instead of making the decision to drop one
> or two extents from inode A because they haven't been used in a long
> time, or one or two extents from inode B because they haven't been
> used in a long time, we make the decision on an inode by inode basis.

Sorry, but it seems that we have implemented this algorithm that tries
to reclaim all extents except delayed extent from a particular inode.

Let me describe it, please.  When we access extent status tree of a
particular inode, this inode will be moved into the tail of LRU list
(sbi->s_es_lru).  When we try to reclaim some extents, we will traverse
this list, and reclaim all extents except delayed extent from this
inode until the number of reclaimed objects is equal to 'nr_to_scan' or
all extents has been discarded.

BTW, I have a patch to avoid traverse all entries in LRU list to speed
up this process.

> 
> > Af far as I understand, we can do the following improvement.  We tag the
> > last access time for each inode.  So that would avoid to move the inode
> > in lru list frequently.  When we try to reclaim some entries, we call
> > list_sort() to get a sorted lru list, and drop some entries from lru
> > list.  Please correct me if I misunderstood.
> 
> Yes, that's what I meant;

I have a new patch to implement it.  But as I measure it using 'perf
top', there is no any improvement.  I see that the spin lock burns about
1.3% CPU time in my desktop with 2 Intel CPUs.  Maybe a x86 server with
8 CPUs could highlight this improvement.

> although note that depending on how many
> items we need to drop, it might not be necesary to do a full sort of
> the list --- we could just look for the inode with the oldest time,
> which would be O(n) rather than O(n log n).
> 
> If we are under enough memory pressure that the shrinker is getting
> called a lot, maybe it would be worthwhile keep a list of the oldest N
> inodes, which only gets updated every so many seconds.  (Since we can
> always validate whether or not an inode has been touched since it had
> been put on the oldest N inodes very easily.)

Yes, I will give this a try.

> 
> > For creating all extent leaf blocks, that is another improvement.  I
> > will try to add it after I solve current problem that Dave reported.
> 
> Yes, it's a separable issue, but it's related to the same idea.  We
> should drop items in the cache which can be easily instaniated at the
> time same time (i.e., drop all of the items related to a particular
> extent leaf block, or all of the items related to a particular inode).
> 
> There are of course, other hueristics we could use, such as only
> searching inodes which have open file descriptors.  It's not clear the
> benefit is worth the cost of determining which inodes fit a particular
> criteria or not, though.

Yes, but as you said, we need to measure the benefit from it.

> 
> One other possibility while we're brainstorming is using hints from
> posix_fadvise or madvise to drop items from the extent cache.  The
> cautions with this idea are (a) not that many applications provide
> those hints, and (b) just because one process declares that it doesn't
> anticipate needing to access a particular file, doesn't mean that it
> is true for other users of the same file.  A related classical problem
> is the backup program which reads every single file in the file system
> once.  There's no particular good reason to have the backup program's
> activities affect the caches in the system, whether it be the page
> cache or the extent cache.  The trick is determining whether or not a
> particular program is behaving like a backup program or not, either
> via a hueristic or via an explicit hint/declaration from the program.

To be honest, it is difficulty for us to add a new flag in
fadvise/madvise because other file systems might not provide this hint,
and app A using this hint could affect app B as you described.  But I
agree with you that we could provide a interface to let application
manage extent cache.

Regards,
                                                - Zheng

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ext4 extent status tree LRU locking
  2013-06-14 14:09 ` Zheng Liu
@ 2013-06-14 14:02   ` Theodore Ts'o
  2013-06-14 17:00     ` Zheng Liu
  2013-06-14 15:57   ` Dave Hansen
  1 sibling, 1 reply; 22+ messages in thread
From: Theodore Ts'o @ 2013-06-14 14:02 UTC (permalink / raw)
  To: Dave Hansen, linux-ext4, LKML, Jan kara

On Fri, Jun 14, 2013 at 10:09:40PM +0800, Zheng Liu wrote:
> This commit tries to fix this problem.  Now a new member called
> i_touch_when is added into ext4_inode_info to record the last access
> time for an inode.  Meanwhile we never need to keep a proper in-order
> LRU list.  So this can avoid to burns some CPU time.  When we try to
> reclaim some entries from extent status tree, we use list_sort() to get
> a proper in-order list.  Then we traverse this list to discard some
> entries.

I think this approach is very sound.  I have one reservation, however.
If there are a large number of inodes on this list, list_sort() could
burn a lot of CPU time, especially if the shrinker is called a very
large number of times in a short time period (i.e., when the system is
thrashing, or when one or more memory containers are under a large
amount of memory pressure).

I have a suggestion for how to address this: Keep a timestamp of when
the list last has been sorted in struct ext4_super_info.  When
iterating over the list, looking for a candidate inode, if inode's
i_touch_when is greater than the last time sorted, skip the inode.  If
there are no inodes left in the list, or more than some threshold
inodes have been skipped, only then resort the list (and update the
timestamp in the ext4_super_info structure).

What do you think?

					- Ted

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ext4 extent status tree LRU locking
  2013-06-11 23:22 ext4 extent status tree LRU locking Dave Hansen
  2013-06-12  7:17 ` Zheng Liu
@ 2013-06-14 14:09 ` Zheng Liu
  2013-06-14 14:02   ` Theodore Ts'o
  2013-06-14 15:57   ` Dave Hansen
  1 sibling, 2 replies; 22+ messages in thread
From: Zheng Liu @ 2013-06-14 14:09 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-ext4, LKML, Theodore Ts'o, Jan kara

On Tue, Jun 11, 2013 at 04:22:16PM -0700, Dave Hansen wrote:
> I've got a test case which I intended to use to stress the VM a bit.  It
> fills memory up with page cache a couple of times.  It essentially runs
> 30 or so cp's in parallel.
> 
> 98% of my CPU is system time, and 96% of _that_ is being spent on the
> spinlock in ext4_es_lru_add().  I think the LRU list head and its lock
> end up being *REALLY* hot cachelines and are *the* bottleneck on this
> test.  Note that this is _before_ we go in to reclaim and actually start
> calling in to the shrinker.  There is zero memory pressure in this test.
> 
> I'm not sure the benefits of having a proper in-order LRU during reclaim
> outweigh such a drastic downside for the common case.

Hi Dave,

I paste a patch that tries to fix this problem that you reported.  I do
the following test in my sand box, and it seeems that this patch could
fix the problem.  This patch is not very mature.  But I hope to be sure
that my direction is right.  So I really appreciate if you could confirm
that this patch can fix the problem in your test case.

I try to reproduce the test as you described.  I creaete a ext4 file
system that is a loopback device in a x86_64 server with 16 CPUs, 24G
memory.  Then I create 160 64G sparse files, and copy them to /dev/null.
I use 'perf record/report' to observe the overhead, and the results are
as below:

unpatched kernel
----------------
65.32%               cp  [kernel.kallsyms]            [k] _raw_spin_lock                                                                                                       
                         |
                         --- _raw_spin_lock
                            |          
                            |--99.30%-- ext4_es_lru_add
                            |          ext4_es_lookup_extent
                            |          ext4_map_blocks
                            |          _ext4_get_block
                            |          ext4_get_block
                            |          do_mpage_readpage

The result shows that ext4_es_lru_add() burns about 65% CPU time.

patched kernel
--------------
2.26%               cp  [kernel.kallsyms]          [k] _raw_spin_lock                                                                                                         
                         |
                         --- _raw_spin_lock
                            |          
                            |--96.36%-- rmqueue_bulk.clone.0
                            |          get_page_from_freelist
                            |          |          
                            |          |--60.13%-- __alloc_pages_nodemask

After applied the patch, ext4_es_lru_add() is never a bottleneck.

I am not sure whether I have reproduced your test.  So that would be
great if you can confirm that this patch can fix the problem.  The patch
bases against the master branch of ext4 tree.

Any comment or suggestion is welcome.

Thanks,
                                                - Zheng


Subject: [PATCH] ext4: improve extent cache shrink mechanism to avoid to burn CPU time

From: Zheng Liu <wenqing.lz@taobao.com>

Now we maintain an proper in-order LRU list in ext4 to reclaim entries
from extent status tree when we are under heavy memory pressure.  For
keeping this order, a spin lock is used to protect this list.  But this
lock burns a lot of CPU time.

This commit tries to fix this problem.  Now a new member called
i_touch_when is added into ext4_inode_info to record the last access
time for an inode.  Meanwhile we never need to keep a proper in-order
LRU list.  So this can avoid to burns some CPU time.  When we try to
reclaim some entries from extent status tree, we use list_sort() to get
a proper in-order list.  Then we traverse this list to discard some
entries.

In this commit, we break the loop if s_extent_cache_cnt == 0 because
that means that all extents in extent status tree have been reclaimed.

Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
 fs/ext4/ext4.h           |    1 +
 fs/ext4/extents_status.c |   43 ++++++++++++++++++++++++++++++++++---------
 fs/ext4/inode.c          |    4 ++++
 fs/ext4/super.c          |    1 +
 4 files changed, 40 insertions(+), 9 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 019db3c..1c31f27 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -874,6 +874,7 @@ struct ext4_inode_info {
 	rwlock_t i_es_lock;
 	struct list_head i_es_lru;
 	unsigned int i_es_lru_nr;	/* protected by i_es_lock */
+	unsigned long i_touch_when;	/* jiffies of last accessing */
 
 	/* ialloc */
 	ext4_group_t	i_last_alloc_group;
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index e6941e6..0b867b8 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -10,6 +10,7 @@
  * Ext4 extents status tree core functions.
  */
 #include <linux/rbtree.h>
+#include <linux/list_sort.h>
 #include "ext4.h"
 #include "extents_status.h"
 #include "ext4_extents.h"
@@ -291,7 +292,6 @@ out:
 
 	read_unlock(&EXT4_I(inode)->i_es_lock);
 
-	ext4_es_lru_add(inode);
 	trace_ext4_es_find_delayed_extent_range_exit(inode, es);
 }
 
@@ -672,7 +672,6 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 error:
 	write_unlock(&EXT4_I(inode)->i_es_lock);
 
-	ext4_es_lru_add(inode);
 	ext4_es_print_tree(inode);
 
 	return err;
@@ -734,7 +733,6 @@ out:
 
 	read_unlock(&EXT4_I(inode)->i_es_lock);
 
-	ext4_es_lru_add(inode);
 	trace_ext4_es_lookup_extent_exit(inode, es, found);
 	return found;
 }
@@ -878,12 +876,30 @@ int ext4_es_zeroout(struct inode *inode, struct ext4_extent *ex)
 				     EXTENT_STATUS_WRITTEN);
 }
 
+static int ext4_inode_touch_time_cmp(void *priv, struct list_head *a,
+				     struct list_head *b)
+{
+	struct ext4_inode_info *eia, *eib;
+	unsigned long diff;
+
+	eia = list_entry(a, struct ext4_inode_info, i_es_lru);
+	eib = list_entry(b, struct ext4_inode_info, i_es_lru);
+
+	diff = eia->i_touch_when - eib->i_touch_when;
+	if (diff < 0)
+		return -1;
+	if (diff > 0)
+		return 1;
+	return 0;
+}
+
 static int ext4_es_shrink(struct shrinker *shrink, struct shrink_control *sc)
 {
 	struct ext4_sb_info *sbi = container_of(shrink,
 					struct ext4_sb_info, s_es_shrinker);
 	struct ext4_inode_info *ei;
-	struct list_head *cur, *tmp, scanned;
+	struct list_head *cur, *tmp;
+	LIST_HEAD(scanned);
 	int nr_to_scan = sc->nr_to_scan;
 	int ret, nr_shrunk = 0;
 
@@ -893,10 +909,16 @@ static int ext4_es_shrink(struct shrinker *shrink, struct shrink_control *sc)
 	if (!nr_to_scan)
 		return ret;
 
-	INIT_LIST_HEAD(&scanned);
-
 	spin_lock(&sbi->s_es_lru_lock);
+	list_sort(NULL, &sbi->s_es_lru, ext4_inode_touch_time_cmp);
 	list_for_each_safe(cur, tmp, &sbi->s_es_lru) {
+		/*
+		 * If we have already reclaimed all extents from extent
+		 * status tree, just stop the loop immediately.
+		 */
+		if (percpu_counter_read_positive(&sbi->s_extent_cache_cnt) == 0)
+			break;
+
 		list_move_tail(cur, &scanned);
 
 		ei = list_entry(cur, struct ext4_inode_info, i_es_lru);
@@ -947,11 +969,14 @@ void ext4_es_lru_add(struct inode *inode)
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 
+	ei->i_touch_when = jiffies;
+
+	if (!list_empty(&ei->i_es_lru))
+		return;
+
 	spin_lock(&sbi->s_es_lru_lock);
 	if (list_empty(&ei->i_es_lru))
-		list_add_tail(&ei->i_es_lru, &sbi->s_es_lru);
-	else
-		list_move_tail(&ei->i_es_lru, &sbi->s_es_lru);
+		list_add(&ei->i_es_lru, &sbi->s_es_lru);
 	spin_unlock(&sbi->s_es_lru_lock);
 }
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 38f03dc..1477406 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -571,6 +571,8 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 		  "logical block %lu\n", inode->i_ino, flags, map->m_len,
 		  (unsigned long) map->m_lblk);
 
+	ext4_es_lru_add(inode);
+
 	/* Lookup extent status tree firstly */
 	if (ext4_es_lookup_extent(inode, map->m_lblk, &es)) {
 		if (ext4_es_is_written(&es) || ext4_es_is_unwritten(&es)) {
@@ -1888,6 +1890,8 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 		  "logical block %lu\n", inode->i_ino, map->m_len,
 		  (unsigned long) map->m_lblk);
 
+	ext4_es_lru_add(inode);
+
 	/* Lookup extent status tree firstly */
 	if (ext4_es_lookup_extent(inode, iblock, &es)) {
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index a9c1438..b365695 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -849,6 +849,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 	rwlock_init(&ei->i_es_lock);
 	INIT_LIST_HEAD(&ei->i_es_lru);
 	ei->i_es_lru_nr = 0;
+	ei->i_touch_when = 0;
 	ei->i_reserved_data_blocks = 0;
 	ei->i_reserved_meta_blocks = 0;
 	ei->i_allocated_meta_blocks = 0;
-- 
1.7.9.7

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: ext4 extent status tree LRU locking
  2013-06-14 14:09 ` Zheng Liu
  2013-06-14 14:02   ` Theodore Ts'o
@ 2013-06-14 15:57   ` Dave Hansen
  2013-06-14 17:11     ` Zheng Liu
  1 sibling, 1 reply; 22+ messages in thread
From: Dave Hansen @ 2013-06-14 15:57 UTC (permalink / raw)
  To: linux-ext4, LKML, Theodore Ts'o, Jan kara

On 06/14/2013 07:09 AM, Zheng Liu wrote:
> -	INIT_LIST_HEAD(&scanned);
> -
>  	spin_lock(&sbi->s_es_lru_lock);
> +	list_sort(NULL, &sbi->s_es_lru, ext4_inode_touch_time_cmp);
>  	list_for_each_safe(cur, tmp, &sbi->s_es_lru) {

How long can this list get?  I have the feeling this might get a bit
painful, especially on a NUMA machine.

But, it definitely eliminates the spinlock contention that I was seeing.
 The top ext4 function in my profiles is way down under 1% of CPU time
now.  Thanks for the quick response, and please let me know if you need
any further testing.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ext4 extent status tree LRU locking
  2013-06-14 17:11     ` Zheng Liu
@ 2013-06-14 16:55       ` Dave Hansen
  0 siblings, 0 replies; 22+ messages in thread
From: Dave Hansen @ 2013-06-14 16:55 UTC (permalink / raw)
  To: linux-ext4, LKML, Theodore Ts'o, Jan kara

On 06/14/2013 10:11 AM, Zheng Liu wrote:
> On Fri, Jun 14, 2013 at 08:57:44AM -0700, Dave Hansen wrote:
>> > On 06/14/2013 07:09 AM, Zheng Liu wrote:
>>> > > -	INIT_LIST_HEAD(&scanned);
>>> > > -
>>> > >  	spin_lock(&sbi->s_es_lru_lock);
>>> > > +	list_sort(NULL, &sbi->s_es_lru, ext4_inode_touch_time_cmp);
>>> > >  	list_for_each_safe(cur, tmp, &sbi->s_es_lru) {
>> > 
>> > How long can this list get?  I have the feeling this might get a bit
>> > painful, especially on a NUMA machine.
> I guess that you worry about the time of sorting a lru list, right?

Yeah, just worried about the amount of time it takes to examine (and
possibly write to) every item on the list.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ext4 extent status tree LRU locking
  2013-06-14 14:02   ` Theodore Ts'o
@ 2013-06-14 17:00     ` Zheng Liu
  2013-06-14 18:00       ` Theodore Ts'o
  0 siblings, 1 reply; 22+ messages in thread
From: Zheng Liu @ 2013-06-14 17:00 UTC (permalink / raw)
  To: Theodore Ts'o, Dave Hansen, linux-ext4, LKML, Jan kara

On Fri, Jun 14, 2013 at 10:02:15AM -0400, Theodore Ts'o wrote:
> On Fri, Jun 14, 2013 at 10:09:40PM +0800, Zheng Liu wrote:
> > This commit tries to fix this problem.  Now a new member called
> > i_touch_when is added into ext4_inode_info to record the last access
> > time for an inode.  Meanwhile we never need to keep a proper in-order
> > LRU list.  So this can avoid to burns some CPU time.  When we try to
> > reclaim some entries from extent status tree, we use list_sort() to get
> > a proper in-order list.  Then we traverse this list to discard some
> > entries.
> 
> I think this approach is very sound.  I have one reservation, however.
> If there are a large number of inodes on this list, list_sort() could
> burn a lot of CPU time, especially if the shrinker is called a very
> large number of times in a short time period (i.e., when the system is
> thrashing, or when one or more memory containers are under a large
> amount of memory pressure).
> 
> I have a suggestion for how to address this: Keep a timestamp of when
> the list last has been sorted in struct ext4_super_info.  When
> iterating over the list, looking for a candidate inode, if inode's
> i_touch_when is greater than the last time sorted, skip the inode.  If
> there are no inodes left in the list, or more than some threshold
> inodes have been skipped, only then resort the list (and update the
> timestamp in the ext4_super_info structure).
> 
> What do you think?

Thanks for your suggestion.  I fully agree with you.  What I concern is
how to define this threshold.  A fixed value is very simple but too
inflexible.  If this value is too big, this lru list could never be
sorted.  So my proposal is that we set this value accroding to the
number of used inodes.  For example, if the number of used inodes is
10,000, this threshold could be set to 10 (10,000 x 0.1%).  Meanwhile,
we need to provide a sysfs interface to let user set this value.  Does
it make sense?

Thanks,
                                                - Zheng

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ext4 extent status tree LRU locking
  2013-06-14 15:57   ` Dave Hansen
@ 2013-06-14 17:11     ` Zheng Liu
  2013-06-14 16:55       ` Dave Hansen
  0 siblings, 1 reply; 22+ messages in thread
From: Zheng Liu @ 2013-06-14 17:11 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-ext4, LKML, Theodore Ts'o, Jan kara

On Fri, Jun 14, 2013 at 08:57:44AM -0700, Dave Hansen wrote:
> On 06/14/2013 07:09 AM, Zheng Liu wrote:
> > -	INIT_LIST_HEAD(&scanned);
> > -
> >  	spin_lock(&sbi->s_es_lru_lock);
> > +	list_sort(NULL, &sbi->s_es_lru, ext4_inode_touch_time_cmp);
> >  	list_for_each_safe(cur, tmp, &sbi->s_es_lru) {
> 
> How long can this list get?  I have the feeling this might get a bit
> painful, especially on a NUMA machine.

I guess that you worry about the time of sorting a lru list, right?
Ted and I also worry about this, especially when shrinker is called()
very frequently.  Ted has presented a solution.  I will give it a try.

> 
> But, it definitely eliminates the spinlock contention that I was seeing.
>  The top ext4 function in my profiles is way down under 1% of CPU time
> now.  Thanks for the quick response, and please let me know if you need
> any further testing.

Thanks for your help.  As I said above, I will try to improve this patch
later.  My server's numa has been turned off in BIOS and I haven't a
privilege to turn it on.  So that would be great if you could help me to
do some further testing.

Thanks in advance,
                                                - Zheng

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ext4 extent status tree LRU locking
  2013-06-14 17:00     ` Zheng Liu
@ 2013-06-14 18:00       ` Theodore Ts'o
  2013-06-17 10:10         ` Zheng Liu
  0 siblings, 1 reply; 22+ messages in thread
From: Theodore Ts'o @ 2013-06-14 18:00 UTC (permalink / raw)
  To: Dave Hansen, linux-ext4, LKML, Jan kara

On Sat, Jun 15, 2013 at 01:00:28AM +0800, Zheng Liu wrote:
> > I have a suggestion for how to address this: Keep a timestamp of when
> > the list last has been sorted in struct ext4_super_info.  When
> > iterating over the list, looking for a candidate inode, if inode's
> > i_touch_when is greater than the last time sorted, skip the inode.  If
> > there are no inodes left in the list, or more than some threshold
> > inodes have been skipped, only then resort the list (and update the
> > timestamp in the ext4_super_info structure).
> > 
> 
> Thanks for your suggestion.  I fully agree with you.  What I concern is
> how to define this threshold.  A fixed value is very simple but too
> inflexible.  If this value is too big, this lru list could never be
> sorted.

In my proposal, if the list is never sorted before, we would
definitely sort the list the first time there is a request to shrink
the number of extent caches.  So the list would always be sorted at
least once.

If an inode gets accessed after the time when the list has been
sorted, then the last_touch time will be greater than the last_sorted
time, right?  And our goal is to use the least recently used inode, so
as long as we know they have been used more recently than the time the
list was sorted, and there are still inodes on the list where the
last_touch time is older than the last_sort time, we don't have a
problem.  So we will evict all of the non-delalloc extent cache items
from that inode, and remove it from the list.

The original reason for having the threshold at all is so that if we
skip a huge number of inodes on the list, this would take a long time.
But thinking about this some more, we don't even need a threshold.
What we do instead is if the the last_touch time is newer than the
last_sort time, we just move the inode to the end of the list.  That
means the end of the list will be unsorted, but that's OK, because the
oldest inodes are still at the beginning of the list.  Once the head
of the list is newer than the last_sorted time, then we know we need
to resort the entire list.

Does that make sense to you?

					- Ted

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ext4 extent status tree LRU locking
  2013-06-14 18:00       ` Theodore Ts'o
@ 2013-06-17 10:10         ` Zheng Liu
  2013-06-17 21:12           ` Dave Hansen
  2013-06-18  2:47           ` Theodore Ts'o
  0 siblings, 2 replies; 22+ messages in thread
From: Zheng Liu @ 2013-06-17 10:10 UTC (permalink / raw)
  To: Theodore Ts'o, Dave Hansen, linux-ext4, LKML, Jan kara

On Fri, Jun 14, 2013 at 02:00:54PM -0400, Theodore Ts'o wrote:
> On Sat, Jun 15, 2013 at 01:00:28AM +0800, Zheng Liu wrote:
> > > I have a suggestion for how to address this: Keep a timestamp of when
> > > the list last has been sorted in struct ext4_super_info.  When
> > > iterating over the list, looking for a candidate inode, if inode's
> > > i_touch_when is greater than the last time sorted, skip the inode.  If
> > > there are no inodes left in the list, or more than some threshold
> > > inodes have been skipped, only then resort the list (and update the
> > > timestamp in the ext4_super_info structure).
> > > 
> > 
> > Thanks for your suggestion.  I fully agree with you.  What I concern is
> > how to define this threshold.  A fixed value is very simple but too
> > inflexible.  If this value is too big, this lru list could never be
> > sorted.
> 
> In my proposal, if the list is never sorted before, we would
> definitely sort the list the first time there is a request to shrink
> the number of extent caches.  So the list would always be sorted at
> least once.
> 
> If an inode gets accessed after the time when the list has been
> sorted, then the last_touch time will be greater than the last_sorted
> time, right?  And our goal is to use the least recently used inode, so
> as long as we know they have been used more recently than the time the
> list was sorted, and there are still inodes on the list where the
> last_touch time is older than the last_sort time, we don't have a
> problem.  So we will evict all of the non-delalloc extent cache items
> from that inode, and remove it from the list.
> 
> The original reason for having the threshold at all is so that if we
> skip a huge number of inodes on the list, this would take a long time.
> But thinking about this some more, we don't even need a threshold.
> What we do instead is if the the last_touch time is newer than the
> last_sort time, we just move the inode to the end of the list.  That
> means the end of the list will be unsorted, but that's OK, because the
> oldest inodes are still at the beginning of the list.  Once the head
> of the list is newer than the last_sorted time, then we know we need
> to resort the entire list.
> 
> Does that make sense to you?

Hi Ted and Dave,

Sorry for the late.  It makes sense to me.  Here is the second version
of patch.  Could you please review it?

Dave, that would be great if you could do your testing again to confirm
this patch is useful.

Any feedback is welcome.

Thanks,
                                                - Zheng


Subject: [PATCH v2] ext4: improve extent cache shrink mechanism to avoid to burn CPU time

From: Zheng Liu <wenqing.lz@taobao.com>

Now we maintain an proper in-order LRU list in ext4 to reclaim entries
from extent status tree when we are under heavy memory pressure.  For
keeping this order, a spin lock is used to protect this list.  But this
lock burns a lot of CPU time.  We can use the following steps to trigger
it.

  % cd /dev/shm
  % dd if=/dev/zero of=ext4-img bs=1M count=2k
  % mkfs.ext4 ext4-img
  % mount -t ext4 -o loop ext4-img /mnt
  % cd /mnt
  % for ((i=0;i<160;i++)); do truncate -s 64g $i; done
  % for ((i=0;i<160;i++)); do cp $i /dev/null &; done
  % perf record -a -g
  % perf report

This commit tries to fix this problem.  Now a new member called
i_touch_when is added into ext4_inode_info to record the last access
time for an inode.  Meanwhile we never need to keep a proper in-order
LRU list.  So this can avoid to burns some CPU time.  When we try to
reclaim some entries from extent status tree, we use list_sort() to get
a proper in-order list.  Then we traverse this list to discard some
entries.  In ext4_sb_info, we use s_es_last_sorted to record the last
time of sorting this list.  When we traverse the list, we skip the inode
that is newer than this time, and move this inode to the tail of LRU
list.  When the head of the list is newer than s_es_last_sorted, we will
sort the LRU list again.

In this commit, we break the loop if s_extent_cache_cnt == 0 because
that means that all extents in extent status tree have been reclaimed.

Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is
changed to save a local variable in these functions.

Reported-by: Dave Hansen <dave.hansen@intel.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
changelog:
v2:
	* avoid to sort the LRU list
	* change ext4_es_{un}register_shrinker()'s prototype to save a local
	  variable

 fs/ext4/ext4.h           |    2 ++
 fs/ext4/extents_status.c |   77 ++++++++++++++++++++++++++++++++++------------
 fs/ext4/extents_status.h |    5 +--
 fs/ext4/inode.c          |    4 +++
 fs/ext4/super.c          |    7 +++--
 5 files changed, 70 insertions(+), 25 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 019db3c..f30ff8c 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -874,6 +874,7 @@ struct ext4_inode_info {
 	rwlock_t i_es_lock;
 	struct list_head i_es_lru;
 	unsigned int i_es_lru_nr;	/* protected by i_es_lock */
+	unsigned long i_touch_when;	/* jiffies of last accessing */
 
 	/* ialloc */
 	ext4_group_t	i_last_alloc_group;
@@ -1302,6 +1303,7 @@ struct ext4_sb_info {
 	/* Reclaim extents from extent status tree */
 	struct shrinker s_es_shrinker;
 	struct list_head s_es_lru;
+	unsigned long s_es_last_sorted;
 	struct percpu_counter s_extent_cache_cnt;
 	spinlock_t s_es_lru_lock ____cacheline_aligned_in_smp;
 };
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index e6941e6..80dcc59 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -10,6 +10,7 @@
  * Ext4 extents status tree core functions.
  */
 #include <linux/rbtree.h>
+#include <linux/list_sort.h>
 #include "ext4.h"
 #include "extents_status.h"
 #include "ext4_extents.h"
@@ -291,7 +292,6 @@ out:
 
 	read_unlock(&EXT4_I(inode)->i_es_lock);
 
-	ext4_es_lru_add(inode);
 	trace_ext4_es_find_delayed_extent_range_exit(inode, es);
 }
 
@@ -672,7 +672,6 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 error:
 	write_unlock(&EXT4_I(inode)->i_es_lock);
 
-	ext4_es_lru_add(inode);
 	ext4_es_print_tree(inode);
 
 	return err;
@@ -734,7 +733,6 @@ out:
 
 	read_unlock(&EXT4_I(inode)->i_es_lock);
 
-	ext4_es_lru_add(inode);
 	trace_ext4_es_lookup_extent_exit(inode, es, found);
 	return found;
 }
@@ -878,12 +876,30 @@ int ext4_es_zeroout(struct inode *inode, struct ext4_extent *ex)
 				     EXTENT_STATUS_WRITTEN);
 }
 
+static int ext4_inode_touch_time_cmp(void *priv, struct list_head *a,
+				     struct list_head *b)
+{
+	struct ext4_inode_info *eia, *eib;
+	unsigned long diff;
+
+	eia = list_entry(a, struct ext4_inode_info, i_es_lru);
+	eib = list_entry(b, struct ext4_inode_info, i_es_lru);
+
+	diff = eia->i_touch_when - eib->i_touch_when;
+	if (diff < 0)
+		return -1;
+	if (diff > 0)
+		return 1;
+	return 0;
+}
+
 static int ext4_es_shrink(struct shrinker *shrink, struct shrink_control *sc)
 {
 	struct ext4_sb_info *sbi = container_of(shrink,
 					struct ext4_sb_info, s_es_shrinker);
 	struct ext4_inode_info *ei;
-	struct list_head *cur, *tmp, scanned;
+	struct list_head *cur, *tmp;
+	LIST_HEAD(skiped);
 	int nr_to_scan = sc->nr_to_scan;
 	int ret, nr_shrunk = 0;
 
@@ -893,23 +909,41 @@ static int ext4_es_shrink(struct shrinker *shrink, struct shrink_control *sc)
 	if (!nr_to_scan)
 		return ret;
 
-	INIT_LIST_HEAD(&scanned);
-
 	spin_lock(&sbi->s_es_lru_lock);
+
+	/*
+	 * If the inode that is at the head of LRU list is newer than
+	 * last_sorted time, that means that we need to sort this list.
+	 */
+	ei = list_first_entry(&sbi->s_es_lru, struct ext4_inode_info, i_es_lru);
+	if (sbi->s_es_last_sorted < ei->i_touch_when) {
+		list_sort(NULL, &sbi->s_es_lru, ext4_inode_touch_time_cmp);
+		sbi->s_es_last_sorted = jiffies;
+	}
+
 	list_for_each_safe(cur, tmp, &sbi->s_es_lru) {
-		list_move_tail(cur, &scanned);
+		/*
+		 * If we have already reclaimed all extents from extent
+		 * status tree, just stop the loop immediately.
+		 */
+		if (percpu_counter_read_positive(&sbi->s_extent_cache_cnt) == 0)
+			break;
 
 		ei = list_entry(cur, struct ext4_inode_info, i_es_lru);
 
-		read_lock(&ei->i_es_lock);
-		if (ei->i_es_lru_nr == 0) {
-			read_unlock(&ei->i_es_lock);
+		/* Skip the inode that is newer than the last_sorted time */
+		if (sbi->s_es_last_sorted < ei->i_touch_when) {
+			list_move_tail(cur, &skiped);
 			continue;
 		}
-		read_unlock(&ei->i_es_lock);
+
+		if (ei->i_es_lru_nr == 0)
+			continue;
 
 		write_lock(&ei->i_es_lock);
 		ret = __es_try_to_reclaim_extents(ei, nr_to_scan);
+		if (ei->i_es_lru_nr == 0)
+			list_del_init(&ei->i_es_lru);
 		write_unlock(&ei->i_es_lock);
 
 		nr_shrunk += ret;
@@ -917,7 +951,9 @@ static int ext4_es_shrink(struct shrinker *shrink, struct shrink_control *sc)
 		if (nr_to_scan == 0)
 			break;
 	}
-	list_splice_tail(&scanned, &sbi->s_es_lru);
+
+	/* Move the newer inodes into the tail of the LRU list. */
+	list_splice_tail(&skiped, &sbi->s_es_lru);
 	spin_unlock(&sbi->s_es_lru_lock);
 
 	ret = percpu_counter_read_positive(&sbi->s_extent_cache_cnt);
@@ -925,21 +961,19 @@ static int ext4_es_shrink(struct shrinker *shrink, struct shrink_control *sc)
 	return ret;
 }
 
-void ext4_es_register_shrinker(struct super_block *sb)
+void ext4_es_register_shrinker(struct ext4_sb_info *sbi)
 {
-	struct ext4_sb_info *sbi;
-
-	sbi = EXT4_SB(sb);
 	INIT_LIST_HEAD(&sbi->s_es_lru);
 	spin_lock_init(&sbi->s_es_lru_lock);
+	sbi->s_es_last_sorted = 0;
 	sbi->s_es_shrinker.shrink = ext4_es_shrink;
 	sbi->s_es_shrinker.seeks = DEFAULT_SEEKS;
 	register_shrinker(&sbi->s_es_shrinker);
 }
 
-void ext4_es_unregister_shrinker(struct super_block *sb)
+void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi)
 {
-	unregister_shrinker(&EXT4_SB(sb)->s_es_shrinker);
+	unregister_shrinker(&sbi->s_es_shrinker);
 }
 
 void ext4_es_lru_add(struct inode *inode)
@@ -947,11 +981,14 @@ void ext4_es_lru_add(struct inode *inode)
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 
+	ei->i_touch_when = jiffies;
+
+	if (!list_empty(&ei->i_es_lru))
+		return;
+
 	spin_lock(&sbi->s_es_lru_lock);
 	if (list_empty(&ei->i_es_lru))
 		list_add_tail(&ei->i_es_lru, &sbi->s_es_lru);
-	else
-		list_move_tail(&ei->i_es_lru, &sbi->s_es_lru);
 	spin_unlock(&sbi->s_es_lru_lock);
 }
 
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index f740eb03..e936730 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -39,6 +39,7 @@
 				 EXTENT_STATUS_DELAYED | \
 				 EXTENT_STATUS_HOLE)
 
+struct ext4_sb_info;
 struct ext4_extent;
 
 struct extent_status {
@@ -119,8 +120,8 @@ static inline void ext4_es_store_status(struct extent_status *es,
 	es->es_pblk = block;
 }
 
-extern void ext4_es_register_shrinker(struct super_block *sb);
-extern void ext4_es_unregister_shrinker(struct super_block *sb);
+extern void ext4_es_register_shrinker(struct ext4_sb_info *sbi);
+extern void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi);
 extern void ext4_es_lru_add(struct inode *inode);
 extern void ext4_es_lru_del(struct inode *inode);
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 38f03dc..1477406 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -571,6 +571,8 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 		  "logical block %lu\n", inode->i_ino, flags, map->m_len,
 		  (unsigned long) map->m_lblk);
 
+	ext4_es_lru_add(inode);
+
 	/* Lookup extent status tree firstly */
 	if (ext4_es_lookup_extent(inode, map->m_lblk, &es)) {
 		if (ext4_es_is_written(&es) || ext4_es_is_unwritten(&es)) {
@@ -1888,6 +1890,8 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 		  "logical block %lu\n", inode->i_ino, map->m_len,
 		  (unsigned long) map->m_lblk);
 
+	ext4_es_lru_add(inode);
+
 	/* Lookup extent status tree firstly */
 	if (ext4_es_lookup_extent(inode, iblock, &es)) {
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index a9c1438..a7054e7 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -760,7 +760,7 @@ static void ext4_put_super(struct super_block *sb)
 			ext4_abort(sb, "Couldn't clean up the journal");
 	}
 
-	ext4_es_unregister_shrinker(sb);
+	ext4_es_unregister_shrinker(sbi);
 	del_timer(&sbi->s_err_report);
 	ext4_release_system_zone(sb);
 	ext4_mb_release(sb);
@@ -849,6 +849,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 	rwlock_init(&ei->i_es_lock);
 	INIT_LIST_HEAD(&ei->i_es_lru);
 	ei->i_es_lru_nr = 0;
+	ei->i_touch_when = 0;
 	ei->i_reserved_data_blocks = 0;
 	ei->i_reserved_meta_blocks = 0;
 	ei->i_allocated_meta_blocks = 0;
@@ -3766,7 +3767,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	sbi->s_err_report.data = (unsigned long) sb;
 
 	/* Register extent status tree shrinker */
-	ext4_es_register_shrinker(sb);
+	ext4_es_register_shrinker(sbi);
 
 	err = percpu_counter_init(&sbi->s_freeclusters_counter,
 			ext4_count_free_clusters(sb));
@@ -4084,7 +4085,7 @@ failed_mount_wq:
 		sbi->s_journal = NULL;
 	}
 failed_mount3:
-	ext4_es_unregister_shrinker(sb);
+	ext4_es_unregister_shrinker(sbi);
 	del_timer(&sbi->s_err_report);
 	if (sbi->s_flex_groups)
 		ext4_kvfree(sbi->s_flex_groups);
-- 
1.7.9.7

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: ext4 extent status tree LRU locking
  2013-06-17 10:10         ` Zheng Liu
@ 2013-06-17 21:12           ` Dave Hansen
  2013-06-18  2:25             ` Zheng Liu
  2013-06-18  2:47           ` Theodore Ts'o
  1 sibling, 1 reply; 22+ messages in thread
From: Dave Hansen @ 2013-06-17 21:12 UTC (permalink / raw)
  To: Theodore Ts'o, linux-ext4, LKML, Jan kara

On 06/17/2013 03:10 AM, Zheng Liu wrote:
> Dave, that would be great if you could do your testing again to confirm
> this patch is useful.

I was able to apply this to Ted's

	https://git.kernel.org/cgit/linux/kernel/git/tytso/ext4.git/

"ext4/dev" tree with a bit of fuzz.  What did you generate this against,
btw?

It does seem to be functioning OK for me, and passes the other tests
that saw so much spinlock contention previously.  You can add my
Tested-by if you like.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ext4 extent status tree LRU locking
  2013-06-17 21:12           ` Dave Hansen
@ 2013-06-18  2:25             ` Zheng Liu
  2013-06-18  2:51               ` Theodore Ts'o
  0 siblings, 1 reply; 22+ messages in thread
From: Zheng Liu @ 2013-06-18  2:25 UTC (permalink / raw)
  To: Dave Hansen, Theodore Ts'o; +Cc: linux-ext4, LKML, Jan kara

On Mon, Jun 17, 2013 at 02:12:10PM -0700, Dave Hansen wrote:
> On 06/17/2013 03:10 AM, Zheng Liu wrote:
> > Dave, that would be great if you could do your testing again to confirm
> > this patch is useful.
> 
> I was able to apply this to Ted's
> 
> 	https://git.kernel.org/cgit/linux/kernel/git/tytso/ext4.git/
> 
> "ext4/dev" tree with a bit of fuzz.  What did you generate this against,
> btw?

Ah, sorry, I forgot to mention that this patch bases against ext4/master
branch.  Now ext4/dev branch has some regression when I run xfstests.
So I don't base against this branch to generate my patch.

Ted, I notice that now in ext4 tree we have 'dev', 'dev-with-revert',
and 'dev2' branches.  Which one is the best to generate a new patch for
the next merge window?

> 
> It does seem to be functioning OK for me, and passes the other tests
> that saw so much spinlock contention previously.  You can add my
> Tested-by if you like.

Thanks for your testing.

Regards,
                                                - Zheng

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ext4 extent status tree LRU locking
  2013-06-17 10:10         ` Zheng Liu
  2013-06-17 21:12           ` Dave Hansen
@ 2013-06-18  2:47           ` Theodore Ts'o
  1 sibling, 0 replies; 22+ messages in thread
From: Theodore Ts'o @ 2013-06-18  2:47 UTC (permalink / raw)
  To: Dave Hansen, linux-ext4, LKML, Jan kara

> Subject: [PATCH v2] ext4: improve extent cache shrink mechanism to avoid to burn CPU time
> 
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> Now we maintain an proper in-order LRU list in ext4 to reclaim entries
> from extent status tree when we are under heavy memory pressure.  For
> keeping this order, a spin lock is used to protect this list.  But this
> lock burns a lot of CPU time.  We can use the following steps to trigger
> it.
> 
>   % cd /dev/shm
>   % dd if=/dev/zero of=ext4-img bs=1M count=2k
>   % mkfs.ext4 ext4-img
>   % mount -t ext4 -o loop ext4-img /mnt
>   % cd /mnt
>   % for ((i=0;i<160;i++)); do truncate -s 64g $i; done
>   % for ((i=0;i<160;i++)); do cp $i /dev/null &; done
>   % perf record -a -g
>   % perf report
> 
> This commit tries to fix this problem.  Now a new member called
> i_touch_when is added into ext4_inode_info to record the last access
> time for an inode.  Meanwhile we never need to keep a proper in-order
> LRU list.  So this can avoid to burns some CPU time.  When we try to
> reclaim some entries from extent status tree, we use list_sort() to get
> a proper in-order list.  Then we traverse this list to discard some
> entries.  In ext4_sb_info, we use s_es_last_sorted to record the last
> time of sorting this list.  When we traverse the list, we skip the inode
> that is newer than this time, and move this inode to the tail of LRU
> list.  When the head of the list is newer than s_es_last_sorted, we will
> sort the LRU list again.
> 
> In this commit, we break the loop if s_extent_cache_cnt == 0 because
> that means that all extents in extent status tree have been reclaimed.
> 
> Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is
> changed to save a local variable in these functions.
> 
> Reported-by: Dave Hansen <dave.hansen@intel.com>
> Cc: "Theodore Ts'o" <tytso@mit.edu>
> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>

Applied, thanks.

					- Ted

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ext4 extent status tree LRU locking
  2013-06-18  2:25             ` Zheng Liu
@ 2013-06-18  2:51               ` Theodore Ts'o
  2013-06-18  3:49                 ` Zheng Liu
  0 siblings, 1 reply; 22+ messages in thread
From: Theodore Ts'o @ 2013-06-18  2:51 UTC (permalink / raw)
  To: Dave Hansen, linux-ext4, LKML, Jan kara

On Tue, Jun 18, 2013 at 10:25:48AM +0800, Zheng Liu wrote:
> Ah, sorry, I forgot to mention that this patch bases against ext4/master
> branch.  Now ext4/dev branch has some regression when I run xfstests.

What regressions are you seeing?

> Ted, I notice that now in ext4 tree we have 'dev', 'dev-with-revert',
> and 'dev2' branches.  Which one is the best to generate a new patch for
> the next merge window?

Either the dev branch or the master branch.   

The dev-with-revert and dev2 were branches that I had created when
investigating a potential regression with the invalidage page range
patch set.  I've since determined that it's a timing issue and it's
not a new regression --- we've had xfstests failures with test
generic/300 for a while now.

					- Ted

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: ext4 extent status tree LRU locking
  2013-06-18  2:51               ` Theodore Ts'o
@ 2013-06-18  3:49                 ` Zheng Liu
  0 siblings, 0 replies; 22+ messages in thread
From: Zheng Liu @ 2013-06-18  3:49 UTC (permalink / raw)
  To: Theodore Ts'o, Dave Hansen, linux-ext4, LKML, Jan kara

On Mon, Jun 17, 2013 at 10:51:34PM -0400, Theodore Ts'o wrote:
> On Tue, Jun 18, 2013 at 10:25:48AM +0800, Zheng Liu wrote:
> > Ah, sorry, I forgot to mention that this patch bases against ext4/master
> > branch.  Now ext4/dev branch has some regression when I run xfstests.
> 
> What regressions are you seeing?

generic/300.

When I try to test my patch, I know that there has a report that
invalidate page range patch set causes a regression, and I am not sure
whether invalidate page range patch set causes it or not.  So I decide
to generate my patch against ext4/master.  So, don't worry. :-)

BTW, I will run xfstests this week.  If I meet any regression, I will
let you know.

> 
> > Ted, I notice that now in ext4 tree we have 'dev', 'dev-with-revert',
> > and 'dev2' branches.  Which one is the best to generate a new patch for
> > the next merge window?
> 
> Either the dev branch or the master branch.   
> 
> The dev-with-revert and dev2 were branches that I had created when
> investigating a potential regression with the invalidage page range
> patch set.  I've since determined that it's a timing issue and it's
> not a new regression --- we've had xfstests failures with test
> generic/300 for a while now.

Thanks for pointing it out.

                                                - Zheng

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2013-06-18  3:31 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-11 23:22 ext4 extent status tree LRU locking Dave Hansen
2013-06-12  7:17 ` Zheng Liu
2013-06-12 15:09   ` Dave Hansen
2013-06-12 16:03     ` Zheng Liu
2013-06-12 17:52       ` Dave Hansen
2013-06-12 20:48     ` Theodore Ts'o
2013-06-13 13:27       ` Zheng Liu
2013-06-13 13:35         ` Theodore Ts'o
2013-06-14  3:27           ` Zheng Liu
2013-06-14 14:09 ` Zheng Liu
2013-06-14 14:02   ` Theodore Ts'o
2013-06-14 17:00     ` Zheng Liu
2013-06-14 18:00       ` Theodore Ts'o
2013-06-17 10:10         ` Zheng Liu
2013-06-17 21:12           ` Dave Hansen
2013-06-18  2:25             ` Zheng Liu
2013-06-18  2:51               ` Theodore Ts'o
2013-06-18  3:49                 ` Zheng Liu
2013-06-18  2:47           ` Theodore Ts'o
2013-06-14 15:57   ` Dave Hansen
2013-06-14 17:11     ` Zheng Liu
2013-06-14 16:55       ` Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).