Re: [patch] Converting writeback linked lists to a tree based data structure

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Fengguang Wu <wfg@mail.ustc.edu.cn>
To: Michael Rubin <mrubin@google.com>
Cc: a.p.zijlstra@chello.nl, akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure
Date: Fri, 18 Jan 2008 12:56:09 +0800	[thread overview]
Message-ID: <400632190.14601@ustc.edu.cn> (raw)
Message-ID: <E1JFjGz-0001eU-3O@localhost.localdomain> (raw)
In-Reply-To: <532480950801171307q4b540ewa3acb6bfbea5dbc8@mail.gmail.com>

On Thu, Jan 17, 2008 at 01:07:05PM -0800, Michael Rubin wrote:
> On Jan 17, 2008 1:41 AM, Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:
> > On Tue, Jan 15, 2008 at 12:09:21AM -0800, Michael Rubin wrote:
> > The main benefit of rbtree is possibly better support of future policies.
> > Can you demonstrate an example?
> 
> These are ill-formed thoughts as of now on my end but the idea was
> that keeping one tree sorted via a scheme might be simpler than
> multiple list_heads.

Suppose we want to grant longer expiration window for temp files,
adding a new list named s_dirty_tmpfile would be a handy solution.

So the question is: should we need more than 3 QoS classes?

> > The most tricky writeback issues could be starvation prevention
> > between
> 
> 
> >         - small/large files
> >         - new/old files
> >         - superblocks
> 
> So I have written tests and believe I have covered these issues. If
> you are concerned in specific on any and have a test case please let
> me know.

OK.

> > Some kind of limit should be applied for each. They used to be:
> >         - requeue to s_more_io whenever MAX_WRITEBACK_PAGES is reached
> >           this preempts big files
> 
> The patch uses th same limit.
> 
> >         - refill s_io iif it is drained
> >           this prevents promotion of big/old files
> 
> Once a big file gets its first do_writepages it is moved behind the
> other smaller files via i_flushed_when. And the same in reverse for
> big vs old.

You mean i_flush_gen? No, sync_sb_inodes() will abort on every
MAX_WRITEBACK_PAGES, and s_flush_gen will be updated accordingly.
Hence the sync will restart from big/old files.

> 
> >         - return from sync_sb_inodes() after one go of s_io
> 
> I am not sure how this limit helps things out. Is this for superblock
> starvation? Can you elaborate?

We should have a way to go to next superblock even if new dirty inodes
or pages are emerging fast in this superblock. Fill and drain s_io
only once and then abort helps.

s_io is a stable and bounded working set in one go of superblock.

> > Michael, could you sort out and document the new starvation prevention schemes?
> 
> The basic idea behind the writeback algorithm to handle starvation.
> The over arching idea is that we want to preserve order of writeback
> based on when an inode was dirtied and also preserve the dirtied_when
> contents until the inode has been written back (partially or fully)
> 
> Every sync_sb_inodes we find the least recent inodes dirtied. To deal
> with large or small starvation we have a s_flush_gen for each
> iteration of sync_sb_inodes every time we issue a writeback we mark
> that the inode cannot be processed until the next s_flush_gen. This
> way we don't process one get to the rest since we keep pushing them
> into subsequent s_fush_gen's.
> 
> Let me know if you want more detail or structured responses.
> 
> > Introduce i_flush_gen to help restarting from the last inode?
> > Well, it's not as simple as list_heads.

Basically you make one list_head in each rbtree node.
That list_head is recycled cyclic, and is an analog to the old
fashioned s_dirty. We need to know 'where we are' and 'where it ends'.
So an extra indicator must be introduced - i_flush_gen. It's awkward.

We are simply repeating the aged list_heads' problem.

> > > 2) Added an inode flag to allow inodes to be marked so that they
> > >    are never written back to disk.
> > >
> > >    The motivation behind this change is several fold. The first is
> > >    to insure fairness in the writeback algorithm. The second is to
> >
> > What do you mean by fairness?
> 
> So originally this comment was written when I was trying to fix a bug
> in 2.6.23. The one where we were starving large files from being
> flushed. There was a fairness issue where small files were being
> flushed but the large ones were just ballooning in memory.

In fact the bug is turned-around rather than fixed - now the small
files could be starved.

> > Why cannot I_WRITEBACK_NEVER be in a decoupled standalone patch?
> 
> The WRITEBACK_NEVER could be in a previous patch to the rbtree. But
> not a subsequent patch to the rbtree. The rbtree depends on the
> WRITEBACK_NEVER patch otherwise we run in to problems in
> generic_delete_inode. Now that you point it out I think I can split
> this patch into two patches and make the WRITEBACK_NEVER in the first
> one.

OK.

WARNING: multiple messages have this Message-ID (diff)

From: Fengguang Wu <wfg@mail.ustc.edu.cn>
To: Michael Rubin <mrubin@google.com>
Cc: a.p.zijlstra@chello.nl, akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure
Date: Fri, 18 Jan 2008 12:56:09 +0800	[thread overview]
Message-ID: <400632190.14601@ustc.edu.cn> (raw)
Message-ID: <E1JFjGz-0001eU-3O@localhost.localdomain> (raw)
In-Reply-To: <532480950801171307q4b540ewa3acb6bfbea5dbc8@mail.gmail.com>

On Thu, Jan 17, 2008 at 01:07:05PM -0800, Michael Rubin wrote:
> On Jan 17, 2008 1:41 AM, Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:
> > On Tue, Jan 15, 2008 at 12:09:21AM -0800, Michael Rubin wrote:
> > The main benefit of rbtree is possibly better support of future policies.
> > Can you demonstrate an example?
> 
> These are ill-formed thoughts as of now on my end but the idea was
> that keeping one tree sorted via a scheme might be simpler than
> multiple list_heads.

Suppose we want to grant longer expiration window for temp files,
adding a new list named s_dirty_tmpfile would be a handy solution.

So the question is: should we need more than 3 QoS classes?

> > The most tricky writeback issues could be starvation prevention
> > between
> 
> 
> >         - small/large files
> >         - new/old files
> >         - superblocks
> 
> So I have written tests and believe I have covered these issues. If
> you are concerned in specific on any and have a test case please let
> me know.

OK.

> > Some kind of limit should be applied for each. They used to be:
> >         - requeue to s_more_io whenever MAX_WRITEBACK_PAGES is reached
> >           this preempts big files
> 
> The patch uses th same limit.
> 
> >         - refill s_io iif it is drained
> >           this prevents promotion of big/old files
> 
> Once a big file gets its first do_writepages it is moved behind the
> other smaller files via i_flushed_when. And the same in reverse for
> big vs old.

You mean i_flush_gen? No, sync_sb_inodes() will abort on every
MAX_WRITEBACK_PAGES, and s_flush_gen will be updated accordingly.
Hence the sync will restart from big/old files.

> 
> >         - return from sync_sb_inodes() after one go of s_io
> 
> I am not sure how this limit helps things out. Is this for superblock
> starvation? Can you elaborate?

We should have a way to go to next superblock even if new dirty inodes
or pages are emerging fast in this superblock. Fill and drain s_io
only once and then abort helps.

s_io is a stable and bounded working set in one go of superblock.

> > Michael, could you sort out and document the new starvation prevention schemes?
> 
> The basic idea behind the writeback algorithm to handle starvation.
> The over arching idea is that we want to preserve order of writeback
> based on when an inode was dirtied and also preserve the dirtied_when
> contents until the inode has been written back (partially or fully)
> 
> Every sync_sb_inodes we find the least recent inodes dirtied. To deal
> with large or small starvation we have a s_flush_gen for each
> iteration of sync_sb_inodes every time we issue a writeback we mark
> that the inode cannot be processed until the next s_flush_gen. This
> way we don't process one get to the rest since we keep pushing them
> into subsequent s_fush_gen's.
> 
> Let me know if you want more detail or structured responses.
> 
> > Introduce i_flush_gen to help restarting from the last inode?
> > Well, it's not as simple as list_heads.

Basically you make one list_head in each rbtree node.
That list_head is recycled cyclic, and is an analog to the old
fashioned s_dirty. We need to know 'where we are' and 'where it ends'.
So an extra indicator must be introduced - i_flush_gen. It's awkward.

We are simply repeating the aged list_heads' problem.

> > > 2) Added an inode flag to allow inodes to be marked so that they
> > >    are never written back to disk.
> > >
> > >    The motivation behind this change is several fold. The first is
> > >    to insure fairness in the writeback algorithm. The second is to
> >
> > What do you mean by fairness?
> 
> So originally this comment was written when I was trying to fix a bug
> in 2.6.23. The one where we were starving large files from being
> flushed. There was a fairness issue where small files were being
> flushed but the large ones were just ballooning in memory.

In fact the bug is turned-around rather than fixed - now the small
files could be starved.

> > Why cannot I_WRITEBACK_NEVER be in a decoupled standalone patch?
> 
> The WRITEBACK_NEVER could be in a previous patch to the rbtree. But
> not a subsequent patch to the rbtree. The rbtree depends on the
> WRITEBACK_NEVER patch otherwise we run in to problems in
> generic_delete_inode. Now that you point it out I think I can split
> this patch into two patches and make the WRITEBACK_NEVER in the first
> one.

OK.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2008-01-18  4:56 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-01-15  8:09 [patch] Converting writeback linked lists to a tree based data structure Michael Rubin
2008-01-15  8:09 ` Michael Rubin, Michael Rubin
2008-01-15  8:46 ` Peter Zijlstra
2008-01-15  8:46   ` Peter Zijlstra
2008-01-15 17:53   ` Michael Rubin
2008-01-15 17:53     ` Michael Rubin
2008-01-16  3:01     ` Fengguang Wu
2008-01-16  3:01       ` Fengguang Wu
2008-01-16  3:01         ` Fengguang Wu
2008-01-16  3:44         ` Andrew Morton
2008-01-16  3:44           ` Andrew Morton
2008-01-16  4:25           ` Fengguang Wu
2008-01-16  4:25             ` Fengguang Wu
2008-01-16  4:25               ` Fengguang Wu
2008-01-16  4:42               ` Andrew Morton
2008-01-16  4:42                 ` Andrew Morton
2008-01-16  4:55                 ` Fengguang Wu
2008-01-16  4:55                   ` Fengguang Wu
2008-01-16  4:55                     ` Fengguang Wu
2008-01-16  5:51                     ` Andrew Morton
2008-01-16  5:51                       ` Andrew Morton
2008-01-16  9:07                       ` Fengguang Wu
2008-01-16  9:07                         ` Fengguang Wu
2008-01-16  9:07                           ` Fengguang Wu
2008-01-18  7:36                           ` Mike Waychison
2008-01-18  7:36                             ` Mike Waychison
2008-01-16 22:35                         ` David Chinner
2008-01-16 22:35                           ` David Chinner
2008-01-17  3:16                           ` Fengguang Wu
2008-01-17  3:16                             ` Fengguang Wu
2008-01-17  3:16                               ` Fengguang Wu
2008-01-17  5:21                             ` David Chinner
2008-01-17  5:21                               ` David Chinner
2008-01-16  7:55           ` David Chinner
2008-01-16  7:55             ` David Chinner
2008-01-16  8:13             ` Andrew Morton
2008-01-16  8:13               ` Andrew Morton
2008-01-16 13:06               ` Fengguang Wu
2008-01-16 13:06                 ` Fengguang Wu
2008-01-16 13:06                   ` Fengguang Wu
2008-01-16 18:55         ` Michael Rubin
2008-01-16 18:55           ` Michael Rubin
2008-01-17  3:31           ` Fengguang Wu
2008-01-17  3:31             ` Fengguang Wu
2008-01-17  3:31               ` Fengguang Wu
2008-01-17  9:41 ` Fengguang Wu
2008-01-17  9:41   ` Fengguang Wu
2008-01-17  9:41     ` Fengguang Wu
2008-01-17 21:07     ` Michael Rubin
2008-01-17 21:07       ` Michael Rubin
2008-01-18  4:56       ` Fengguang Wu [this message]
2008-01-18  4:56         ` Fengguang Wu
2008-01-18  4:56           ` Fengguang Wu
2008-01-18  5:41           ` Andi Kleen
2008-01-18  5:41             ` Andi Kleen
2008-01-18  6:01             ` Fengguang Wu
2008-01-18  6:01               ` Fengguang Wu
2008-01-18  6:01                 ` Fengguang Wu
2008-01-18  7:48             ` Mike Waychison
2008-01-18  7:48               ` Mike Waychison
2008-01-18  6:43           ` Michael Rubin
2008-01-18  6:43             ` Michael Rubin
2008-01-18  9:32             ` Fengguang Wu
2008-01-18  9:32               ` Fengguang Wu
2008-01-18  9:32                 ` Fengguang Wu
2008-01-18  5:01       ` David Chinner
2008-01-18  5:01         ` David Chinner
2008-01-18  5:38         ` Michael Rubin
2008-01-18  5:38           ` Michael Rubin
2008-01-18  8:54           ` David Chinner
2008-01-18  8:54             ` David Chinner
2008-01-18  9:26             ` Michael Rubin
2008-01-18  9:26               ` Michael Rubin
2008-01-18  5:41         ` Fengguang Wu
2008-01-18  5:41           ` Fengguang Wu
2008-01-18  5:41             ` Fengguang Wu
2008-01-19  2:50           ` David Chinner
2008-01-19  2:50             ` David Chinner
  -- strict thread matches above, loose matches on Subject: below --
2007-12-13  0:32 Michael Rubin
2007-12-13  0:32 ` Michael Rubin, Michael Rubin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=400632190.14601@ustc.edu.cn \
    --to=wfg@mail.ustc.edu.cn \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mrubin@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.