Re: [PATCH 0/2] pdflush fix and enhancement

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Andi Kleen <andi@firstfloor.org>
To: "Peter W. Morreale" <pmorreale@novell.com>
Cc: Andi Kleen <andi@firstfloor.org>, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 0/2] pdflush fix and enhancement
Date: Wed, 31 Dec 2008 14:27:39 +0100	[thread overview]
Message-ID: <20081231132738.GS496@one.firstfloor.org> (raw)
In-Reply-To: <1230696664.3470.105.camel@hermosa.site>

> I say most because the assumption would be that we will be successful in
> creating the new thread.  Not that bad an assumption I think.  Besides,

And that the memory read is not reordered (rmb()).

> the consequences of a miss are not harmful. 

Nod. Sounds reasonable.

>  
> > 
> > > More to the point, on small systems with few file systems, what is the
> > > point of having 8 (the current max) threads competing with each other on
> > > a single disk?  Likewise, on a 64-way, or larger system with dozens of
> > > filesystems/disks, why wouldn't I want more background flushing?
> > 
> > That makes some sense, but perhaps it would be better to base the default
> > size on the number of underlying block devices then?
> > 
> > Ok one issue there is that there are lots of different types of 
> > block devices, e.g. a big raid array may look like a single disk.
> > Still I suspect defaults based on the block devices would do reasonably
> > well.
> 
> Could be...  However bear in mind that we traverse *filesystems*, not
> block devs with background_writeout() (the pdflush work function). 

My thinking was that on traditional block devices you roughly
want only N, N small number, flushers per spingle because 
otherwise they will just seek too much.

Anyways iirc there's a way now to distingush SSDs from normal
block devices based on hints from the block layer, but that still 
doesn't handle the big RAID array case well.

> 
> But even if we did block devices, consider that we still don't know the
> speed of those devices (consider SSD v. raid v. disk) and consequently,
> we don't know how many threads to throw at the device before it becomes
> congested and we're merely spinning our wheels.  I mean, an SSD at
> 500MB/s (or greater) certainly could handle more pages being thrown at
> it than an IDE drive... 

I was thinking just of the initial default, but you're right
it really needs to tune the upper limit.

> 
> And this ties back to MAX_WRITEBACK_PAGES (currently 1k) which is the
> chunk that we write out in one pass.   In order to not "hold the inode
> lock too long", this is the chunk we attempt to write out.  
> 
> What is the right magic number for the various types of block devs?  1k
> for all?  for all time?  :-)

Ok it probably needs some kind of feedback mechanism.

Perhaps have keep an estimate of the average IO time for a single
flush and when it reaches some threshold start more threads? 
Or have feedback from the elevators how busy they are.

Of course it would still need a upper limit to prevent
a thread explosion in case IO suddenly becomes very slow
(e.g. in a error recovery case), but it could be much
higher than today.

> 
> Anyway, back to the traversal of filesystems.  In writeback_inodes(), we
> currently traverse the super block list in reverse.  I don't quite
> understand why we do this, but <shrug>.  
> 
> What this does mean is that unfairly penalize certain file systems when
> attempting to clean dirty pages.  If I have 5 filesystems, all getting
> hit on, then the last one in will always be the 'cleanest'.   Not sure
> that makes sense.  

Probably not.

> 
> I was thinking about a patch that would go both directions - forward and
> reverse depending upon, say, a bit in jiffies...  Certainly not perfect,
> but a bit more fair.  

Better a real RNG. But such probalistic schemes unfortunately tend to drive
benchmarkers crazy, that is why it is better to avoid them. 

I suppose you could just keep some state per fs to ensure fairness.

-Andi

next prev parent reply	other threads:[~2008-12-31 13:14 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-12-30 23:12 [PATCH 0/2] pdflush fix and enhancement Peter W Morreale
2008-12-30 23:12 ` [PATCH 1/2] Fix pdflush thread creation upper bound Peter W Morreale
2008-12-30 23:12 ` [PATCH 2/2] Add /proc controls for pdflush threads Peter W Morreale
2008-12-30 23:59   ` Randy Dunlap
2008-12-31  0:15     ` Peter W. Morreale
2008-12-31  2:38     ` Peter W. Morreale
2008-12-31  3:30       ` Randy Dunlap
2008-12-31  8:01   ` Andrew Morton
2008-12-31 14:54     ` Peter W. Morreale
2008-12-31  0:28 ` [PATCH 0/2] pdflush fix and enhancement Andi Kleen
2008-12-31  1:56   ` Peter W. Morreale
2008-12-31  2:46     ` Andi Kleen
2008-12-31  4:11       ` Peter W. Morreale
2008-12-31  7:08         ` Dave Chinner
2008-12-31 15:40           ` Peter W. Morreale
2009-01-01 23:27             ` Dave Chinner
2009-01-02  2:07               ` Peter W. Morreale
2008-12-31 13:27         ` Andi Kleen [this message]
2008-12-31 16:08           ` Peter W. Morreale
2009-01-01  1:48             ` Andi Kleen
2008-12-31 11:40       ` Martin Knoblauch

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20081231132738.GS496@one.firstfloor.org \
    --to=andi@firstfloor.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pmorreale@novell.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.