Re: [PATCH 0/2] pdflush fix and enhancement

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Andi Kleen <andi@firstfloor.org>
To: "Peter W. Morreale" <pmorreale@novell.com>
Cc: Andi Kleen <andi@firstfloor.org>, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 0/2] pdflush fix and enhancement
Date: Wed, 31 Dec 2008 14:27:39 +0100	[thread overview]
Message-ID: <20081231132738.GS496@one.firstfloor.org> (raw)
In-Reply-To: <1230696664.3470.105.camel@hermosa.site>

> I say most because the assumption would be that we will be successful in
> creating the new thread.  Not that bad an assumption I think.  Besides,

And that the memory read is not reordered (rmb()).

> the consequences of a miss are not harmful. 

Nod. Sounds reasonable.

>  
> > 
> > > More to the point, on small systems with few file systems, what is the
> > > point of having 8 (the current max) threads competing with each other on
> > > a single disk?  Likewise, on a 64-way, or larger system with dozens of
> > > filesystems/disks, why wouldn't I want more background flushing?
> > 
> > That makes some sense, but perhaps it would be better to base the default
> > size on the number of underlying block devices then?
> > 
> > Ok one issue there is that there are lots of different types of 
> > block devices, e.g. a big raid array may look like a single disk.
> > Still I suspect defaults based on the block devices would do reasonably
> > well.
> 
> Could be...  However bear in mind that we traverse *filesystems*, not
> block devs with background_writeout() (the pdflush work function). 

My thinking was that on traditional block devices you roughly
want only N, N small number, flushers per spingle because 
otherwise they will just seek too much.

Anyways iirc there's a way now to distingush SSDs from normal
block devices based on hints from the block layer, but that still 
doesn't handle the big RAID array case well.

> 
> But even if we did block devices, consider that we still don't know the
> speed of those devices (consider SSD v. raid v. disk) and consequently,
> we don't know how many threads to throw at the device before it becomes
> congested and we're merely spinning our wheels.  I mean, an SSD at
> 500MB/s (or greater) certainly could handle more pages being thrown at
> it than an IDE drive... 

I was thinking just of the initial default, but you're right
it really needs to tune the upper limit.

> 
> And this ties back to MAX_WRITEBACK_PAGES (currently 1k) which is the
> chunk that we write out in one pass.   In order to not "hold the inode
> lock too long", this is the chunk we attempt to write out.  
> 
> What is the right magic number for the various types of block devs?  1k
> for all?  for all time?  :-)

Ok it probably needs some kind of feedback mechanism.

Perhaps have keep an estimate of the average IO time for a single
flush and when it reaches some threshold start more threads? 
Or have feedback from the elevators how busy they are.

Of course it would still need a upper limit to prevent
a thread explosion in case IO suddenly becomes very slow
(e.g. in a error recovery case), but it could be much
higher than today.

> 
> Anyway, back to the traversal of filesystems.  In writeback_inodes(), we
> currently traverse the super block list in reverse.  I don't quite
> understand why we do this, but <shrug>.  
> 
> What this does mean is that unfairly penalize certain file systems when
> attempting to clean dirty pages.  If I have 5 filesystems, all getting
> hit on, then the last one in will always be the 'cleanest'.   Not sure
> that makes sense.  

Probably not.

> 
> I was thinking about a patch that would go both directions - forward and
> reverse depending upon, say, a bit in jiffies...  Certainly not perfect,
> but a bit more fair.  

Better a real RNG. But such probalistic schemes unfortunately tend to drive
benchmarkers crazy, that is why it is better to avoid them. 

I suppose you could just keep some state per fs to ensure fairness.

-Andi

next prev parent reply	other threads:[~2008-12-31 13:14 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-12-30 23:12 [PATCH 0/2] pdflush fix and enhancement Peter W Morreale
2008-12-30 23:12 ` [PATCH 1/2] Fix pdflush thread creation upper bound Peter W Morreale
2008-12-30 23:12 ` [PATCH 2/2] Add /proc controls for pdflush threads Peter W Morreale
2008-12-30 23:59   ` Randy Dunlap
2008-12-31  0:15     ` Peter W. Morreale
2008-12-31  2:38     ` Peter W. Morreale
2008-12-31  3:30       ` Randy Dunlap
2008-12-31  8:01   ` Andrew Morton
2008-12-31 14:54     ` Peter W. Morreale
2008-12-31  0:28 ` [PATCH 0/2] pdflush fix and enhancement Andi Kleen
2008-12-31  1:56   ` Peter W. Morreale
2008-12-31  2:46     ` Andi Kleen
2008-12-31  4:11       ` Peter W. Morreale
2008-12-31  7:08         ` Dave Chinner
2008-12-31 15:40           ` Peter W. Morreale
2009-01-01 23:27             ` Dave Chinner
2009-01-02  2:07               ` Peter W. Morreale
2008-12-31 13:27         ` Andi Kleen [this message]
2008-12-31 16:08           ` Peter W. Morreale
2009-01-01  1:48             ` Andi Kleen
2008-12-31 11:40       ` Martin Knoblauch

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20081231132738.GS496@one.firstfloor.org \
    --to=andi@firstfloor.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pmorreale@novell.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox