* [Lustre-devel] proposal on implementing a new readahead in clio
@ 2010-01-20 12:37 jay
2010-01-22 10:53 ` jay
2010-01-26 10:02 ` Alex Zhuravlev
0 siblings, 2 replies; 11+ messages in thread
From: jay @ 2010-01-20 12:37 UTC (permalink / raw)
To: lustre-devel
Hello,
We have discussed the implementation of new readahead in CLIO. Here I
just send the design document out to ask for comments.
We have already had several inputs from Z(aka bzzz) and other engineers.
I'm not going to copy those ideas here directly, because I'm afraid to
distort something, so please reply this email to show your ideas -
thanks in advance.
Jay
--
Good good study, day day up
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ra.tar
Type: application/x-tar
Size: 194560 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20100120/d07ce14d/attachment.tar>
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Lustre-devel] proposal on implementing a new readahead in clio
2010-01-20 12:37 [Lustre-devel] proposal on implementing a new readahead in clio jay
@ 2010-01-22 10:53 ` jay
2010-01-23 7:09 ` Alexey Lyashkov
2010-01-26 10:02 ` Alex Zhuravlev
1 sibling, 1 reply; 11+ messages in thread
From: jay @ 2010-01-22 10:53 UTC (permalink / raw)
To: lustre-devel
UP!!
Here is a pdf version of design document. Also I'm attaching a picture
because the pictures in pdf is not clear.
Thanks,
Jay
jay wrote:
> Hello,
>
> We have discussed the implementation of new readahead in CLIO. Here I
> just send the design document out to ask for comments.
>
> We have already had several inputs from Z(aka bzzz) and other
> engineers. I'm not going to copy those ideas here directly, because
> I'm afraid to distort something, so please reply this email to show
> your ideas - thanks in advance.
>
> Jay
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
--
Good good study, day day up
-------------- next part --------------
A non-text attachment was scrubbed...
Name: readahead.pdf
Type: application/pdf
Size: 198695 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20100122/9c461a12/attachment.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lazy-readahead-1.jpg
Type: image/jpeg
Size: 142721 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20100122/9c461a12/attachment.jpg>
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Lustre-devel] proposal on implementing a new readahead in clio
2010-01-22 10:53 ` jay
@ 2010-01-23 7:09 ` Alexey Lyashkov
2010-01-24 1:01 ` jay
0 siblings, 1 reply; 11+ messages in thread
From: Alexey Lyashkov @ 2010-01-23 7:09 UTC (permalink / raw)
To: lustre-devel
>>
We have an idea to spawn a per file readahead thread for each process,
and this thread can be used to issue the readahead RPC async.
>>
I correctly understand: you suggest a spawn one new thread per open
file?
so if client have 10 processes, and each process is open 100 files, you
need spawn 1000 new threads?
On Fri, 2010-01-22 at 18:53 +0800, jay wrote:
> UP!!
>
> Here is a pdf version of design document. Also I'm attaching a picture
> because the pictures in pdf is not clear.
>
> Thanks,
> Jay
>
> jay wrote:
> > Hello,
--
Alexey Lyashkov <alexey.lyashkov@clusterstor.com>
ClusterStor
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Lustre-devel] proposal on implementing a new readahead in clio
2010-01-23 7:09 ` Alexey Lyashkov
@ 2010-01-24 1:01 ` jay
2010-01-24 9:18 ` Alexey Lyashkov
2010-01-25 4:05 ` Nicolas Williams
0 siblings, 2 replies; 11+ messages in thread
From: jay @ 2010-01-24 1:01 UTC (permalink / raw)
To: lustre-devel
Alexey Lyashkov wrote:
> We have an idea to spawn a per file readahead thread for each process,
> and this thread can be used to issue the readahead RPC async.
>
> I correctly understand: you suggest a spawn one new thread per open
> file?
> so if client have 10 processes, and each process is open 100 files, you
> need spawn 1000 new threads?
>
No, per process readahead, or some system readahead thread pool, this is
because most of those threads are sleeping, and it consumes little time
to issue readahead requests. The idea behind the scheme is to issue
readahead rpcs async.
BTW, I'm not going to implement what you mentioned in linux, because I
don't think this is a good idea, as what I said in design doc. However,
we HAVE to have an async thread pool to implement readahead for windows.
Windows doesn't have an interface of issuing async read request, lack of
a mechanism to have page lock or similar things - what a pity!
Jay
>
> On Fri, 2010-01-22 at 18:53 +0800, jay wrote:
>
>> UP!!
>>
>> Here is a pdf version of design document. Also I'm attaching a picture
>> because the pictures in pdf is not clear.
>>
>> Thanks,
>> Jay
>>
>> jay wrote:
>>
>>> Hello,
>>>
>
>
>
--
Good good study, day day up
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Lustre-devel] proposal on implementing a new readahead in clio
2010-01-24 1:01 ` jay
@ 2010-01-24 9:18 ` Alexey Lyashkov
2010-01-25 6:17 ` jay
2010-01-25 4:05 ` Nicolas Williams
1 sibling, 1 reply; 11+ messages in thread
From: Alexey Lyashkov @ 2010-01-24 9:18 UTC (permalink / raw)
To: lustre-devel
On Sun, 2010-01-24 at 09:01 +0800, jay wrote:
> Alexey Lyashkov wrote:
> > We have an idea to spawn a per file readahead thread for each process,
> > and this thread can be used to issue the readahead RPC async.
> >
> > I correctly understand: you suggest a spawn one new thread per open
> > file?
> > so if client have 10 processes, and each process is open 100 files, you
> > need spawn 1000 new threads?
> >
> No, per process readahead, or some system readahead thread pool, this is
> because most of those threads are sleeping, and it consumes little time
> to issue readahead requests. The idea behind the scheme is to issue
> readahead rpcs async.
first case is same as i say (i think) - 10 processes reading from own
files, so will be spawn 1000 new threads.
in second case you will be lost readahead requests on hardloaded client.
>
> BTW, I'm not going to implement what you mentioned in linux, because I
> don't think this is a good idea, as what I said in design doc. However,
> we HAVE to have an async thread pool to implement readahead for windows.
> Windows doesn't have an interface of issuing async read request, lack of
> a mechanism to have page lock or similar things - what a pity!
hm.. looks i don't understand problem. Currently linux client is using
->readpage() to generate OST_READ RPC and sending via ptlrpcd-io.
Why isn't generate this RPC directly for Windows? Or you mean about
update asynchronous update VM cache ?
--
Alexey Lyashkov <alexey.lyashkov@clusterstor.com>
ClusterStor
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Lustre-devel] proposal on implementing a new readahead in clio
2010-01-24 1:01 ` jay
2010-01-24 9:18 ` Alexey Lyashkov
@ 2010-01-25 4:05 ` Nicolas Williams
2010-01-25 6:55 ` Matt Wu
1 sibling, 1 reply; 11+ messages in thread
From: Nicolas Williams @ 2010-01-25 4:05 UTC (permalink / raw)
To: lustre-devel
On Sun, Jan 24, 2010 at 09:01:46AM +0800, jay wrote:
> Alexey Lyashkov wrote:
> > I correctly understand: you suggest a spawn one new thread per open
> > file?
> > so if client have 10 processes, and each process is open 100 files, you
> > need spawn 1000 new threads?
> >
> No, per process readahead, or some system readahead thread pool, this is
> because most of those threads are sleeping, and it consumes little time
> to issue readahead requests. The idea behind the scheme is to issue
> readahead rpcs async.
Sleeping threads do consume memory resources, and context switches
between them do add cache pressure. The read ahead work should all be
async, in which case you need no more readahead threads than you have
CPUs.
> BTW, I'm not going to implement what you mentioned in linux, because I
> don't think this is a good idea, as what I said in design doc. However,
> we HAVE to have an async thread pool to implement readahead for windows.
> Windows doesn't have an interface of issuing async read request, lack of
> a mechanism to have page lock or similar things - what a pity!
But surely you can still do the readaheads asynchronously. Say you
think that block N of some file will be needed soon: so you issue the
read ahead of time. You'll need to place the data somewhere, and
hopefully that will be somewhere that the host OS's VFS sub-system
(Windows in your case) can either provide or accept -- if not you'll
need to do a copy later, but you're still able to send the read request,
and process the reply, asynchronously.
Nico
--
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Lustre-devel] proposal on implementing a new readahead in clio
2010-01-24 9:18 ` Alexey Lyashkov
@ 2010-01-25 6:17 ` jay
0 siblings, 0 replies; 11+ messages in thread
From: jay @ 2010-01-25 6:17 UTC (permalink / raw)
To: lustre-devel
Nico and shadow,
Since you have the same question about windows, I just replied them in
one email. Also I got Matt involved - he is a windows expert.
Alexey Lyashkov wrote:
> On Sun, 2010-01-24 at 09:01 +0800, jay wrote:
>
>> Alexey Lyashkov wrote:
>>
>>> We have an idea to spawn a per file readahead thread for each process,
>>> and this thread can be used to issue the readahead RPC async.
>>>
>>> I correctly understand: you suggest a spawn one new thread per open
>>> file?
>>> so if client have 10 processes, and each process is open 100 files, you
>>> need spawn 1000 new threads?
>>>
>>>
>> No, per process readahead, or some system readahead thread pool, this is
>> because most of those threads are sleeping, and it consumes little time
>> to issue readahead requests. The idea behind the scheme is to issue
>> readahead rpcs async.
>>
> first case is same as i say (i think) - 10 processes reading from own
> files, so will be spawn 1000 new threads.
> in second case you will be lost readahead requests on hardloaded client.
>
Nod - that's why I'm not going to do it in linux - as the design doc
said - don't you see section 8.1? :-)
>
>> BTW, I'm not going to implement what you mentioned in linux, because I
>> don't think this is a good idea, as what I said in design doc. However,
>> we HAVE to have an async thread pool to implement readahead for windows.
>> Windows doesn't have an interface of issuing async read request, lack of
>> a mechanism to have page lock or similar things - what a pity!
>>
> hm.. looks i don't understand problem. Currently linux client is using
> ->readpage() to generate OST_READ RPC and sending via ptlrpcd-io.
> Why isn't generate this RPC directly for Windows? Or you mean about
> update asynchronous update VM cache ?
>
The problem is that we have to wait for the RPC(which may just contain
readahead pages) to be finished before we can return to user space. You
may ask why we can't do this, the answer is that we should pipeline the
readahead request, instead of reading a chunk of data to support readahead.
The problem of windows is that it's lack of interfaces to manipulate
pages. I'm not a windows expert - please ask matt if you have windows
specific questions.
>
>
--
Good good study, day day up
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Lustre-devel] proposal on implementing a new readahead in clio
2010-01-25 4:05 ` Nicolas Williams
@ 2010-01-25 6:55 ` Matt Wu
2010-01-25 7:23 ` Andreas Dilger
0 siblings, 1 reply; 11+ messages in thread
From: Matt Wu @ 2010-01-25 6:55 UTC (permalink / raw)
To: lustre-devel
We need do readahead asynchronously, but Windows kernel doesn't give us an
easy solution. Here are the issues for Windows readahead:
1, Windows kenrel (VM) doesn't provide kernel drivers an equivalent
grab_cache_page_nowait_gfp() to allocate an empty/invalid page. So in
ll_readpage(), it's too late for WNC to grab more pages for readahead.
2, The routines provided by Windows kernel to allocate page cache are
synchronous and they won't return until the requested pages are fetched.
So we plan to start a thread pool, and dispatch the readahead requests to
these threads instead of blocking user thread.
We can group the threads by several ways:
1, request per random thread, without any specify order. we just start a
fixed number of threads and queue the readahead request to any thread of
the thread pool.
this is the decision we made during WNC readahead meeting last week.
2, thread per file (file) or thread per open instance (fd)
3, thread per ost, we need divide the readahead request to several which
are stripe boundary aligned.
regards,
matt
On 2010/1/25 12:05, Nicolas Williams wrote:
> On Sun, Jan 24, 2010 at 09:01:46AM +0800, jay wrote:
>> Alexey Lyashkov wrote:
>>> I correctly understand: you suggest a spawn one new thread per open
>>> file?
>>> so if client have 10 processes, and each process is open 100 files, you
>>> need spawn 1000 new threads?
>>>
>> No, per process readahead, or some system readahead thread pool, this is
>> because most of those threads are sleeping, and it consumes little time
>> to issue readahead requests. The idea behind the scheme is to issue
>> readahead rpcs async.
>
> Sleeping threads do consume memory resources, and context switches
> between them do add cache pressure. The read ahead work should all be
> async, in which case you need no more readahead threads than you have
> CPUs.
>
>> BTW, I'm not going to implement what you mentioned in linux, because I
>> don't think this is a good idea, as what I said in design doc. However,
>> we HAVE to have an async thread pool to implement readahead for windows.
>> Windows doesn't have an interface of issuing async read request, lack of
>> a mechanism to have page lock or similar things - what a pity!
>
> But surely you can still do the readaheads asynchronously. Say you
> think that block N of some file will be needed soon: so you issue the
> read ahead of time. You'll need to place the data somewhere, and
> hopefully that will be somewhere that the host OS's VFS sub-system
> (Windows in your case) can either provide or accept -- if not you'll
> need to do a copy later, but you're still able to send the read request,
> and process the reply, asynchronously.
>
> Nico
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Lustre-devel] proposal on implementing a new readahead in clio
2010-01-25 6:55 ` Matt Wu
@ 2010-01-25 7:23 ` Andreas Dilger
2010-01-25 15:34 ` Nicolas Williams
0 siblings, 1 reply; 11+ messages in thread
From: Andreas Dilger @ 2010-01-25 7:23 UTC (permalink / raw)
To: lustre-devel
On 2010-01-24, at 23:55, Matt Wu wrote:
> We can group the threads by several ways:
> 1, request per random thread, without any specify order. we just
> start a
> fixed number of threads and queue the readahead request to any
> thread of
> the thread pool.
> this is the decision we made during WNC readahead meeting last
> week.
> 2, thread per file (file) or thread per open instance (fd)
> 3, thread per ost, we need divide the readahead request to several
> which
> are stripe boundary aligned.
In order to keep the readahead pages local to the NUMA node that the
userspace thread is running on, I'd recommend at most a single
readahead thread per core. That way, when the readahead thread is
allocating pages they will be on the right NUMA node.
> On 2010/1/25 12:05, Nicolas Williams wrote:
>> On Sun, Jan 24, 2010 at 09:01:46AM +0800, jay wrote:
>>> Alexey Lyashkov wrote:
>>>> I correctly understand: you suggest a spawn one new thread per open
>>>> file?
>>>> so if client have 10 processes, and each process is open 100
>>>> files, you
>>>> need spawn 1000 new threads?
>>>>
>>> No, per process readahead, or some system readahead thread pool,
>>> this is
>>> because most of those threads are sleeping, and it consumes little
>>> time
>>> to issue readahead requests. The idea behind the scheme is to issue
>>> readahead rpcs async.
>>
>> Sleeping threads do consume memory resources, and context switches
>> between them do add cache pressure. The read ahead work should all
>> be
>> async, in which case you need no more readahead threads than you have
>> CPUs.
>>
>>> BTW, I'm not going to implement what you mentioned in linux,
>>> because I
>>> don't think this is a good idea, as what I said in design doc.
>>> However,
>>> we HAVE to have an async thread pool to implement readahead for
>>> windows.
>>> Windows doesn't have an interface of issuing async read request,
>>> lack of
>>> a mechanism to have page lock or similar things - what a pity!
>>
>> But surely you can still do the readaheads asynchronously. Say you
>> think that block N of some file will be needed soon: so you issue the
>> read ahead of time. You'll need to place the data somewhere, and
>> hopefully that will be somewhere that the host OS's VFS sub-system
>> (Windows in your case) can either provide or accept -- if not you'll
>> need to do a copy later, but you're still able to send the read
>> request,
>> and process the reply, asynchronously.
>>
>> Nico
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Lustre-devel] proposal on implementing a new readahead in clio
2010-01-25 7:23 ` Andreas Dilger
@ 2010-01-25 15:34 ` Nicolas Williams
0 siblings, 0 replies; 11+ messages in thread
From: Nicolas Williams @ 2010-01-25 15:34 UTC (permalink / raw)
To: lustre-devel
On Mon, Jan 25, 2010 at 12:23:03AM -0700, Andreas Dilger wrote:
> On 2010-01-24, at 23:55, Matt Wu wrote:
> >We can group the threads by several ways:
> >1, request per random thread, without any specify order. we just
> >start a fixed number of threads and queue the readahead request to
> >any thread of the thread pool. this is the decision we made during
> >WNC readahead meeting last week.
> >2, thread per file (file) or thread per open instance (fd)
> >3, thread per ost, we need divide the readahead request to several
> >which are stripe boundary aligned.
>
> In order to keep the readahead pages local to the NUMA node that the
> userspace thread is running on, I'd recommend at most a single
> readahead thread per core. That way, when the readahead thread is
> allocating pages they will be on the right NUMA node.
That was my recommendation as well, but if I understand Matt correctly,
the Windows VFS makes it impossible to do readahead asynchronously,
which is why Matt suggests having many threads. I have no clue as to
the relevant Windows kernel APIs, but if Matt's right about Windows,
then color me surprised. Assuming that's correct and that there's no
reasonable way around the problem, then I'd recommend having a pool with
some number of threads (say, 3 * CPUs), with readaheads done only when
there are threads available in the pool.
Nico
--
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Lustre-devel] proposal on implementing a new readahead in clio
2010-01-20 12:37 [Lustre-devel] proposal on implementing a new readahead in clio jay
2010-01-22 10:53 ` jay
@ 2010-01-26 10:02 ` Alex Zhuravlev
1 sibling, 0 replies; 11+ messages in thread
From: Alex Zhuravlev @ 2010-01-26 10:02 UTC (permalink / raw)
To: lustre-devel
Hi,
I think we could help a lot if you restructure the proposal a bit:
first of all, describe the algorithm w/o implementation details,
probably using notion of event: read event extending window, io to
get data ahead, read-ahead io completion, hit/miss, etc.
then map these events to specific code paths? explain what kind of
information/mechanism layers miss to implement the algorithm?
z.
On 1/20/10 3:37 PM, jay wrote:
> Hello,
>
> We have discussed the implementation of new readahead in CLIO. Here I
> just send the design document out to ask for comments.
>
> We have already had several inputs from Z(aka bzzz) and other engineers.
> I'm not going to copy those ideas here directly, because I'm afraid to
> distort something, so please reply this email to show your ideas -
> thanks in advance.
>
> Jay
>
>
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2010-01-26 10:02 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-20 12:37 [Lustre-devel] proposal on implementing a new readahead in clio jay
2010-01-22 10:53 ` jay
2010-01-23 7:09 ` Alexey Lyashkov
2010-01-24 1:01 ` jay
2010-01-24 9:18 ` Alexey Lyashkov
2010-01-25 6:17 ` jay
2010-01-25 4:05 ` Nicolas Williams
2010-01-25 6:55 ` Matt Wu
2010-01-25 7:23 ` Andreas Dilger
2010-01-25 15:34 ` Nicolas Williams
2010-01-26 10:02 ` Alex Zhuravlev
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.