linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Dilger, Andreas" <andreas.dilger@intel.com>
To: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"Drokin, Oleg" <oleg.drokin@intel.com>,
	Peng Tao <bergwolf@gmail.com>, "greg@kroah.com" <greg@kroah.com>,
	"Xiong, Jinshan" <jinshan.xiong@intel.com>
Subject: Re: RFC: [PATCH] staging/lustre/llite: fix O_TMPFILE/O_LOV_DELAY_CREATE conflict
Date: Tue, 11 Feb 2014 00:31:39 +0000	[thread overview]
Message-ID: <CF1EB9B4.913BD%andreas.dilger@intel.com> (raw)
In-Reply-To: <20140210225130.GH18016@ZenIV.linux.org.uk>

On 2014/02/10, 3:51 PM, "Al Viro" <viro@ZenIV.linux.org.uk> wrote:
>
>Egads.  Correct me if I'm wrong, please, but it looks like you have
>
>1) ->aio_write() stash its arguments in an off-stack struct and pass it
>and &iocb->ki_pos to
>
>2) ll_file_io_generic() which calls
>
>3) cl_io_rw_init() which copied ki_pos value passed to it into
>io->u.ci_rw.crw_pos (in another off-stack struct) and calls

The off-stack storage of per-thread info is used to avoid stack overflow
because of the 4kB stack limitation, and to avoid kmalloc/kfree of the
same data structures for every call into the callpath.  Other code in
the kernel handles similar problems by having a separate thread do the
lower level IO, but that adds context switches to the IO path.  I'm not
a big fan of this, but that is what we live with due to limited stack
size.

>4) cl_io_init() which calls
>
>5) cl_io_init0() which possibly calls a bunch of instances of
>->coo_io_init(),
>which may or may not return 0; I _hope_ that in this case that's what'll
>happen and we get back to ll_file_io_generic(), where
>
>3') we stash iocb and iov/iovlen into the third off-stack structure (cio)
>and call

The reason for separate IO structures at different parts of the stack is
that the Lustre client code has different upper level VFS interfaces for
Linux, MacOS, WinNT, and a userspace library.  The kernel iocb is not
available in all of those cases.

Those non-Linux VFS interfaces are deprecated, and we are looking to
remove these abstractions again, but it will take time.

>4') cl_io_loop() where we (in a loop) call cl_io_iter_init(),
>cl_io_lock() and
>
>5') cl_io_start() which a calls a bunch (how large in that case?) of
>
>6') ->cio_start() instances.  Hopefully that'll be vvp_io_write_start()
>which will pass the value of io->u.ci_rw.crw_pos (picked via an
>overlapping field of union) to generic_file_aio_write().  Which, BTW,
> updates ->ki_pos. Then we return into cl_io_loop(), where
>
>4'') we call cl_io_end(), cl_io_unlock() and
>
>5'') cl_io_rw_advance() which increments ....crw_pos, hopefully in sync
>with what we have in iocb->ki_pos.  And calls a bunch of methods.  And
>return into cl_io_loop(), where
>
>4''') we call cl_io_iter_fini() (_another_ pile of methods called) and
> possibly repeat everything from (4') on (apparently only if nothing had
> been done so far).  Eventually we return into ll_file_io_generic() and
>there
>
>3''') we copy ....crw_pos into iocb->ki_pos.  WTF do we need that?  Hadn't
>generic_file_aio_write() been good enough?
>
>Is that correct?  I have *not* traced it into all methods that might've
>been called in process - stuff called from cl_io_loop() is chock-full of
>those.  Have I missed anything relevant wrt file position handling in
>there?

This internal separation of the IO submission path was done to allow
parallel IO generation for multiple target devices and parallel RPC
generation.  Pretty much everyone (except original author, who no
longer works on Lustre) agrees that the level of abstraction is too
much, and we need to simplify it so that more than a handful of people
can even understand it.

>You guys really should be forced to hand-draw a call graph for that thing.
>Both as a punishment and a deterrent...

Believe me, this is one of my least favourite, and most complex parts
of the Lustre code, and we've had many internal discussions and plans
to clean it up.  Some of that cleanup work has already started, and more
is planned, but it will take a while to test it and push it upstream.

Cheers, Andreas
-- 
Andreas Dilger

Lustre Software Architect
Intel High Performance Data Division



  reply	other threads:[~2014-02-11  0:31 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-02-10 20:16 RFC: [PATCH] staging/lustre/llite: fix O_TMPFILE/O_LOV_DELAY_CREATE conflict Dilger, Andreas
2014-02-10 21:29 ` Al Viro
2014-02-10 22:10   ` Al Viro
2014-02-10 22:51     ` Al Viro
2014-02-11  0:31       ` Dilger, Andreas [this message]
2014-02-11  2:40         ` Al Viro
2014-02-11  2:54           ` Drokin, Oleg
2014-02-11  6:55           ` Xiong, Jinshan
2014-02-11 14:25             ` Al Viro
2014-02-11 18:26               ` Xiong, Jinshan
2014-02-11  0:18     ` Xiong, Jinshan
2014-02-11  0:37   ` Dilger, Andreas
2014-02-11  0:51     ` greg
2014-02-11  9:13   ` Christoph Hellwig
2014-02-11 11:01     ` Dilger, Andreas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CF1EB9B4.913BD%andreas.dilger@intel.com \
    --to=andreas.dilger@intel.com \
    --cc=bergwolf@gmail.com \
    --cc=greg@kroah.com \
    --cc=hch@infradead.org \
    --cc=jinshan.xiong@intel.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=oleg.drokin@intel.com \
    --cc=viro@ZenIV.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).