[RFC] what's going on with file->f_pos uses in orangefs_file_write

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] what's going on with file->f_pos uses in orangefs_file_write_iter()?
@ 2017-12-07 22:26 Al Viro
  2017-12-08 16:39 ` martin
  0 siblings, 1 reply; 3+ messages in thread
From: Al Viro @ 2017-12-07 22:26 UTC (permalink / raw)
  To: Mike Marshall; +Cc: linux-fsdevel

I'd missed that back then, but...

        if (file->f_pos > i_size_read(file->f_mapping->host))
                orangefs_i_size_write(file->f_mapping->host, file->f_pos);

        rc = generic_write_checks(iocb, iter);

        if (rc <= 0) {
                gossip_err("%s: generic_write_checks failed, rc:%zd:.\n",
                           __func__, rc);
                goto out;
        }

        /*
         * if we are appending, generic_write_checks would have updated
         * pos to the end of the file, so we will wait till now to set
         * pos...
         */
        pos = *(&iocb->ki_pos);

looks suspicious as hell.  What's going on there?  Not to mention anything
else file->f_pos might be completely unrelated to any IO going on -
consider e.g. pwrite(2), where the position (in iocb->ki_pos) has nothing
to do with file->f_pos.  Then there's the question of WTF is write()
(or pwrite()) past the current EOF doing bumping the file size, before
it even gets a chance to decide whether it'll be trying to do any IO at
all.

_Then_ there's the deadlock on 32bit SMP in that code.  Look: several
lines prior we'd done
        inode_lock(file->f_mapping->host);
and hadn't unlocked the sucker since then.  And
static inline void orangefs_i_size_write(struct inode *inode, loff_t i_size)
{
#if BITS_PER_LONG == 32 && defined(CONFIG_SMP)
        inode_lock(inode);
#endif
        i_size_write(inode, i_size);
#if BITS_PER_LONG == 32 && defined(CONFIG_SMP)
        inode_unlock(inode);
#endif
}
means that if we get around to calling it there in SMP/32bit case, we'll
get as plain a deadlock as possible.  And AFAICS it had been that way
since the initial merge.

What the hell is that code about and what is it trying to do?

PS: While we are at it, what's the point of that *(&...) in there?

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC] what's going on with file->f_pos uses in orangefs_file_write_iter()?
  2017-12-07 22:26 [RFC] what's going on with file->f_pos uses in orangefs_file_write_iter()? Al Viro
@ 2017-12-08 16:39 ` martin
  2017-12-12 16:31   ` Mike Marshall
  0 siblings, 1 reply; 3+ messages in thread
From: martin @ 2017-12-08 16:39 UTC (permalink / raw)
  To: Al Viro; +Cc: Mike Marshall, linux-fsdevel

On Thu, Dec 07, 2017 at 10:26:10PM +0000, Al Viro wrote:
> I'd missed that back then, but...
> 
>         if (file->f_pos > i_size_read(file->f_mapping->host))
>                 orangefs_i_size_write(file->f_mapping->host, file->f_pos);
> 
>         rc = generic_write_checks(iocb, iter);
> 
>         if (rc <= 0) {
>                 gossip_err("%s: generic_write_checks failed, rc:%zd:.\n",
>                            __func__, rc);
>                 goto out;
>         }
> 
>         /*
>          * if we are appending, generic_write_checks would have updated
>          * pos to the end of the file, so we will wait till now to set
>          * pos...
>          */
>         pos = *(&iocb->ki_pos);
> 
> looks suspicious as hell.  What's going on there?  Not to mention anything
> else file->f_pos might be completely unrelated to any IO going on -
> consider e.g. pwrite(2), where the position (in iocb->ki_pos) has nothing
> to do with file->f_pos.  Then there's the question of WTF is write()
> (or pwrite()) past the current EOF doing bumping the file size, before
> it even gets a chance to decide whether it'll be trying to do any IO at
> all.

It seems to be a poor attempt at doing what the code immediately above
it does.

http://dev.orangefs.org/trac/orangefs/changeset/4828

"put the i_size update at the beginning instead of the end. The issue is
that generic_write_checks sets the offset to i_size in the case of
append, and if i_pos has changed (so that its > i_size) outside of
write, than we need to update the i_size to reflect that value before
the offset is updated."

Oh lovely, he's just moving it.

http://dev.orangefs.org/trac/orangefs/changeset/4827

"fix for pwrite04 test that does a pwrite with open(O_APPEND).

"fix for f_frsize (fragment size) to be equal to f_bsize. This should fix
newer statfs tools that use frsize as the block size instead of bsize.
Hopefully, it won't break anything that depended on frsize being
hardcoded to 1024."

Which is just a rework of

http://dev.orangefs.org/trac/orangefs/changeset/4723

"(from branch): fix for append bug"

I'm not sure what branch he's talking about, and SVN seems to be
designed so as to make it impossible to find out.

Note that nothing updates i_size after a successful write.  So now we do
the getattr which he claims is ridiculous (it may be cached today though
it probably won't be) anywhere i_size is needed.  In do_readv_writev we
invalidate the cached attribute so as to force an update from the
server.

Well none of this does particular wonders for performance.  None of it
really solves the problem, either.  OrangeFS does not properly support
append writes, so we send a offset equal to the size of the file.  It
completely fails when the file is being written to from another machine
(or even the same machine but not through the kernel).  Without some
work on the OrangeFS server, solving that problem is impossible.

Given that limitation, the best we can hope for is consistency on a
single machine.  We could try to avoid the getattr by updating i_size
then not doing a getattr if within a cache timeout.  That would be more
in line with the getattr cache I've implemented.

I avoided it here because there was already an explicit getattr.  I am
not sure under what circumstances it was added.  I've looked through
Mike's pre-merge code; it's in there as early (3.15) as I can see.  It's
not in the out-of-tree module.  Perhaps Mike remembers why it was added.

https://github.com/hubcapsc/linux/commit/60eb8a4f7fc5931ccec0372483ed405ef9ca9110

> 
> _Then_ there's the deadlock on 32bit SMP in that code.  Look: several
> lines prior we'd done
>         inode_lock(file->f_mapping->host);
> and hadn't unlocked the sucker since then.  And
> static inline void orangefs_i_size_write(struct inode *inode, loff_t i_size)
> {
> #if BITS_PER_LONG == 32 && defined(CONFIG_SMP)
>         inode_lock(inode);
> #endif
>         i_size_write(inode, i_size);
> #if BITS_PER_LONG == 32 && defined(CONFIG_SMP)
>         inode_unlock(inode);
> #endif
> }
> means that if we get around to calling it there in SMP/32bit case, we'll
> get as plain a deadlock as possible.  And AFAICS it had been that way
> since the initial merge.
> 
> What the hell is that code about and what is it trying to do?

Guess what our original commit message is: "Trac 16"

http://dev.orangefs.org/trac/orangefs/changeset/6709

Hardly seems to be any use trying to figure out what motivated it.  I
don't think it was a deadlock then except to the extent that all the
OrangeFS code was a snarl of deadlocks then.  But the obfuscation only
serves to hide it from us.

> 
> PS: While we are at it, what's the point of that *(&...) in there?

https://github.com/hubcapsc/linux/commit/cae140731030471ec4782d65efefe8e819d6a467

Not sure why it's obfuscated but there's another one in
orangefs_file_read_iter.

I intended to send a series that implements pagecache support in
OrangeFS at some point in the future.  It's not quite done, but I
suppose it would do just as well to send it for review sooner rather
than later.  All of this code has been significantly re-written there.

Martin

There is no xfstests regression with the following.

diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c
index 1668fd645c45..0d228cd087e6 100644
--- a/fs/orangefs/file.c
+++ b/fs/orangefs/file.c
@@ -452,7 +452,7 @@ ssize_t orangefs_inode_read(struct inode *inode,
 static ssize_t orangefs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 {
 	struct file *file = iocb->ki_filp;
-	loff_t pos = *(&iocb->ki_pos);
+	loff_t pos = iocb->ki_pos;
 	ssize_t rc = 0;

 	BUG_ON(iocb->private);
@@ -492,9 +492,6 @@ static ssize_t orangefs_file_write_iter(struct kiocb *iocb, struct iov_iter *ite
 		}
 	}

-	if (file->f_pos > i_size_read(file->f_mapping->host))
-		orangefs_i_size_write(file->f_mapping->host, file->f_pos);
-
 	rc = generic_write_checks(iocb, iter);

 	if (rc <= 0) {
@@ -508,7 +505,7 @@ static ssize_t orangefs_file_write_iter(struct kiocb *iocb, struct iov_iter *ite
 	 * pos to the end of the file, so we will wait till now to set
 	 * pos...
 	 */
-	pos = *(&iocb->ki_pos);
+	pos = iocb->ki_pos;

 	rc = do_readv_writev(ORANGEFS_IO_WRITE,
 			     file,

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [RFC] what's going on with file->f_pos uses in orangefs_file_write_iter()?
  2017-12-08 16:39 ` martin
@ 2017-12-12 16:31   ` Mike Marshall
  0 siblings, 0 replies; 3+ messages in thread
From: Mike Marshall @ 2017-12-12 16:31 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Al Viro, Martin Brandenburg, Mike Marshall

AL> What's going on there?

Around 3.18 (before we went upstream) I got rid of .aio_read and
.aio_write and "replaced" them with implementations of .read_iter
and .write_iter.

Then, in Linux 4.0, generic_write_checks got iterized too.

-inline int generic_write_checks(struct file *file, loff_t *pos, size_t *count)
+inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)

In my 4.1 rebase I refactored a bunch of un-iterized stuff in file.c and
tried to retain functionality... some code (the "suspicious as hell"
code) I iterized and moved from do_readv_writev to orangefs_file_write_iter.

AL> _Then_ there's the deadlock on 32bit SMP in that code.

Yeah, that doesn't need to be there... I'm a little afraid to say
who put it there in commit 5ecfcb2... <g>

Hopefully Martin's suggested fix will do for now. We'll try to send
it up - I guess next merge window is the right time? In one of the
upcoming merge windows, I hope we have some page-cache related code
to send up that will get rid of all of that...

As far as the cumbersome C expression is concerned... I'm reminded
of this cartoon I drew...

https://sites.google.com/site/hubcapsite3/misc/tinFoilHat.jpg

-Mike

On Fri, Dec 8, 2017 at 11:39 AM,  <martin@omnibond.com> wrote:
> On Thu, Dec 07, 2017 at 10:26:10PM +0000, Al Viro wrote:
>> I'd missed that back then, but...
>>
>>         if (file->f_pos > i_size_read(file->f_mapping->host))
>>                 orangefs_i_size_write(file->f_mapping->host, file->f_pos);
>>
>>         rc = generic_write_checks(iocb, iter);
>>
>>         if (rc <= 0) {
>>                 gossip_err("%s: generic_write_checks failed, rc:%zd:.\n",
>>                            __func__, rc);
>>                 goto out;
>>         }
>>
>>         /*
>>          * if we are appending, generic_write_checks would have updated
>>          * pos to the end of the file, so we will wait till now to set
>>          * pos...
>>          */
>>         pos = *(&iocb->ki_pos);
>>
>> looks suspicious as hell.  What's going on there?  Not to mention anything
>> else file->f_pos might be completely unrelated to any IO going on -
>> consider e.g. pwrite(2), where the position (in iocb->ki_pos) has nothing
>> to do with file->f_pos.  Then there's the question of WTF is write()
>> (or pwrite()) past the current EOF doing bumping the file size, before
>> it even gets a chance to decide whether it'll be trying to do any IO at
>> all.
>
> It seems to be a poor attempt at doing what the code immediately above
> it does.
>
> http://dev.orangefs.org/trac/orangefs/changeset/4828
>
> "put the i_size update at the beginning instead of the end. The issue is
> that generic_write_checks sets the offset to i_size in the case of
> append, and if i_pos has changed (so that its > i_size) outside of
> write, than we need to update the i_size to reflect that value before
> the offset is updated."
>
> Oh lovely, he's just moving it.
>
> http://dev.orangefs.org/trac/orangefs/changeset/4827
>
> "fix for pwrite04 test that does a pwrite with open(O_APPEND).
>
> "fix for f_frsize (fragment size) to be equal to f_bsize. This should fix
> newer statfs tools that use frsize as the block size instead of bsize.
> Hopefully, it won't break anything that depended on frsize being
> hardcoded to 1024."
>
> Which is just a rework of
>
> http://dev.orangefs.org/trac/orangefs/changeset/4723
>
> "(from branch): fix for append bug"
>
> I'm not sure what branch he's talking about, and SVN seems to be
> designed so as to make it impossible to find out.
>
> Note that nothing updates i_size after a successful write.  So now we do
> the getattr which he claims is ridiculous (it may be cached today though
> it probably won't be) anywhere i_size is needed.  In do_readv_writev we
> invalidate the cached attribute so as to force an update from the
> server.
>
> Well none of this does particular wonders for performance.  None of it
> really solves the problem, either.  OrangeFS does not properly support
> append writes, so we send a offset equal to the size of the file.  It
> completely fails when the file is being written to from another machine
> (or even the same machine but not through the kernel).  Without some
> work on the OrangeFS server, solving that problem is impossible.
>
> Given that limitation, the best we can hope for is consistency on a
> single machine.  We could try to avoid the getattr by updating i_size
> then not doing a getattr if within a cache timeout.  That would be more
> in line with the getattr cache I've implemented.
>
> I avoided it here because there was already an explicit getattr.  I am
> not sure under what circumstances it was added.  I've looked through
> Mike's pre-merge code; it's in there as early (3.15) as I can see.  It's
> not in the out-of-tree module.  Perhaps Mike remembers why it was added.
>
> https://github.com/hubcapsc/linux/commit/60eb8a4f7fc5931ccec0372483ed405ef9ca9110
>
>>
>> _Then_ there's the deadlock on 32bit SMP in that code.  Look: several
>> lines prior we'd done
>>         inode_lock(file->f_mapping->host);
>> and hadn't unlocked the sucker since then.  And
>> static inline void orangefs_i_size_write(struct inode *inode, loff_t i_size)
>> {
>> #if BITS_PER_LONG == 32 && defined(CONFIG_SMP)
>>         inode_lock(inode);
>> #endif
>>         i_size_write(inode, i_size);
>> #if BITS_PER_LONG == 32 && defined(CONFIG_SMP)
>>         inode_unlock(inode);
>> #endif
>> }
>> means that if we get around to calling it there in SMP/32bit case, we'll
>> get as plain a deadlock as possible.  And AFAICS it had been that way
>> since the initial merge.
>>
>> What the hell is that code about and what is it trying to do?
>
> Guess what our original commit message is: "Trac 16"
>
> http://dev.orangefs.org/trac/orangefs/changeset/6709
>
> Hardly seems to be any use trying to figure out what motivated it.  I
> don't think it was a deadlock then except to the extent that all the
> OrangeFS code was a snarl of deadlocks then.  But the obfuscation only
> serves to hide it from us.
>
>>
>> PS: While we are at it, what's the point of that *(&...) in there?
>
> https://github.com/hubcapsc/linux/commit/cae140731030471ec4782d65efefe8e819d6a467
>
> Not sure why it's obfuscated but there's another one in
> orangefs_file_read_iter.
>
> I intended to send a series that implements pagecache support in
> OrangeFS at some point in the future.  It's not quite done, but I
> suppose it would do just as well to send it for review sooner rather
> than later.  All of this code has been significantly re-written there.
>
> Martin
>
> There is no xfstests regression with the following.
>
> diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c
> index 1668fd645c45..0d228cd087e6 100644
> --- a/fs/orangefs/file.c
> +++ b/fs/orangefs/file.c
> @@ -452,7 +452,7 @@ ssize_t orangefs_inode_read(struct inode *inode,
>  static ssize_t orangefs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
>  {
>         struct file *file = iocb->ki_filp;
> -       loff_t pos = *(&iocb->ki_pos);
> +       loff_t pos = iocb->ki_pos;
>         ssize_t rc = 0;
>
>         BUG_ON(iocb->private);
> @@ -492,9 +492,6 @@ static ssize_t orangefs_file_write_iter(struct kiocb *iocb, struct iov_iter *ite
>                 }
>         }
>
> -       if (file->f_pos > i_size_read(file->f_mapping->host))
> -               orangefs_i_size_write(file->f_mapping->host, file->f_pos);
> -
>         rc = generic_write_checks(iocb, iter);
>
>         if (rc <= 0) {
> @@ -508,7 +505,7 @@ static ssize_t orangefs_file_write_iter(struct kiocb *iocb, struct iov_iter *ite
>          * pos to the end of the file, so we will wait till now to set
>          * pos...
>          */
> -       pos = *(&iocb->ki_pos);
> +       pos = iocb->ki_pos;
>
>         rc = do_readv_writev(ORANGEFS_IO_WRITE,
>                              file,

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2017-12-12 16:31 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-12-07 22:26 [RFC] what's going on with file->f_pos uses in orangefs_file_write_iter()? Al Viro
2017-12-08 16:39 ` martin
2017-12-12 16:31   ` Mike Marshall

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).