All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Yan, Zheng" <zheng.z.yan@intel.com>
To: Gregory Farnum <greg@inktank.com>
Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>,
	Sage Weil <sage@inktank.com>
Subject: Re: [PATCH 6/7] ceph: don't early drop Fw cap
Date: Tue, 05 Mar 2013 09:57:18 +0800	[thread overview]
Message-ID: <513550FE.1020401@intel.com> (raw)
In-Reply-To: <CAPYLRziF-BTvksCM3LDfdwvjrLWXQhSdoF+PB3XJAd7KJp4cmw@mail.gmail.com>

On 03/05/2013 02:26 AM, Gregory Farnum wrote:
> On Thu, Feb 28, 2013 at 10:46 PM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
>> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>>
>> ceph_aio_write() has an optimization that marks CEPH_CAP_FILE_WR
>> cap dirty before data is copied to page cache and inode size is
>> updated. The optimization avoids slow cap revocation caused by
>> balance_dirty_pages(), but introduces inode size update race. If
>> ceph_check_caps() flushes the dirty cap before the inode size is
>> updated, MDS can miss the new inode size. So just remove the
>> optimization.
>>
>> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
>> ---
>>  fs/ceph/file.c | 42 +++++++++++++++++-------------------------
>>  1 file changed, 17 insertions(+), 25 deletions(-)
>>
>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>> index a949805..28ef273 100644
>> --- a/fs/ceph/file.c
>> +++ b/fs/ceph/file.c
>> @@ -724,9 +724,12 @@ static ssize_t ceph_aio_write(struct kiocb *iocb, const struct iovec *iov,
>>         if (ceph_snap(inode) != CEPH_NOSNAP)
>>                 return -EROFS;
>>
>> +       sb_start_write(inode->i_sb);
>>  retry_snap:
>> -       if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL))
>> -               return -ENOSPC;
>> +       if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL)) {
>> +               ret = -ENOSPC;
>> +               goto out;
>> +       }
>>         __ceph_do_pending_vmtruncate(inode);
>>         dout("aio_write %p %llx.%llx %llu~%u getting caps. i_size %llu\n",
>>              inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
>> @@ -750,29 +753,10 @@ retry_snap:
>>                 ret = ceph_sync_write(file, iov->iov_base, iov->iov_len,
>>                         &iocb->ki_pos);
>>         } else {
>> -               /*
>> -                * buffered write; drop Fw early to avoid slow
>> -                * revocation if we get stuck on balance_dirty_pages
>> -                */
>> -               int dirty;
>> -
>> -               spin_lock(&ci->i_ceph_lock);
>> -               dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR);
>> -               spin_unlock(&ci->i_ceph_lock);
>> -               ceph_put_cap_refs(ci, got);
>> -
>> -               ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
>> -               if ((ret >= 0 || ret == -EIOCBQUEUED) &&
>> -                   ((file->f_flags & O_SYNC) || IS_SYNC(file->f_mapping->host)
>> -                    || ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_NEARFULL))) {
>> -                       err = vfs_fsync_range(file, pos, pos + ret - 1, 1);
>> -                       if (err < 0)
>> -                               ret = err;
>> -               }
>> -
>> -               if (dirty)
>> -                       __mark_inode_dirty(inode, dirty);
>> -               goto out;
>> +               mutex_lock(&inode->i_mutex);
>> +               ret = __generic_file_aio_write(iocb, iov, nr_segs,
>> +                                              &iocb->ki_pos);
>> +               mutex_unlock(&inode->i_mutex);
> 
> Hmm, you're here passing in a different value than the removed
> generic_file_aio_write() call did — &iocb->ki_pos instead of pos.
> Everything else is using the pos parameter so I rather expect that
> should still be used here?

They always have the same value, see the BUG_ON in generic_file_aio_write()

> Also a quick skim of the interfaces makes me think that the two
> versions aren't interchangeable — __generic_file_aio_write() also
> handles O_SYNC in addition to grabbing i_mutex. Why'd you switch them?

ceph has its own code that handles O_SYNC case. I want to make sb_start_write()
covers ceph_sync_write(), that's the reason I use __generic_file_aio_write() here.

Regards
Yan, Zheng

> (I haven't checked deep enough to see if the unlocked version is
> correct or not, but it does say that's what most filesystems should
> use.)
> -Greg
> 
> 
>>         }
>>
>>         if (ret >= 0) {
>> @@ -790,12 +774,20 @@ out_put:
>>              ceph_cap_string(got));
>>         ceph_put_cap_refs(ci, got);
>>
>> +       if (ret >= 0 &&
>> +           ((file->f_flags & O_SYNC) || IS_SYNC(file->f_mapping->host) ||
>> +            ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_NEARFULL))) {
>> +               err = vfs_fsync_range(file, pos, pos + ret - 1, 1);
>> +               if (err < 0)
>> +                       ret = err;
>> +       }
>>  out:
>>         if (ret == -EOLDSNAPC) {
>>                 dout("aio_write %p %llx.%llx %llu~%u got EOLDSNAPC, retrying\n",
>>                      inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len);
>>                 goto retry_snap;
>>         }
>> +       sb_end_write(inode->i_sb);
>>
>>         return ret;
>>  }
>> --
>> 1.7.11.7
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2013-03-05  1:57 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-03-01  6:46 [PATCH 0/7] ceph: misc fixes Yan, Zheng
2013-03-01  6:46 ` [PATCH 1/7] ceph: fix LSSNAP regression Yan, Zheng
2013-03-01  6:46 ` [PATCH 2/7] ceph: queue cap release when trimming cap Yan, Zheng
2013-03-01  6:46 ` [PATCH 3/7] ceph: set mds_want according to cap import message Yan, Zheng
2013-03-01  6:46 ` [PATCH 4/7] ceph: use I_COMPLETE inode flag instead of D_COMPLETE flag Yan, Zheng
2013-03-04 18:19   ` Sage Weil
2013-03-01  6:46 ` [PATCH 5/7] ceph: revert commit 22cddde104 Yan, Zheng
2013-03-01  6:46 ` [PATCH 6/7] ceph: don't early drop Fw cap Yan, Zheng
2013-03-04 18:26   ` Gregory Farnum
2013-03-05  1:57     ` Yan, Zheng [this message]
2013-03-05 20:42       ` Greg Farnum
2013-03-01  6:46 ` [PATCH 7/7] ceph: acquire i_mutex in __ceph_do_pending_vmtruncate Yan, Zheng
2013-03-04 18:49 ` [PATCH 0/7] ceph: misc fixes Gregory Farnum
     [not found]   ` <513568D1.60305@intel.com>
2013-03-05 20:44     ` Greg Farnum

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=513550FE.1020401@intel.com \
    --to=zheng.z.yan@intel.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=greg@inktank.com \
    --cc=sage@inktank.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.