* [Ocfs2-devel] New update for sync in reflink. [Re: some thoughts about data integrity test for reflink.] [not found] ` <4A7C4708.9030303@oracle.com> @ 2009-08-08 0:10 ` Joel Becker 2009-08-08 14:28 ` Tao Ma 0 siblings, 1 reply; 5+ messages in thread From: Joel Becker @ 2009-08-08 0:10 UTC (permalink / raw) To: ocfs2-devel [I've added ocfs2-devel to the Cc list. Please copy the list, otherwise we can't get Mark's thoughts, and he knows some of the code better than I do. ] On Fri, Aug 07, 2009 at 11:23:52PM +0800, Tao Ma wrote: > Joel Becker wrote: > >>I called ocfs2_sync_inode before reflink to flush the page > >>cache. So I think we should get all B's and C's are missed. > >>The reason I call ocfs2_sync_inode is the following test case: > >>1. Write a file with all A's. > >>2. reflink A to B. > >>3. read B. > >>If the page cache of A isn't synced to the disk, we may get the > >>garbage from reading of B. > > > > We won't get garbage. We get 'A's, which is a valid state of > >the file. Whether it's valid in the face of relink is what we're trying > >to decide here. > We don't get 'A' here actually and we would get garbage. Strange, > uh? I just added some printk and it seems that while reading file B, > the buffer_head is created > and read from the disk. I used to think that while we write file A, > we will create buffer_head. And reading file B will use this > buffer_head. > But that is not the case actually. The only right sequence is that > we add a sync between step 1 and 2. Um, what? Oh, we do the check for PageDirty() in CoW, but not in straight read... > > Now, whether we want a reflink to force a sync or not in the > >writeback case...on the one hand, people will probably expect write(); > >reflink(); read() to work. On the other hand, they really should > >write(); fsync(); reflink(); if they care. We're already dealing > >with this regarding rename(), and its ugly. Force the fsync() for them, > >and we help the naive caller but penalize (performance-wise) the > >intelligent caller. > I have checked my mount mode, it is data=ordered, so we also meet > with this issue in ordered mode. Yes, the original discussion was concerning the page cache contents during CoW. You've coded that up correctly. Here we're talking about simple read. > > I really lean towards just requiring the caller to deal with it. > >It's what unix-like systems do for all other operations. I poked Chris, > >and he agrees. So a caller if they knew they had such a situation, does > >'fsync(); reflink();'. > So a user always need to do fsync before he do reflink. Is it OK? If > yes, I will ask tristan to update his test case. Well... I would normally say yes, because stale vs not-stale is one thing. But complete garbage from before the first write is not OK. > >>I think it is OK if I call ocfs2_sync_inode before reflink?(See > >>my git tree for details). > > > > Calling it unconditionally is just a performance disaster > >waiting to happen. Even if we want to require reflink() to happen with > >data flushed, you don't want to call ocfs2_sync_inode() in the ordered > >case. You want the ordered journal to handle it. > Since it doesn't work in order mode which is more common for ocfs2 > usage, do you change your mind here? I do not. Having to call ocfs2_sync_inode() for every reflink is a disaster. We need to think about this. Basically, how can the reflink target notice it has to either sync the source or perhaps copy the data over? Joel -- Life's Little Instruction Book #43 "Never give up on somebody. Miracles happen every day." Joel Becker Principal Software Developer Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 5+ messages in thread
* [Ocfs2-devel] New update for sync in reflink. [Re: some thoughts about data integrity test for reflink.] 2009-08-08 0:10 ` [Ocfs2-devel] New update for sync in reflink. [Re: some thoughts about data integrity test for reflink.] Joel Becker @ 2009-08-08 14:28 ` Tao Ma 2009-08-10 18:59 ` Joel Becker 0 siblings, 1 reply; 5+ messages in thread From: Tao Ma @ 2009-08-08 14:28 UTC (permalink / raw) To: ocfs2-devel Joel Becker wrote: > [I've added ocfs2-devel to the Cc list. Please copy the list, otherwise > we can't get Mark's thoughts, and he knows some of the code better > than I do. ] > > On Fri, Aug 07, 2009 at 11:23:52PM +0800, Tao Ma wrote: > >> Joel Becker wrote: >> >>>> I called ocfs2_sync_inode before reflink to flush the page >>>> cache. So I think we should get all B's and C's are missed. >>>> The reason I call ocfs2_sync_inode is the following test case: >>>> 1. Write a file with all A's. >>>> 2. reflink A to B. >>>> 3. read B. >>>> If the page cache of A isn't synced to the disk, we may get the >>>> garbage from reading of B. >>>> >>> We won't get garbage. We get 'A's, which is a valid state of >>> the file. Whether it's valid in the face of relink is what we're trying >>> to decide here. >>> >> We don't get 'A' here actually and we would get garbage. Strange, >> uh? I just added some printk and it seems that while reading file B, >> the buffer_head is created >> and read from the disk. I used to think that while we write file A, >> we will create buffer_head. And reading file B will use this >> buffer_head. >> But that is not the case actually. The only right sequence is that >> we add a sync between step 1 and 2. >> > > Um, what? > Oh, we do the check for PageDirty() in CoW, but not in straight > read... > So what do you mean here? I need to add another patch for straight read check because of reflink? I am lack of this type of technique. So please give me some tips. Thanks. > >>> Now, whether we want a reflink to force a sync or not in the >>> writeback case...on the one hand, people will probably expect write(); >>> reflink(); read() to work. On the other hand, they really should >>> write(); fsync(); reflink(); if they care. We're already dealing >>> with this regarding rename(), and its ugly. Force the fsync() for them, >>> and we help the naive caller but penalize (performance-wise) the >>> intelligent caller. >>> >> I have checked my mount mode, it is data=ordered, so we also meet >> with this issue in ordered mode. >> > > Yes, the original discussion was concerning the page cache > contents during CoW. You've coded that up correctly. Here we're > talking about simple read. > > >>> I really lean towards just requiring the caller to deal with it. >>> It's what unix-like systems do for all other operations. I poked Chris, >>> and he agrees. So a caller if they knew they had such a situation, does >>> 'fsync(); reflink();'. >>> >> So a user always need to do fsync before he do reflink. Is it OK? If >> yes, I will ask tristan to update his test case. >> > > Well... I would normally say yes, because stale vs not-stale is > one thing. But complete garbage from before the first write is not OK. > No problem, I will ask tristan to do it. And we may also need to add some explanations about it when we finally release reflink. > >>>> I think it is OK if I call ocfs2_sync_inode before reflink?(See >>>> my git tree for details). >>>> >>> Calling it unconditionally is just a performance disaster >>> waiting to happen. Even if we want to require reflink() to happen with >>> data flushed, you don't want to call ocfs2_sync_inode() in the ordered >>> case. You want the ordered journal to handle it. >>> >> Since it doesn't work in order mode which is more common for ocfs2 >> usage, do you change your mind here? >> > > I do not. Having to call ocfs2_sync_inode() for every reflink > is a disaster. We need to think about this. Basically, how can the > reflink target notice it has to either sync the source or perhaps copy > the data over? > OK, I agree with you for ocfs2_sync_inode part. I am just a little worried about the above case I mentioned, so if we can resolve it gracefully, I would be very glad about it. Regards, Tao ^ permalink raw reply [flat|nested] 5+ messages in thread
* [Ocfs2-devel] New update for sync in reflink. [Re: some thoughts about data integrity test for reflink.] 2009-08-08 14:28 ` Tao Ma @ 2009-08-10 18:59 ` Joel Becker 2009-08-11 1:29 ` Tao Ma 0 siblings, 1 reply; 5+ messages in thread From: Joel Becker @ 2009-08-10 18:59 UTC (permalink / raw) To: ocfs2-devel On Sat, Aug 08, 2009 at 10:28:34PM +0800, Tao Ma wrote: > > I do not. Having to call ocfs2_sync_inode() for every reflink > > is a disaster. We need to think about this. Basically, how can the > > reflink target notice it has to either sync the source or perhaps copy > > the data over? > > > OK, I agree with you for ocfs2_sync_inode part. I am just a little > worried about the above case I mentioned, so if we can > resolve it gracefully, I would be very glad about it. Well, I can't find a good solution. But we're not going to use ocfs2_sync_inode(), as that syncs metadata, which is pointless for reflink. Also, we need to code up waiting on the sync (ocfs2_sync_inode() starts the sync, but does not wait on it). We need to start the sync as early as possible. As an aside, ocfs2_sync_inode() is a completely useless function. filemap_fdatawrite() is called by the vfs fsync code, and we don't use the mapping buffers so sync_mapping_buffers() is a noop. A future patch should just remove ocfs2_sync_inode(). So, here's what needs to happen. At the very top of __ocfs2_reflink(), right after you have ip_alloc_sem and have locked out concurrent writers, you need to call filemap_fdatawrite(). This will start the writeback in the background while you go on to building the refcount trees. At the very bottom of __ocfs2_reflink(), before you drop the locks on the new_inode, you then filemap_fdatawait(). I think that this means we can drop ocfs2_cow_sync_writeback() because the data must have been up to date before the reflink could have been CoWd. We can also drop the filemap_fdatawrite_range() in ocfs2_refcount_cow(). In ocfs2_duplicate_clusters() PageDirty() is now a BUG_ON(). Right? Joel -- "If the human brain were so simple we could understand it, we would be so simple that we could not." - W. A. Clouston Joel Becker Principal Software Developer Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 5+ messages in thread
* [Ocfs2-devel] New update for sync in reflink. [Re: some thoughts about data integrity test for reflink.] 2009-08-10 18:59 ` Joel Becker @ 2009-08-11 1:29 ` Tao Ma 2009-08-11 16:37 ` Joel Becker 0 siblings, 1 reply; 5+ messages in thread From: Tao Ma @ 2009-08-11 1:29 UTC (permalink / raw) To: ocfs2-devel Joel Becker wrote: > On Sat, Aug 08, 2009 at 10:28:34PM +0800, Tao Ma wrote: >>> I do not. Having to call ocfs2_sync_inode() for every reflink >>> is a disaster. We need to think about this. Basically, how can the >>> reflink target notice it has to either sync the source or perhaps copy >>> the data over? >>> >> OK, I agree with you for ocfs2_sync_inode part. I am just a little >> worried about the above case I mentioned, so if we can >> resolve it gracefully, I would be very glad about it. > > Well, I can't find a good solution. But we're not going to use > ocfs2_sync_inode(), as that syncs metadata, which is pointless for > reflink. Also, we need to code up waiting on the sync > (ocfs2_sync_inode() starts the sync, but does not wait on it). We need > to start the sync as early as possible. > As an aside, ocfs2_sync_inode() is a completely useless > function. filemap_fdatawrite() is called by the vfs fsync code, and we > don't use the mapping buffers so sync_mapping_buffers() is a noop. A > future patch should just remove ocfs2_sync_inode(). > So, here's what needs to happen. At the very top of > __ocfs2_reflink(), right after you have ip_alloc_sem and have locked out > concurrent writers, you need to call filemap_fdatawrite(). This will > start the writeback in the background while you go on to building the > refcount trees. At the very bottom of __ocfs2_reflink(), before you > drop the locks on the new_inode, you then filemap_fdatawait(). We may call filemap_fdatawait after we drop the locks of new_inode since it is just about the old inode's data writeback(And actually at the very top of __ocfs2_reflink where we call filemap_fdatawrite, we don't have i_mutex of new inode locked, so these 2 functions are both outside of new_inode's i_mutex lock)? And what' more, now the new inode is still in orphan dir, so other nodes or process should not be able to touch its data at that time. > I think that this means we can drop ocfs2_cow_sync_writeback() > because the data must have been up to date before the reflink could have > been CoWd. We can also drop the filemap_fdatawrite_range() in > ocfs2_refcount_cow(). In ocfs2_duplicate_clusters() PageDirty() is now > a BUG_ON(). Right? ocfs2_cow_sync_writeback is used to sync data to the new clusters after CoW in writeback mode, so it can't be removed. As for filemap_fdatawrite_range and PageDirty, you are right, we can remove them and add BUG_ON if we meet with PageDirty. Regards, Tao ^ permalink raw reply [flat|nested] 5+ messages in thread
* [Ocfs2-devel] New update for sync in reflink. [Re: some thoughts about data integrity test for reflink.] 2009-08-11 1:29 ` Tao Ma @ 2009-08-11 16:37 ` Joel Becker 0 siblings, 0 replies; 5+ messages in thread From: Joel Becker @ 2009-08-11 16:37 UTC (permalink / raw) To: ocfs2-devel On Tue, Aug 11, 2009 at 09:29:45AM +0800, Tao Ma wrote: > Joel Becker wrote: > We may call filemap_fdatawait after we drop the locks of new_inode since > it is just about the old inode's data writeback(And actually at the very > top of __ocfs2_reflink where we call filemap_fdatawrite, we don't have > i_mutex of new inode locked, so these 2 functions are both outside of > new_inode's i_mutex lock)? And what' more, now the new inode is still in > orphan dir, so other nodes or process should not be able to touch its > data at that time. You're right that the data writeout is from the source inode. And we want to start the fdatawrite as soon as we've locked the source inode, which is why we do it at the very top of __ocfs2_reflink(). If you really want to do the fdatawaite outside of the new inode's mutex, that's up to you. I figured we'd do it inside the cluster lock so that the refcounted areas are on disk before the journal commits the reflink, but since file is in the orphan dir, it won't be visible someone uses the tools to pull it out. > ocfs2_cow_sync_writeback is used to sync data to the new clusters after > CoW in writeback mode, so it can't be removed. Oh, right. Good catch. Joel -- Life's Little Instruction Book #226 "When someone hugs you, let them be the first to let go." Joel Becker Principal Software Developer Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2009-08-11 16:37 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <4A1B8E15.20005@oracle.com>
[not found] ` <1243322353.6616.18.camel@tristan-laptop.cn.oracle.com>
[not found] ` <4A1C8708.8050704@oracle.com>
[not found] ` <1243389610.6616.30.camel@tristan-laptop.cn.oracle.com>
[not found] ` <4A1DDFCF.4060203@oracle.com>
[not found] ` <4A1E3B9A.7000807@oracle.com>
[not found] ` <20090528171406.GA315@mail.oracle.com>
[not found] ` <4A7C4708.9030303@oracle.com>
2009-08-08 0:10 ` [Ocfs2-devel] New update for sync in reflink. [Re: some thoughts about data integrity test for reflink.] Joel Becker
2009-08-08 14:28 ` Tao Ma
2009-08-10 18:59 ` Joel Becker
2009-08-11 1:29 ` Tao Ma
2009-08-11 16:37 ` Joel Becker
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.