* Async direct IO write vs buffered read race @ 2017-06-22 15:57 Lukas Czerner 2017-06-22 16:55 ` Jeff Moyer 0 siblings, 1 reply; 8+ messages in thread From: Lukas Czerner @ 2017-06-22 15:57 UTC (permalink / raw) To: linux-fsdevel; +Cc: viro, esandeen Hello, I am dealing with a problem where in case that buffered read happens to land between direct IO submission and completion page cache will contain the stale data, while the new data will be on disk. We are trying to avoid such problems by calling invalidate_inode_pages2_range() before and after direct_IO() in generic_file_direct_write() however that does not seem to be enough, because nothing prevents buffered reads to come in afterwards populating page cache. Aside from the fact that mixing direct and buffered IO is not such a good idea, we end up with page cache showing different content than what's on disk even after aio dio completes which seems very strange to me. I can reproduce this on ext4 as well as xfs and kernel version going back at least to v3.10 which leads me to believe that this might actually be known behaviour ? I was trying to avoid that by moving invalidate_inode_pages2_range() to after the aio dio completion into dio_complete (or file system ->end_io callback) but it has it's own problems - sometimes this appears to be called from atomic context and I do not really see why... Do you have any comments on this ? Is it actually expected behaviour ? Thanks! -Lukas ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Async direct IO write vs buffered read race 2017-06-22 15:57 Async direct IO write vs buffered read race Lukas Czerner @ 2017-06-22 16:55 ` Jeff Moyer 2017-06-23 7:59 ` Jan Kara 0 siblings, 1 reply; 8+ messages in thread From: Jeff Moyer @ 2017-06-22 16:55 UTC (permalink / raw) To: Lukas Czerner, linux-fsdevel; +Cc: viro, esandeen, Christoph Hellwig, Jan Kara Lukas Czerner <lczerner@redhat.com> writes: > Hello, > > I am dealing with a problem where in case that buffered read happens to > land between direct IO submission and completion page cache will contain > the stale data, while the new data will be on disk. > > We are trying to avoid such problems by calling > invalidate_inode_pages2_range() before and after direct_IO() in > generic_file_direct_write() however that does not seem to be enough, > because nothing prevents buffered reads to come in afterwards populating > page cache. Ugh, right. With aio, we're doing the invalidate after the submission, not the completion. > Aside from the fact that mixing direct and buffered IO is not such a > good idea, we end up with page cache showing different content than > what's on disk even after aio dio completes which seems very strange > to me. > > I can reproduce this on ext4 as well as xfs and kernel version going > back at least to v3.10 which leads me to believe that this might > actually be known behaviour ? At least I didn't know about it. ;-) > I was trying to avoid that by moving invalidate_inode_pages2_range() to > after the aio dio completion into dio_complete (or file system ->end_io > callback) but it has it's own problems - sometimes this appears to be > called from atomic context and I do not really see why... Well, I/O completion processing of course happens in atomic context. We do defer some things (like O_(D)SYNC processing) to process context. I guess we could add another qualifier inside of dio_bio_end_aio: bool defer_completion = false; if (dio->result) defer_completion = dio->defer_completion || (dio->op == REQ_OP_WRITE && dio->inode->i_mapping->nrpages); if (remaining == 0) { if (defer_completion) { INIT_WORK(&dio->complete_work, dio_aio_complete_work); queue_work(dio->inode->i_sb->s_dio_done_wq, &dio->complete_work); ... (I'm not sure whether we also have to take into account exceptional entries.) And then call invalidate_inode_pages2_range from dio_aio_complete_work. That at least wouldn't defer /all/ completion processing to a workqueue. However, it will slow things down when there is mixed buffered and direct I/O. Christoph or Jan, any thoughts on this? > Do you have any comments on this ? Is it actually expected behaviour ? It's not expected behaviour. -Jeff ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Async direct IO write vs buffered read race 2017-06-22 16:55 ` Jeff Moyer @ 2017-06-23 7:59 ` Jan Kara 2017-06-23 10:16 ` Lukas Czerner 2017-06-23 18:04 ` Eric Sandeen 0 siblings, 2 replies; 8+ messages in thread From: Jan Kara @ 2017-06-23 7:59 UTC (permalink / raw) To: Jeff Moyer Cc: Lukas Czerner, linux-fsdevel, viro, esandeen, Christoph Hellwig, Jan Kara On Thu 22-06-17 12:55:50, Jeff Moyer wrote: > Lukas Czerner <lczerner@redhat.com> writes: > > > Hello, > > > > I am dealing with a problem where in case that buffered read happens to > > land between direct IO submission and completion page cache will contain > > the stale data, while the new data will be on disk. > > > > We are trying to avoid such problems by calling > > invalidate_inode_pages2_range() before and after direct_IO() in > > generic_file_direct_write() however that does not seem to be enough, > > because nothing prevents buffered reads to come in afterwards populating > > page cache. > > Ugh, right. With aio, we're doing the invalidate after the submission, > not the completion. > > > Aside from the fact that mixing direct and buffered IO is not such a > > good idea, we end up with page cache showing different content than > > what's on disk even after aio dio completes which seems very strange > > to me. > > > > I can reproduce this on ext4 as well as xfs and kernel version going > > back at least to v3.10 which leads me to believe that this might > > actually be known behaviour ? > > At least I didn't know about it. ;-) I'm actually aware of it :) > > I was trying to avoid that by moving invalidate_inode_pages2_range() to > > after the aio dio completion into dio_complete (or file system ->end_io > > callback) but it has it's own problems - sometimes this appears to be > > called from atomic context and I do not really see why... > > Well, I/O completion processing of course happens in atomic context. We > do defer some things (like O_(D)SYNC processing) to process context. I > guess we could add another qualifier inside of dio_bio_end_aio: > > bool defer_completion = false; > if (dio->result) > defer_completion = dio->defer_completion || > (dio->op == REQ_OP_WRITE && dio->inode->i_mapping->nrpages); > > if (remaining == 0) { > if (defer_completion) { > INIT_WORK(&dio->complete_work, dio_aio_complete_work); > queue_work(dio->inode->i_sb->s_dio_done_wq, > &dio->complete_work); > ... > > (I'm not sure whether we also have to take into account exceptional entries.) > > And then call invalidate_inode_pages2_range from dio_aio_complete_work. > That at least wouldn't defer /all/ completion processing to a workqueue. > However, it will slow things down when there is mixed buffered and > direct I/O. > > Christoph or Jan, any thoughts on this? So our stance has been: Do not ever mix buffered and direct IO! Definitely not on the same file range, most definitely not at the same time. The thing we do is a best effort thing that more or less guarantees that if you do say buffered IO and direct IO after that, it will work reasonably. However if direct and buffered IO can race, bad luck for your data. I don't think we want to sacrifice any performance of AIO DIO (and offloading of direct IO completion to a workqueue so that we can do invalidation costs noticeable mount of performance) for supporting such usecase. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Async direct IO write vs buffered read race 2017-06-23 7:59 ` Jan Kara @ 2017-06-23 10:16 ` Lukas Czerner 2017-06-26 15:11 ` Jeff Moyer 2017-06-23 18:04 ` Eric Sandeen 1 sibling, 1 reply; 8+ messages in thread From: Lukas Czerner @ 2017-06-23 10:16 UTC (permalink / raw) To: Jan Kara; +Cc: Jeff Moyer, linux-fsdevel, viro, esandeen, Christoph Hellwig On Fri, Jun 23, 2017 at 09:59:42AM +0200, Jan Kara wrote: > On Thu 22-06-17 12:55:50, Jeff Moyer wrote: > > Lukas Czerner <lczerner@redhat.com> writes: > > > > > Hello, > > > > > > I am dealing with a problem where in case that buffered read happens to > > > land between direct IO submission and completion page cache will contain > > > the stale data, while the new data will be on disk. > > > > > > We are trying to avoid such problems by calling > > > invalidate_inode_pages2_range() before and after direct_IO() in > > > generic_file_direct_write() however that does not seem to be enough, > > > because nothing prevents buffered reads to come in afterwards populating > > > page cache. > > > > Ugh, right. With aio, we're doing the invalidate after the submission, > > not the completion. > > > > > Aside from the fact that mixing direct and buffered IO is not such a > > > good idea, we end up with page cache showing different content than > > > what's on disk even after aio dio completes which seems very strange > > > to me. > > > > > > I can reproduce this on ext4 as well as xfs and kernel version going > > > back at least to v3.10 which leads me to believe that this might > > > actually be known behaviour ? > > > > At least I didn't know about it. ;-) > > I'm actually aware of it :) > > > > I was trying to avoid that by moving invalidate_inode_pages2_range() to > > > after the aio dio completion into dio_complete (or file system ->end_io > > > callback) but it has it's own problems - sometimes this appears to be > > > called from atomic context and I do not really see why... > > > > Well, I/O completion processing of course happens in atomic context. We > > do defer some things (like O_(D)SYNC processing) to process context. I > > guess we could add another qualifier inside of dio_bio_end_aio: > > > > bool defer_completion = false; > > if (dio->result) > > defer_completion = dio->defer_completion || > > (dio->op == REQ_OP_WRITE && dio->inode->i_mapping->nrpages); > > > > if (remaining == 0) { > > if (defer_completion) { > > INIT_WORK(&dio->complete_work, dio_aio_complete_work); > > queue_work(dio->inode->i_sb->s_dio_done_wq, > > &dio->complete_work); > > ... > > > > (I'm not sure whether we also have to take into account exceptional entries.) > > > > And then call invalidate_inode_pages2_range from dio_aio_complete_work. > > That at least wouldn't defer /all/ completion processing to a workqueue. > > However, it will slow things down when there is mixed buffered and > > direct I/O. > > > > Christoph or Jan, any thoughts on this? > > So our stance has been: Do not ever mix buffered and direct IO! Definitely > not on the same file range, most definitely not at the same time. True, that's a good recommendation. > > The thing we do is a best effort thing that more or less guarantees that if > you do say buffered IO and direct IO after that, it will work reasonably. > However if direct and buffered IO can race, bad luck for your data. I don't > think we want to sacrifice any performance of AIO DIO (and offloading of > direct IO completion to a workqueue so that we can do invalidation costs > noticeable mount of performance) for supporting such usecase. What Jeff proposed would sacrifice performance for the case where AIO DIO write does race with buffered IO - the situation we agree is not ideal and should be avoided anyway. For the rest of AIO DIO this should have no effect right ? If true, I'd say this is a good effort to make sure we do not have disparity between page cache and disk. -Lukas > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Async direct IO write vs buffered read race 2017-06-23 10:16 ` Lukas Czerner @ 2017-06-26 15:11 ` Jeff Moyer 2017-06-28 16:57 ` Rik van Riel 0 siblings, 1 reply; 8+ messages in thread From: Jeff Moyer @ 2017-06-26 15:11 UTC (permalink / raw) To: Jan Kara, Lukas Czerner; +Cc: linux-fsdevel, viro, esandeen, Christoph Hellwig Lukas Czerner <lczerner@redhat.com> writes: >> The thing we do is a best effort thing that more or less guarantees that if >> you do say buffered IO and direct IO after that, it will work reasonably. >> However if direct and buffered IO can race, bad luck for your data. I don't >> think we want to sacrifice any performance of AIO DIO (and offloading of >> direct IO completion to a workqueue so that we can do invalidation costs >> noticeable mount of performance) for supporting such usecase. > > What Jeff proposed would sacrifice performance for the case where AIO > DIO write does race with buffered IO - the situation we agree is not ideal > and should be avoided anyway. For the rest of AIO DIO this should have no > effect right ? If true, I'd say this is a good effort to make sure we do > not have disparity between page cache and disk. Exactly. Jan, are you concerned about impacting performance for mixed buffered I/O and direct writes? If so, we could look into restricting the process context switch further, to just overlapping buffered and direct I/O (assuming there are no locking issues). Alternatively, since we already know this is racy, we don't actually have to defer I/O completion to process context. We could just complete the I/O as we normally would, but also queue up an invalidate_inode_pages2_range work item. It will be asynchronous, but this is best effort, anyway. As Eric mentioned, the thing that bothers me is that we have invalid data lingering in the page cache indefinitely. Cheers, Jeff ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Async direct IO write vs buffered read race 2017-06-26 15:11 ` Jeff Moyer @ 2017-06-28 16:57 ` Rik van Riel 2017-06-30 11:16 ` Lukas Czerner 0 siblings, 1 reply; 8+ messages in thread From: Rik van Riel @ 2017-06-28 16:57 UTC (permalink / raw) To: Jeff Moyer, Jan Kara, Lukas Czerner Cc: linux-fsdevel, viro, esandeen, Christoph Hellwig [-- Attachment #1: Type: text/plain, Size: 2007 bytes --] On Mon, 2017-06-26 at 11:11 -0400, Jeff Moyer wrote: > Lukas Czerner <lczerner@redhat.com> writes: > > > > The thing we do is a best effort thing that more or less > > > guarantees that if > > > you do say buffered IO and direct IO after that, it will work > > > reasonably. > > > However if direct and buffered IO can race, bad luck for your > > > data. I don't > > > think we want to sacrifice any performance of AIO DIO (and > > > offloading of > > > direct IO completion to a workqueue so that we can do > > > invalidation costs > > > noticeable mount of performance) for supporting such usecase. > > > > What Jeff proposed would sacrifice performance for the case where > > AIO > > DIO write does race with buffered IO - the situation we agree is > > not ideal > > and should be avoided anyway. For the rest of AIO DIO this should > > have no > > effect right ? If true, I'd say this is a good effort to make sure > > we do > > not have disparity between page cache and disk. > > Exactly. Jan, are you concerned about impacting performance for > mixed > buffered I/O and direct writes? If so, we could look into > restricting > the process context switch further, to just overlapping buffered and > direct I/O (assuming there are no locking issues). > > Alternatively, since we already know this is racy, we don't actually > have to defer I/O completion to process context. We could just > complete > the I/O as we normally would, but also queue up an > invalidate_inode_pages2_range work item. It will be asynchronous, > but > this is best effort, anyway. > > As Eric mentioned, the thing that bothers me is that we have invalid > data lingering in the page cache indefinitely. Given that the requirement is that the page cache gets invalidated after IO completion, would it be possible to defer only the page cache invalidation to task context, and handle the rest of the IO completion in interrupt context? -- All rights reversed [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 473 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Async direct IO write vs buffered read race 2017-06-28 16:57 ` Rik van Riel @ 2017-06-30 11:16 ` Lukas Czerner 0 siblings, 0 replies; 8+ messages in thread From: Lukas Czerner @ 2017-06-30 11:16 UTC (permalink / raw) To: Rik van Riel Cc: Jeff Moyer, Jan Kara, linux-fsdevel, viro, esandeen, Christoph Hellwig On Wed, Jun 28, 2017 at 12:57:59PM -0400, Rik van Riel wrote: > On Mon, 2017-06-26 at 11:11 -0400, Jeff Moyer wrote: > > Lukas Czerner <lczerner@redhat.com> writes: > > > > > > The thing we do is a best effort thing that more or less > > > > guarantees that if > > > > you do say buffered IO and direct IO after that, it will work > > > > reasonably. > > > > However if direct and buffered IO can race, bad luck for your > > > > data. I don't > > > > think we want to sacrifice any performance of AIO DIO (and > > > > offloading of > > > > direct IO completion to a workqueue so that we can do > > > > invalidation costs > > > > noticeable mount of performance) for supporting such usecase. > > > > > > What Jeff proposed would sacrifice performance for the case where > > > AIO > > > DIO write does race with buffered IO - the situation we agree is > > > not ideal > > > and should be avoided anyway. For the rest of AIO DIO this should > > > have no > > > effect right ? If true, I'd say this is a good effort to make sure > > > we do > > > not have disparity between page cache and disk. > > > > Exactly.��Jan, are you concerned about impacting performance for > > mixed > > buffered I/O and direct writes?��If so, we could look into > > restricting > > the process context switch further, to just overlapping buffered and > > direct I/O (assuming there are no locking issues). > > > > Alternatively, since we already know this is racy, we don't actually > > have to defer I/O completion to process context.��We could just > > complete > > the I/O as we normally would, but also queue up an > > invalidate_inode_pages2_range work item.��It will be asynchronous, > > but > > this is best effort, anyway. > > > > As Eric mentioned, the thing that bothers me is that we have invalid > > data lingering in the page cache indefinitely. > > Given that the requirement is that the page cache > gets invalidated after IO completion, would it be > possible to defer only the page cache invalidation > to task context, and handle the rest of the IO > completion in interrupt context? Hi, if I am reading it correctly that's basically how it works now for the IO that has defer_completion set (filesystems set this to do extent conversion at the completion). We'd use the same path here for the invalidation. -Lukas > > -- > All rights reversed ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Async direct IO write vs buffered read race 2017-06-23 7:59 ` Jan Kara 2017-06-23 10:16 ` Lukas Czerner @ 2017-06-23 18:04 ` Eric Sandeen 1 sibling, 0 replies; 8+ messages in thread From: Eric Sandeen @ 2017-06-23 18:04 UTC (permalink / raw) To: Jan Kara, Jeff Moyer Cc: Lukas Czerner, linux-fsdevel, viro, Christoph Hellwig On 6/23/17 2:59 AM, Jan Kara wrote: > On Thu 22-06-17 12:55:50, Jeff Moyer wrote: >> Christoph or Jan, any thoughts on this? > > So our stance has been: Do not ever mix buffered and direct IO! Definitely > not on the same file range, most definitely not at the same time. FWIW, I'd always known that concurrent buffered & direct wasn't particularly deterministic, i.e. the racing buffered read may get old or new data at the time. I was surprised to learn that the stale file data would linger indefinitely in the page cache though. Maybe I was just blissfully unaware. :) -Eric > The thing we do is a best effort thing that more or less guarantees that if > you do say buffered IO and direct IO after that, it will work reasonably. > However if direct and buffered IO can race, bad luck for your data. I don't > think we want to sacrifice any performance of AIO DIO (and offloading of > direct IO completion to a workqueue so that we can do invalidation costs > noticeable mount of performance) for supporting such usecase. > > Honza > ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2017-06-30 11:17 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-06-22 15:57 Async direct IO write vs buffered read race Lukas Czerner 2017-06-22 16:55 ` Jeff Moyer 2017-06-23 7:59 ` Jan Kara 2017-06-23 10:16 ` Lukas Czerner 2017-06-26 15:11 ` Jeff Moyer 2017-06-28 16:57 ` Rik van Riel 2017-06-30 11:16 ` Lukas Czerner 2017-06-23 18:04 ` Eric Sandeen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).