* Better read bio error granularity? @ 2022-03-13 10:24 Qu Wenruo 2022-03-13 10:55 ` Matthew Wilcox 0 siblings, 1 reply; 4+ messages in thread From: Qu Wenruo @ 2022-03-13 10:24 UTC (permalink / raw) To: Linux FS Devel, linux-btrfs@vger.kernel.org, dm-devel@redhat.com Hi, Since if any of the split bio got an error, the whole bio will have bi_status set to some error number. This is completely fine for write bio, but I'm wondering can we get a better granularity by introducing per-bvec bi_status or using page status? One situation is, for fs like btrfs or dm device like dm-verify, if a large bio is submitted, say a 128K one, and one of the page failed to pass checksum/hmac verification. Then the whole 128K will be marked error, while in fact the rest 124K are completely fine. Can this be solved by something like per-vec bi_status, or using page error status to indicate where exactly the error is? Or is such usage case too niche (only makes sense for read, only makes sense for fs with verification)? Thanks, Qu ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Better read bio error granularity? 2022-03-13 10:24 Better read bio error granularity? Qu Wenruo @ 2022-03-13 10:55 ` Matthew Wilcox 2022-03-13 11:03 ` Qu Wenruo 0 siblings, 1 reply; 4+ messages in thread From: Matthew Wilcox @ 2022-03-13 10:55 UTC (permalink / raw) To: Qu Wenruo Cc: Linux FS Devel, linux-btrfs@vger.kernel.org, dm-devel@redhat.com On Sun, Mar 13, 2022 at 06:24:32PM +0800, Qu Wenruo wrote: > Since if any of the split bio got an error, the whole bio will have > bi_status set to some error number. > > This is completely fine for write bio, but I'm wondering can we get a > better granularity by introducing per-bvec bi_status or using page status? > > > One situation is, for fs like btrfs or dm device like dm-verify, if a > large bio is submitted, say a 128K one, and one of the page failed to > pass checksum/hmac verification. > > Then the whole 128K will be marked error, while in fact the rest 124K > are completely fine. > > > Can this be solved by something like per-vec bi_status, or using page > error status to indicate where exactly the error is? In general, I think we want to keep this simple; the BIO has an error. If the user wants more fine granularity on the error, they can resubmit a smaller I/O, or hopefully some day we get a better method of reporting errors to the admin than "some random program got EIO". Specifically for the page cache (which I hope is what you meant by "page error status", because we definitely can't use that for DIO), the intent is that ->readahead can just fail and not set any of the pages Uptodate. Then we come along later, notice there's a page in the cache and call ->readpage on it and get the error for that one page. The other 31 pages should read successfully. (There's an awkward queston to ask about large folios here, and what we might be able to do around sub-folio or even sub-page or sub-block reads that happen to not touch the bytes which are in an error region, but let's keep the conversation about pages for now). ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Better read bio error granularity? 2022-03-13 10:55 ` Matthew Wilcox @ 2022-03-13 11:03 ` Qu Wenruo 2022-03-13 17:44 ` Matthew Wilcox 0 siblings, 1 reply; 4+ messages in thread From: Qu Wenruo @ 2022-03-13 11:03 UTC (permalink / raw) To: Matthew Wilcox Cc: Linux FS Devel, linux-btrfs@vger.kernel.org, dm-devel@redhat.com On 2022/3/13 18:55, Matthew Wilcox wrote: > On Sun, Mar 13, 2022 at 06:24:32PM +0800, Qu Wenruo wrote: >> Since if any of the split bio got an error, the whole bio will have >> bi_status set to some error number. >> >> This is completely fine for write bio, but I'm wondering can we get a >> better granularity by introducing per-bvec bi_status or using page status? >> >> >> One situation is, for fs like btrfs or dm device like dm-verify, if a >> large bio is submitted, say a 128K one, and one of the page failed to >> pass checksum/hmac verification. >> >> Then the whole 128K will be marked error, while in fact the rest 124K >> are completely fine. >> >> >> Can this be solved by something like per-vec bi_status, or using page >> error status to indicate where exactly the error is? > > In general, I think we want to keep this simple; the BIO has an error. > If the user wants more fine granularity on the error, they can resubmit > a smaller I/O, or hopefully some day we get a better method of reporting > errors to the admin than "some random program got EIO". Indeed this looks much simpler. > > Specifically for the page cache (which I hope is what you meant by > "page error status", because we definitely can't use that for DIO), Although what I exactly mean is PageError flag. For DIO the pages are not mapping to any inode, but it shouldn't prevent us from using PageError flag I guess? > the intent is that ->readahead can just fail and not set any of the > pages Uptodate. Then we come along later, notice there's a page in > the cache and call ->readpage on it and get the error for that one > page. The other 31 pages should read successfully. This comes a small question, what is prevent the fs to submit a large bio containing the 32 pages again, other than reading them page by page? Just because of that page is there, but not Uptodate? > > (There's an awkward queston to ask about large folios here, and what > we might be able to do around sub-folio or even sub-page or sub-block > reads that happen to not touch the bytes which are in an error region, > but let's keep the conversation about pages for now). > Yeah, that can go crazy pretty soon. Like iomap or btrfs, they all use page::private to store extra bitmaps for those cases, thus it really impossible to use PageError flag. Thus I intentionally skip them here. Thank you very much for the quick and helpful reply, Qu ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Better read bio error granularity? 2022-03-13 11:03 ` Qu Wenruo @ 2022-03-13 17:44 ` Matthew Wilcox 0 siblings, 0 replies; 4+ messages in thread From: Matthew Wilcox @ 2022-03-13 17:44 UTC (permalink / raw) To: Qu Wenruo Cc: Linux FS Devel, linux-btrfs@vger.kernel.org, dm-devel@redhat.com On Sun, Mar 13, 2022 at 07:03:39PM +0800, Qu Wenruo wrote: > > Specifically for the page cache (which I hope is what you meant by > > "page error status", because we definitely can't use that for DIO), > > Although what I exactly mean is PageError flag. > > For DIO the pages are not mapping to any inode, but it shouldn't prevent > us from using PageError flag I guess? For DIO, *it is not your page*. It belongs to somebody else, and you can't even look at the flags, or the ->mapping, or anything, really. You have a refcount on the page to prevent it from going away, but the only thing you can do to the page is use its address and put your refcount when you're done with it. It might be an anonymous page, the ZERO_PAGE, a pagecache page for another filesystem, a pagecache page for your own filesystem, or even a pagecache page for the same file you're writing to. The DIO might only be 512-byte aligned, so different parts of the page can be being written to by different DIOs, or parts or the entire page might be being read by multiple DIOs. It might be registered to an RDMA device. I haven't been keeping up, but I think it's even possible that the page of memory might be on a graphics card. > > the intent is that ->readahead can just fail and not set any of the > > pages Uptodate. Then we come along later, notice there's a page in > > the cache and call ->readpage on it and get the error for that one > > page. The other 31 pages should read successfully. > > This comes a small question, what is prevent the fs to submit a large > bio containing the 32 pages again, other than reading them page by page? > > Just because of that page is there, but not Uptodate? This would be a bad idea. Pages can be !Uptodate and Dirty, if they've been partially written by a block-aligned write. So mindlessly expanding a read of a single page to be a read of multiple pages will result in data loss. Also we might not have pages in the page cache for the surrounding pages, and we might be low enough on memory that we can't allocate them. And pages have to be locked before we do I/O. They have to be locked in order of file index, so you would need to cope with trylocking the pages before this one and those trylocks maybe failing. In short, this would be prohibitively complex. > > > > (There's an awkward queston to ask about large folios here, and what > > we might be able to do around sub-folio or even sub-page or sub-block > > reads that happen to not touch the bytes which are in an error region, > > but let's keep the conversation about pages for now). > > > Yeah, that can go crazy pretty soon. > > Like iomap or btrfs, they all use page::private to store extra bitmaps > for those cases, thus it really impossible to use PageError flag. > Thus I intentionally skip them here. I'm actually looking to get rid of PageError, but I think the other points stand by themselves even without that intention ;-) > Thank you very much for the quick and helpful reply, You're welcome! Thanks for starting the discussion rather than rushing off to implement it ;-) ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2022-03-13 17:44 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-03-13 10:24 Better read bio error granularity? Qu Wenruo 2022-03-13 10:55 ` Matthew Wilcox 2022-03-13 11:03 ` Qu Wenruo 2022-03-13 17:44 ` Matthew Wilcox
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).