* [PATCH 1/3] tmpfs: revert SEEK_DATA and SEEK_HOLE
2012-07-09 22:35 [PATCH 0/3] shmem/tmpfs: three late patches Hugh Dickins
@ 2012-07-09 22:41 ` Hugh Dickins
2012-07-11 6:07 ` Cong Wang
2012-07-09 22:44 ` [PATCH 2/3] shmem: fix negative rss in memcg memory.stat Hugh Dickins
` (2 subsequent siblings)
3 siblings, 1 reply; 15+ messages in thread
From: Hugh Dickins @ 2012-07-09 22:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Christoph Hellwig, Josef Bacik, Andi Kleen, Andreas Dilger,
Dave Chinner, Marco Stornelli, Jeff liu, Chris Mason, linux-mm,
linux-fsdevel, linux-kernel
Revert 4fb5ef089b28 ("tmpfs: support SEEK_DATA and SEEK_HOLE").
I believe it's correct, and it's been nice to have from rc1 to rc6;
but as the original commit said:
I don't know who actually uses SEEK_DATA or SEEK_HOLE, and whether it
would be of any use to them on tmpfs. This code adds 92 lines and 752
bytes on x86_64 - is that bloat or worthwhile?
Nobody asked for it, so I conclude that it's bloat: let's revert tmpfs
to the dumb generic support for v3.5. We can always reinstate it later
if useful, and anyone needing it in a hurry can just get it out of git.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Josef Bacik <josef@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andreas Dilger <adilger@dilger.ca>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Marco Stornelli <marco.stornelli@gmail.com>
Cc: Jeff liu <jeff.liu@oracle.com>
Cc: Chris Mason <chris.mason@fusionio.com>
---
But if someone protests at this reversion, of course we can drop it.
mm/shmem.c | 94 ---------------------------------------------------
1 file changed, 1 insertion(+), 93 deletions(-)
--- 3.5-rc6/mm/shmem.c 2012-07-07 18:20:40.635328642 -0700
+++ linux/mm/shmem.c 2012-07-07 19:20:02.986950655 -0700
@@ -1692,98 +1692,6 @@ static ssize_t shmem_file_splice_read(st
return error;
}
-/*
- * llseek SEEK_DATA or SEEK_HOLE through the radix_tree.
- */
-static pgoff_t shmem_seek_hole_data(struct address_space *mapping,
- pgoff_t index, pgoff_t end, int origin)
-{
- struct page *page;
- struct pagevec pvec;
- pgoff_t indices[PAGEVEC_SIZE];
- bool done = false;
- int i;
-
- pagevec_init(&pvec, 0);
- pvec.nr = 1; /* start small: we may be there already */
- while (!done) {
- pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
- pvec.nr, pvec.pages, indices);
- if (!pvec.nr) {
- if (origin == SEEK_DATA)
- index = end;
- break;
- }
- for (i = 0; i < pvec.nr; i++, index++) {
- if (index < indices[i]) {
- if (origin == SEEK_HOLE) {
- done = true;
- break;
- }
- index = indices[i];
- }
- page = pvec.pages[i];
- if (page && !radix_tree_exceptional_entry(page)) {
- if (!PageUptodate(page))
- page = NULL;
- }
- if (index >= end ||
- (page && origin == SEEK_DATA) ||
- (!page && origin == SEEK_HOLE)) {
- done = true;
- break;
- }
- }
- shmem_deswap_pagevec(&pvec);
- pagevec_release(&pvec);
- pvec.nr = PAGEVEC_SIZE;
- cond_resched();
- }
- return index;
-}
-
-static loff_t shmem_file_llseek(struct file *file, loff_t offset, int origin)
-{
- struct address_space *mapping;
- struct inode *inode;
- pgoff_t start, end;
- loff_t new_offset;
-
- if (origin != SEEK_DATA && origin != SEEK_HOLE)
- return generic_file_llseek_size(file, offset, origin,
- MAX_LFS_FILESIZE);
- mapping = file->f_mapping;
- inode = mapping->host;
- mutex_lock(&inode->i_mutex);
- /* We're holding i_mutex so we can access i_size directly */
-
- if (offset < 0)
- offset = -EINVAL;
- else if (offset >= inode->i_size)
- offset = -ENXIO;
- else {
- start = offset >> PAGE_CACHE_SHIFT;
- end = (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
- new_offset = shmem_seek_hole_data(mapping, start, end, origin);
- new_offset <<= PAGE_CACHE_SHIFT;
- if (new_offset > offset) {
- if (new_offset < inode->i_size)
- offset = new_offset;
- else if (origin == SEEK_DATA)
- offset = -ENXIO;
- else
- offset = inode->i_size;
- }
- }
-
- if (offset >= 0 && offset != file->f_pos) {
- file->f_pos = offset;
- file->f_version = 0;
- }
- mutex_unlock(&inode->i_mutex);
- return offset;
-}
-
static long shmem_fallocate(struct file *file, int mode, loff_t offset,
loff_t len)
{
@@ -2787,7 +2695,7 @@ static const struct address_space_operat
static const struct file_operations shmem_file_operations = {
.mmap = shmem_mmap,
#ifdef CONFIG_TMPFS
- .llseek = shmem_file_llseek,
+ .llseek = generic_file_llseek,
.read = do_sync_read,
.write = do_sync_write,
.aio_read = shmem_file_aio_read,
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/3] tmpfs: revert SEEK_DATA and SEEK_HOLE
2012-07-09 22:41 ` [PATCH 1/3] tmpfs: revert SEEK_DATA and SEEK_HOLE Hugh Dickins
@ 2012-07-11 6:07 ` Cong Wang
2012-07-11 18:55 ` Hugh Dickins
0 siblings, 1 reply; 15+ messages in thread
From: Cong Wang @ 2012-07-11 6:07 UTC (permalink / raw)
To: linux-mm; +Cc: linux-fsdevel, linux-kernel
On Mon, 09 Jul 2012 at 22:41 GMT, Hugh Dickins <hughd@google.com> wrote:
> Revert 4fb5ef089b28 ("tmpfs: support SEEK_DATA and SEEK_HOLE").
> I believe it's correct, and it's been nice to have from rc1 to rc6;
> but as the original commit said:
>
> I don't know who actually uses SEEK_DATA or SEEK_HOLE, and whether it
> would be of any use to them on tmpfs. This code adds 92 lines and 752
> bytes on x86_64 - is that bloat or worthwhile?
I don't think 752 bytes matter much, especially for x86_64.
>
> Nobody asked for it, so I conclude that it's bloat: let's revert tmpfs
> to the dumb generic support for v3.5. We can always reinstate it later
> if useful, and anyone needing it in a hurry can just get it out of git.
>
If you don't have burden to maintain it, I'd prefer to leave as it is,
I don't think 752-bytes is the reason we revert it.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/3] tmpfs: revert SEEK_DATA and SEEK_HOLE
2012-07-11 6:07 ` Cong Wang
@ 2012-07-11 18:55 ` Hugh Dickins
2012-07-11 23:01 ` Dave Chinner
0 siblings, 1 reply; 15+ messages in thread
From: Hugh Dickins @ 2012-07-11 18:55 UTC (permalink / raw)
To: Cong Wang; +Cc: linux-mm, linux-fsdevel, linux-kernel
On Wed, 11 Jul 2012, Cong Wang wrote:
> On Mon, 09 Jul 2012 at 22:41 GMT, Hugh Dickins <hughd@google.com> wrote:
> > Revert 4fb5ef089b28 ("tmpfs: support SEEK_DATA and SEEK_HOLE").
> > I believe it's correct, and it's been nice to have from rc1 to rc6;
> > but as the original commit said:
> >
> > I don't know who actually uses SEEK_DATA or SEEK_HOLE, and whether it
> > would be of any use to them on tmpfs. This code adds 92 lines and 752
> > bytes on x86_64 - is that bloat or worthwhile?
>
>
> I don't think 752 bytes matter much, especially for x86_64.
>
> >
> > Nobody asked for it, so I conclude that it's bloat: let's revert tmpfs
> > to the dumb generic support for v3.5. We can always reinstate it later
> > if useful, and anyone needing it in a hurry can just get it out of git.
> >
>
> If you don't have burden to maintain it, I'd prefer to leave as it is,
> I don't think 752-bytes is the reason we revert it.
Thank you, your vote has been counted ;)
and I'll be glad if yours stimulates some agreement or disagreement.
But your vote would count for a lot more if you know of some app which
would really benefit from this functionality in tmpfs: I've heard of none.
Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/3] tmpfs: revert SEEK_DATA and SEEK_HOLE
2012-07-11 18:55 ` Hugh Dickins
@ 2012-07-11 23:01 ` Dave Chinner
2012-07-12 2:50 ` Hugh Dickins
2012-07-12 3:21 ` Jeff Liu
0 siblings, 2 replies; 15+ messages in thread
From: Dave Chinner @ 2012-07-11 23:01 UTC (permalink / raw)
To: Hugh Dickins; +Cc: Cong Wang, linux-mm, linux-fsdevel, linux-kernel
On Wed, Jul 11, 2012 at 11:55:34AM -0700, Hugh Dickins wrote:
> On Wed, 11 Jul 2012, Cong Wang wrote:
> > On Mon, 09 Jul 2012 at 22:41 GMT, Hugh Dickins <hughd@google.com> wrote:
> > > Revert 4fb5ef089b28 ("tmpfs: support SEEK_DATA and SEEK_HOLE").
> > > I believe it's correct, and it's been nice to have from rc1 to rc6;
> > > but as the original commit said:
> > >
> > > I don't know who actually uses SEEK_DATA or SEEK_HOLE, and whether it
> > > would be of any use to them on tmpfs. This code adds 92 lines and 752
> > > bytes on x86_64 - is that bloat or worthwhile?
> >
> >
> > I don't think 752 bytes matter much, especially for x86_64.
> >
> > >
> > > Nobody asked for it, so I conclude that it's bloat: let's revert tmpfs
> > > to the dumb generic support for v3.5. We can always reinstate it later
> > > if useful, and anyone needing it in a hurry can just get it out of git.
> > >
> >
> > If you don't have burden to maintain it, I'd prefer to leave as it is,
> > I don't think 752-bytes is the reason we revert it.
>
> Thank you, your vote has been counted ;)
> and I'll be glad if yours stimulates some agreement or disagreement.
>
> But your vote would count for a lot more if you know of some app which
> would really benefit from this functionality in tmpfs: I've heard of none.
So what? I've heard of no apps that use this functionality on XFS,
either, but I have heard of a lot of people asking for it to be
implemented over the past couple of years so they can use it.
There's been patches written to make coreutils (cp) make use of it
instead of parsing FIEMAP output to find holes, though I don't know
if that's gone beyond more than "here's some patches"....
Besides, given that you can punch holes in tmpfs files, it seems
strange to then say "we don't need a method of skipping holes to
find data quickly"....
Besides, seek-hole/data is still shiny new and lots of developers
aren't even aware of it's presence in recent kernels. Removing new
functionality saying "no-one is using it" is like smashing the egg
before the chicken hatches (or is it cutting of the chickes's head
before it lays the egg?).
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/3] tmpfs: revert SEEK_DATA and SEEK_HOLE
2012-07-11 23:01 ` Dave Chinner
@ 2012-07-12 2:50 ` Hugh Dickins
2012-07-12 3:21 ` Jeff Liu
1 sibling, 0 replies; 15+ messages in thread
From: Hugh Dickins @ 2012-07-12 2:50 UTC (permalink / raw)
To: Dave Chinner; +Cc: Cong Wang, linux-mm, linux-fsdevel, linux-kernel
On Thu, 12 Jul 2012, Dave Chinner wrote:
> On Wed, Jul 11, 2012 at 11:55:34AM -0700, Hugh Dickins wrote:
> > On Wed, 11 Jul 2012, Cong Wang wrote:
> > >
> > > If you don't have burden to maintain it, I'd prefer to leave as it is,
> > > I don't think 752-bytes is the reason we revert it.
> >
> > Thank you, your vote has been counted ;)
> > and I'll be glad if yours stimulates some agreement or disagreement.
> >
> > But your vote would count for a lot more if you know of some app which
> > would really benefit from this functionality in tmpfs: I've heard of none.
>
> So what? I've heard of no apps that use this functionality on XFS,
> either, but I have heard of a lot of people asking for it to be
> implemented over the past couple of years so they can use it.
I'd certainly not ask you to remove your support for it from XFS:
nobody would call XFS a minimal filesystem.
But tmpfs has a tradition and a duty to keep fairly small:
it needs to be useful, but it shouldn't be carrying unused baggage.
> There's been patches written to make coreutils (cp) make use of it
> instead of parsing FIEMAP output to find holes, though I don't know
> if that's gone beyond more than "here's some patches"....
>
> Besides, given that you can punch holes in tmpfs files, it seems
> strange to then say "we don't need a method of skipping holes to
> find data quickly"....
tmpfs has been punching holes (via MADV_REMOVE) since 2.6.16 (and
that wasn't added on my whim, IBM wanted and did it). But I haven't
heard of anybody asking for a method of skipping them in six years.
>
> Besides, seek-hole/data is still shiny new and lots of developers
> aren't even aware of it's presence in recent kernels. Removing new
> functionality saying "no-one is using it" is like smashing the egg
> before the chicken hatches (or is it cutting of the chickes's head
> before it lays the egg?).
(You remind me of my chicken-and-egg sandwiches - you can't get them,
you see, it's chicken and egg.)
I'm not trying to remove SEEK_HOLE/SEEK_DATA support from the kernel:
I'm just saying that nobody has yet made the case for their usefulness
in tmpfs, so they're better removed from it before v3.5 is released.
Once we see how useful they have become in the grown-up filesystems,
or someone shows how useful they can be on tmpfs, then we reinstate.
Of course, I'm on both sides of this argument: I wrote that code,
I like it, I'll be glad to put it back when it's useful to someone.
Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/3] tmpfs: revert SEEK_DATA and SEEK_HOLE
2012-07-11 23:01 ` Dave Chinner
2012-07-12 2:50 ` Hugh Dickins
@ 2012-07-12 3:21 ` Jeff Liu
2012-07-16 9:28 ` Hugh Dickins
1 sibling, 1 reply; 15+ messages in thread
From: Jeff Liu @ 2012-07-12 3:21 UTC (permalink / raw)
To: Dave Chinner
Cc: Hugh Dickins, Cong Wang, linux-mm, linux-fsdevel, linux-kernel
On 07/12/2012 07:01 AM, Dave Chinner wrote:
> On Wed, Jul 11, 2012 at 11:55:34AM -0700, Hugh Dickins wrote:
>> On Wed, 11 Jul 2012, Cong Wang wrote:
>>> On Mon, 09 Jul 2012 at 22:41 GMT, Hugh Dickins <hughd@google.com> wrote:
>>>> Revert 4fb5ef089b28 ("tmpfs: support SEEK_DATA and SEEK_HOLE").
>>>> I believe it's correct, and it's been nice to have from rc1 to rc6;
>>>> but as the original commit said:
>>>>
>>>> I don't know who actually uses SEEK_DATA or SEEK_HOLE, and whether it
>>>> would be of any use to them on tmpfs. This code adds 92 lines and 752
>>>> bytes on x86_64 - is that bloat or worthwhile?
>>>
>>>
>>> I don't think 752 bytes matter much, especially for x86_64.
>>>
>>>>
>>>> Nobody asked for it, so I conclude that it's bloat: let's revert tmpfs
>>>> to the dumb generic support for v3.5. We can always reinstate it later
>>>> if useful, and anyone needing it in a hurry can just get it out of git.
>>>>
>>>
>>> If you don't have burden to maintain it, I'd prefer to leave as it is,
>>> I don't think 752-bytes is the reason we revert it.
>>
>> Thank you, your vote has been counted ;)
>> and I'll be glad if yours stimulates some agreement or disagreement.
>>
>> But your vote would count for a lot more if you know of some app which
>> would really benefit from this functionality in tmpfs: I've heard of none.
>
> So what? I've heard of no apps that use this functionality on XFS,
> either, but I have heard of a lot of people asking for it to be
> implemented over the past couple of years so they can use it.
> There's been patches written to make coreutils (cp) make use of it
> instead of parsing FIEMAP output to find holes, though I don't know
> if that's gone beyond more than "here's some patches"...
Yes, for apps, cp(1) will make use of it to replace the old FIEMAP for efficient sparse file copy.
I have implemented an extent-scan module to coreutils a few years ago,
http://fossies.org/dox/coreutils-8.17/extent-scan_8c_source.html
It does extent scan through FIEMAP, however, SEEK_DATA/SEEK_HOLE is more convenient and easy to use
considering the call interface. So FIEMAP will be replaced by SEEK_XXX once it got supported by EXT4.
Moreover, I have discussed with Jim who is the coreutils maintainer previously, He would like to post
extent-scan module to Gnulib so that other GNU utilities which are relied on Gnulib might be a potential
user of it, at least, GNU tar will definitely need it for sparse file backup.
>
> Besides, given that you can punch holes in tmpfs files, it seems
> strange to then say "we don't need a method of skipping holes to
> find data quickly"....
So its deserve to keep this feature working on tmpfs considering hole punch. :)
Thanks,
-Jeff
>
> Besides, seek-hole/data is still shiny new and lots of developers
> aren't even aware of it's presence in recent kernels. Removing new
> functionality saying "no-one is using it" is like smashing the egg
> before the chicken hatches (or is it cutting of the chickes's head
> before it lays the egg?).
>
> Cheers,
>
> Dave.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/3] tmpfs: revert SEEK_DATA and SEEK_HOLE
2012-07-12 3:21 ` Jeff Liu
@ 2012-07-16 9:28 ` Hugh Dickins
2012-07-17 6:15 ` Jeff Liu
0 siblings, 1 reply; 15+ messages in thread
From: Hugh Dickins @ 2012-07-16 9:28 UTC (permalink / raw)
To: Jeff Liu; +Cc: Dave Chinner, Cong Wang, linux-mm, linux-fsdevel, linux-kernel
On Thu, 12 Jul 2012, Jeff Liu wrote:
> On 07/12/2012 07:01 AM, Dave Chinner wrote:
> > On Wed, Jul 11, 2012 at 11:55:34AM -0700, Hugh Dickins wrote:
> >>
> >> But your vote would count for a lot more if you know of some app which
> >> would really benefit from this functionality in tmpfs: I've heard of none.
> >
> > So what? I've heard of no apps that use this functionality on XFS,
> > either, but I have heard of a lot of people asking for it to be
> > implemented over the past couple of years so they can use it.
> > There's been patches written to make coreutils (cp) make use of it
> > instead of parsing FIEMAP output to find holes, though I don't know
> > if that's gone beyond more than "here's some patches"...
>
> Yes, for apps, cp(1) will make use of it to replace the old FIEMAP for efficient sparse file copy.
> I have implemented an extent-scan module to coreutils a few years ago,
> http://fossies.org/dox/coreutils-8.17/extent-scan_8c_source.html
Thanks for confirming Dave's pointer to cp.
Of course, tmpfs has never supported FIBMAP or FIEMAP;
but SEEK_DATA and SEEK_HOLE do fit it much more naturally.
>
> It does extent scan through FIEMAP, however, SEEK_DATA/SEEK_HOLE is more convenient and easy to use
> considering the call interface. So FIEMAP will be replaced by SEEK_XXX once it got supported by EXT4.
>
> Moreover, I have discussed with Jim who is the coreutils maintainer previously, He would like to post
> extent-scan module to Gnulib so that other GNU utilities which are relied on Gnulib might be a potential
> user of it, at least, GNU tar will definitely need it for sparse file backup.
Thanks for the info. I confess I'm not hugely swayed by cp and sparse
file archive arguments - I doubt many people care, and I doubt those who
do care are using tmpfs for them.
But my doubts are just ignorance. I was hoping to hear, not that we have
tools to copy sparse files efficiently (umm, over the network?), but
what apps are actually working live with those sparse files on tmpfs,
and now need to seek around them. Some math or physics applications?
> >
> > Besides, given that you can punch holes in tmpfs files, it seems
> > strange to then say "we don't need a method of skipping holes to
> > find data quickly"....
>
> So its deserve to keep this feature working on tmpfs considering hole punch. :)
Well, thank you, as I said earlier I am on both sides of the argument.
(And feel uncomfortably like a prima donna waiting in the wings until
the audience has shouted long enough for the encore.)
It's now taken out of 3.5, but we can bring it back when there's more
demand. Your extent-scan is itself waiting for ext4 to support it:
maybe get noisy at me when that's imminent.
Hugh
>
> Thanks,
> -Jeff
>
> >
> > Besides, seek-hole/data is still shiny new and lots of developers
> > aren't even aware of it's presence in recent kernels. Removing new
> > functionality saying "no-one is using it" is like smashing the egg
> > before the chicken hatches (or is it cutting of the chickes's head
> > before it lays the egg?).
> >
> > Cheers,
> >
> > Dave.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/3] tmpfs: revert SEEK_DATA and SEEK_HOLE
2012-07-16 9:28 ` Hugh Dickins
@ 2012-07-17 6:15 ` Jeff Liu
0 siblings, 0 replies; 15+ messages in thread
From: Jeff Liu @ 2012-07-17 6:15 UTC (permalink / raw)
To: Hugh Dickins
Cc: Dave Chinner, Cong Wang, linux-mm, linux-fsdevel, linux-kernel
On 07/16/2012 05:28 PM, Hugh Dickins wrote:
> On Thu, 12 Jul 2012, Jeff Liu wrote:
>> On 07/12/2012 07:01 AM, Dave Chinner wrote:
>>> On Wed, Jul 11, 2012 at 11:55:34AM -0700, Hugh Dickins wrote:
>>>>
>>>> But your vote would count for a lot more if you know of some app which
>>>> would really benefit from this functionality in tmpfs: I've heard of none.
>>>
>>> So what? I've heard of no apps that use this functionality on XFS,
>>> either, but I have heard of a lot of people asking for it to be
>>> implemented over the past couple of years so they can use it.
>>> There's been patches written to make coreutils (cp) make use of it
>>> instead of parsing FIEMAP output to find holes, though I don't know
>>> if that's gone beyond more than "here's some patches"...
>>
>> Yes, for apps, cp(1) will make use of it to replace the old FIEMAP for efficient sparse file copy.
>> I have implemented an extent-scan module to coreutils a few years ago,
>> http://fossies.org/dox/coreutils-8.17/extent-scan_8c_source.html
>
> Thanks for confirming Dave's pointer to cp.
>
> Of course, tmpfs has never supported FIBMAP or FIEMAP;
> but SEEK_DATA and SEEK_HOLE do fit it much more naturally.
>
>>
>> It does extent scan through FIEMAP, however, SEEK_DATA/SEEK_HOLE is more convenient and easy to use
>> considering the call interface. So FIEMAP will be replaced by SEEK_XXX once it got supported by EXT4.
>>
>> Moreover, I have discussed with Jim who is the coreutils maintainer previously, He would like to post
>> extent-scan module to Gnulib so that other GNU utilities which are relied on Gnulib might be a potential
>> user of it, at least, GNU tar will definitely need it for sparse file backup.
>
> Thanks for the info. I confess I'm not hugely swayed by cp and sparse
> file archive arguments - I doubt many people care, and I doubt those who
> do care are using tmpfs for them.
>
> But my doubts are just ignorance. I was hoping to hear, not that we have
> tools to copy sparse files efficiently (umm, over the network?), but
> what apps are actually working live with those sparse files on tmpfs,
> and now need to seek around them. Some math or physics applications?
>
>>>
>>> Besides, given that you can punch holes in tmpfs files, it seems
>>> strange to then say "we don't need a method of skipping holes to
>>> find data quickly"....
>>
>> So its deserve to keep this feature working on tmpfs considering hole punch. :)
>
> Well, thank you, as I said earlier I am on both sides of the argument.
> (And feel uncomfortably like a prima donna waiting in the wings until
> the audience has shouted long enough for the encore.)
Oh, sorry, I missed you response to Dave before.
>
> It's now taken out of 3.5, but we can bring it back when there's more
> demand.
Yep. :)
Thanks,
-Jeff
> Your extent-scan is itself waiting for ext4 to support it:
> maybe get noisy at me when that's imminent.
>
> Hugh
>
>>
>> Thanks,
>> -Jeff
>>
>>>
>>> Besides, seek-hole/data is still shiny new and lots of developers
>>> aren't even aware of it's presence in recent kernels. Removing new
>>> functionality saying "no-one is using it" is like smashing the egg
>>> before the chicken hatches (or is it cutting of the chickes's head
>>> before it lays the egg?).
>>>
>>> Cheers,
>>>
>>> Dave.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 2/3] shmem: fix negative rss in memcg memory.stat
2012-07-09 22:35 [PATCH 0/3] shmem/tmpfs: three late patches Hugh Dickins
2012-07-09 22:41 ` [PATCH 1/3] tmpfs: revert SEEK_DATA and SEEK_HOLE Hugh Dickins
@ 2012-07-09 22:44 ` Hugh Dickins
2012-07-10 12:41 ` Johannes Weiner
2012-07-09 22:46 ` [PATCH 3/3] shmem: cleanup shmem_add_to_page_cache Hugh Dickins
2012-07-09 23:39 ` [PATCH 0/3] shmem/tmpfs: three late patches Andrew Morton
3 siblings, 1 reply; 15+ messages in thread
From: Hugh Dickins @ 2012-07-09 22:44 UTC (permalink / raw)
To: Andrew Morton
Cc: Johannes Weiner, KAMEZAWA Hiroyuki, Michal Hocko, linux-mm,
linux-kernel
When adding the page_private checks before calling shmem_replace_page(),
I did realize that there is a further race, but thought it too unlikely
to need a hurried fix.
But independently I've been chasing why a mem cgroup's memory.stat
sometimes shows negative rss after all tasks have gone: I expected it
to be a stats gathering bug, but actually it's shmem swapping's fault.
It's an old surprise, that when you lock_page(lookup_swap_cache(swap)),
the page may have been removed from swapcache before getting the lock;
or it may have been freed and reused and be back in swapcache; and it
can even be using the same swap location as before (page_private same).
The swapoff case is already secure against this (swap cannot be reused
until the whole area has been swapped off, and a new swapped on); and
shmem_getpage_gfp() is protected by shmem_add_to_page_cache()'s check
for the expected radix_tree entry - but a little too late.
By that time, we might have already decided to shmem_replace_page():
I don't know of a problem from that, but I'd feel more at ease not to
do so spuriously. And we have already done mem_cgroup_cache_charge(),
on perhaps the wrong mem cgroup: and this charge is not then undone on
the error path, because PageSwapCache ends up preventing that.
It's this last case which causes the occasional negative rss in
memory.stat: the page is charged here as cache, but (sometimes) found
to be anon when eventually it's uncharged - and in between, it's an
undeserved charge on the wrong memcg.
Fix this by adding an earlier check on the radix_tree entry: it's
inelegant to descend the tree twice, but swapping is not the fast path,
and a better solution would need a pair (try+commit) of memcg calls,
and a rework of shmem_replace_page() to keep out of the swapcache.
We can use the added shmem_confirm_swap() function to replace the
find_get_page+page_cache_release we were already doing on the error
path. And add a comment on that -EEXIST: it seems a peculiar errno
to be using, but originates from its use in radix_tree_insert().
[It can be surprising to see positive rss left in a memcg's memory.stat
after all tasks have gone, since it is supposed to count anonymous but
not shmem. Aside from sharing anon pages via fork with a task in some
other memcg, it often happens after swapping: because a swap page can't
be freed while under writeback, nor while locked. So it's not an error,
and these residual pages are easily freed once pressure demands.]
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
---
I'd rather like this to go into v3.5, but it is late, and I don't have
a very strong argument for it: as you prefer. And I've not marked it
for stable, since the patch won't apply to v3.4 as is; but I'd happily
supply a patch for v3.1 onwards if asked.
mm/shmem.c | 41 +++++++++++++++++++++++++++++------------
1 file changed, 29 insertions(+), 12 deletions(-)
--- 3.5-rc6/mm/shmem.c 2012-07-07 19:20:02.986950655 -0700
+++ linux/mm/shmem.c 2012-07-07 19:20:52.026952048 -0700
@@ -264,6 +264,24 @@ static int shmem_radix_tree_replace(stru
}
/*
+ * Sometimes, before we decide whether to proceed or to fail, we must check
+ * that an entry was not already brought back from swap by a racing thread.
+ *
+ * Checking page is not enough: by the time a SwapCache page is locked, it
+ * might be reused, and again be SwapCache, using the same swap as before.
+ */
+static bool shmem_confirm_swap(struct address_space *mapping,
+ pgoff_t index, swp_entry_t swap)
+{
+ void *item;
+
+ rcu_read_lock();
+ item = radix_tree_lookup(&mapping->page_tree, index);
+ rcu_read_unlock();
+ return item == swp_to_radix_entry(swap);
+}
+
+/*
* Like add_to_page_cache_locked, but error if expected item has gone.
*/
static int shmem_add_to_page_cache(struct page *page,
@@ -1124,9 +1142,9 @@ repeat:
/* We have to do this with page locked to prevent races */
lock_page(page);
if (!PageSwapCache(page) || page_private(page) != swap.val ||
- page->mapping) {
+ !shmem_confirm_swap(mapping, index, swap)) {
error = -EEXIST; /* try again */
- goto failed;
+ goto unlock;
}
if (!PageUptodate(page)) {
error = -EIO;
@@ -1142,9 +1160,12 @@ repeat:
error = mem_cgroup_cache_charge(page, current->mm,
gfp & GFP_RECLAIM_MASK);
- if (!error)
+ if (!error) {
error = shmem_add_to_page_cache(page, mapping, index,
gfp, swp_to_radix_entry(swap));
+ /* We already confirmed swap, and make no allocation */
+ VM_BUG_ON(error);
+ }
if (error)
goto failed;
@@ -1245,14 +1266,10 @@ decused:
unacct:
shmem_unacct_blocks(info->flags, 1);
failed:
- if (swap.val && error != -EINVAL) {
- struct page *test = find_get_page(mapping, index);
- if (test && !radix_tree_exceptional_entry(test))
- page_cache_release(test);
- /* Have another try if the entry has changed */
- if (test != swp_to_radix_entry(swap))
- error = -EEXIST;
- }
+ if (swap.val && error != -EINVAL &&
+ !shmem_confirm_swap(mapping, index, swap))
+ error = -EEXIST;
+unlock:
if (page) {
unlock_page(page);
page_cache_release(page);
@@ -1264,7 +1281,7 @@ failed:
spin_unlock(&info->lock);
goto repeat;
}
- if (error == -EEXIST)
+ if (error == -EEXIST) /* from above or from radix_tree_insert */
goto repeat;
return error;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/3] shmem: fix negative rss in memcg memory.stat
2012-07-09 22:44 ` [PATCH 2/3] shmem: fix negative rss in memcg memory.stat Hugh Dickins
@ 2012-07-10 12:41 ` Johannes Weiner
2012-07-11 18:15 ` Hugh Dickins
0 siblings, 1 reply; 15+ messages in thread
From: Johannes Weiner @ 2012-07-10 12:41 UTC (permalink / raw)
To: Hugh Dickins
Cc: Andrew Morton, KAMEZAWA Hiroyuki, Michal Hocko, linux-mm,
linux-kernel
On Mon, Jul 09, 2012 at 03:44:24PM -0700, Hugh Dickins wrote:
> When adding the page_private checks before calling shmem_replace_page(),
> I did realize that there is a further race, but thought it too unlikely
> to need a hurried fix.
>
> But independently I've been chasing why a mem cgroup's memory.stat
> sometimes shows negative rss after all tasks have gone: I expected it
> to be a stats gathering bug, but actually it's shmem swapping's fault.
>
> It's an old surprise, that when you lock_page(lookup_swap_cache(swap)),
> the page may have been removed from swapcache before getting the lock;
> or it may have been freed and reused and be back in swapcache; and it
> can even be using the same swap location as before (page_private same).
>
> The swapoff case is already secure against this (swap cannot be reused
> until the whole area has been swapped off, and a new swapped on); and
> shmem_getpage_gfp() is protected by shmem_add_to_page_cache()'s check
> for the expected radix_tree entry - but a little too late.
>
> By that time, we might have already decided to shmem_replace_page():
> I don't know of a problem from that, but I'd feel more at ease not to
> do so spuriously. And we have already done mem_cgroup_cache_charge(),
> on perhaps the wrong mem cgroup: and this charge is not then undone on
> the error path, because PageSwapCache ends up preventing that.
I couldn't see anything wrong with shmem_replace_page(), either, but
maybe the comment in its error path could be updated as the callsite
does not rely on page_private alone anymore to confirm correct swap.
> It's this last case which causes the occasional negative rss in
> memory.stat: the page is charged here as cache, but (sometimes) found
> to be anon when eventually it's uncharged - and in between, it's an
> undeserved charge on the wrong memcg.
>
> Fix this by adding an earlier check on the radix_tree entry: it's
> inelegant to descend the tree twice, but swapping is not the fast path,
> and a better solution would need a pair (try+commit) of memcg calls,
> and a rework of shmem_replace_page() to keep out of the swapcache.
>
> We can use the added shmem_confirm_swap() function to replace the
> find_get_page+page_cache_release we were already doing on the error
> path. And add a comment on that -EEXIST: it seems a peculiar errno
> to be using, but originates from its use in radix_tree_insert().
>
> [It can be surprising to see positive rss left in a memcg's memory.stat
> after all tasks have gone, since it is supposed to count anonymous but
> not shmem. Aside from sharing anon pages via fork with a task in some
> other memcg, it often happens after swapping: because a swap page can't
> be freed while under writeback, nor while locked. So it's not an error,
> and these residual pages are easily freed once pressure demands.]
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/3] shmem: fix negative rss in memcg memory.stat
2012-07-10 12:41 ` Johannes Weiner
@ 2012-07-11 18:15 ` Hugh Dickins
0 siblings, 0 replies; 15+ messages in thread
From: Hugh Dickins @ 2012-07-11 18:15 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, KAMEZAWA Hiroyuki, Michal Hocko, linux-mm,
linux-kernel
On Tue, 10 Jul 2012, Johannes Weiner wrote:
>
> I couldn't see anything wrong with shmem_replace_page(), either, but
> maybe the comment in its error path could be updated as the callsite
> does not rely on page_private alone anymore to confirm correct swap.
I went in to make an incremental fix to update that comment as you
suggest, but found that actually the comment should stay as is.
We're dealing with two different radix trees here: shmem_confirm_swap()
and shmem_add_to_page_cache() are operating on the tmpfs file radix tree,
but shmem_replace_page() is using shmem_radix_tree_replace() to operate
on the "swapper_space" radix tree, exchanging the page pointer there.
The preliminary page_private test (under page lock) should indeed be
guaranteeing that the page pointer found in the swapper_space radix
tree at that (swp_entry_t) offset is the one we expect there, as before.
Whereas the new shmem_confirm_swap() test doesn't help to guarantee that
part at all: it's for confirming that the swap entry is still being used
for the offset in the file that we're interested in.
The comment I would like to change is the "nice clean interface" one!
While it does make for a nice old-page-in/new-page-out interface to that
function, I've come to feel that it would be much better not to mess with
the swapcache at all there - leave the old page in the swapcache, and
remove it at the same time as inserting the new page into filecache.
But I'm also reluctant to mess with what's working: I'm in no rush
to change that around, I'd be sure to screw it up at first.
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Thanks,
Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 3/3] shmem: cleanup shmem_add_to_page_cache
2012-07-09 22:35 [PATCH 0/3] shmem/tmpfs: three late patches Hugh Dickins
2012-07-09 22:41 ` [PATCH 1/3] tmpfs: revert SEEK_DATA and SEEK_HOLE Hugh Dickins
2012-07-09 22:44 ` [PATCH 2/3] shmem: fix negative rss in memcg memory.stat Hugh Dickins
@ 2012-07-09 22:46 ` Hugh Dickins
2012-07-10 13:01 ` Johannes Weiner
2012-07-09 23:39 ` [PATCH 0/3] shmem/tmpfs: three late patches Andrew Morton
3 siblings, 1 reply; 15+ messages in thread
From: Hugh Dickins @ 2012-07-09 22:46 UTC (permalink / raw)
To: Andrew Morton
Cc: Johannes Weiner, KAMEZAWA Hiroyuki, Michal Hocko, linux-mm,
linux-kernel
shmem_add_to_page_cache() has three callsites, but only one of them
wants the radix_tree_preload() (an exceptional entry guarantees that
the radix tree node is present in the other cases), and only that site
can achieve mem_cgroup_uncharge_cache_page() (PageSwapCache makes it a
no-op in the other cases). We did it this way originally to reflect
add_to_page_cache_locked(); but it's confusing now, so move the
radix_tree preloading and mem_cgroup uncharging to that one caller.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
---
This is just a cleanup: I'd prefer it to go in along with the fix 2/3,
but it can be delayed to v3.6 if you prefer.
mm/shmem.c | 58 ++++++++++++++++++++++++---------------------------
1 file changed, 28 insertions(+), 30 deletions(-)
--- 3.5-rc6+/mm/shmem.c 2012-07-07 19:20:52.026952048 -0700
+++ linux/mm/shmem.c 2012-07-07 19:21:44.342952082 -0700
@@ -288,40 +288,31 @@ static int shmem_add_to_page_cache(struc
struct address_space *mapping,
pgoff_t index, gfp_t gfp, void *expected)
{
- int error = 0;
+ int error;
VM_BUG_ON(!PageLocked(page));
VM_BUG_ON(!PageSwapBacked(page));
+ page_cache_get(page);
+ page->mapping = mapping;
+ page->index = index;
+
+ spin_lock_irq(&mapping->tree_lock);
if (!expected)
- error = radix_tree_preload(gfp & GFP_RECLAIM_MASK);
+ error = radix_tree_insert(&mapping->page_tree, index, page);
+ else
+ error = shmem_radix_tree_replace(mapping, index, expected,
+ page);
if (!error) {
- page_cache_get(page);
- page->mapping = mapping;
- page->index = index;
-
- spin_lock_irq(&mapping->tree_lock);
- if (!expected)
- error = radix_tree_insert(&mapping->page_tree,
- index, page);
- else
- error = shmem_radix_tree_replace(mapping, index,
- expected, page);
- if (!error) {
- mapping->nrpages++;
- __inc_zone_page_state(page, NR_FILE_PAGES);
- __inc_zone_page_state(page, NR_SHMEM);
- spin_unlock_irq(&mapping->tree_lock);
- } else {
- page->mapping = NULL;
- spin_unlock_irq(&mapping->tree_lock);
- page_cache_release(page);
- }
- if (!expected)
- radix_tree_preload_end();
+ mapping->nrpages++;
+ __inc_zone_page_state(page, NR_FILE_PAGES);
+ __inc_zone_page_state(page, NR_SHMEM);
+ spin_unlock_irq(&mapping->tree_lock);
+ } else {
+ page->mapping = NULL;
+ spin_unlock_irq(&mapping->tree_lock);
+ page_cache_release(page);
}
- if (error)
- mem_cgroup_uncharge_cache_page(page);
return error;
}
@@ -1202,11 +1193,18 @@ repeat:
__set_page_locked(page);
error = mem_cgroup_cache_charge(page, current->mm,
gfp & GFP_RECLAIM_MASK);
- if (!error)
- error = shmem_add_to_page_cache(page, mapping, index,
- gfp, NULL);
if (error)
goto decused;
+ error = radix_tree_preload(gfp & GFP_RECLAIM_MASK);
+ if (!error) {
+ error = shmem_add_to_page_cache(page, mapping, index,
+ gfp, NULL);
+ radix_tree_preload_end();
+ }
+ if (error) {
+ mem_cgroup_uncharge_cache_page(page);
+ goto decused;
+ }
lru_cache_add_anon(page);
spin_lock(&info->lock);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 3/3] shmem: cleanup shmem_add_to_page_cache
2012-07-09 22:46 ` [PATCH 3/3] shmem: cleanup shmem_add_to_page_cache Hugh Dickins
@ 2012-07-10 13:01 ` Johannes Weiner
0 siblings, 0 replies; 15+ messages in thread
From: Johannes Weiner @ 2012-07-10 13:01 UTC (permalink / raw)
To: Hugh Dickins
Cc: Andrew Morton, KAMEZAWA Hiroyuki, Michal Hocko, linux-mm,
linux-kernel
On Mon, Jul 09, 2012 at 03:46:53PM -0700, Hugh Dickins wrote:
> shmem_add_to_page_cache() has three callsites, but only one of them
> wants the radix_tree_preload() (an exceptional entry guarantees that
> the radix tree node is present in the other cases), and only that site
> can achieve mem_cgroup_uncharge_cache_page() (PageSwapCache makes it a
> no-op in the other cases). We did it this way originally to reflect
> add_to_page_cache_locked(); but it's confusing now, so move the
> radix_tree preloading and mem_cgroup uncharging to that one caller.
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
I'm rebasing my (un)charge series on top of these, thanks. It only
annihilates 3/11 and leaves the rest alone--line numbers aside--since
the rules did not change.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/3] shmem/tmpfs: three late patches
2012-07-09 22:35 [PATCH 0/3] shmem/tmpfs: three late patches Hugh Dickins
` (2 preceding siblings ...)
2012-07-09 22:46 ` [PATCH 3/3] shmem: cleanup shmem_add_to_page_cache Hugh Dickins
@ 2012-07-09 23:39 ` Andrew Morton
3 siblings, 0 replies; 15+ messages in thread
From: Andrew Morton @ 2012-07-09 23:39 UTC (permalink / raw)
To: Hugh Dickins; +Cc: linux-mm, linux-kernel
On Mon, 9 Jul 2012 15:35:26 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:
> Here's three little shmem/tmpfs patches against v3.5-rc6.
> Either the first should go in before v3.5 final, or it should not go
> in at all. The second and third are independent of it: I'd like them
> in v3.5, but don't have a clinching argument: see what you think.
Thanks, I queued all three for 3.5.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread