Tmpfs size accounting misses memory allocations

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Tmpfs size accounting misses memory allocations
@ 2020-04-25 12:33 Topi Miettinen
  2020-04-28  1:34 ` Hugh Dickins
  0 siblings, 1 reply; 7+ messages in thread
From: Topi Miettinen @ 2020-04-25 12:33 UTC (permalink / raw)
  To: hughd, linux-mm

Hi,

It seems that tmpfs does not count memory which is allocated for short 
symlinks or xattrs towards size= limit. I guess the fix would be to 
change shmem_sb_info->{used_blocks,max_blocks} to use bytes as units 
(instead of blocks) and then add accounting and checks to 
shmem_symlink() and shmem_initxattrs(). Would a patch for that be 
acceptable?

Another issue is that inodes aren't counted towards size= limit either, 
but perhaps that's intentional because there's nr_inodes= mount option 
for exactly that.

If these are not bugs but intentional features, I think the manual page 
tmpfs(5) should be clearer in this respect.

-Topi

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Tmpfs size accounting misses memory allocations
  2020-04-25 12:33 Tmpfs size accounting misses memory allocations Topi Miettinen
@ 2020-04-28  1:34 ` Hugh Dickins
  2020-04-28 10:51   ` Topi Miettinen
  0 siblings, 1 reply; 7+ messages in thread
From: Hugh Dickins @ 2020-04-28  1:34 UTC (permalink / raw)
  To: Topi Miettinen; +Cc: Hugh Dickins, Helge Deller, Pete Zaitcev, linux-mm

On Sat, 25 Apr 2020, Topi Miettinen wrote:
> Hi,
> 
> It seems that tmpfs does not count memory which is allocated for short
> symlinks or xattrs towards size= limit.

Yes, you are right. And that is why tmpfs does not (so far) support
user xattrs, because the unprivileged user could take up too much
memory that way.

> I guess the fix would be to change
> shmem_sb_info->{used_blocks,max_blocks} to use bytes as units (instead of
> blocks) and then add accounting and checks to shmem_symlink() and
> shmem_initxattrs(). Would a patch for that be acceptable?

Thank you for offering, but I don't think a patch for exactly that would
be acceptable. Because "size=" has just been another way of expressing
"nr_blocks=" ever since it was added in 2.4.4, and tmpfs has never
counted the kernel metadata towards its data blocks limit.

You could certainly argue that it should have done from the start; but
in order to keep the accounting suitably simple (pages rather than bytes)
it never did. And I believe there are many users who expect a tmpfs of a
certain size to be able to accommodate data of that size, who would not
care to have to change their scripts or apps to meet a lower limitation.

> 
> Another issue is that inodes aren't counted towards size= limit either, but
> perhaps that's intentional because there's nr_inodes= mount option for
> exactly that.

Yes, tmpfs lets the nr_inodes limit be used to constrain the kernel
metadata (and tmpfs has a peculiarity, that it actually counts hard
links out of nr_inodes, in order to limit the memory spent on dentries).

I doubt the nr_inodes limit is depended upon so critically as the
nr_blocks, and I think we might extend it (say, consider each 1 of
nr_inodes to grant approximately 1kB of unswappable lowmem metadata)
to enable limited user xattrs: a patch along those lines might well
be acceptable.

> 
> If these are not bugs but intentional features, I think the manual page
> tmpfs(5) should be clearer in this respect.

Nobody has asked for that before, it seems to have met expectations as is.
But a sentence could be added to the manpage: do you have wording in mind?

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Tmpfs size accounting misses memory allocations
  2020-04-28  1:34 ` Hugh Dickins
@ 2020-04-28 10:51   ` Topi Miettinen
  2020-05-01  3:14     ` Hugh Dickins
  0 siblings, 1 reply; 7+ messages in thread
From: Topi Miettinen @ 2020-04-28 10:51 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Helge Deller, Pete Zaitcev, linux-mm

On 28.4.2020 4.34, Hugh Dickins wrote:
> On Sat, 25 Apr 2020, Topi Miettinen wrote:
>> Hi,
>>
>> It seems that tmpfs does not count memory which is allocated for short
>> symlinks or xattrs towards size= limit.
> 
> Yes, you are right. And that is why tmpfs does not (so far) support
> user xattrs, because the unprivileged user could take up too much
> memory that way.
> 
>> I guess the fix would be to change
>> shmem_sb_info->{used_blocks,max_blocks} to use bytes as units (instead of
>> blocks) and then add accounting and checks to shmem_symlink() and
>> shmem_initxattrs(). Would a patch for that be acceptable?
> 
> Thank you for offering, but I don't think a patch for exactly that would
> be acceptable. Because "size=" has just been another way of expressing
> "nr_blocks=" ever since it was added in 2.4.4, and tmpfs has never
> counted the kernel metadata towards its data blocks limit.
> 
> You could certainly argue that it should have done from the start; but
> in order to keep the accounting suitably simple (pages rather than bytes)
> it never did. And I believe there are many users who expect a tmpfs of a
> certain size to be able to accommodate data of that size, who would not
> care to have to change their scripts or apps to meet a lower limitation.
> 
>>
>> Another issue is that inodes aren't counted towards size= limit either, but
>> perhaps that's intentional because there's nr_inodes= mount option for
>> exactly that.
> 
> Yes, tmpfs lets the nr_inodes limit be used to constrain the kernel
> metadata (and tmpfs has a peculiarity, that it actually counts hard
> links out of nr_inodes, in order to limit the memory spent on dentries).
> 
> I doubt the nr_inodes limit is depended upon so critically as the
> nr_blocks, and I think we might extend it (say, consider each 1 of
> nr_inodes to grant approximately 1kB of unswappable lowmem metadata)
> to enable limited user xattrs: a patch along those lines might well
> be acceptable.

I'm interested in restricting the amount of memory allocated to tmpfs 
mounts in the system rather than granting more. I've seen a system lock 
up because tmpfs mounts consumed the entire memory. Possible 
contributing factors could be use of LVM and encryption for the swap.

Perhaps there should be a single system limit (sysctl) for total memory 
consumed by all {dev,}tmpfs mounts? Setting this to for example 75% 
could be life saving. Then also the compatibility issues would not 
matter and all memory allocations could be counted.

>> If these are not bugs but intentional features, I think the manual page
>> tmpfs(5) should be clearer in this respect.
> 
> Nobody has asked for that before, it seems to have met expectations as is.
> But a sentence could be added to the manpage: do you have wording in mind?

For example addition to the size option:

NB: Only the contents (blocks) of regular files are counted towards the 
size limit. To limit RAM consumption also by the inodes of the 
directories, symbolic links or device special files, option nr_inodes 
must be used.

-Topi


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Tmpfs size accounting misses memory allocations
  2020-04-28 10:51   ` Topi Miettinen
@ 2020-05-01  3:14     ` Hugh Dickins
  2020-05-01  7:29       ` Topi Miettinen
                         ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Hugh Dickins @ 2020-05-01  3:14 UTC (permalink / raw)
  To: Topi Miettinen; +Cc: Hugh Dickins, Helge Deller, Pete Zaitcev, linux-mm

On Tue, 28 Apr 2020, Topi Miettinen wrote:
> On 28.4.2020 4.34, Hugh Dickins wrote:
> > On Sat, 25 Apr 2020, Topi Miettinen wrote:
> > > Hi,
> > > 
> > > It seems that tmpfs does not count memory which is allocated for short
> > > symlinks or xattrs towards size= limit.
> > 
> > Yes, you are right. And that is why tmpfs does not (so far) support
> > user xattrs, because the unprivileged user could take up too much
> > memory that way.
> > 
> > > I guess the fix would be to change
> > > shmem_sb_info->{used_blocks,max_blocks} to use bytes as units (instead of
> > > blocks) and then add accounting and checks to shmem_symlink() and
> > > shmem_initxattrs(). Would a patch for that be acceptable?
> > 
> > Thank you for offering, but I don't think a patch for exactly that would
> > be acceptable. Because "size=" has just been another way of expressing
> > "nr_blocks=" ever since it was added in 2.4.4, and tmpfs has never
> > counted the kernel metadata towards its data blocks limit.
> > 
> > You could certainly argue that it should have done from the start; but
> > in order to keep the accounting suitably simple (pages rather than bytes)
> > it never did. And I believe there are many users who expect a tmpfs of a
> > certain size to be able to accommodate data of that size, who would not
> > care to have to change their scripts or apps to meet a lower limitation.
> > 
> > > 
> > > Another issue is that inodes aren't counted towards size= limit either,
> > > but
> > > perhaps that's intentional because there's nr_inodes= mount option for
> > > exactly that.
> > 
> > Yes, tmpfs lets the nr_inodes limit be used to constrain the kernel
> > metadata (and tmpfs has a peculiarity, that it actually counts hard
> > links out of nr_inodes, in order to limit the memory spent on dentries).
> > 
> > I doubt the nr_inodes limit is depended upon so critically as the
> > nr_blocks, and I think we might extend it (say, consider each 1 of
> > nr_inodes to grant approximately 1kB of unswappable lowmem metadata)
> > to enable limited user xattrs: a patch along those lines might well
> > be acceptable.
> 
> I'm interested in restricting the amount of memory allocated to tmpfs mounts
> in the system rather than granting more. I've seen a system lock up because
> tmpfs mounts consumed the entire memory. Possible contributing factors could
> be use of LVM and encryption for the swap.

Yes, it is too easy to get into a terrible state that way.  With OOM
killer doing no good at all, because it's busy killing processes, which
does nothing to free the memory held by tmpfs files.  I've never found
a good answer to that in general, though marking files as suitable for
truncation on OOM has been useful in special cases.

> 
> Perhaps there should be a single system limit (sysctl) for total memory
> consumed by all {dev,}tmpfs mounts? Setting this to for example 75% could be
> life saving. Then also the compatibility issues would not matter and all
> memory allocations could be counted.

It's a good suggestion, though I don't much like it. Why not? Hmm:
I'm having difficulty expressing why not, let me sit on it for a bit,
I may come around to your idea there.

My resistance is partly because we already have several other schemes
for resource limiting: the nr_blocks+nr_inodes, the __vm_enough_memory()
checks (/proc/sys/vm/overcommit*), and nowadays memory cgroups.  Adding
yet another is likely to make some fast paths slower.

I expect you to say that this is a fundamental problem with tmpfs, which
should not have to rely on use of MEMCG to save it: I'll agree with you.
And I'd hoped to sell you /proc/sys/vm/overcommit_memory 2 (never
overcommit), before realizing that offers no protection at all from an
explosion of tmpfs inodes, only an explosion of tmpfs data.  Not sure
why we never realized that back in the day: perhaps that is what should
be fixed.

I see now that whether metadata (nr_inodes space) is included in size
or not is just a detail: you had quite reasonably expected it to be
included, but it happens to be additional. The problem is not that it's
additional to size, but that few of us are conscious of how high its
default is: how much memory could be gobbled up in that way (I roughly
estimate 1kB per inode+dentry). Though I think you need a malicious
explosion of inodes in several tmpfs instances to get into trouble
that way, if data sizes are already properly limited.

> 
> > > If these are not bugs but intentional features, I think the manual page
> > > tmpfs(5) should be clearer in this respect.
> > 
> > Nobody has asked for that before, it seems to have met expectations as is.
> > But a sentence could be added to the manpage: do you have wording in mind?
> 
> For example addition to the size option:
> 
> NB: Only the contents (blocks) of regular files are counted towards the size
> limit. To limit RAM consumption also by the inodes of the directories,
> symbolic links or device special files, option nr_inodes must be used.

Better not get into listing directories, symbolic links or device special
files there: it's equally a weakness for regular files - if so minded,
I think you could exhaust memory with enough 0-length regular files
across enough tmpfs instances.

I'm holding back from sending that on to Michael Kerrisk for the man-page
for now: if we do decide to add a protective sysctl, we shall want to
mention that there too.

But one little change that occurred to me this morning: how about this
patch, to stop hiding the default size and nr_inodes from /proc/mounts
(and "mount" command), to help make people more conscious of these limits,
and encourage them to be scaled down:

--- 5.7-rc3/mm/shmem.c	2020-04-26 16:05:25.061228461 -0700
+++ linux/mm/shmem.c	2020-04-30 18:59:59.253865989 -0700
@@ -3583,11 +3583,8 @@ static int shmem_show_options(struct seq
 {
 	struct shmem_sb_info *sbinfo = SHMEM_SB(root->d_sb);

-	if (sbinfo->max_blocks != shmem_default_max_blocks())
-		seq_printf(seq, ",size=%luk",
-			sbinfo->max_blocks << (PAGE_SHIFT - 10));
-	if (sbinfo->max_inodes != shmem_default_max_inodes())
-		seq_printf(seq, ",nr_inodes=%lu", sbinfo->max_inodes);
+	seq_printf(seq, ",size=%luk", sbinfo->max_blocks << (PAGE_SHIFT - 10));
+	seq_printf(seq, ",nr_inodes=%lu", sbinfo->max_inodes);
 	if (sbinfo->mode != (0777 | S_ISVTX))
 		seq_printf(seq, ",mode=%03ho", sbinfo->mode);
 	if (!uid_eq(sbinfo->uid, GLOBAL_ROOT_UID))

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Tmpfs size accounting misses memory allocations
  2020-05-01  3:14     ` Hugh Dickins
@ 2020-05-01  7:29       ` Topi Miettinen
  2020-05-01 19:05       ` Topi Miettinen
  2020-05-03 19:58       ` Topi Miettinen
  2 siblings, 0 replies; 7+ messages in thread
From: Topi Miettinen @ 2020-05-01  7:29 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Helge Deller, Pete Zaitcev, linux-mm

On 1.5.2020 6.14, Hugh Dickins wrote:
> On Tue, 28 Apr 2020, Topi Miettinen wrote:
>> On 28.4.2020 4.34, Hugh Dickins wrote:
>>> On Sat, 25 Apr 2020, Topi Miettinen wrote:
>>>> Hi,
>>>>
>>>> It seems that tmpfs does not count memory which is allocated for short
>>>> symlinks or xattrs towards size= limit.
>>>
>>> Yes, you are right. And that is why tmpfs does not (so far) support
>>> user xattrs, because the unprivileged user could take up too much
>>> memory that way.
>>>
>>>> I guess the fix would be to change
>>>> shmem_sb_info->{used_blocks,max_blocks} to use bytes as units (instead of
>>>> blocks) and then add accounting and checks to shmem_symlink() and
>>>> shmem_initxattrs(). Would a patch for that be acceptable?
>>>
>>> Thank you for offering, but I don't think a patch for exactly that would
>>> be acceptable. Because "size=" has just been another way of expressing
>>> "nr_blocks=" ever since it was added in 2.4.4, and tmpfs has never
>>> counted the kernel metadata towards its data blocks limit.
>>>
>>> You could certainly argue that it should have done from the start; but
>>> in order to keep the accounting suitably simple (pages rather than bytes)
>>> it never did. And I believe there are many users who expect a tmpfs of a
>>> certain size to be able to accommodate data of that size, who would not
>>> care to have to change their scripts or apps to meet a lower limitation.
>>>
>>>>
>>>> Another issue is that inodes aren't counted towards size= limit either,
>>>> but
>>>> perhaps that's intentional because there's nr_inodes= mount option for
>>>> exactly that.
>>>
>>> Yes, tmpfs lets the nr_inodes limit be used to constrain the kernel
>>> metadata (and tmpfs has a peculiarity, that it actually counts hard
>>> links out of nr_inodes, in order to limit the memory spent on dentries).
>>>
>>> I doubt the nr_inodes limit is depended upon so critically as the
>>> nr_blocks, and I think we might extend it (say, consider each 1 of
>>> nr_inodes to grant approximately 1kB of unswappable lowmem metadata)
>>> to enable limited user xattrs: a patch along those lines might well
>>> be acceptable.
>>
>> I'm interested in restricting the amount of memory allocated to tmpfs mounts
>> in the system rather than granting more. I've seen a system lock up because
>> tmpfs mounts consumed the entire memory. Possible contributing factors could
>> be use of LVM and encryption for the swap.
> 
> Yes, it is too easy to get into a terrible state that way.  With OOM
> killer doing no good at all, because it's busy killing processes, which
> does nothing to free the memory held by tmpfs files.  I've never found
> a good answer to that in general, though marking files as suitable for
> truncation on OOM has been useful in special cases.
> 
>>
>> Perhaps there should be a single system limit (sysctl) for total memory
>> consumed by all {dev,}tmpfs mounts? Setting this to for example 75% could be
>> life saving. Then also the compatibility issues would not matter and all
>> memory allocations could be counted.
> 
> It's a good suggestion, though I don't much like it. Why not? Hmm:
> I'm having difficulty expressing why not, let me sit on it for a bit,
> I may come around to your idea there.
> 
> My resistance is partly because we already have several other schemes
> for resource limiting: the nr_blocks+nr_inodes, the __vm_enough_memory()
> checks (/proc/sys/vm/overcommit*), and nowadays memory cgroups.  Adding
> yet another is likely to make some fast paths slower.
> 
> I expect you to say that this is a fundamental problem with tmpfs, which
> should not have to rely on use of MEMCG to save it: I'll agree with you.
> And I'd hoped to sell you /proc/sys/vm/overcommit_memory 2 (never
> overcommit), before realizing that offers no protection at all from an
> explosion of tmpfs inodes, only an explosion of tmpfs data.  Not sure
> why we never realized that back in the day: perhaps that is what should
> be fixed.

Memory cgroups (or something similar) could be a solution if the total 
could be kept for example at 75%, but wouldn't this leave 25% of memory 
totally unused? A new cgroup to limit only tmpfs would not have this 
problem. Either way, setting reasonable limits for each cgroup could be 
a lot of work. The mounts can be global and each mount namespace can 
have additional tmpfs mounts.

/proc/sys/vm/overcommit_memory or overcommit_ratio could also work, if 
there was a way to limit the memory use to less than 100%. Also address 
space usage by itself is not interesting but physical RAM use (perhaps 
via page tables for address spaces).

Man proc(5) also mentions /proc/sys/fs/inode-max to limit in-memory 
inodes but it no longer exists since v2.2.

Another existing candidate could be /proc/sys/kernel/shmall (system-wide 
limit on the total number of pages of System V shared memory), if SysV 
shm here was somehow reinterpreted as any shared memory including tmpfs. 
Is there any connection between SysV shm and tmpfs anymore? If there was 
a global limit (and/or cgroup limits), should shm also be counted 
towards the limits or should they be separate?

Then there's /proc/sys/vm/admin_reserve_kbytes. I suppose 'admin' means 
UID 0 here, so it would not protect the system from accidental filling 
of tmpfs mounts by UID 0 processes. It would still be something.

> I see now that whether metadata (nr_inodes space) is included in size
> or not is just a detail: you had quite reasonably expected it to be
> included, but it happens to be additional. The problem is not that it's
> additional to size, but that few of us are conscious of how high its
> default is: how much memory could be gobbled up in that way (I roughly
> estimate 1kB per inode+dentry). Though I think you need a malicious
> explosion of inodes in several tmpfs instances to get into trouble
> that way, if data sizes are already properly limited.

Half of physical memory pages allows 1/8 (50% * 1kB / 4kB pagesize) of 
the memory to be used for the inodes, so if inodes are not limited, it 
would take eight tmpfs mounts for inodes only attack. Though data block 
limits are never zero and unfortunately there could be unlimited mounts 
in the systems, so the number of mounts needed for causing trouble is 
probably lower.

>>>> If these are not bugs but intentional features, I think the manual page
>>>> tmpfs(5) should be clearer in this respect.
>>>
>>> Nobody has asked for that before, it seems to have met expectations as is.
>>> But a sentence could be added to the manpage: do you have wording in mind?
>>
>> For example addition to the size option:
>>
>> NB: Only the contents (blocks) of regular files are counted towards the size
>> limit. To limit RAM consumption also by the inodes of the directories,
>> symbolic links or device special files, option nr_inodes must be used.
> 
> Better not get into listing directories, symbolic links or device special
> files there: it's equally a weakness for regular files - if so minded,
> I think you could exhaust memory with enough 0-length regular files
> across enough tmpfs instances.
> 
> I'm holding back from sending that on to Michael Kerrisk for the man-page
> for now: if we do decide to add a protective sysctl, we shall want to
> mention that there too.
> 
> But one little change that occurred to me this morning: how about this
> patch, to stop hiding the default size and nr_inodes from /proc/mounts
> (and "mount" command), to help make people more conscious of these limits,
> and encourage them to be scaled down:
> 
> --- 5.7-rc3/mm/shmem.c	2020-04-26 16:05:25.061228461 -0700
> +++ linux/mm/shmem.c	2020-04-30 18:59:59.253865989 -0700
> @@ -3583,11 +3583,8 @@ static int shmem_show_options(struct seq
>   {
>   	struct shmem_sb_info *sbinfo = SHMEM_SB(root->d_sb);
>   
> -	if (sbinfo->max_blocks != shmem_default_max_blocks())
> -		seq_printf(seq, ",size=%luk",
> -			sbinfo->max_blocks << (PAGE_SHIFT - 10));
> -	if (sbinfo->max_inodes != shmem_default_max_inodes())
> -		seq_printf(seq, ",nr_inodes=%lu", sbinfo->max_inodes);
> +	seq_printf(seq, ",size=%luk", sbinfo->max_blocks << (PAGE_SHIFT - 10));
> +	seq_printf(seq, ",nr_inodes=%lu", sbinfo->max_inodes);
>   	if (sbinfo->mode != (0777 | S_ISVTX))
>   		seq_printf(seq, ",mode=%03ho", sbinfo->mode);
>   	if (!uid_eq(sbinfo->uid, GLOBAL_ROOT_UID))
> 

Good idea. There shouldn't be any compatibility issues since the 
applications should have been prepared to read the size and nr_inodes 
options always.

-Topi


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Tmpfs size accounting misses memory allocations
  2020-05-01  3:14     ` Hugh Dickins
  2020-05-01  7:29       ` Topi Miettinen
@ 2020-05-01 19:05       ` Topi Miettinen
  2020-05-03 19:58       ` Topi Miettinen
  2 siblings, 0 replies; 7+ messages in thread
From: Topi Miettinen @ 2020-05-01 19:05 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Helge Deller, Pete Zaitcev, linux-mm

On 1.5.2020 6.14, Hugh Dickins wrote:
> On Tue, 28 Apr 2020, Topi Miettinen wrote:
>> On 28.4.2020 4.34, Hugh Dickins wrote:
>>> On Sat, 25 Apr 2020, Topi Miettinen wrote:
>>>> Hi,
>>>>
>>>> It seems that tmpfs does not count memory which is allocated for short
>>>> symlinks or xattrs towards size= limit.
>>>
>>> Yes, you are right. And that is why tmpfs does not (so far) support
>>> user xattrs, because the unprivileged user could take up too much
>>> memory that way.
>>>
>>>> I guess the fix would be to change
>>>> shmem_sb_info->{used_blocks,max_blocks} to use bytes as units (instead of
>>>> blocks) and then add accounting and checks to shmem_symlink() and
>>>> shmem_initxattrs(). Would a patch for that be acceptable?
>>>
>>> Thank you for offering, but I don't think a patch for exactly that would
>>> be acceptable. Because "size=" has just been another way of expressing
>>> "nr_blocks=" ever since it was added in 2.4.4, and tmpfs has never
>>> counted the kernel metadata towards its data blocks limit.
>>>
>>> You could certainly argue that it should have done from the start; but
>>> in order to keep the accounting suitably simple (pages rather than bytes)
>>> it never did. And I believe there are many users who expect a tmpfs of a
>>> certain size to be able to accommodate data of that size, who would not
>>> care to have to change their scripts or apps to meet a lower limitation.
>>>
>>>>
>>>> Another issue is that inodes aren't counted towards size= limit either,
>>>> but
>>>> perhaps that's intentional because there's nr_inodes= mount option for
>>>> exactly that.
>>>
>>> Yes, tmpfs lets the nr_inodes limit be used to constrain the kernel
>>> metadata (and tmpfs has a peculiarity, that it actually counts hard
>>> links out of nr_inodes, in order to limit the memory spent on dentries).
>>>
>>> I doubt the nr_inodes limit is depended upon so critically as the
>>> nr_blocks, and I think we might extend it (say, consider each 1 of
>>> nr_inodes to grant approximately 1kB of unswappable lowmem metadata)
>>> to enable limited user xattrs: a patch along those lines might well
>>> be acceptable.
>>
>> I'm interested in restricting the amount of memory allocated to tmpfs mounts
>> in the system rather than granting more. I've seen a system lock up because
>> tmpfs mounts consumed the entire memory. Possible contributing factors could
>> be use of LVM and encryption for the swap.
> 
> Yes, it is too easy to get into a terrible state that way.  With OOM
> killer doing no good at all, because it's busy killing processes, which
> does nothing to free the memory held by tmpfs files.  I've never found
> a good answer to that in general, though marking files as suitable for
> truncation on OOM has been useful in special cases.

It seems that similar state can be reached also when an unprivileged 
process allocates and maps lots of SysV shm (if the limit isn't set, 
which seems to be the case at least for Debian).

-Topi


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Tmpfs size accounting misses memory allocations
  2020-05-01  3:14     ` Hugh Dickins
  2020-05-01  7:29       ` Topi Miettinen
  2020-05-01 19:05       ` Topi Miettinen
@ 2020-05-03 19:58       ` Topi Miettinen
  2 siblings, 0 replies; 7+ messages in thread
From: Topi Miettinen @ 2020-05-03 19:58 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Helge Deller, Pete Zaitcev, linux-mm

I did more testing to find out a more specific test case. It seems that 
indeed encrypted swap on LVM logical volume is needed. Swap on 
unencrypted LVM volume, encrypted swap on raw partition or no swap at 
all is not enough. On those cases, OOM killer starts after both RAM and 
swap (if any) has been exhausted and after that it's possible to recover 
if essential processes did not get killed. The same happens with either 
tmpfs, SysV shm and just malloc().

However, in case swap is on encrypted LVM volume, the system becomes 
very unresponsive after RAM (not even swap yet) is filled with either 
tmpfs or SysV shm. It's possible to use SysRq and switch VTs (but it 
happens slowly). But bash does not respond and the cursor can stop 
blinking for a while. OOM killer is not triggered. Manual invocation of 
OOM killer with SysRq kills the bad process, but the system never 
recovers. Exhausting the RAM+swap with malloc() does not trigger this.

Here's the entry for swap in /etc/crypttab:
cswap   /dev/mapper/levy-swap            /dev/urandom 
cipher=aes-xts-plain64,size=256,hash=sha1,swap

/dev/mapper/levy-swap is a LVM volume on SSD with the same size as RAM 
(16GB).

I tested this with Debian kernel 5.6.0-1-amd64.

-Topi

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-05-03 19:58 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-04-25 12:33 Tmpfs size accounting misses memory allocations Topi Miettinen
2020-04-28  1:34 ` Hugh Dickins
2020-04-28 10:51   ` Topi Miettinen
2020-05-01  3:14     ` Hugh Dickins
2020-05-01  7:29       ` Topi Miettinen
2020-05-01 19:05       ` Topi Miettinen
2020-05-03 19:58       ` Topi Miettinen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).