Discussion:
[systemd-devel] How to correctly use memory controls (MemoryLow) on unified hierarchy system?
Andrei Borzenkov
2021-05-21 12:25:05 UTC
Permalink
systemd offers MemoryLow for an individual units. It actually sets
memory.low cgroup attribute, so this is OK. The problem is according to
kernel dcouemtation, memory.low is limited by value set in parent cgroup
and all parent cgroups have memory.low=0:

/sys/fs/cgroup/user.slice/user-1001.slice/***@1001.service/gnome-shell-wayland.service/memory.low:536870912
/sys/fs/cgroup/user.slice/user-1001.slice/***@1001.service/memory.low:0
/sys/fs/cgroup/user.slice/user-1001.slice/memory.low:0
/sys/fs/cgroup/user.slice/memory.low:0

which implies setting on lead cgroup has no effect.

Is it necessary to explicitly set it on every ancestor? There is no
clarification in systemd documentation and value is applied without any
warning.
Benjamin Berg
2021-05-21 14:07:24 UTC
Permalink
Hi,

On Fri, 2021-05-21 at 15:25 +0300, Andrei Borzenkov wrote:
> systemd offers MemoryLow for an individual units. It actually sets
> memory.low cgroup attribute, so this is OK. The problem is according to
> kernel dcouemtation, memory.low is limited by value set in parent
> cgroup and all parent cgroups have memory.low=0:
>
> /sys/fs/cgroup/user.slice/user-1001.slice/***@1001.service/gnome-
> shell-wayland.service/memory.low:536870912
> /sys/fs/cgroup/user.slice/user-1001.slice/***@1001.service/memory.low:
> 0
> /sys/fs/cgroup/user.slice/user-1001.slice/memory.low:0
> /sys/fs/cgroup/user.slice/memory.low:0
>
> which implies setting on lead cgroup has no effect.
>
> Is it necessary to explicitly set it on every ancestor? There is no
> clarification in systemd documentation and value is applied without any
> warning.

Yes, you need to set it on all ancestors, and the documentation
mentions this:

"""
For a protection to be effective, it is generally required to
set a corresponding allocation on all ancestors, which is
then distributed between children (with the exception of the
root slice). Any MemoryMin= or MemoryLow= allocation that is
not explicitly distributed to specific children is used to
create a shared protection for all children. As this is a
shared protection, the children will freely compete for the
memory.
"""

Depending on the kernel versions there may be some other caveats:

"""
Units may have their children use a default "memory.min" or
"memory.low" value by specifying DefaultMemoryMin= or
DefaultMemoryLow=, which has the same semantics as MemoryMin=
and MemoryLow=. This setting does not affect "memory.min" or
"memory.low" in the unit itself. Using it to set a default
child allocation is only useful on kernels older than 5.7,
which do not support the "memory_recursiveprot" cgroup2 mount
option.
"""

You need to configure it correctly in various locations. Personally, I
would suggest taking a look at uresourced[1]. It will correctly set a
configurable memory protection, enables some other cgroup features and
tracks the currently active user. Fedora is shipping it by default and
it appears to work well there.

Benjamin

[1] https://gitlab.freedesktop.org/benzea/uresourced and
https://lwn.net/Articles/829567/
Andrei Borzenkov
2021-05-21 17:14:03 UTC
Permalink
On 21.05.2021 17:07, Benjamin Berg wrote:
> Hi,
>
> On Fri, 2021-05-21 at 15:25 +0300, Andrei Borzenkov wrote:
>> systemd offers MemoryLow for an individual units. It actually sets
>> memory.low cgroup attribute, so this is OK. The problem is according to
>> kernel dcouemtation, memory.low is limited by value set in parent
>> cgroup and all parent cgroups have memory.low=0:
>>
>> /sys/fs/cgroup/user.slice/user-1001.slice/***@1001.service/gnome-
>> shell-wayland.service/memory.low:536870912
>> /sys/fs/cgroup/user.slice/user-1001.slice/***@1001.service/memory.low:
>> 0
>> /sys/fs/cgroup/user.slice/user-1001.slice/memory.low:0
>> /sys/fs/cgroup/user.slice/memory.low:0
>>
>> which implies setting on lead cgroup has no effect.
>>
>> Is it necessary to explicitly set it on every ancestor? There is no
>> clarification in systemd documentation and value is applied without any
>> warning.
>
> Yes, you need to set it on all ancestors, and the documentation
> mentions this:
>
> """
> For a protection to be effective, it is generally required to
> set a corresponding allocation on all ancestors, which is
> then distributed between children (with the exception of the
> root slice). Any MemoryMin= or MemoryLow= allocation that is
> not explicitly distributed to specific children is used to
> create a shared protection for all children. As this is a
> shared protection, the children will freely compete for the
> memory.
> """
>

OK, it is in upstream now, was not in my version and I did not pay
attention to web page. Sorry.

I guess I expected systemd to somehow handle it, given that it knows all
the settings, knows exact hierarchy and is the sole master of cgroup tree.

> Depending on the kernel versions there may be some other caveats:
>
> """
> Units may have their children use a default "memory.min" or
> "memory.low" value by specifying DefaultMemoryMin= or
> DefaultMemoryLow=, which has the same semantics as MemoryMin=
> and MemoryLow=. This setting does not affect "memory.min" or
> "memory.low" in the unit itself. Using it to set a default
> child allocation is only useful on kernels older than 5.7,
> which do not support the "memory_recursiveprot" cgroup2 mount
> option.
> """
>
> You need to configure it correctly in various locations. Personally, I
> would suggest taking a look at uresourced[1]. It will correctly set a
> configurable memory protection, enables some other cgroup features and
> tracks the currently active user. Fedora is shipping it by default and
> it appears to work well there.
>

That's overkill for my purposes. This is single user system and all I am
trying to do is to prevent swapping out Wayland composer to avoid
waiting several minutes to unblank screen. I am fine with setting values
once.

Thanks for the pointers.

> Benjamin
>
> [1] https://gitlab.freedesktop.org/benzea/uresourced and
> https://lwn.net/Articles/829567/
>
Benjamin Berg
2021-05-22 10:28:01 UTC
Permalink
On Fri, 2021-05-21 at 20:14 +0300, Andrei Borzenkov wrote:
> On 21.05.2021 17:07, Benjamin Berg wrote:
> > [SNIP]
> > Yes, you need to set it on all ancestors, and the documentation
> > mentions this:
> >
> > """
> > For a protection to be effective, it is generally required to
> > set a corresponding allocation on all ancestors, which is
> > then distributed between children (with the exception of the
> > root slice). Any MemoryMin= or MemoryLow= allocation that is
> > not explicitly distributed to specific children is used to
> > create a shared protection for all children. As this is a
> > shared protection, the children will freely compete for the
> > memory.
> > """
> >
>
> OK, it is in upstream now, was not in my version and I did not pay
> attention to web page. Sorry.

Ah, true, they were updated not too long ago.

> I guess I expected systemd to somehow handle it, given that it knows
> all the settings, knows exact hierarchy and is the sole master of
> cgroup tree.

I think it is a bit of a conundrum. Automatic handling would be neat,
but it also does not make sense that protections further up in the
hierarchy could increase indefinitely.
A somewhat ugly corner case are the automatically created slice units
for template services. Here the user will need to explicitly configure
a sane limit on the parent slice unit.

Benjamin
Michal Koutný
2021-05-26 17:54:26 UTC
Permalink
On Fri, May 21, 2021 at 08:14:03PM +0300, Andrei Borzenkov <***@gmail.com> wrote:
> That's overkill for my purposes. This is single user system and all I am
> trying to do is to prevent swapping out Wayland composer to avoid
> waiting several minutes to unblank screen. I am fine with setting values
> once.

system.slice:MemoryLow=A
foo.service :MemoryLow=B // e.g. the compositor

A < B
- you get protection of A bytes against global reclaim
- specifically A = 0 turns protection off

A > B
- you get protection of >=B bytes against global reclaim for foo.service
- (A-B) bytes is spread among all children of system.slice (with
memory_recursiveprot)
- specifically B = 0 means foo.service shares the protection with all
other services, it's not prioritized

Then there's third relevant value C -- the typical workingset size of
foo.service. You may get away with B < C.

Certainly, you will need to experiment with this to determine good
values that fit your setup.
(I'm not familiar with Wayland but if it critically depends on some
other services, you may need to protect them too.)


GL;HF,
Michal
Michal Koutný
2021-05-26 17:54:18 UTC
Permalink
On Fri, May 21, 2021 at 03:25:05PM +0300, Andrei Borzenkov <***@gmail.com> wrote:
> Is it necessary to explicitly set it on every ancestor?
It depends against what reclaim you want to be protected.

Global memory reclaim (running out of RAM) -> set it on every ancestor.
Cgroup memory reclaim (hitting memory limit of an ancestor cgroup G) ->
set it till G children only.

It's explained (but not merged) with a picture here [1].

The typical case is the former and therefore typically you set
protection on all ancestors.

Michal

[1] https://lore.kernel.org/lkml/20200729140537.13345-2-***@suse.com/
Loading...