Discussion:
systemd-logind bug in systemd v35
(too old to reply)
Albert Strasheim
2011-10-04 13:40:25 UTC
Permalink
Hello all

I think we've found a bug in systemd-logind with systemd v35
(specifically, systemd-35-1.fc16.x86_64.rpm).

After a few thousand ssh logins, we start seeing stuff like this in
/var/log/secure:

Oct 4 13:34:29 next-10-221-0-2 sshd[568]: pam_systemd(sshd:session):
Failed to create session: Did not receive a reply. Possible causes
include: the remote application did not send a reply, the message bus
security policy blocked the reply, the reply timeout expired, or the
network connection was broken.
Oct 4 13:34:29 next-10-221-0-2 sshd[568]: pam_unix(sshd:session):
session opened for user root by (uid=0)
Oct 4 13:34:30 next-10-221-0-2 sshd[621]: pam_systemd(sshd:session):
Failed to create session: Did not receive a reply. Possible causes
include: the remote application did not send a reply, the message bus
security policy blocked the reply, the reply timeout expired, or the
network connection was broken.
Oct 4 13:34:30 next-10-221-0-2 sshd[621]: pam_unix(sshd:session):
session opened for user root by (uid=0)
Oct 4 13:34:31 next-10-221-0-2 sshd[1451]: Accepted password for root
from 10.221.255.254 port 41899 ssh2

and strace on systemd-logind produces this:

stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
sendmsg(3, {msg_name(0)=NULL, msg_iov(5)=[{"<27>", 4}, {"Oct 4
13:37:35 ", 16}, {"systemd-logind", 14}, {"[2169]: ", 8}, {"Failed to
save user data for root: Too many open files", 54}], msg_controllen=0,
msg_flags=0}, MSG_NOSIGNAL) = 96
sendmsg(8, {msg_name(0)=NULL, msg_iov(2)=[{"l\4\1\1=\0\0\0?\333
\3h\0\0\0\1\1o\0\27\0\0\0/org/freedesktop/login1\0\2\1s\0\36\0\0\0org.freedesktop.login1.Manager\0\0\3\1s\0\n\0\0\0SessionNew\0\0\0\0\0\0\10\1g\0\2so\0",
120}, {"\10\0\0\0c6239507\0\0\0\0(\0\0\0/org/freedesktop/login1/session/c6239507\0",
61}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 181
sendmsg(8, {msg_name(0)=NULL, msg_iov(2)=[{"l\4\1\0019\0\0\0@\333
\3~\0\0\0\1\1o\0\36\0\0\0/org/freedesktop/login1/user/0\0\0\2\1s\0\37\0\0\0org.freedesktop.DBus.Properties\0\3\1s\0\21\0\0\0PropertiesChanged\0\0\0\0\0\0\0\10\1g\0\10sa{sv}as\0\0\0",
144}, {"\33\0\0\0org.freedesktop.login1.User\0\0\0\0\0\0\0\0\0\r\0\0\0\10\0\0\0Sessions\0",
57}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 201
fcntl(1023, F_DUPFD_CLOEXEC, 3) = -1 EMFILE (Too many open files)
close(1023) = 0

Looks like systemd-logind isn't closing all its file descriptors under
some circumstances.

Regards

Albert
Lennart Poettering
2011-10-04 23:40:51 UTC
Permalink
Post by Albert Strasheim
Hello all
I think we've found a bug in systemd-logind with systemd v35
(specifically, systemd-35-1.fc16.x86_64.rpm).
After a few thousand ssh logins, we start seeing stuff like this in
After a few thousand parallel ssh logins or one after the other?

This really looks as if we are leaking file descriptors in some way.

What's the contents of /proc/$(pidof systemd-logind)/fd/ if this
happens?

What is the contents of /run/systemd/sessions/?

Lennart
--
Lennart Poettering - Red Hat, Inc.
Albert Strasheim
2011-10-05 08:56:58 UTC
Permalink
Hello

On Wed, Oct 5, 2011 at 1:40 AM, Lennart Poettering
Post by Lennart Poettering
Post by Albert Strasheim
Hello all
I think we've found a bug in systemd-logind with systemd v35
(specifically, systemd-35-1.fc16.x86_64.rpm).
After a few thousand ssh logins, we start seeing stuff like this in
After a few thousand parallel ssh logins or one after the other?
This really looks as if we are leaking file descriptors in some way.
What's the contents of /proc/$(pidof systemd-logind)/fd/ if this
happens?
There's 1023 entries in that directory.
Post by Lennart Poettering
What is the contents of /run/systemd/sessions/?
There's about 2000 files in that directory. Filenames with numbers and
the same number.ref.

This is even when no-one is logged into the system.

Regards

Albert
Lennart Poettering
2011-10-05 12:48:24 UTC
Permalink
Post by Albert Strasheim
Hello
On Wed, Oct 5, 2011 at 1:40 AM, Lennart Poettering
Post by Lennart Poettering
Post by Albert Strasheim
Hello all
I think we've found a bug in systemd-logind with systemd v35
(specifically, systemd-35-1.fc16.x86_64.rpm).
After a few thousand ssh logins, we start seeing stuff like this in
After a few thousand parallel ssh logins or one after the other?
This really looks as if we are leaking file descriptors in some way.
What's the contents of /proc/$(pidof systemd-logind)/fd/ if this
happens?
There's 1023 entries in that directory.
Post by Lennart Poettering
What is the contents of /run/systemd/sessions/?
There's about 2000 files in that directory. Filenames with numbers and
the same number.ref.
This is even when no-one is logged into the system.
That looks as if somehow the sessions are leaked there. Can you do a
fuser -v on the .ref fifos as root? That should tell us if those fds are
still referenced.

Lennart
--
Lennart Poettering - Red Hat, Inc.
Albert Strasheim
2011-10-05 13:34:20 UTC
Permalink
On Wed, Oct 5, 2011 at 2:48 PM, Lennart Poettering
Post by Lennart Poettering
Post by Albert Strasheim
On Wed, Oct 5, 2011 at 1:40 AM, Lennart Poettering
Post by Lennart Poettering
What is the contents of /run/systemd/sessions/?
There's about 2000 files in that directory. Filenames with numbers and
the same number.ref.
This is even when no-one is logged into the system.
That looks as if somehow the sessions are leaked there. Can you do a
fuser -v on the .ref fifos as root? That should tell us if those fds are
still referenced.
# fuser -v *.ref
USER PID ACCESS COMMAND
1310.ref: root 23311 F.... systemd-logind
1368.ref: root 23311 f.... systemd-logind
1369.ref: root 23311 f.... systemd-logind
1370.ref: root 23311 F.... systemd-logind
1371.ref: root 23311 f.... systemd-logind
1372.ref: root 23311 f.... systemd-logind
1373.ref: root 23311 f.... systemd-logind
1374.ref: root 23311 f.... systemd-logind
1375.ref: root 23311 f.... systemd-logind
1376.ref: root 23311 f.... systemd-logind
1377.ref: root 23311 f.... systemd-logind
1378.ref: root 23311 f.... systemd-logind
1379.ref: root 23311 F.... systemd-logind
1380.ref: root 23311 F.... systemd-logind

etc.

Regards

Albert
Lennart Poettering
2011-10-05 13:40:54 UTC
Permalink
Post by Albert Strasheim
Post by Lennart Poettering
That looks as if somehow the sessions are leaked there. Can you do a
fuser -v on the .ref fifos as root? That should tell us if those fds are
still referenced.
# fuser -v *.ref
USER PID ACCESS COMMAND
1310.ref: root 23311 F.... systemd-logind
1368.ref: root 23311 f.... systemd-logind
1369.ref: root 23311 f.... systemd-logind
1370.ref: root 23311 F.... systemd-logind
1371.ref: root 23311 f.... systemd-logind
1372.ref: root 23311 f.... systemd-logind
1373.ref: root 23311 f.... systemd-logind
1374.ref: root 23311 f.... systemd-logind
1375.ref: root 23311 f.... systemd-logind
1376.ref: root 23311 f.... systemd-logind
1377.ref: root 23311 f.... systemd-logind
1378.ref: root 23311 f.... systemd-logind
1379.ref: root 23311 F.... systemd-logind
1380.ref: root 23311 F.... systemd-logind
etc.
Hmm, that suggests that systemd-logind never received an EOF on those
fds, which is really weird.

Or actually it shows with F that it has those fds open for writing which
should never happen. Weird. I'll try to reproduce this here.

Lennart
--
Lennart Poettering - Red Hat, Inc.
Albert Strasheim
2011-10-05 09:04:32 UTC
Permalink
Hello

On Wed, Oct 5, 2011 at 1:40 AM, Lennart Poettering
Post by Lennart Poettering
Post by Albert Strasheim
Hello all
I think we've found a bug in systemd-logind with systemd v35
(specifically, systemd-35-1.fc16.x86_64.rpm).
After a few thousand ssh logins, we start seeing stuff like this in
After a few thousand parallel ssh logins or one after the other?
Forgot to answer this one: this is only with a few parallel logins,
but many logins over a period of hours.

Which raises the question: what is systemd-logind going to do when
more than 1024 users log in to a machine at the same time?

Which raises the more general question: what will other services like
systemd-stdout-syslog-bridge do when the number of processes that log
go into the hundreds? I haven't thought about it carefully, but there
seems to be some cases where one could run into the default limit of
1024 open files.

Regards

Albert
Lennart Poettering
2011-10-05 12:50:13 UTC
Permalink
Post by Albert Strasheim
Hello
On Wed, Oct 5, 2011 at 1:40 AM, Lennart Poettering
Post by Lennart Poettering
Post by Albert Strasheim
Hello all
I think we've found a bug in systemd-logind with systemd v35
(specifically, systemd-35-1.fc16.x86_64.rpm).
After a few thousand ssh logins, we start seeing stuff like this in
After a few thousand parallel ssh logins or one after the other?
Forgot to answer this one: this is only with a few parallel logins,
but many logins over a period of hours.
Which raises the question: what is systemd-logind going to do when
more than 1024 users log in to a machine at the same time?
The admin should probably increase the resource limit for it. Maybe we
should do that by default.
Post by Albert Strasheim
Which raises the more general question: what will other services like
systemd-stdout-syslog-bridge do when the number of processes that log
go into the hundreds? I haven't thought about it carefully, but there
seems to be some cases where one could run into the default limit of
1024 open files.
Well, I guess every software has that problem. And the answer is always
the same: increase the limit.

Lennart
--
Lennart Poettering - Red Hat, Inc.
Loading...