Discussion:
journald reliability
(too old to reply)
Zbigniew Jędrzejewski-Szmek
2012-10-18 14:10:23 UTC
Permalink
Hi,

I've been playing with journald and journalctl, and have a few
thoughts about making the logs more reliable.

First problem: it would be nice to tie more messages from systemd to
units, i.e. use more structured logging. I'll follow up with a patch
which does that, and I'm wondering if this is the right approach.

Unfortunately, currently not all messages can be usefully attached by
journald to processes, because the system unit information is lost
after the last process exits. This sucks, because sometimes the most
important messages are the last ones. E.g.
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=665782 is about a
process failing to start. 'systemctl status' will probably not be able
to show this message, because by the time journald tries to query
cgroup information, it is too late.

But systemd has/had this information and could be queried for it. So I
propose to add a cache of recently dead PIDs in systemd, that could be
queried by journald. I'm not sure about the details, but probably
keeping the information for a few seconds should be enough. This would
lead to more reliable logging for messages close to the end of the
program.

Second problem: some messages are lost because of the small depth of
the unix datagram queue length. By default net.unix.max_dgram_qlen
appears to be 10, which is really really low. Messages get dropped
silently once this queue is full. I think that the goal should be
almost never loose messages. In extreme cases, if the logging system
if full or broken, messages must be lost, but otherwise this should
not happen. Otherwise things become very confusing for the user:
didn't this service start without giving any reason, or were just the
messages dropped?

Overflowing the queue can also be done on purpose by a rogue program:
send enough messages and break logging for legitimate processes.
Journald does rate-limiting, but only after receiving the messages,
so this does not help.

TODO contains an entry for this, suggesting (a) bumping sysctl,
(b) adding sockopt.

(a) would certainly help, but has the downside that this might
negatively influence other daemons. Also, it does not really solve the
problem, because if journald is not keeping up, the queue will
overflow anyway.

(b) would be nice, but will take time to get the kernel modified. And
again, increasing the queue depth can only be a part of a solution.

As an alternative to in-kernel queuing, I tried to implement a message
queue in the application. Messages are queued when the socket is busy,
and are resent using sendmmsg on next log statement. This adds
queueing that is under control of the application and helps if the
socket is _temporarily_ overloaded. To provide reliable logging
however, the application must block (or wait) at some point, until the
socket is free again. I'll post the patch as RFC, I'm not sure if it
is really useful.

I guess that some applications could simply block, but in case of e.g.
systemd itself it is not possible. journald should process the
messages faster. Should journald run reniced to higher priority? I
think that the problem is worse with journald than with syslog,
because journald tries to do more things: parse the message, query
cgroup information, forward the message to syslog, store it on disk.
syslog only does the last one, so necessarily must be more efficient.

Also, it seems that journald could be more efficient too: running
test-log (which queues sends two messages to joural before quitting)
show the following activity by journald:

epoll_wait(7, {?} 0x7fff4348ad40, 1, -1) = 1
ioctl(4, FIONREAD, [164]) = 0
recvmsg(4, {msg_name(0)=NULL, msg_iov(1)=[{"PRIORITY=6\nSYSLOG_FACILITY=3\nCOD"..., 2315}], msg_controllen=136, {cmsg_len=32, cmsg_level=SOL_SOCKET, cmsg_type=0x1d /* SCM_??? */, ...}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 164
open("/etc/localtime", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
sendmsg(5, {msg_name(29)={sa_family=AF_FILE, sun_path="/run/systemd/journal/syslog"}, msg_iov(5)=[{"<30>", 4}, {"Oct 18 11:42:56 ", 16}, {"lt-test-log", 11}, {"[19981]: ", 9}, {"Foobar PID=19981", 16}], msg_controllen=28, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=19981, uid=500, gid=500}}, msg_flags=0}, MSG_NOSIGNAL) = -1 ESRCH (No such process)
sendmsg(5, {msg_name(29)={sa_family=AF_FILE, sun_path="/run/systemd/journal/syslog"}, msg_iov(5)=[{"<30>", 4}, {"Oct 18 11:42:56 ", 16}, {"lt-test-log", 11}, {"[19981]: ", 9}, {"Foobar PID=19981", 16}], msg_controllen=28, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=350, uid=500, gid=500}}, msg_flags=0}, MSG_NOSIGNAL) = 56
open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/19981/comm", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
readlink("/proc/19981/exe", 0x1008e80, 99) = -1 ENOENT (No such file or directory)
open("/proc/19981/cmdline", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
capget({_LINUX_CAPABILITY_VERSION_3, 0}, NULL) = 0
capget({_LINUX_CAPABILITY_VERSION_3, 0}, {CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, 0}) = 0
open("/proc/19981/sessionid", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
capget({_LINUX_CAPABILITY_VERSION_3, 0}, NULL) = 0
capget({_LINUX_CAPABILITY_VERSION_3, 0}, {CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, 0}) = 0
open("/proc/19981/loginuid", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/500/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
uname({sys="Linux", node="fedora-15", ...}) = 0
clock_gettime(CLOCK_REALTIME, {1350560576, 588798333}) = 0
clock_gettime(CLOCK_MONOTONIC, {184373, 816089357}) = 0
ftruncate(11, 29777920) = 0
ioctl(4, FIONREAD, [237]) = 0
recvmsg(4, {msg_name(0)=NULL, msg_iov(1)=[{"PRIORITY=6\nSYSLOG_FACILITY=3\nCOD"..., 2315}], msg_controllen=136, {cmsg_len=32, cmsg_level=SOL_SOCKET, cmsg_type=0x1d /* SCM_??? */, ...}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 237
open("/etc/localtime", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
sendmsg(5, {msg_name(29)={sa_family=AF_FILE, sun_path="/run/systemd/journal/syslog"}, msg_iov(5)=[{"<30>", 4}, {"Oct 18 11:42:56 ", 16}, {"lt-test-log", 11}, {"[19981]: ", 9}, {"Foobar PID=19981", 16}], msg_controllen=28, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=19981, uid=500, gid=500}}, msg_flags=0}, MSG_NOSIGNAL) = -1 ESRCH (No such process)
sendmsg(5, {msg_name(29)={sa_family=AF_FILE, sun_path="/run/systemd/journal/syslog"}, msg_iov(5)=[{"<30>", 4}, {"Oct 18 11:42:56 ", 16}, {"lt-test-log", 11}, {"[19981]: ", 9}, {"Foobar PID=19981", 16}], msg_controllen=28, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=350, uid=500, gid=500}}, msg_flags=0}, MSG_NOSIGNAL) = 56
open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/19981/comm", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
readlink("/proc/19981/exe", 0x1008e80, 99) = -1 ENOENT (No such file or directory)
open("/proc/19981/cmdline", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
capget({_LINUX_CAPABILITY_VERSION_3, 0}, NULL) = 0
capget({_LINUX_CAPABILITY_VERSION_3, 0}, {CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, 0}) = 0
open("/proc/19981/sessionid", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
capget({_LINUX_CAPABILITY_VERSION_3, 0}, NULL) = 0
capget({_LINUX_CAPABILITY_VERSION_3, 0}, {CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, 0}) = 0
open("/proc/19981/loginuid", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/500/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
uname({sys="Linux", node="fedora-15", ...}) = 0
clock_gettime(CLOCK_REALTIME, {1350560576, 594694729}) = 0
clock_gettime(CLOCK_MONOTONIC, {184373, 821970285}) = 0
ftruncate(11, 29777920) = 0
ioctl(4, FIONREAD, [0]) = 0
recvmsg(4, 0x7fff4348ab60, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(7,

It seems that since the first sendmsg returning -ESRCH journald can
expect the the process it is trying to query is gone, and there is no
need to check /proc/19981/cgroup, /proc/19981/comm, etc. Would be
nice to short-circuit this. Is there a legitimate reason why some of
those files in /proc would be available, but not others?

Zbyszek
Zbigniew Jędrzejewski-Szmek
2012-10-18 14:12:11 UTC
Permalink
---
In preparation for the following patches...

src/core/execute.c | 42 ++++++++++++------------------------------
1 file changed, 12 insertions(+), 30 deletions(-)

diff --git a/src/core/execute.c b/src/core/execute.c
index cb703cb..5faeee8 100644
--- a/src/core/execute.c
+++ b/src/core/execute.c
@@ -999,7 +999,7 @@ int exec_spawn(ExecCommand *command,
int r;
char *line;
int socket_fd;
- char **files_env = NULL;
+ char _cleanup_strv_free_ **files_env = NULL;

assert(command);
assert(context);
@@ -1020,7 +1020,8 @@ int exec_spawn(ExecCommand *command,
} else
socket_fd = -1;

- if ((r = exec_context_load_environment(context, &files_env)) < 0) {
+ r = exec_context_load_environment(context, &files_env);
+ if (r < 0) {
log_error("Failed to load environment files: %s", strerror(-r));
return r;
}
@@ -1029,24 +1030,21 @@ int exec_spawn(ExecCommand *command,
argv = command->argv;

line = exec_command_line(argv);
- if (!line) {
- r = -ENOMEM;
- goto fail_parent;
- }
+ if (!line)
+ return log_oom();

log_debug("About to execute: %s", line);
free(line);

r = cgroup_bonding_realize_list(cgroup_bondings);
if (r < 0)
- goto fail_parent;
+ return r;

cgroup_attribute_apply_list(cgroup_attributes, cgroup_bondings);

- if ((pid = fork()) < 0) {
- r = -errno;
- goto fail_parent;
- }
+ pid = fork();
+ if (pid < 0)
+ return -errno;

if (pid == 0) {
int i, err;
@@ -1054,7 +1052,8 @@ int exec_spawn(ExecCommand *command,
const char *username = NULL, *home = NULL;
uid_t uid = (uid_t) -1;
gid_t gid = (gid_t) -1;
- char **our_env = NULL, **pam_env = NULL, **final_env = NULL, **final_argv = NULL;
+ char _cleanup_strv_free_ **our_env = NULL, **pam_env = NULL,
+ **final_env = NULL, **final_argv = NULL;
unsigned n_env = 0;
bool set_access = false;

@@ -1331,8 +1330,7 @@ int exec_spawn(ExecCommand *command,
goto fail_child;
}
} else {
-
- char *d;
+ char _cleanup_free_ *d = NULL;

if (asprintf(&d, "%s/%s",
context->root_directory ? context->root_directory : "",
@@ -1344,12 +1342,9 @@ int exec_spawn(ExecCommand *command,

if (chdir(d) < 0) {
err = -errno;
- free(d);
r = EXIT_CHDIR;
goto fail_child;
}
-
- free(d);
}

/* We repeat the fd closing here, to make sure that
@@ -1500,17 +1495,9 @@ int exec_spawn(ExecCommand *command,
command->path, strerror(-err));
}

- strv_free(our_env);
- strv_free(final_env);
- strv_free(pam_env);
- strv_free(files_env);
- strv_free(final_argv);
-
_exit(r);
}

- strv_free(files_env);
-
/* We add the new process to the cgroup both in the child (so
* that we can be sure that no user code is ever executed
* outside of the cgroup) and in the parent (so that we can be
@@ -1525,11 +1512,6 @@ int exec_spawn(ExecCommand *command,

*ret = pid;
return 0;
-
-fail_parent:
- strv_free(files_env);
-
- return r;
}

void exec_context_init(ExecContext *c) {
--
1.7.12.rc1.173.g9b27acc
Zbigniew Jędrzejewski-Szmek
2012-10-18 14:12:12 UTC
Permalink
Information which unit a log entry pertains to enables systemctl
status to display more log messages.
---
src/core/execute.c | 32 +++++++++++++++++++++++++-------
src/core/selinux-access.c | 2 +-
src/core/service.c | 19 ++++++++++++++++---
src/core/unit.c | 5 ++++-
src/systemd/sd-messages.h | 2 ++
5 files changed, 48 insertions(+), 12 deletions(-)

diff --git a/src/core/execute.c b/src/core/execute.c
index 5faeee8..aa3a587 100644
--- a/src/core/execute.c
+++ b/src/core/execute.c
@@ -62,6 +62,8 @@
#include "loopback-setup.h"
#include "path-util.h"
#include "syscall-list.h"
+#include "sd-id128.h"
+#include "sd-messages.h"

#define IDLE_TIMEOUT_USEC (5*USEC_PER_SEC)

@@ -1022,7 +1024,11 @@ int exec_spawn(ExecCommand *command,

r = exec_context_load_environment(context, &files_env);
if (r < 0) {
- log_error("Failed to load environment files: %s", strerror(-r));
+ log_struct(LOG_ERR,
+ "UNIT=%s", unit_id,
+ "MESSAGE=Failed to load environment files: %s", strerror(-r),
+ "ERRNO=%d", errno,
+ NULL);
return r;
}

@@ -1033,7 +1039,10 @@ int exec_spawn(ExecCommand *command,
if (!line)
return log_oom();

- log_debug("About to execute: %s", line);
+ log_struct(LOG_DEBUG,
+ "UNIT=%s", unit_id,
+ "MESSAGE=About to execute %s", line,
+ NULL);
free(line);

r = cgroup_bonding_realize_list(cgroup_bondings);
@@ -1490,14 +1499,25 @@ int exec_spawn(ExecCommand *command,
fail_child:
if (r != 0) {
log_open();
- log_warning("Failed at step %s spawning %s: %s",
- exit_status_to_string(r, EXIT_STATUS_SYSTEMD),
- command->path, strerror(-err));
+ log_struct(LOG_ERR, MESSAGE_ID(SD_MESSAGE_SPAWN_FAILED),
+ "EXECUTABLE=%s", command->path,
+ "MESSAGE=Failed at step %s spawning %s: %s",
+ exit_status_to_string(r, EXIT_STATUS_SYSTEMD),
+ command->path, strerror(-err),
+ "ERRNO=%d", -err,
+ NULL);
+ log_close();
}

_exit(r);
}

+ log_struct(LOG_DEBUG,
+ "UNIT=%s", unit_id,
+ "MESSAGE=Forked %s as %lu",
+ command->path, (unsigned long) pid,
+ NULL);
+
/* We add the new process to the cgroup both in the child (so
* that we can be sure that no user code is ever executed
* outside of the cgroup) and in the parent (so that we can be
@@ -1506,8 +1526,6 @@ int exec_spawn(ExecCommand *command,
if (cgroup_bondings)
cgroup_bonding_install_list(cgroup_bondings, pid, cgroup_suffix);

- log_debug("Forked %s as %lu", command->path, (unsigned long) pid);
-
exec_status_start(&command->exec_status, pid);

*ret = pid;
diff --git a/src/core/selinux-access.c b/src/core/selinux-access.c
index 852a328..6dfe8b4 100644
--- a/src/core/selinux-access.c
+++ b/src/core/selinux-access.c
@@ -314,7 +314,7 @@ static int get_calling_context(
if (r >= 0)
return r;

- log_error("bus_get_selinux_security_context failed %m");
+ log_error("bus_get_selinux_security_context failed: %m");
return r;
}

diff --git a/src/core/service.c b/src/core/service.c
index 5742968..2febdb6 100644
--- a/src/core/service.c
+++ b/src/core/service.c
@@ -1614,7 +1614,12 @@ static void service_set_state(Service *s, ServiceState state) {
cgroup_bonding_trim_list(UNIT(s)->cgroup_bondings, true);

if (old_state != state)
- log_debug("%s changed %s -> %s", UNIT(s)->id, service_state_to_string(old_state), service_state_to_string(state));
+ log_struct(LOG_DEBUG,
+ "UNIT=%s", UNIT(s)->id,
+ "MESSAGE=%s changed %s -> %s", UNIT(s)->id,
+ service_state_to_string(old_state),
+ service_state_to_string(state),
+ NULL);

unit_notify(UNIT(s), table[old_state], table[state], s->reload_result == SERVICE_SUCCESS);
s->reload_result = SERVICE_SUCCESS;
@@ -2948,8 +2953,16 @@ static void service_sigchld_event(Unit *u, pid_t pid, int code, int status) {
f = SERVICE_SUCCESS;
}

- log_full(f == SERVICE_SUCCESS ? LOG_DEBUG : LOG_NOTICE,
- "%s: main process exited, code=%s, status=%i", u->id, sigchld_code_to_string(code), status);
+ log_struct(f == SERVICE_SUCCESS ? LOG_DEBUG : LOG_NOTICE,
+ "MESSAGE=%s: main process exited, code=%s, status=%i/%s",
+ u->id, sigchld_code_to_string(code), status,
+ strna(code == CLD_EXITED
+ ? exit_status_to_string(status, EXIT_STATUS_FULL)
+ : signal_to_string(status)),
+ "UNIT=%s", u->id,
+ "EXIT_CODE=%s", sigchld_code_to_string(code),
+ "EXIT_STATUS=%i", status,
+ NULL);

if (f != SERVICE_SUCCESS)
s->result = f;
diff --git a/src/core/unit.c b/src/core/unit.c
index 6396bfc..99e1c27 100644
--- a/src/core/unit.c
+++ b/src/core/unit.c
@@ -1438,7 +1438,10 @@ void unit_notify(Unit *u, UnitActiveState os, UnitActiveState ns, bool reload_su
check_unneeded_dependencies(u);

if (ns != os && ns == UNIT_FAILED) {
- log_notice("Unit %s entered failed state.", u->id);
+ log_struct(LOG_NOTICE,
+ "MESSAGE=Unit %s entered failed state", u->id,
+ "UNIT=%s", u->id,
+ NULL);
unit_trigger_on_failure(u);
}
}
diff --git a/src/systemd/sd-messages.h b/src/systemd/sd-messages.h
index 698aa60..5099a36 100644
--- a/src/systemd/sd-messages.h
+++ b/src/systemd/sd-messages.h
@@ -58,6 +58,8 @@ extern "C" {
#define SD_MESSAGE_UNIT_RELOADING SD_ID128_MAKE(d3,4d,03,7f,ff,18,47,e6,ae,66,9a,37,0e,69,47,25)
#define SD_MESSAGE_UNIT_RELOADED SD_ID128_MAKE(7b,05,eb,c6,68,38,42,22,ba,a8,88,11,79,cf,da,54)

+#define SD_MESSAGE_SPAWN_FAILED SD_ID128_MAKE(64,12,57,65,1c,1b,4e,c9,a8,62,4d,7a,40,a9,e1,e7)
+
#define SD_MESSAGE_FORWARD_SYSLOG_MISSED SD_ID128_MAKE(00,27,22,9c,a0,64,41,81,a7,6c,4e,92,45,8a,fa,2e)

#ifdef __cplusplus
--
1.7.12.rc1.173.g9b27acc
Marius Tolzmann
2012-10-18 14:44:39 UTC
Permalink
Hi..
Post by Zbigniew Jędrzejewski-Szmek
Information which unit a log entry pertains to enables systemctl
status to display more log messages.
---
@@ -1022,7 +1024,11 @@ int exec_spawn(ExecCommand *command,
r = exec_context_load_environment(context, &files_env);
if (r < 0) {
- log_error("Failed to load environment files: %s", strerror(-r));
+ log_struct(LOG_ERR,
+ "UNIT=%s", unit_id,
+ "MESSAGE=Failed to load environment files: %s", strerror(-r),
+ "ERRNO=%d", errno,
shouldn't that be "ERRNO=%d", (-r), ?

I did not check the rest. It just hit my eyes 8)

Bye marius..
--
Dipl.-Inf. Marius Tolzmann <***@molgen.mpg.de>
----------------------------------.------------------------------
MPI f. molekulare Genetik |
Ihnestrasse 63-73, D-14195 Berlin | ==> MarIuX GNU/Linux <==
Phone: +49 (0)30 8413 1709 |
----------------------------------^------------------------------
God put me on earth to accomplish a certain number of things.
Right now I am so far behind..
..I will never die. <by calvin from calvin&hobbes ;)>
Zbigniew Jędrzejewski-Szmek
2012-10-18 14:47:41 UTC
Permalink
Post by Marius Tolzmann
Hi..
Post by Zbigniew Jędrzejewski-Szmek
Information which unit a log entry pertains to enables systemctl
status to display more log messages.
---
@@ -1022,7 +1024,11 @@ int exec_spawn(ExecCommand *command,
r = exec_context_load_environment(context, &files_env);
if (r < 0) {
- log_error("Failed to load environment files: %s", strerror(-r));
+ log_struct(LOG_ERR,
+ "UNIT=%s", unit_id,
+ "MESSAGE=Failed to load environment files: %s", strerror(-r),
+ "ERRNO=%d", errno,
shouldn't that be "ERRNO=%d", (-r), ?
Yeah, obviously. Thanks for catching that.

Zbyszek
Lennart Poettering
2012-10-18 17:03:58 UTC
Permalink
Post by Zbigniew Jędrzejewski-Szmek
Information which unit a log entry pertains to enables systemctl
status to display more log messages.
The first two patches really look good, please go ahead and commit.

Lennart
--
Lennart Poettering - Red Hat, Inc.
Zbigniew Jędrzejewski-Szmek
2012-10-18 14:12:13 UTC
Permalink
Queue size for unix datagrams is set globally and can be quite small
(e.g. 10). This means that if journald is not keeping up, it is very
easy to fill up the queue and loose messages. To alleviate this
problem, messages are queued on the sender side, and pushed out
repeatedly with sendmmsg. A repeat with a sleep is used to give
journald a chance to process more messages. If log_close() is called,
a bigger delay is used, to try to flush messages.
---
Makefile.am | 7 ++
src/shared/log.c | 261 ++++++++++++++++++++++++++++++++++------------
src/test/test-log-multi.c | 58 +++++++++++
src/test/test-log.c | 4 +-
4 files changed, 264 insertions(+), 66 deletions(-)
create mode 100644 src/test/test-log-multi.c

diff --git a/Makefile.am b/Makefile.am
index 4c65076..b8fc7a1 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -1168,6 +1168,7 @@ noinst_PROGRAMS += \
test-watchdog \
test-unit-name \
test-log \
+ test-log-multi \
test-unit-file \
test-date

@@ -1239,6 +1240,12 @@ test_log_SOURCES = \
test_log_LDADD = \
libsystemd-core.la

+test_log_multi_SOURCES = \
+ src/test/test-log-multi.c
+
+test_log_multi_LDADD = \
+ libsystemd-core.la
+
test_date_SOURCES = \
src/test/test-date.c

diff --git a/src/shared/log.c b/src/shared/log.c
index 6357868..c40d8f0 100644
--- a/src/shared/log.c
+++ b/src/shared/log.c
@@ -37,6 +37,13 @@

#define SNDBUF_SIZE (8*1024*1024)

+typedef struct mmsg_info {
+ struct mmsghdr *vec;
+ int size; /* size (in mmsghdrs) of vec */
+ int offset; /* first used mmsghdr */
+ int used; /* number of available mmsghdrs */
+} mmsg_info;
+
static LogTarget log_target = LOG_TARGET_CONSOLE;
static int log_max_level = LOG_INFO;
static int log_facility = LOG_DAEMON;
@@ -45,6 +52,7 @@ static int console_fd = STDERR_FILENO;
static int syslog_fd = -1;
static int kmsg_fd = -1;
static int journal_fd = -1;
+static mmsg_info journal_mmsg_info;

static bool syslog_is_stream = false;

@@ -55,6 +63,97 @@ static bool show_location = false;
* use here. */
static char *log_abort_msg = NULL;

+#define MSG_QUEUE_LEN 1024
+
+static struct msghdr *msg_new(mmsg_info *info, int max_size) {
+ int pos;
+ struct msghdr *mh;
+
+ assert(max_size >= 1);
+ assert(info);
+
+ if (info->used == info->size) {
+ struct mmsghdr *vec;
+ int size, wrapped;
+
+ size = MAX(info->size*2, 1);
+ if (size > max_size)
+ return NULL;
+
+ vec = realloc(info->vec, sizeof(struct mmsghdr) * size);
+ if (vec == NULL)
+ return NULL;
+
+ /* move wrapped part to the end */
+ wrapped = (info->offset + info->used) - info->size;
+ if (wrapped > 0)
+ memcpy(vec + info->size, vec,
+ sizeof(struct mmsghdr) * wrapped);
+
+ info->vec = vec;
+ info->size = size;
+ }
+
+ assert(info->used < info->size);
+
+ pos = (info->offset + info->used++) % info->size;
+ zero(info->vec[pos].msg_hdr);
+ mh = &info->vec[pos].msg_hdr;
+
+ return mh;
+}
+
+static const char const* newline = "\n";
+
+static void free_iovec(struct iovec *iovec, int iovlen) {
+ int i;
+
+ for (i = 1; i < iovlen; i += 2)
+ if (iovec[i].iov_base != newline)
+ free(iovec[i].iov_base);
+ free(iovec);
+}
+
+static int msg_send(int fd, mmsg_info *info, int tries, useconds_t sleep_usec) {
+ int sent = 0;
+
+ assert(info);
+ assert(info->used <= info->size);
+ assert(info->offset < info->size || info->offset == 0);
+
+ while(info->used) {
+ struct mmsghdr *mmh = info->vec + info->offset;
+ int vlen = MIN(info->size - info->offset, info->used);
+ int j;
+
+ int r = sendmmsg(fd, mmh, vlen, MSG_NOSIGNAL);
+ if (r < 0) {
+ if (errno==EAGAIN || errno==EWOULDBLOCK) {
+ if (--tries) {
+ usleep(sleep_usec);
+ continue;
+ } else
+ break;
+ }
+
+ /* drop message */
+ info->used--;
+ info->offset = (info->offset + 1) % info->size;
+ free_iovec(mmh->msg_hdr.msg_iov,
+ mmh->msg_hdr.msg_iovlen);
+ return -errno;
+ }
+
+ sent += r;
+ info->used -= r;
+ info->offset = (info->offset + r) % info->size;
+ for (j = 0; j < r; j++, mmh++)
+ free_iovec(mmh->msg_hdr.msg_iov, mmh->msg_hdr.msg_iovlen);
+ }
+
+ return sent;
+}
+
void log_close_console(void) {

if (console_fd < 0)
@@ -174,10 +273,20 @@ fail:
}

void log_close_journal(void) {
+ struct mmsghdr *mmh;
+ int i;

if (journal_fd < 0)
return;

+ msg_send(journal_fd, &journal_mmsg_info, 100, 10 * USEC_PER_MSEC);
+
+ mmh = journal_mmsg_info.vec + journal_mmsg_info.offset;
+ for(i = 0; i < journal_mmsg_info.used; i++, mmh++)
+ free_iovec(mmh->msg_hdr.msg_iov, mmh->msg_hdr.msg_iovlen);
+ free(journal_mmsg_info.vec);
+ zero(journal_mmsg_info);
+
close_nointr_nofail(journal_fd);
journal_fd = -1;
}
@@ -199,7 +308,9 @@ static int log_open_journal(void) {
sa.un.sun_family = AF_UNIX;
strncpy(sa.un.sun_path, "/run/systemd/journal/socket", sizeof(sa.un.sun_path));

- if (connect(journal_fd, &sa.sa, offsetof(struct sockaddr_un, sun_path) + strlen(sa.un.sun_path)) < 0) {
+ r = connect(journal_fd, &sa.sa,
+ offsetof(struct sockaddr_un, sun_path) + strlen(sa.un.sun_path));
+ if (r < 0) {
r = -errno;
goto fail;
}
@@ -441,43 +552,55 @@ static int write_to_journal(
const char *func,
const char *buffer) {

- char header[LINE_MAX];
- struct iovec iovec[3];
- struct msghdr mh;
+ int r;
+ char *header = NULL, *buffercopy = NULL;
+ const int iovec_size = 3;
+ struct iovec *iovec = NULL;
+ struct msghdr *mh;

if (journal_fd < 0)
return 0;

- snprintf(header, sizeof(header),
- "PRIORITY=%i\n"
- "SYSLOG_FACILITY=%i\n"
- "CODE_FILE=%s\n"
- "CODE_LINE=%i\n"
- "CODE_FUNCTION=%s\n"
- "SYSLOG_IDENTIFIER=%s\n"
- "MESSAGE=",
- LOG_PRI(level),
- LOG_FAC(level),
- file,
- line,
- func,
- program_invocation_short_name);
-
- char_array_0(header);
+ r = asprintf(&header,
+ "PRIORITY=%i\n"
+ "SYSLOG_FACILITY=%i\n"
+ "CODE_FILE=%s\n"
+ "CODE_LINE=%i\n"
+ "CODE_FUNCTION=%s\n"
+ "SYSLOG_IDENTIFIER=%s\n"
+ "MESSAGE=",
+ LOG_PRI(level),
+ LOG_FAC(level),
+ file,
+ line,
+ func,
+ program_invocation_short_name);
+
+ iovec = calloc(sizeof(struct iovec), iovec_size);
+
+ buffercopy = strdup(buffer);
+
+ if (r < 0 || !iovec || !buffercopy) {
+ free(header);
+ free(iovec);
+ free(buffercopy);
+ return -ENOMEM;
+ }

- zero(iovec);
IOVEC_SET_STRING(iovec[0], header);
- IOVEC_SET_STRING(iovec[1], buffer);
- IOVEC_SET_STRING(iovec[2], "\n");
+ IOVEC_SET_STRING(iovec[1], buffercopy);
+ IOVEC_SET_STRING(iovec[2], newline);

- zero(mh);
- mh.msg_iov = iovec;
- mh.msg_iovlen = ELEMENTSOF(iovec);
+ mh = msg_new(&journal_mmsg_info, MSG_QUEUE_LEN);
+ if (!mh) {
+ free_iovec(iovec, iovec_size);
+ return -ENOMEM;
+ }

- if (sendmsg(journal_fd, &mh, MSG_NOSIGNAL) < 0)
- return -errno;
+ mh->msg_iov = iovec;
+ mh->msg_iovlen = iovec_size;

- return 1;
+ return msg_send(journal_fd, &journal_mmsg_info, 1, 0);
}

static int log_dispatch(
@@ -677,34 +800,41 @@ int log_struct_internal(
log_target == LOG_TARGET_JOURNAL) &&
journal_fd >= 0) {

- char header[LINE_MAX];
- struct iovec iovec[17];
- unsigned n = 0, i;
- struct msghdr mh;
- const char nl = '\n';
+ char *header = NULL;
+ struct iovec *iovec = NULL;
+ unsigned n = 0;
+ struct msghdr *mh;
+ const unsigned iovec_size = 17;

/* If the journal is available do structured logging */

- snprintf(header, sizeof(header),
- "PRIORITY=%i\n"
- "SYSLOG_FACILITY=%i\n"
- "CODE_FILE=%s\n"
- "CODE_LINE=%i\n"
- "CODE_FUNCTION=%s\n"
- "SYSLOG_IDENTIFIER=%s\n",
- LOG_PRI(level),
- LOG_FAC(level),
- file,
- line,
- func,
- program_invocation_short_name);
- char_array_0(header);
-
- zero(iovec);
+ r = asprintf(&header,
+ "PRIORITY=%i\n"
+ "SYSLOG_FACILITY=%i\n"
+ "CODE_FILE=%s\n"
+ "CODE_LINE=%i\n"
+ "CODE_FUNCTION=%s\n"
+ "SYSLOG_IDENTIFIER=%s\n",
+ LOG_PRI(level),
+ LOG_FAC(level),
+ file,
+ line,
+ func,
+ program_invocation_short_name);
+
+ iovec = calloc(sizeof(struct iovec), iovec_size);
+
+ if(r < 0 || !iovec) {
+ free(iovec);
+ free(header);
+ r = -ENOMEM;
+ goto finish;
+ }
+
IOVEC_SET_STRING(iovec[n++], header);

va_start(ap, format);
- while (format && n + 1 < ELEMENTSOF(iovec)) {
+ while (format && n + 1 < iovec_size) {
char *buf;
va_list aq;

@@ -725,27 +855,25 @@ int log_struct_internal(
VA_FORMAT_ADVANCE(format, ap);

IOVEC_SET_STRING(iovec[n++], buf);
-
- iovec[n].iov_base = (char*) &nl;
- iovec[n].iov_len = 1;
- n++;
+ IOVEC_SET_STRING(iovec[n++], newline);

format = va_arg(ap, char *);
}

- zero(mh);
- mh.msg_iov = iovec;
- mh.msg_iovlen = n;
+ mh = msg_new(&journal_mmsg_info, MSG_QUEUE_LEN);
+ if (!mh) {
+ free_iovec(iovec, n);
+ r = -ENOMEM;
+ goto finish;
+ }
+
+ mh->msg_iov = iovec;
+ mh->msg_iovlen = n;

- if (sendmsg(journal_fd, &mh, MSG_NOSIGNAL) < 0)
- r = -errno;
- else
- r = 1;
+ r = msg_send(journal_fd, &journal_mmsg_info, 1, 0);

finish:
va_end(ap);
- for (i = 1; i < n; i += 2)
- free(iovec[i].iov_base);

} else {
char buf[LINE_MAX];
@@ -780,6 +908,11 @@ int log_struct_internal(
}

errno = saved_errno;
+ if(r < 0) {
+ char buf[LINE_MAX];
+ snprintf(buf, sizeof(buf), "dispatch failed %d/%s", r, strerror(-r));
+ log_dispatch(level, file, line, func, buf);
+ }
return r;
}

diff --git a/src/test/test-log-multi.c b/src/test/test-log-multi.c
new file mode 100644
index 0000000..0f02afc
--- /dev/null
+++ b/src/test/test-log-multi.c
@@ -0,0 +1,58 @@
+/*-*- Mode: C; c-basic-offset: 8; indent-tabs-mode: nil -*-*/
+
+/***
+ This file is part of systemd.
+
+ Copyright 2012 Lennart Poettering
+
+ systemd is free software; you can redistribute it and/or modify it
+ under the terms of the GNU Lesser General Public License as published by
+ the Free Software Foundation; either version 2.1 of the License, or
+ (at your option) any later version.
+
+ systemd is distributed in the hope that it will be useful, but
+ WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ Lesser General Public License for more details.
+
+ You should have received a copy of the GNU Lesser General Public License
+ along with systemd; If not, see <http://www.gnu.org/licenses/>.
+***/
+
+#include <stddef.h>
+#include <unistd.h>
+
+#include "log.h"
+#include "util.h"
+
+#define _filler " filler filler filler filler filler filler filler filler filler"
+#define filler _filler _filler _filler _filler _filler _filler _filler _filler
+#define filler2 filler filler filler filler filler filler filler filler filler filler filler filler
+
+int main(int argc, char* argv[]) {
+ unsigned i, limit = 100;
+ int r;
+
+ switch(argc) {
+ case 1: break;
+ case 2:
+ r = safe_atou(argv[1], &limit);
+ if (!r)
+ break;
+ default:
+ log_error("wrong number or bad options");
+ return 1;
+ }
+
+ log_set_target(LOG_TARGET_JOURNAL);
+ log_open();
+
+ for(i = 0; i < limit; i++)
+ log_struct(LOG_NOTICE,
+ "MESSAGE=line %d", i,
+ "FILLER=%s", filler2,
+ NULL);
+
+ log_close();
+ return 0;
+}
diff --git a/src/test/test-log.c b/src/test/test-log.c
index 8dc3d53..fb73fc5 100644
--- a/src/test/test-log.c
+++ b/src/test/test-log.c
@@ -37,12 +37,12 @@ int main(int argc, char* argv[]) {
log_set_target(LOG_TARGET_JOURNAL);
log_open();

- log_struct(LOG_INFO,
+ log_struct(LOG_NOTICE,
"MESSAGE=Foobar PID=%lu", (unsigned long) getpid(),
"SERVICE=foobar",
NULL);

- log_struct(LOG_INFO,
+ log_struct(LOG_WARNING,
"MESSAGE=Foobar PID=%lu", (unsigned long) getpid(),
"FORMAT_STR_TEST=1=%i A=%c 2=%hi 3=%li 4=%lli 1=%p foo=%s 2.5=%g 3.5=%g 4.5=%Lg",
(int) 1, 'A', (short) 2, (long int) 3, (long long int) 4, (void*) 1, "foo", (float) 2.5f, (double) 3.5, (long double) 4.5,
--
1.7.12.rc1.173.g9b27acc
Lennart Poettering
2012-10-18 17:02:28 UTC
Permalink
Post by Zbigniew Jędrzejewski-Szmek
Hi,
I've been playing with journald and journalctl, and have a few
thoughts about making the logs more reliable.
First problem: it would be nice to tie more messages from systemd to
units, i.e. use more structured logging. I'll follow up with a patch
which does that, and I'm wondering if this is the right approach.
Yes, this is definitely desirable.
Post by Zbigniew Jędrzejewski-Szmek
Unfortunately, currently not all messages can be usefully attached by
journald to processes, because the system unit information is lost
after the last process exits. This sucks, because sometimes the most
important messages are the last ones. E.g.
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=665782 is about a
process failing to start. 'systemctl status' will probably not be able
to show this message, because by the time journald tries to query
cgroup information, it is too late.
This is a real problem.
Post by Zbigniew Jędrzejewski-Szmek
But systemd has/had this information and could be queried for it. So I
propose to add a cache of recently dead PIDs in systemd, that could be
queried by journald. I'm not sure about the details, but probably
keeping the information for a few seconds should be enough. This would
lead to more reliable logging for messages close to the end of the
program.
This wouldn't really work I fear, since systemd only really tracks the
main and control PIds of a cgroup, and there might be much more that
log.

I think the only real way to solve this cleanly is to extend the kernel
to provide SCM_CREDENTIAL and SCM_SECURITY-like auxiliary messages that
carries the information we need. More specifically I am hoping that we
can get SCM_AUDIT (for the loginuid + sessionid of a process),
SCM_EXECINFO (for exe, comm, cmdline), SCM_CGROUP (for cgroup
membership). With that we should have all data we need in a safe and
secure way.
Post by Zbigniew Jędrzejewski-Szmek
Second problem: some messages are lost because of the small depth of
the unix datagram queue length. By default net.unix.max_dgram_qlen
appears to be 10, which is really really low. Messages get dropped
silently once this queue is full.
Note that this is only true for systemd's own log.c message. For the
stuff written with sd_printv we actually block on the socket.
Post by Zbigniew Jędrzejewski-Szmek
I think that the goal should be
almost never loose messages. In extreme cases, if the logging system
if full or broken, messages must be lost, but otherwise this should
didn't this service start without giving any reason, or were just the
messages dropped?
TODO contains an entry for this, suggesting (a) bumping sysctl,
(b) adding sockopt.
This (and the SCM_xyz issue mentioned above) have been on our plumbers
wishlist for a while.
Post by Zbigniew Jędrzejewski-Szmek
(a) would certainly help, but has the downside that this might
negatively influence other daemons. Also, it does not really solve the
problem, because if journald is not keeping up, the queue will
overflow anyway.
Yeah, this is not really desirable indeed.
Post by Zbigniew Jędrzejewski-Szmek
(b) would be nice, but will take time to get the kernel modified. And
again, increasing the queue depth can only be a part of a solution.
As an alternative to in-kernel queuing, I tried to implement a message
queue in the application. Messages are queued when the socket is busy,
and are resent using sendmmsg on next log statement. This adds
queueing that is under control of the application and helps if the
socket is _temporarily_ overloaded. To provide reliable logging
however, the application must block (or wait) at some point, until the
socket is free again. I'll post the patch as RFC, I'm not sure if it
is really useful.
Hmm, but how would the app queue be that much better than an in-kernel
socket queue? Either way eventually messages need to be dropped. And the
kernel queue is maintained anyway, so I'd just make use of that.
Post by Zbigniew Jędrzejewski-Szmek
I guess that some applications could simply block, but in case of e.g.
systemd itself it is not possible. journald should process the
messages faster. Should journald run reniced to higher priority? I
think that the problem is worse with journald than with syslog,
because journald tries to do more things: parse the message, query
cgroup information, forward the message to syslog, store it on disk.
syslog only does the last one, so necessarily must be more efficient.
I remember that maemo used to have issues with syslog blocking. They
basically ran into a priority inversion problem, where RT threads ended
up waiting for the non-RT syslog. I think large qlen are a pretty good
solution for this.
Post by Zbigniew Jędrzejewski-Szmek
Also, it seems that journald could be more efficient too: running
test-log (which queues sends two messages to joural before quitting)
epoll_wait(7, {?} 0x7fff4348ad40, 1, -1) = 1
ioctl(4, FIONREAD, [164]) = 0
recvmsg(4, {msg_name(0)=NULL, msg_iov(1)=[{"PRIORITY=6\nSYSLOG_FACILITY=3\nCOD"..., 2315}], msg_controllen=136, {cmsg_len=32, cmsg_level=SOL_SOCKET, cmsg_type=0x1d /* SCM_??? */, ...}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 164
open("/etc/localtime", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
sendmsg(5, {msg_name(29)={sa_family=AF_FILE, sun_path="/run/systemd/journal/syslog"}, msg_iov(5)=[{"<30>", 4}, {"Oct 18 11:42:56 ", 16}, {"lt-test-log", 11}, {"[19981]: ", 9}, {"Foobar PID=19981", 16}], msg_controllen=28, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=19981, uid=500, gid=500}}, msg_flags=0}, MSG_NOSIGNAL) = -1 ESRCH (No such process)
sendmsg(5, {msg_name(29)={sa_family=AF_FILE, sun_path="/run/systemd/journal/syslog"}, msg_iov(5)=[{"<30>", 4}, {"Oct 18 11:42:56 ", 16}, {"lt-test-log", 11}, {"[19981]: ", 9}, {"Foobar PID=19981", 16}], msg_controllen=28, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=350, uid=500, gid=500}}, msg_flags=0}, MSG_NOSIGNAL) = 56
open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/19981/comm", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
readlink("/proc/19981/exe", 0x1008e80, 99) = -1 ENOENT (No such file or directory)
open("/proc/19981/cmdline", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
capget({_LINUX_CAPABILITY_VERSION_3, 0}, NULL) = 0
capget({_LINUX_CAPABILITY_VERSION_3, 0}, {CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, 0}) = 0
open("/proc/19981/sessionid", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
capget({_LINUX_CAPABILITY_VERSION_3, 0}, NULL) = 0
capget({_LINUX_CAPABILITY_VERSION_3, 0}, {CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, 0}) = 0
open("/proc/19981/loginuid", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/500/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
uname({sys="Linux", node="fedora-15", ...}) = 0
clock_gettime(CLOCK_REALTIME, {1350560576, 588798333}) = 0
clock_gettime(CLOCK_MONOTONIC, {184373, 816089357}) = 0
ftruncate(11, 29777920) = 0
ioctl(4, FIONREAD, [237]) = 0
recvmsg(4, {msg_name(0)=NULL, msg_iov(1)=[{"PRIORITY=6\nSYSLOG_FACILITY=3\nCOD"..., 2315}], msg_controllen=136, {cmsg_len=32, cmsg_level=SOL_SOCKET, cmsg_type=0x1d /* SCM_??? */, ...}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 237
open("/etc/localtime", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
sendmsg(5, {msg_name(29)={sa_family=AF_FILE, sun_path="/run/systemd/journal/syslog"}, msg_iov(5)=[{"<30>", 4}, {"Oct 18 11:42:56 ", 16}, {"lt-test-log", 11}, {"[19981]: ", 9}, {"Foobar PID=19981", 16}], msg_controllen=28, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=19981, uid=500, gid=500}}, msg_flags=0}, MSG_NOSIGNAL) = -1 ESRCH (No such process)
sendmsg(5, {msg_name(29)={sa_family=AF_FILE, sun_path="/run/systemd/journal/syslog"}, msg_iov(5)=[{"<30>", 4}, {"Oct 18 11:42:56 ", 16}, {"lt-test-log", 11}, {"[19981]: ", 9}, {"Foobar PID=19981", 16}], msg_controllen=28, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=350, uid=500, gid=500}}, msg_flags=0}, MSG_NOSIGNAL) = 56
open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/19981/comm", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
readlink("/proc/19981/exe", 0x1008e80, 99) = -1 ENOENT (No such file or directory)
open("/proc/19981/cmdline", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
capget({_LINUX_CAPABILITY_VERSION_3, 0}, NULL) = 0
capget({_LINUX_CAPABILITY_VERSION_3, 0}, {CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, 0}) = 0
open("/proc/19981/sessionid", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
capget({_LINUX_CAPABILITY_VERSION_3, 0}, NULL) = 0
capget({_LINUX_CAPABILITY_VERSION_3, 0}, {CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, CAP_CHOWN|CAP_DAC_OVERRIDE|CAP_DAC_READ_SEARCH|CAP_FOWNER|CAP_SETGID|CAP_SETUID|CAP_SYS_PTRACE|CAP_SYS_ADMIN|CAP_AUDIT_CONTROL, 0}) = 0
open("/proc/19981/loginuid", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/500/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/19981/cgroup", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
uname({sys="Linux", node="fedora-15", ...}) = 0
clock_gettime(CLOCK_REALTIME, {1350560576, 594694729}) = 0
clock_gettime(CLOCK_MONOTONIC, {184373, 821970285}) = 0
ftruncate(11, 29777920) = 0
ioctl(4, FIONREAD, [0]) = 0
recvmsg(4, 0x7fff4348ab60, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(7,
It seems that since the first sendmsg returning -ESRCH journald can
expect the the process it is trying to query is gone, and there is no
need to check /proc/19981/cgroup, /proc/19981/comm, etc. Would be
nice to short-circuit this. Is there a legitimate reason why some of
those files in /proc would be available, but not others?
This would go away if we had SCM_xyz as mentioned above. The big
advantage would be that we'd always have the data from the precise point
the message was queued.

Optimizing the ENOENTs away feels like optimizing the error path,
i.e. polishing failure. I am pretty sure the right fix is to get the
kernel extended to be nicer to us...

Lennart
--
Lennart Poettering - Red Hat, Inc.
Zbigniew Jędrzejewski-Szmek
2012-10-18 22:26:59 UTC
Permalink
Post by Lennart Poettering
Post by Zbigniew Jędrzejewski-Szmek
But systemd has/had this information and could be queried for it. So I
propose to add a cache of recently dead PIDs in systemd, that could be
queried by journald. I'm not sure about the details, but probably
keeping the information for a few seconds should be enough. This would
lead to more reliable logging for messages close to the end of the
program.
This wouldn't really work I fear, since systemd only really tracks the
main and control PIds of a cgroup, and there might be much more that
log.
I think the only real way to solve this cleanly is to extend the kernel
to provide SCM_CREDENTIAL and SCM_SECURITY-like auxiliary messages that
carries the information we need. More specifically I am hoping that we
can get SCM_AUDIT (for the loginuid + sessionid of a process),
SCM_EXECINFO (for exe, comm, cmdline), SCM_CGROUP (for cgroup
membership). With that we should have all data we need in a safe and
secure way.
So basically it all comes down to fixing the kernel...
Post by Lennart Poettering
Post by Zbigniew Jędrzejewski-Szmek
I guess that some applications could simply block, but in case of e.g.
systemd itself it is not possible. journald should process the
messages faster. Should journald run reniced to higher priority? I
think that the problem is worse with journald than with syslog,
because journald tries to do more things: parse the message, query
cgroup information, forward the message to syslog, store it on disk.
syslog only does the last one, so necessarily must be more efficient.
I remember that maemo used to have issues with syslog blocking. They
basically ran into a priority inversion problem, where RT threads ended
up waiting for the non-RT syslog. I think large qlen are a pretty good
solution for this.
OK, but let's say that we have 1000 processes generating
messages. journald would be overrun anyway, no matter what the queue
size actually is.

Zbyszek
Lennart Poettering
2012-10-23 16:42:27 UTC
Permalink
Post by Zbigniew Jędrzejewski-Szmek
Post by Lennart Poettering
I think the only real way to solve this cleanly is to extend the kernel
to provide SCM_CREDENTIAL and SCM_SECURITY-like auxiliary messages that
carries the information we need. More specifically I am hoping that we
can get SCM_AUDIT (for the loginuid + sessionid of a process),
SCM_EXECINFO (for exe, comm, cmdline), SCM_CGROUP (for cgroup
membership). With that we should have all data we need in a safe and
secure way.
So basically it all comes down to fixing the kernel...
Yeah, unfortunately.
Post by Zbigniew Jędrzejewski-Szmek
Post by Lennart Poettering
I remember that maemo used to have issues with syslog blocking. They
basically ran into a priority inversion problem, where RT threads ended
up waiting for the non-RT syslog. I think large qlen are a pretty good
solution for this.
OK, but let's say that we have 1000 processes generating
messages. journald would be overrun anyway, no matter what the queue
size actually is.
That is a valid concern. But at one point you *must* make a cut, and say
"OK, this is now getting waaaaay too much, now it is better to drop
rather than collecting more and more and more and DoSing the system for
good". Which is what we do, though admittedly much too early, but where
adding a socktopt for the queue size would help.

That said, we probably could improve this worst case scenario a bit. For
example, for the syslog forwarding we recently added a scheme where we'd
count how much we dropped and we would log this counter as soon as the
queue gets unclogged. We did this on request of the HA guys who
basically said "it's OK if you drop, but you need to tell us about
that". So maybe we shoud apply this to log.c and journal-send.c as well,
and instead of queing locally we should just count and flush a messages
about that when things get unclogged?

(added this to the TODO list now)

Lennart
--
Lennart Poettering - Red Hat, Inc.
Continue reading on narkive:
Loading...