Hello! Maxim Cournoyer skribis: > 374 connect(11, {sa_family=AF_UNIX, sun_path="/var/run/dbus/system_bus_socket"}, 34) = 0 [...] > 374 epoll_wait(5, [{events=EPOLLIN|EPOLLOUT|EPOLLHUP, data={u32=24802800, u64=24802800}}], 20, -1) = 1 > 374 sendmsg(11, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\1\0\1\0\0\0\0\1\0\0\0m\0\0\0\1\1o\0\25\0\0\0/org/freedesktop/DBus\0\0\0\3\1s\0\5\0\0\0Hello\0\0\0\2\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\6\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0", iov_len=128}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = -1 EPIPE (Broken pipe) > 374 gettid() = 374 > 374 epoll_ctl(5, EPOLL_CTL_MOD, 11, {events=0, data={u32=24802800, u64=24802800}}) = 0 > 374 timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0 > 374 epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2 > 374 read(12, "\1\0\0\0\0\0\0\0", 8) = 8 > 374 gettid() = 374 > 374 timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0 > 374 epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2 > 374 read(12, "\1\0\0\0\0\0\0\0", 8) = 8 > 374 gettid() = 374 > 374 timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0 > 374 epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2 > 374 read(12, "\1\0\0\0\0\0\0\0", 8) = 8 > 374 epoll_ctl(5, EPOLL_CTL_DEL, 11, NULL) = 0 > 374 close(11) = 0 > 374 gettid() = 374 > 374 epoll_wait(5, > 391 <... close resumed>) = 0 > 391 madvise(0x7fd6c83dc000, 8368128, MADV_DONTNEED) = 0 > 391 exit(0) = ? > 391 +++ exited with 0 +++ > 374 <... epoll_wait resumed>[{events=EPOLLERR, data={u32=24768000, u64=24768000}}], 17, -1) = 1 > 374 lseek(7, 0, SEEK_SET) = 0 > 374 read(7, "tty7\n", 63) = 5 As you pointed out on IRC, the initially ‘Hello’ method call above leads to EPIPE, and we can see that elogind eventually closes its socket to dbus-daemon *but* keeps doing its thing. Some interesting things to note… First, to my surprise, elogind does not use the client library of the ‘dbus’ package: --8<---------------cut here---------------start------------->8--- $ guix gc --references $(./pre-inst-env guix build elogind)|grep dbus $ echo $? 1 --8<---------------cut here---------------end--------------->8--- (This is already the case in ‘master’ with v243.7.) Instead, it has its own implementation of the DBus protocol, in C, from systemd—we can’t have enough sources of bugs and vulnerabilities. Anyway, the “Hello” message is sent to the system bus asynchronously in ‘sd-bus.c’: --8<---------------cut here---------------start------------->8--- static int bus_send_hello(sd_bus *bus) { _cleanup_(sd_bus_message_unrefp) sd_bus_message *m = NULL; int r; assert(bus); if (!bus->bus_client) return 0; r = sd_bus_message_new_method_call( bus, &m, "org.freedesktop.DBus", "/org/freedesktop/DBus", "org.freedesktop.DBus", "Hello"); if (r < 0) return r; return sd_bus_call_async(bus, NULL, m, hello_callback, NULL, 0); } --8<---------------cut here---------------end--------------->8--- A callback is called when a reply is received or an error arises: --8<---------------cut here---------------start------------->8--- static int hello_callback(sd_bus_message *reply, void *userdata, sd_bus_error *error) { [...] fail: /* When Hello() failed, let's propagate this in two ways: first we return the error immediately here, * which is the propagated up towards the event loop. Let's also invalidate the connection, so that * if the user then calls back into us again we won't wait any longer. */ bus_set_state(bus, BUS_CLOSING); return r; } --8<---------------cut here---------------end--------------->8--- It’s not clear from that whether the authors intended for the thing to keep going in case of failure. In our case it’s not helpful. But why does dbus-daemon drop the connection in the first place? To know that, we could change ‘dbus-root-service-type’ to run dbus-daemon from a ‘--enable-verbose-mode’ build, and with the ‘DBUS_VERBOSE’ environment set to 1. Looking at ‘dbus-server-socket.c’ it would seem that the only sensible way this can happen is if dbus-daemon doesn’t yet have a ‘new_connection_function’ set at the time it accepts the incoming connection: --8<---------------cut here---------------start------------->8--- static dbus_bool_t handle_new_client_fd_and_unlock (DBusServer *server, DBusSocket client_fd) { [...] /* See if someone wants to handle this new connection, self-referencing * for paranoia. */ new_connection_function = server->new_connection_function; new_connection_data = server->new_connection_data; _dbus_server_ref_unlocked (server); SERVER_UNLOCK (server); if (new_connection_function) { (* new_connection_function) (server, connection, new_connection_data); } dbus_server_unref (server); /* If no one grabbed a reference, the connection will die. */ _dbus_connection_close_if_only_one_ref (connection); dbus_connection_unref (connection); --8<---------------cut here---------------end--------------->8--- We know that elogind is started after dbus-daemon has written its PID file (there’s a Shepherd service dependency). Is there a possibility that dbus-daemon writes its PID file before it has set ‘new_connection_function’? It would seem that ‘bus_context_new’ writes the PID file after it has called ‘dbus_server_set_new_connection_function’ via ‘setup_server’ via ‘process_config_first_time_only’. So not sure what happens. That’s it for today. Thoughts? Ludo’.