@@ -135,48 +135,6 @@ entirely read-only. To close this gap it would be great if such
135135propagated mounts could implicitly gain ` MS_RDONLY ` as they are
136136propagated.
137137
138- ### Disabling reception of ` SCM_RIGHTS ` for ` AF_UNIX ` sockets
139-
140- Ability to turn off ` SCM_RIGHTS ` reception for ` AF_UNIX `
141- sockets. Right now reception of file descriptors is always on when
142- a process makes the mistake of invoking ` recvmsg() ` on such a
143- socket. This is problematic since ` SCM_RIGHTS ` installs file
144- descriptors in the recipient process' file descriptor
145- table. Getting rid of these file descriptors is not necessarily
146- easy, as they could refer to "slow-to-close" files (think: dirty
147- file descriptor referring to a file on an unresponsive NFS server,
148- or some device file descriptor), that might cause the recipient to
149- block for a longer time when it tries to them. Programs reading
150- from an ` AF_UNIX ` socket currently have three options:
151-
152- 1 . Never use ` recvmsg() ` , and stick to ` read() ` , ` recv() ` and
153- similar which do not install file descriptors in the recipients
154- file descriptor table.
155-
156- 2 . Ignore the problem, and simply ` close() ` the received file descriptors
157- it didn't expect, thus possibly locking up for a longer time.
158-
159- 3 . Fork off a thread that invokes ` close() ` , which mitigates the
160- risk of blocking, but still means a sender can cause resource
161- exhaustion in a recipient by flooding it with file descriptors,
162- as for each of them a thread needs to be spawned and a file
163- descriptor is taken while it is in the process of being closed.
164-
165- (Another option of course is to never talk ` AF_UNIX ` to peers that
166- are not trusted to not send unexpected file descriptors.)
167-
168- A simple knob that allows turning off ` SCM_RIGHTS ` right reception
169- would be useful to close this weakness, and would allow
170- ` recvmsg() ` to be called without risking file descriptors to be
171- installed in the file descriptor table, and thus risking a
172- blocking ` close() ` or a form of potential resource exhaustion.
173-
174- ** Use-Case:** any program that uses ` AF_UNIX ` sockets and uses (or
175- would like to use) ` recvmsg() ` on it (which is useful to acquire
176- other metadata). Example: logging daemons that want to collect
177- timestamp or ` SCM_CREDS ` auxiliary data, or the D-Bus message
178- broker and suchlike.
179-
180138### Filtering on received file descriptors
181139
182140An alternative to the previous item could be if some form of filtering
@@ -187,26 +145,8 @@ received" may be expressed. (BPF?).
187145
188146** Use-Case:** as above.
189147
190- ### A reliable way to check for PID namespacing
191-
192- A reliable (non-heuristic) way to detect from userspace if the
193- current process is running in a PID namespace that is not the main
194- PID namespace. PID namespaces are probably the primary type of
195- namespace that identify a container environment. While many
196- heuristics exist to determine generically whether one is executed
197- inside a container, it would be good to have a correct,
198- well-defined way to determine this.
199-
200- ** Use-Case:** tools such as ` systemd-detect-virt ` exist to determine
201- container execution, but typically resolve to checking for
202- specific implementations. It would be much nicer and universally
203- applicable if such a check could be done generically. It would
204- probably suffice to provide an ` ioctl() ` call on the ` pidns ` file
205- descriptor that reveals this kind of information in some form.
206-
207148### Excluding processes watched via ` pidfd ` from ` waitid(P_ALL, …) `
208149
209-
210150** Use-Case:** various programs use ` waitid(P_ALL, …) ` to collect exit
211151information of exited child processes. In particular PID 1 and
212152processes using ` PR_SET_CHILD_SUBREAPER ` use this as they may
@@ -540,7 +480,7 @@ does not work anymore. It would be great if there was an API to
540480simply query ` overlayfs ` for the superblock information
541481(i.e. ` .st_dev ` ) of the backing layers.
542482
543- #### Automatic growing of ` btrfs ` filesystems
483+ ### Automatic growing of ` btrfs ` filesystems
544484
545485An * auto-grow* feature in ` btrfs ` would be excellent.
546486
@@ -663,6 +603,32 @@ speaking the 2nd idea makes the 1st idea half-way redundant.
663603so on) needs this, so that it can reasonably handle SELinux AVC errors
664604on received messages.
665605
606+ ### Reasonable EOF on SOCK_SEQPACKET
607+
608+ Zero size datagrams cannot be distinguished from EOF on
609+ ` SOCK_SEQPACKET ` . Both will cause ` recvmsg() ` to return zero.
610+
611+ Idea how to improve things: maybe define a new MSG_XYZ flag for this,
612+ which causes either of the two cases result in some recognizable error
613+ code returned rather than a 0.
614+
615+ ** Use-Case:** Any code that wants to use ` SOCK_SEQPACKET ` and cannot
616+ effort disallowing zero sized datagrams from their protocol.
617+
618+ ### Reasonable Handling of SELinux dropping SCM_RIGHTS fds
619+
620+ Currently, if SELinux refuses to let some file descriptor through, it
621+ will just drop them from the ` SCM_RIGHTS ` array. That's a terrible
622+ idea, since applications rely on the precise arrangement of the array
623+ to know which fd is which. By dropping entries silently, these apps
624+ will all break.
625+
626+ Idea how to improve things: leave the elements in the array in place,
627+ but return a marker instead (i.e. negative integer, maybe ` -EPERM ` ) that
628+ tells userspace that there was an fd, but it was not allowed through.
629+
630+ ** Use-Case:** Any code that wants to use ` SCM_RIGHTS ` properly.
631+
666632---
667633
668634## Finished Items
@@ -726,6 +692,7 @@ https://github.com/systemd/systemd/commit/8939eeae528ef9b9ad2a21995279b76d382d5c
726692to safely and race-freely invoke processes, but the fact that ` comm `
727693is useless after invoking a process that way makes the call
728694unfortunately hard to use for systemd.
695+
729696### Make statx() on a pidfd return additional info
730697
731698Make statx() on a pidfd return additional recognizable identifiers in
@@ -974,3 +941,70 @@ handlers.
974941** 🙇 ` bc70682a497c ("ovl: support idmapped layers") ` 🙇**
975942
976943** Use-Case:** Allow containers to use ` overlayfs ` with idmapped mounts.
944+
945+ ### Disabling reception of ` SCM_RIGHTS ` for ` AF_UNIX ` sockets
946+
947+ [ x] Ability to turn off ` SCM_RIGHTS ` reception for ` AF_UNIX `
948+ sockets.
949+
950+ ** 🙇 ` 77cbe1a6d8730a07f99f9263c2d5f2304cf5e830 ("af_unix: Introduce SO_PASSRIGHTS") ` 🙇**
951+
952+ Right now reception of file descriptors is always on when
953+ a process makes the mistake of invoking ` recvmsg() ` on such a
954+ socket. This is problematic since ` SCM_RIGHTS ` installs file
955+ descriptors in the recipient process' file descriptor
956+ table. Getting rid of these file descriptors is not necessarily
957+ easy, as they could refer to "slow-to-close" files (think: dirty
958+ file descriptor referring to a file on an unresponsive NFS server,
959+ or some device file descriptor), that might cause the recipient to
960+ block for a longer time when it tries to them. Programs reading
961+ from an ` AF_UNIX ` socket currently have three options:
962+
963+ 1 . Never use ` recvmsg() ` , and stick to ` read() ` , ` recv() ` and
964+ similar which do not install file descriptors in the recipients
965+ file descriptor table.
966+
967+ 2 . Ignore the problem, and simply ` close() ` the received file descriptors
968+ it didn't expect, thus possibly locking up for a longer time.
969+
970+ 3 . Fork off a thread that invokes ` close() ` , which mitigates the
971+ risk of blocking, but still means a sender can cause resource
972+ exhaustion in a recipient by flooding it with file descriptors,
973+ as for each of them a thread needs to be spawned and a file
974+ descriptor is taken while it is in the process of being closed.
975+
976+ (Another option of course is to never talk ` AF_UNIX ` to peers that
977+ are not trusted to not send unexpected file descriptors.)
978+
979+ A simple knob that allows turning off ` SCM_RIGHTS ` right reception
980+ would be useful to close this weakness, and would allow
981+ ` recvmsg() ` to be called without risking file descriptors to be
982+ installed in the file descriptor table, and thus risking a
983+ blocking ` close() ` or a form of potential resource exhaustion.
984+
985+ ** Use-Case:** any program that uses ` AF_UNIX ` sockets and uses (or
986+ would like to use) ` recvmsg() ` on it (which is useful to acquire
987+ other metadata). Example: logging daemons that want to collect
988+ timestamp or ` SCM_CREDS ` auxiliary data, or the D-Bus message
989+ broker and suchlike.
990+
991+ ### A reliable way to check for PID namespacing
992+
993+ [ x] A reliable (non-heuristic) way to detect from userspace if the
994+ current process is running in a PID namespace that is not the main
995+ PID namespace. PID namespaces are probably the primary type of
996+ namespace that identify a container environment. While many
997+ heuristics exist to determine generically whether one is executed
998+ inside a container, it would be good to have a correct,
999+ well-defined way to determine this.
1000+
1001+ ** 🙇 The inode number of the root PID namespace is fixed (0xEFFFFFFC)
1002+ and now considered API. It can be used to distinguish the root PID
1003+ namespace from all others. 🙇**
1004+
1005+ ** Use-Case:** tools such as ` systemd-detect-virt ` exist to determine
1006+ container execution, but typically resolve to checking for
1007+ specific implementations. It would be much nicer and universally
1008+ applicable if such a check could be done generically. It would
1009+ probably suffice to provide an ` ioctl() ` call on the ` pidns ` file
1010+ descriptor that reveals this kind of information in some form.
0 commit comments