Skip to content

Commit 726fc5b

Browse files
authored
Merge pull request #35 from poettering/more-is-done
mark more items as done, add two new items
2 parents cfd3acb + 60570f3 commit 726fc5b

File tree

1 file changed

+95
-61
lines changed

1 file changed

+95
-61
lines changed

README.md

Lines changed: 95 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -135,48 +135,6 @@ entirely read-only. To close this gap it would be great if such
135135
propagated mounts could implicitly gain `MS_RDONLY` as they are
136136
propagated.
137137

138-
### Disabling reception of `SCM_RIGHTS` for `AF_UNIX` sockets
139-
140-
Ability to turn off `SCM_RIGHTS` reception for `AF_UNIX`
141-
sockets. Right now reception of file descriptors is always on when
142-
a process makes the mistake of invoking `recvmsg()` on such a
143-
socket. This is problematic since `SCM_RIGHTS` installs file
144-
descriptors in the recipient process' file descriptor
145-
table. Getting rid of these file descriptors is not necessarily
146-
easy, as they could refer to "slow-to-close" files (think: dirty
147-
file descriptor referring to a file on an unresponsive NFS server,
148-
or some device file descriptor), that might cause the recipient to
149-
block for a longer time when it tries to them. Programs reading
150-
from an `AF_UNIX` socket currently have three options:
151-
152-
1. Never use `recvmsg()`, and stick to `read()`, `recv()` and
153-
similar which do not install file descriptors in the recipients
154-
file descriptor table.
155-
156-
2. Ignore the problem, and simply `close()` the received file descriptors
157-
it didn't expect, thus possibly locking up for a longer time.
158-
159-
3. Fork off a thread that invokes `close()`, which mitigates the
160-
risk of blocking, but still means a sender can cause resource
161-
exhaustion in a recipient by flooding it with file descriptors,
162-
as for each of them a thread needs to be spawned and a file
163-
descriptor is taken while it is in the process of being closed.
164-
165-
(Another option of course is to never talk `AF_UNIX` to peers that
166-
are not trusted to not send unexpected file descriptors.)
167-
168-
A simple knob that allows turning off `SCM_RIGHTS` right reception
169-
would be useful to close this weakness, and would allow
170-
`recvmsg()` to be called without risking file descriptors to be
171-
installed in the file descriptor table, and thus risking a
172-
blocking `close()` or a form of potential resource exhaustion.
173-
174-
**Use-Case:** any program that uses `AF_UNIX` sockets and uses (or
175-
would like to use) `recvmsg()` on it (which is useful to acquire
176-
other metadata). Example: logging daemons that want to collect
177-
timestamp or `SCM_CREDS` auxiliary data, or the D-Bus message
178-
broker and suchlike.
179-
180138
### Filtering on received file descriptors
181139

182140
An alternative to the previous item could be if some form of filtering
@@ -187,26 +145,8 @@ received" may be expressed. (BPF?).
187145

188146
**Use-Case:** as above.
189147

190-
### A reliable way to check for PID namespacing
191-
192-
A reliable (non-heuristic) way to detect from userspace if the
193-
current process is running in a PID namespace that is not the main
194-
PID namespace. PID namespaces are probably the primary type of
195-
namespace that identify a container environment. While many
196-
heuristics exist to determine generically whether one is executed
197-
inside a container, it would be good to have a correct,
198-
well-defined way to determine this.
199-
200-
**Use-Case:** tools such as `systemd-detect-virt` exist to determine
201-
container execution, but typically resolve to checking for
202-
specific implementations. It would be much nicer and universally
203-
applicable if such a check could be done generically. It would
204-
probably suffice to provide an `ioctl()` call on the `pidns` file
205-
descriptor that reveals this kind of information in some form.
206-
207148
### Excluding processes watched via `pidfd` from `waitid(P_ALL, …)`
208149

209-
210150
**Use-Case:** various programs use `waitid(P_ALL, …)` to collect exit
211151
information of exited child processes. In particular PID 1 and
212152
processes using `PR_SET_CHILD_SUBREAPER` use this as they may
@@ -540,7 +480,7 @@ does not work anymore. It would be great if there was an API to
540480
simply query `overlayfs` for the superblock information
541481
(i.e. `.st_dev`) of the backing layers.
542482

543-
#### Automatic growing of `btrfs` filesystems
483+
### Automatic growing of `btrfs` filesystems
544484

545485
An *auto-grow* feature in `btrfs` would be excellent.
546486

@@ -663,6 +603,32 @@ speaking the 2nd idea makes the 1st idea half-way redundant.
663603
so on) needs this, so that it can reasonably handle SELinux AVC errors
664604
on received messages.
665605

606+
### Reasonable EOF on SOCK_SEQPACKET
607+
608+
Zero size datagrams cannot be distinguished from EOF on
609+
`SOCK_SEQPACKET`. Both will cause `recvmsg()` to return zero.
610+
611+
Idea how to improve things: maybe define a new MSG_XYZ flag for this,
612+
which causes either of the two cases result in some recognizable error
613+
code returned rather than a 0.
614+
615+
**Use-Case:** Any code that wants to use `SOCK_SEQPACKET` and cannot
616+
effort disallowing zero sized datagrams from their protocol.
617+
618+
### Reasonable Handling of SELinux dropping SCM_RIGHTS fds
619+
620+
Currently, if SELinux refuses to let some file descriptor through, it
621+
will just drop them from the `SCM_RIGHTS` array. That's a terrible
622+
idea, since applications rely on the precise arrangement of the array
623+
to know which fd is which. By dropping entries silently, these apps
624+
will all break.
625+
626+
Idea how to improve things: leave the elements in the array in place,
627+
but return a marker instead (i.e. negative integer, maybe `-EPERM`) that
628+
tells userspace that there was an fd, but it was not allowed through.
629+
630+
**Use-Case:** Any code that wants to use `SCM_RIGHTS` properly.
631+
666632
---
667633

668634
## Finished Items
@@ -726,6 +692,7 @@ https://github.com/systemd/systemd/commit/8939eeae528ef9b9ad2a21995279b76d382d5c
726692
to safely and race-freely invoke processes, but the fact that `comm`
727693
is useless after invoking a process that way makes the call
728694
unfortunately hard to use for systemd.
695+
729696
### Make statx() on a pidfd return additional info
730697

731698
Make statx() on a pidfd return additional recognizable identifiers in
@@ -974,3 +941,70 @@ handlers.
974941
**🙇 `bc70682a497c ("ovl: support idmapped layers")` 🙇**
975942

976943
**Use-Case:** Allow containers to use `overlayfs` with idmapped mounts.
944+
945+
### Disabling reception of `SCM_RIGHTS` for `AF_UNIX` sockets
946+
947+
[x] Ability to turn off `SCM_RIGHTS` reception for `AF_UNIX`
948+
sockets.
949+
950+
**🙇 `77cbe1a6d8730a07f99f9263c2d5f2304cf5e830 ("af_unix: Introduce SO_PASSRIGHTS")` 🙇**
951+
952+
Right now reception of file descriptors is always on when
953+
a process makes the mistake of invoking `recvmsg()` on such a
954+
socket. This is problematic since `SCM_RIGHTS` installs file
955+
descriptors in the recipient process' file descriptor
956+
table. Getting rid of these file descriptors is not necessarily
957+
easy, as they could refer to "slow-to-close" files (think: dirty
958+
file descriptor referring to a file on an unresponsive NFS server,
959+
or some device file descriptor), that might cause the recipient to
960+
block for a longer time when it tries to them. Programs reading
961+
from an `AF_UNIX` socket currently have three options:
962+
963+
1. Never use `recvmsg()`, and stick to `read()`, `recv()` and
964+
similar which do not install file descriptors in the recipients
965+
file descriptor table.
966+
967+
2. Ignore the problem, and simply `close()` the received file descriptors
968+
it didn't expect, thus possibly locking up for a longer time.
969+
970+
3. Fork off a thread that invokes `close()`, which mitigates the
971+
risk of blocking, but still means a sender can cause resource
972+
exhaustion in a recipient by flooding it with file descriptors,
973+
as for each of them a thread needs to be spawned and a file
974+
descriptor is taken while it is in the process of being closed.
975+
976+
(Another option of course is to never talk `AF_UNIX` to peers that
977+
are not trusted to not send unexpected file descriptors.)
978+
979+
A simple knob that allows turning off `SCM_RIGHTS` right reception
980+
would be useful to close this weakness, and would allow
981+
`recvmsg()` to be called without risking file descriptors to be
982+
installed in the file descriptor table, and thus risking a
983+
blocking `close()` or a form of potential resource exhaustion.
984+
985+
**Use-Case:** any program that uses `AF_UNIX` sockets and uses (or
986+
would like to use) `recvmsg()` on it (which is useful to acquire
987+
other metadata). Example: logging daemons that want to collect
988+
timestamp or `SCM_CREDS` auxiliary data, or the D-Bus message
989+
broker and suchlike.
990+
991+
### A reliable way to check for PID namespacing
992+
993+
[x] A reliable (non-heuristic) way to detect from userspace if the
994+
current process is running in a PID namespace that is not the main
995+
PID namespace. PID namespaces are probably the primary type of
996+
namespace that identify a container environment. While many
997+
heuristics exist to determine generically whether one is executed
998+
inside a container, it would be good to have a correct,
999+
well-defined way to determine this.
1000+
1001+
**🙇 The inode number of the root PID namespace is fixed (0xEFFFFFFC)
1002+
and now considered API. It can be used to distinguish the root PID
1003+
namespace from all others. 🙇**
1004+
1005+
**Use-Case:** tools such as `systemd-detect-virt` exist to determine
1006+
container execution, but typically resolve to checking for
1007+
specific implementations. It would be much nicer and universally
1008+
applicable if such a check could be done generically. It would
1009+
probably suffice to provide an `ioctl()` call on the `pidns` file
1010+
descriptor that reveals this kind of information in some form.

0 commit comments

Comments
 (0)