Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] new network driver: slirpnetstack (experimental) #101

Closed

Conversation

AkihiroSuda
Copy link
Member

@AkihiroSuda
Copy link
Member Author

cc @majek

Travis will give us some iperf benchmark data.

@AkihiroSuda AkihiroSuda changed the title new network driver: slirpnetstack (experimental) [WIP] new network driver: slirpnetstack (experimental) Jan 22, 2020
@AkihiroSuda
Copy link
Member Author

seems cloudflare/slirpnetstack#1 needs to be solved

@AkihiroSuda AkihiroSuda force-pushed the slirpnetstack branch 3 times, most recently from fc17d80 to dc56a71 Compare January 22, 2020 11:47
@AkihiroSuda
Copy link
Member Author

DNS doesn't work? @majek

(host)$ rootlesskit  --net=slirpnetstack  --copy-up=/etc bash
(rootlesskit) # cat /etc/resolv.conf 
nameserver 8.8.8.8

(rootlesskit) # nslookup www.google.com
Server:         8.8.8.8
Address:        8.8.8.8#53

Non-authoritative answer:
Name:   www.google.com
Address: 172.217.175.100
Name:   www.google.com
Address: 2404:6800:4004:80b::2004

(rootlesskit) # telnet www.google.com 80
telnet: could not resolve www.google.com/80: Name or service not known

@majek
Copy link

majek commented Jan 22, 2020

@AkihiroSuda yeah, I didn't get that right yet. I just committed some hack to maybe work around the major problem, but I haven't tested it proper yet. Let me know if it fixes the immediate problem.

@AkihiroSuda
Copy link
Member Author

thanks, seems fine

@AkihiroSuda
Copy link
Member Author

AkihiroSuda commented Jan 22, 2020

https://travis-ci.org/rootless-containers/rootlesskit/builds/640423841

+ rootlesskit --net=slirp4netns --mtu=1500 iperf3 -t 60 -c 10.0.2.2
...
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-60.00  sec  5.35 GBytes   766 Mbits/sec    0             sender
[  4]   0.00-60.00  sec  5.35 GBytes   766 Mbits/sec                  receiver
...
+ rootlesskit --net=slirpnetstack --mtu=1500 iperf3 -t 60 -c 172.17.0.2
...
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-60.00  sec  4.26 GBytes   610 Mbits/sec    0             sender
[  4]   0.00-60.00  sec  4.26 GBytes   610 Mbits/sec                  receiver
+ rootlesskit --net=slirp4netns --mtu=65520 iperf3 -t 60 -c 10.0.2.2
...
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-60.00  sec  62.6 GBytes  8.97 Gbits/sec    0             sender
[  4]   0.00-60.00  sec  62.6 GBytes  8.97 Gbits/sec                  receiver
...
+ rootlesskit --net=slirpnetstack --mtu=65520 iperf3 -t 60 -c 172.17.0.2
...
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-60.00  sec  1.24 GBytes   177 Mbits/sec    0             sender
[  4]   0.00-60.00  sec  1.23 GBytes   177 Mbits/sec                  receiver

Note: slirp4netns vs slirpnetstack cannot be compared fairly in this benchmark, because slirpnetstack lacks host-loopback address 10.0.2.2. So slirpnetstack benchmark connects to the host eth0 address 172.17.0.2 instead.

But even considering that slirpnetstack seems slow?

@tonistiigi
Copy link
Contributor

But even considering that slirpnetstack seems slow?

mtu=1500 looks promising though. Maybe there is some configuration issue in high MTU. Did you check the other values in the middle as well? For some workflows, ~1Gbps is enough (in 2020). @majek ideas how to interpret the benchmark?

@AkihiroSuda
Copy link
Member Author

AkihiroSuda commented Jan 22, 2020

Note: slirp4netns vs slirpnetstack cannot be compared fairly in this benchmark, because slirpnetstack lacks host-loopback address 10.0.2.2. So slirpnetstack benchmark connects to the host eth0 address 172.17.0.2 instead.

p.s. On my laptop with slirp4netns, iperf3 throughput against the host loopback and against the host eth0 were almost same (with several --mtu).

Did you check the other values in the middle as well?

+ rootlesskit --net=slirp4netns --mtu=4000 iperf3 -t 60 -c 10.0.2.2
...
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-60.00  sec  13.2 GBytes  1.89 Gbits/sec    0             sender
[  4]   0.00-60.00  sec  13.2 GBytes  1.89 Gbits/sec                  receiver
...
+ rootlesskit --net=slirpnetstack --mtu=4000 iperf3 -t 60 -c 172.17.0.2
...
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-60.00  sec  6.15 GBytes   880 Mbits/sec    0             sender
[  4]   0.00-60.00  sec  6.15 GBytes   880 Mbits/sec                  receiver

@hbhasker
Copy link

bhasker from gvisor team here.

So I ran this locally and I think I understand why this is performing worse and there are a few simple fixes to slirpnetstack

a) NewForwarder is being called with a window of 30k which means its advertising a Window Scale of 1 resulting in every packet causing a ZeroWindow event/update.
b) RXChecksumOffload should be set to true as when reading packets from tap there is no need to do checksum verification. Ideally we should not need to do TXChecksumming either but enabling that causes linux to drop TCP packets w/o valid checksums. Also our calculateChecksum code is slow and could use with some loop unrolling to make it much faster.
c) slirpnetstack is enabling sniffer on the link endpoint which means its trying to log every packet. This will slow down things dramatically. This should be made configurable and only enabled when debugging issues.
d) ModerateRecvBuf to true won't really work as gonet API is not configured to invoke the endpoint.ModerateRecvBuf after Read(). The way auto-tuning works is gvisor invokes the API after it copies bytes to user-space. Right now setting it to true will not do anything unless the API is called. That said there is no real reason to use auto-tuning since its all onhost connection and a buffer of a couple of MB is more than enough to hit 10Gbits/s.

w/ these fixes I see the following

iperf3 -c 100.117.29.130 -t 240
Connecting to host 100.117.29.130, port 5201
[ 5] local 10.0.2.100 port 60364 connected to 100.117.29.130 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 1.02 GBytes 8.75 Gbits/sec 0 1.19 MBytes
[ 5] 1.00-2.00 sec 1.02 GBytes 8.78 Gbits/sec 0 1.19 MBytes
[ 5] 2.00-3.00 sec 1.03 GBytes 8.83 Gbits/sec 0 1.19 MBytes
^C[ 5] 3.00-3.78 sec 848 MBytes 9.08 Gbits/sec 0 1.19 MBytes

iperf3 -c 100.117.29.130 -t 240 -R
Connecting to host 100.117.29.130, port 5201
Reverse mode, remote host 100.117.29.130 is sending
[ 5] local 10.0.2.100 port 60502 connected to 100.117.29.130 port 5201
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 405 MBytes 3.40 Gbits/sec
[ 5] 1.00-2.00 sec 353 MBytes 2.96 Gbits/sec
[ 5] 2.00-3.00 sec 362 MBytes 3.03 Gbits/sec
[ 5] 3.00-4.00 sec 365 MBytes 3.06 Gbits/sec
^C[ 5] 4.00-4.43 sec 151 MBytes 2.95 Gbits/sec

The latter is slower because our checksum calculation is really slow. It's using 30% of cpu when I run

sudo perf top -p

majek added a commit to cloudflare/slirpnetstack that referenced this pull request Jan 24, 2020
rootless-containers/rootlesskit#101
a) NewForwarder is being called with a window of 30k which
means its advertising a Window Scale of 1 resulting in every
packet causing a ZeroWindow event/update.
b) RXChecksumOffload should be set to true as when reading packets
from tap there is no need to do checksum verification. Ideally we
should not need to do TXChecksumming either but enabling that causes
linux to drop TCP packets w/o valid checksums. Also our
calculateChecksum code is slow and could use with some loop unrolling
to make it much faster.
c) slirpnetstack is enabling sniffer on the link endpoint which means
its trying to log every packet. This will slow down things dramatically.
This should be made configurable and only enabled when debugging issues.
d) ModerateRecvBuf to true won't really work as gonet API is not
configured to invoke the endpoint.ModerateRecvBuf after Read(). The
way auto-tuning works is gvisor invokes the API after it copies bytes
to user-space. Right now setting it to true will not do anything unless
the API is called. That said there is no real reason to use auto-tuning
since its all onhost connection and a buffer of a couple of MB is more
than enough to hit 10Gbits/s.
@majek
Copy link

majek commented Jan 24, 2020

Thanks @hbhasker; implemented in cloudflare/slirpnetstack@14ee235

@AkihiroSuda
Copy link
Member Author

The PR should be almost ready to merge, but slirpnetstack CLI spec seems going to change? cloudflare/slirpnetstack#4

@elmarco
Copy link

elmarco commented Feb 13, 2020

@AkihiroSuda it would be nice if we agreed on what CLI should look like. It seems the requirements for containers or VM are pretty similar.

I have moved & updated the slirp-helper spec on a wiki: https://gitlab.freedesktop.org/slirp/libslirp/-/wikis/Slirp-Helper. Not sure it's the best way to discuss the spec though, any suggestion? but feel free to edit the wiki in the meantime.

Or maybe we don't need a spec, and just follow whatever slirpnetstack defines?

(note: the slirp-helper spec was precisely written to allow to easily interchange the helper implementation in libvirt...)

@AkihiroSuda
Copy link
Member Author

Thanks @elmarco , I think we need equivalent of --ready-fd and --exit-fd
cc @giuseppe

Or maybe we don't need a spec, and just follow whatever slirpnetstack defines?

Or follow whatever slirp4netns defines? 😛

@elmarco
Copy link

elmarco commented Feb 14, 2020

Thanks @elmarco , I think we need equivalent of --ready-fd and --exit-fd

I don't really get what --ready-fd is for. The network data should be processed only after the helper/configuration is ready.

--exit-fd, why not kill the process?

There is also a --exit-with-parent in the spec which may help.

cc @giuseppe

Or maybe we don't need a spec, and just follow whatever slirpnetstack defines?

Or follow whatever slirp4netns defines? stuck_out_tongue

yeah :) beside the NS-specific options, there are not that much slirp4netns options anyway.

But I wonder about the JSON API, why not use DBus?

@giuseppe
Copy link
Contributor

--exit-fd, why not kill the process?

it simplifies how the lifecycle is handled in Podman. The other end is injected into the conmon process (the shim process for the container). When conmon exits then also slirp4netns is terminated without conmon knowing how to handle slirp4netns.

@AkihiroSuda
Copy link
Member Author

But I wonder about the JSON API, why not use DBus?

slirp4netns adopted JSON API because we didn't want to introduce extra dependencies.
Also for testability using shell. (I assume dbus also has shell scripting interface, but not sure)

@elmarco
Copy link

elmarco commented Feb 25, 2020

slirp4netns adopted JSON API because we didn't want to introduce extra dependencies.
Also for testability using shell. (I assume dbus also has shell scripting interface, but not sure)

Fundamentally, DBus, the protocol, doesn't need extra dependencies compared to JSON. For example, for glib/gio apps, DBus facilities are there, while JSON would be an extra library...

In practice though, the most convenient is to use the bus, which requires a message bus process. But given that DBus bus is present in 99% Linux systems, it shouldn't be a problem for slirp. (and it can work on macos or even windows)

Regarding shell scriptability, plenty of choices: busctl, gdbus and dbus-send. The introspection capability makes it very convenient too with bash completion etc. For others, like python, there are various convenient high-level API.

json isn't a good machine serialization format, and has issues with numbers. It is also pretty limited.

DBus comes with better machine format, types, security, introspection and tools in general for IPC. It's unfortunate that qemu picked JSON. I don't think we should repeat that.

@AkihiroSuda
Copy link
Member Author

But given that DBus bus is present in 99% Linux systems, it shouldn't be a problem for slirp.

How will it work with dind?

@elmarco
Copy link

elmarco commented Feb 25, 2020

But given that DBus bus is present in 99% Linux systems, it shouldn't be a problem for slirp.

How will it work with dind?

What do you mean?

@AkihiroSuda
Copy link
Member Author

AkihiroSuda commented Feb 25, 2020

Docker-in-Docker and its variants like Podman-in-Podman.

These environments don't have init and execute the container engine directly as PID 1.
So if slirp-helper requires the external dbus daemon process, it is hard to set up.

@elmarco
Copy link

elmarco commented Feb 25, 2020

There are several options I can think of:

  • engine starts a dbus-daemon with slirp-helper etc (it's fairly simple to have a private bus, it doesn't require privileges etc)
  • use a 1-1 communication instead from PID1 with helper
  • or pass an existing socket/fd to the helper (either connected to a session bus or private bus, or 1-1)

How/what do you connect to the json socket today?

@AkihiroSuda
Copy link
Member Author

slirp4netns itself listens on a UNIX socket. The socket path is specified by the caller process such as RootlessKit (used by Docker) and Podman.

@elmarco
Copy link

elmarco commented Feb 25, 2020

slirp4netns itself listens on a UNIX socket. The socket path is specified by the caller process such as RootlessKit (used by Docker) and Podman.

How do you typically connect to that socket path, and when using dind ? From outside the container?

Sounds like it would be fairly easy to start a dbus-daemon and use the bus socket instead for all IPC. You would have a single socket path for all external processes.

dbus-broker is 250kb, typically 1-2 mb in RAM. Hardly a large dependency compared to the multi-Mb of go processes.

@elmarco
Copy link

elmarco commented Feb 26, 2020

I just found out podman is using varlink. This is perhaps another option..

@AkihiroSuda
Copy link
Member Author

How do you typically connect to that socket path, and when using dind ? From outside the container?

The socket is only connected from RootlessKit and Podman, which determines the socket path by themselves.
Anyway, the latest version of Docker and Podman no longer use slirp4netns API, because they use RootlessKit's built-in port forwarder implementation.
So, any kind of API framework is fine for Docker and Podman, as long as the use of the API is optional.

I just found out podman is using varlink. This is perhaps another option..

The varlink API is already being superseded by Docker-compatible REST API.
So another option is to use REST API.
For slirp4netns, I avoided using REST API because I didn't want to implement HTTP server in C nor introduce extra dependencies.

@AkihiroSuda AkihiroSuda added this to the v0.10.0 milestone Mar 3, 2020
@elmarco
Copy link

elmarco commented Mar 11, 2020

The socket is only connected from RootlessKit and Podman, which determines the socket path by themselves.
Anyway, the latest version of Docker and Podman no longer use slirp4netns API, because they use RootlessKit's built-in port forwarder implementation.
So, any kind of API framework is fine for Docker and Podman, as long as the use of the API is optional.

I see, so you don't need to have an IPC with the helper. There is a "builtin" port redirection that does setup over unix sockets, and skips the slirp/userspace tcpip stack, but running a small splice/proxy. It is not as generic/powerful as slirp arbitrary stream redirection, but that's probably not necessary for containers in general.

As long as --dbus remain optional, I suppose you don't care then?

I just found out podman is using varlink. This is perhaps another option..

The varlink API is already being superseded by Docker-compatible REST API.
So another option is to use REST API.
For slirp4netns, I avoided using REST API because I didn't want to implement HTTP server in C nor introduce extra dependencies.

Ok, I think we can stick with DBus then, as I believe it is more convenient to work with from shell, python or go (godbus is pretty nice).

For -exit-fd, I can see that makes things a bit easier on the management side, and it's easy to add. But it looks more like a hack to me. If I read it right, podman intentionally "leaks" rootlessSlirpSyncW opened fd, and silently passed to conmon on fork/exec. Right? I see that conmon.pid is written out on some userdata/ dir. Why not also track other helpers this way?

@AkihiroSuda
Copy link
Member Author

As long as --dbus remain optional, I suppose you don't care then?

I don't care it for rootless containers. But if we want the slirp-helper spec to be adopted by virtual machine platforms (QEMU, VirtualBox, Bochs, BasilliskII, SheepShaver, SIMH...) as well, we need to make sure the spec can be easily implemented on non-Linux platforms such as macOS and Windows.

@elmarco
Copy link

elmarco commented Mar 12, 2020

As long as --dbus remain optional, I suppose you don't care then?

I don't care it for rootless containers. But if we want the slirp-helper spec to be adopted by virtual machine platforms (QEMU, VirtualBox, Bochs, BasilliskII, SheepShaver, SIMH...) as well, we need to make sure the spec can be easily implemented on non-Linux platforms such as macOS and Windows.

Given that the bus is explicitly optional, and DBus p2p must be supported, a simple stream (tcp or namedpipe or whatever) is enough to communicate with the helper on any platforms. But quite knowingly, choosing DBus is based on a preference for Linux systems.

Also, nothing prevents from extending the spec with additional IPC, if necessary.

@AkihiroSuda
Copy link
Member Author

dbus can't cross over the user namespace border? 😭

$ unshare -r dbus-monitor --address unix:path=/run/user/1001/bus
Failed to register connection to bus at unix:path=/run/user/1001/bus: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

@elmarco
Copy link

elmarco commented Apr 1, 2020

dbus can't cross over the user namespace border? sob

$ unshare -r dbus-monitor --address unix:path=/run/user/1001/bus
Failed to register connection to bus at unix:path=/run/user/1001/bus: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

It's a plain unix socket. There are some default security/credentials restrictions that may apply here. What are you trying to achieve? You want a container to access the host bus? That sounds wrong.

Fwiw, I think flatpak has a lot of thinking around how host and container bus can communicate.

@AkihiroSuda AkihiroSuda removed this from the v0.15.0 milestone Jan 22, 2021
@AkihiroSuda AkihiroSuda marked this pull request as draft January 22, 2021 17:06
@AkihiroSuda AkihiroSuda closed this Oct 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

support slirpnetstack (as a binary & as a Go pkg)
6 participants