Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Route traffic into rootlesskit netns #173

Open
maybe-sybr opened this issue Sep 16, 2020 · 7 comments
Open

Route traffic into rootlesskit netns #173

maybe-sybr opened this issue Sep 16, 2020 · 7 comments
Labels
enhancement New feature or request

Comments

@maybe-sybr
Copy link

I'm using u7s to mock up a bare-metal cluster deployment and it would be nice to be able to easily enter an environment where I'm able to route traffic to IPs inside a rootlesskit netns. In my head, this would probably involve a throwaway netns with a veth to pass traffic to the sibling (target) netns (somehow bridged as well?), and a set of routes from the CLI (e.g. cluster IP ranges from u7s, plus extra ranges specified by a user). It would make sense for this throwaway netns to also be capable of routing out to the host/onward and making use of host DNS by default.

Is there currently a nice way of doing something like this? If not, would it be straightforward to implement?

@maybe-sybr
Copy link
Author

I attempted to do this myself by binding the rootlesskit netns into a named one for ip netns (just to make my life easier), poking one end of a veth into it and then setting that end's master to cni0 (although IIUC this would expose internal cluster IPs as well). I manged to get traffic into the netns via a gateway IP on the outside but didn't work out the routing/filtering on the inside of the rootlesskit netns to actually be able to curl a service running on the cluster :(

@AkihiroSuda
Copy link
Member

@AkihiroSuda AkihiroSuda added enhancement New feature or request question Further information is requested labels Sep 16, 2020
@maybe-sybr
Copy link
Author

maybe-sybr commented Sep 17, 2020

It'd be an alternative to the current helpers we have in rootlessctl.sh. As it stands, we can currently expose single ports bound by the cluster node in u7s using either:

  • rootlessctl.sh add-ports,
  • kubectl port-forward (works for non-NodePort services as well IIRC) or
  • socat trickery as documented.

The issue with this is that it means we have to deconflict ports in both the rootlesskit netns, and outside in the main netns. We already have IP networking for both the CNI subnet and the ClusterIP subnet available inside the rootlesskit netns, and if I nsenter it, I can curl all of the pods and services using their assigned IP addresses.

It would be nice if we could simply route traffic to/from those ranges without having to enter the main rootlesskit netns. Specifically, it would be useful and roughly representative of a cloud cluster if ClusterIPs and LoadBalancer IPs (e.g. allocated by something like metallb) could be easily routed to by a naive user. I know that ClusterIPs would normally not be accessible by a user, but allowing that for u7s would make it much easier to work with since it would let them avoid running metallb or some similar infrastructural service.

I'd see this working something like:

$ rootlessctl.sh bridge --extra-subnet 10.20.30.0/24 --extra-subnet 10.30.0.0/16
(netns=some_throwaway) $ curl http://10.20.30.1
...
(netns=some_throwaway) $ dig some-service.default.svc.cluster.local @10.0.0.53
NOERROR
...

I would see the actual implementation looking something like the following steps:

  • enter a new throwaway netns
  • create a veth pair
  • poke one end of veth into rootlesskit cluster netns
  • bridge veth end in rootlesskit netns to cni0 probably
  • assign unique /30 IPs to veth ends
  • add routes in throwaway netns to clusterIP range and extra subnets via other end of the veth's IP?
  • add a route back to the user end of the veth in the rootlesskit netns?

I didn't get the routing working when I tried it. It's been a little while since I worked with veths and bridging so it's probably my fault. I also think it would be sensible to be able to call out of the throwaway netns to the host/internet, I've not described that in the steps above. It's also worth noting that the throwaway netns is purely so the user can do this rootless as well. Having a rootful version which bridges the main system netns to the cluster might be interesting for convenience.

So does this sound workable? I'm happy to make a PR if I can get it working, but might need some tips on why I can't route like I expected.

Edit: So I've just realised that obviously the throwaway netns won't work nicely as I laid out above because we can't push one of the veth ends into a sibling netns. I suppose we could set up the veths in the rootlesskit namespace, then push one end into a throwaway child netns of the RK one, then push the user into that throwaway netns?

For now I've mocked this up creating a veth in the main system netns with root, then pushing one end into each of the throwaway and rootlesskit namespaces. Setting a /30 on each end of the veth and adding a route to 10.0.0.0/24 via the IP assigned to the veth in the rootlesskit namespace allows me to reach services by their assigned ClusterIP :)

Edit 2: Working flow:

# Terminal 1
(ns/host) $ export RK_PID="$(cat "${XDG_RUNTIME_DIR}/usernetes/rootlesskit/child_pid")"
(ns/host) $ nsenter -U --preserve-credential -n -t "${RK_PID)"
(ns/cluster) $ ip link add rk-veth type veth peer name usr-veth
(ns/cluster) $ ip link set up rk-veth
(ns/cluster) $ ip link set up usr-veth
(ns/cluster) $ ip addr add 10.20.30.1/30 dev rk-veth
(ns/cluster) $ ip addr add 10.20.30.2/30 dev usr-veth

# Terminal 2
(ns/host) $ nsenter -U --preserve-credential -n -t "${RK_PID)"
(ns/cluster) $ unshare -U -n --map-root-user
(ns/bridge) $ echo $$
<CHILD_PID>

# Terminal 3 - I think this shouldn't need to be rootful but I couldn't do it in the RK namespace
(ns/host) $ sudo mkdir -p /run/netns; sudo touch /run/netns/usr-bridge
(ns/host) $ sudo mount -o /proc/<CHILD_PID/net/ns /run/netns/usr-bridge

# Back to terminal 1
(ns/cluster) $ ip link set usr-veth netns usr-bridge

# Back to terminal 2
(ns/bridge) $ ip route add 10.0.0/24 via 10.20.30.1 dev usr-veth
(ns/bridge) $ curl -vv <some_service_ip>
hello world!

We can then enter the bridged namespace from the host at will in other terminals and things work as expected. By bridging the rk-veth to cni0 when I tried last time, I wasn't actually able to do IP routing which explains why I was getting confused. By just adding a route via the IP on the rk-veth from the usr-veth side, the kernel's IP forwarding stack takes care of it for me.

@AkihiroSuda
Copy link
Member

Terminal 3 - I think this shouldn't need to be rootful but I couldn't do it in the RK namespace

This could be substituted by running rootlesskit with --copy-up=/run, and then running rm -f /run/netns in the ns.

@maybe-sybr
Copy link
Author

maybe-sybr commented Sep 17, 2020

Terminal 3 - I think this shouldn't need to be rootful but I couldn't do it in the RK namespace

This could be substituted by running rootlesskit with --copy-up=/run, and then running rm -f /run/netns in the ns.

I actually just avoided this by realising I could just specify the target PID instead. Didn't know that was a thing. Here is a working bridge.sh script which works as advertised:

#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail

SCRIPT_PATH="$(realpath -e "${BASH_SOURCE[0]}")"
SCRIPT_DIR="${SCRIPT_PATH%/*}"

UNSHARE_CMD=("unshare" "-U" "--map-root-user" "-m" "-n")
NSENTER_CMD=(
    "nsenter" "-U" "--preserve-credential" "-m" "-n" "--wd=${PWD}" "-t"
)

if [ -z "${_BR_REEXEC_OUTER:+SET}" ]; then
    _RK_PID="$(cat "${XDG_RUNTIME_DIR}/usernetes/rootlesskit/child_pid")"
    if [ -z "${_RK_PID}" ]; then
        echo "No rootlesskit PID found"
        exit 127
    fi
    "${NSENTER_CMD[@]}" "${_RK_PID}" env _BR_REEXEC_OUTER=true "${SCRIPT_PATH}"
    exit $?
else
    ip link add rk-veth type veth peer name usr-veth
    ip link set up rk-veth
    ip addr add 10.20.30.1/30 dev rk-veth
    "${UNSHARE_CMD[@]}" sleep infinity &
    SLEEP_PID=$!
    # Set up a teardown function
    declare -a TEARDOWN_PIDS=( "${SLEEP_PID}" )
    function stop () {
        if [ "${#TEARDOWN_PIDS[@]}" -gt 0 ]; then
            kill "${TEARDOWN_PIDS[@]}"
        fi
    }
    trap exit SIGINT SIGTERM
    trap stop EXIT ERR
    # Pause to ensure that unshare has re-execed
    while [                                                                 \
        "$(stat -c "%i" /proc/self/ns/net)" ==                              \
        "$(stat -c "%i" "/proc/${SLEEP_PID}/ns/net")"                       \
    ]; do sleep 0.1; done
    # Run a slirp in the bridge netns
    slirp4netns --configure --mtu=65520 --disable-host-loopback             \
        "${SLEEP_PID}" tap0 &
    TEARDOWN_PIDS+=( "$!" )
    # Send the user end of the veth into the bridge netns
    ip link set usr-veth netns "${SLEEP_PID}"
    # Run some network setup commands
    cmds=(
        "ip link set up usr-veth;"
        "ip addr add 10.20.30.2/30 dev usr-veth;"
        "ip route add 10.0.0.0/24 via 10.20.30.1 dev usr-veth;"
    )
    "${NSENTER_CMD[@]}" "${SLEEP_PID}" "${SHELL:-bash}" -c "${cmds[*]}"
    "${NSENTER_CMD[@]}" "${SLEEP_PID}" cat >/etc/bridge.resolv.conf <<EOF
nameserver 10.0.0.53
search cluster.local
EOF
    "${NSENTER_CMD[@]}" "${SLEEP_PID}" mount -o ro,bind /etc/bridge.resolv.conf /etc/resolv.conf
    # Finally run a shell for the user
    "${NSENTER_CMD[@]}" "${SLEEP_PID}" "${SHELL:-bash}"
    exit $?
fi

There are a few things to change in there if it were to be runnable more than once at a time, e.g. dynamic veth names and IPs, but should be a solid base. I'm going to move onto a few other things for the day now since I can now see how well this will fit into the rest of my workflow. If you could let me know what you think about getting something like this merged in, I'd appreciate it! I'll keep an eye on this issue :)

Edit: Added a slirp and cluster DNS

@AkihiroSuda
Copy link
Member

Why do you need slirp in slirp?

@AkihiroSuda AkihiroSuda removed the question Further information is requested label Sep 17, 2020
@maybe-sybr
Copy link
Author

maybe-sybr commented Sep 17, 2020

Hmm, you're right. That second slirp should probably run in the top level namespace, rather than in the cluster namespace. The intention was to avoid routing external traffic (ie. not to the cluster related routes we explicitly add) from the child namespace via the main rootlesskit ns, but obviously by running the slirp in the rootlesskit ns, that's exactly what we're doing. Oops!

It seems like it might be a pain to get the PID of the sleep anchoring the child namespace up to the top level script though, so a cheaper way would be to add a default route via the veth as well - but I think the second slirp is a better way, if we can manage it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants