Skip to content

Introduce prefix allocation #2611

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

fmoehler
Copy link
Contributor

@fmoehler fmoehler commented Apr 15, 2025

What is this change about?

As described in RFC https://github.com/cloudfoundry/community/blob/main/toc/rfc/rfc-0038-ipv6-dual-stack-for-cf.md bosh shall be enabled to allocate prefix ip addresses (ipv4 and ipv6). Currently bosh only supports attaching ip addresses with a /32 (ipv4) or /128 (ipv6) prefix, which are single ip addresses. To apply these changes a new property called 'prefix' is introduced in the cloud config networks section. Please refer to the example below for a manual network:

- name: diego-cells-ipv6
  subnets:
  - az: z1
    cloud_properties:
      security_groups:
      - sg-0005f94257313417d
      - sg-06acfe8fb0a6247f0
      - sg-05a8b2b2e26ac1d5d
      - sg-064a667fb375e2dac
      - sg-01bb6e1e1f821fe4c
      subnet: subnet-0beae7541e0ebf5f1
    dns:
    - 2600:1f18:7415:8f00:0000:0000:0000:0253
    prefix: 80
    gateway: 2600:1f18:7415:8f00:0000:0000:0000:0001
    range: 2600:1f18:7415:8f00:0000:0000:0000:0000/56
    reserved:
    - 2600:1f18:7415:8f00:0000:0000:0000:0002
    - 2600:1f18:7415:8f00:0000:0000:0000:0003
    - 2600:1f18:7415:8f00:ffff:ffff:ffff:ffff

This example network tells bosh that instead of assigning a single ip address, to assign a cidr range. So to slash the /56 network into multiple /80 networks.

The ip address allocation from the bosh director is adapted to also consider these prefixes (previously the director was just counting up by 1)

One major change is how the ip addresses are stored inside the database. The address_str field in the ip_addresses table will change from storing the ip address as an integer representation of the ip to store the ip address in cidr notation. This change is necessary to not "lose" the prefix information when storing the ip address. Also it has the advantage that you can directly create an IpAddrOrCidr Object out of the string coming from the database.

This PR also changes the RPC Interface for the create_vm method. It will include a separate field called "prefix". We will send the Prefix information in a separate field to not break existing cpis. Older CPIs that do not support prefix allocation will just ignore this field. Below you can find an example network section of the create_vm call:

{"ha_proxy":{"type":"manual","ip":"10.0.72.1","prefix":"32","netmask":"255.255.224.0","cloud_properties":{"security_groups":"<redacted>","subnet":"<redacted>"},"default":["dns","gateway"],"dns":["10.0.0.2"],"gateway":"10.0.64.1"}

The prefix here is 32 indicating a single ip address.

What tests have you run against this PR?

Include a comprehensive list of all tests run successfully.

How should this change be described in bosh release notes?

Something brief that conveys the change and is written with the Operator audience in mind.
See previous release notes for examples.

Does this PR introduce a breaking change?

Does this introduce changes that would require operators to take care in order to upgrade without a failure?

Tag your pair, your PM, and/or team!

It's helpful to tag a few other folks on your team or your team alias in case we need to follow up later.

@peanball
Copy link

peanball commented Apr 20, 2025

Hi @aramprice, I just watched the recording of the foundational infra meeting from April 17th.

To clarify a bit:
Diego is currently using Silk as overlay network that is independent from the IP address(es) of the Diego VM itself.

The idea for IPv6 is to use as much native network routing as possible, i.e. not use an overlay network for IPv6. This means that the IP addresses that would be assigned to the containers will be delegated from an IPv6 prefix that is assigned to the Diego VM.

Your question about whether those IP addresses would move in case of an evacuation: no. These IP addresses are not "sticky" to the app or app instance and would not move.

The goal is to make the networking setup for Diego simpler by using "native" IPv6 addressing. Traffic aimed at a particular container will reach the Diego VM via its CIDR range, and then the Diego VM's kernel can forward the traffic to the container's virtual NIC.

Please also note that the networks are supposed to be dual stack, not pure IPv6. So you would want to assign multiple (at least one IPv4 and one IPv6) networks to the same VM, as @fmoehler mentioned in the call already.

The "prefix" parameter is also the size of the prefix to delegate to each VM from a larger range. Have a look at the discussion while creating the RFC for a more extensive example.

The VM is supposed to self-assign an address to itself. Usually this is the x:y:z::1 address (i.e. the "first" address in the provided CIDR range). Addressing the VM is done via that IP address. The "remaining" addresses are then to do for the VM as it wishes.

@fmoehler fmoehler force-pushed the introduce-prefix-allocation branch 8 times, most recently from b24cf94 to 354d3e8 Compare April 23, 2025 18:57
@fmoehler fmoehler force-pushed the introduce-prefix-allocation branch from 354d3e8 to 7cdb704 Compare April 23, 2025 19:58
@aramprice
Copy link
Member

Hi @peanball,

Thanks for that additional context - the info and the RFC were helpful.

In thinking about the overall changes to Bosh a few goals or principles (maybe too strong a word) came to mind:

  • Make conditional behavior / IP-version specific code and settings as limited as possible
    • I really like that prefix: is used and not ipv6_prefix:
    • Should an IPv4 network allow/require a prefix: (one that selects a single IP) for consistency?
  • Reduce the back and forth conversion / formatting where possible (format_ip(), ip.to_i and similar)
    • Ideally a single IpAddrOrCidr (I apologize for the name) or IpAddress representation would be used, and conversions would be handled only where need by the class itself, ex: #to_db_value - for persistence, or #to_s - for logs
  • Avoid assumption in the code that bosh will always be dual-stack
    • I realize the system will start out dual-stack but this may not be forever

This isn't to imply disagreement about the current state of things, only to captures my thoughts at the moment.

@fmoehler
Copy link
Contributor Author

@aramprice thanks for your feedback!

I just want to elaborate a little on our current idea, but of course this is open for ideas. Regarding your points:

  1. We will introduce a "prefix" property in the subnet section of the network (Maybe we can also put it to the network section itself, so that it applies automatically to all subnets?). This property will be considered for ipv4 and ipv6 networks. However if the property is not defined (e.g. for existing networks), the director will consider the "prefix" as /32 (ipv4) or /128 (ipv6) and maintain this accordingly in the database.
  2. Yes we will try to get rid of as many conversion as possible. Sometimes it can be handy to have the ips in integer representation, but most of the time we will pass them as IpAddrOrCidr Objects
  3. Agree with that. I did not test this (yet), but it might not even be specific for a dual stack setup. Probably it would also need changes in the bosh agent.

@peanball
Copy link

Thanks @aramprice, I fully agree with you there. The logical unit of "IP address", with or without netmask (i.e. prefix) should be supported in either scenario and contain the logic of representation.

As @fmoehler mentioned, omitting the prefix will default to a single address (previous behavior).

And there should not be assumptions about dual stack but dual stack should be possible. So far, BOSH supported either v4 or v6, not both at the same time. The mechanism to support more than one network is a classic n+1 problem and can be solved as such where possible.

We just happen to choose n=2 with one v4 and one v6 network. This should be the mindset.

@peanball
Copy link

@fmoehler I have a general comment on the term prefix in the config. It's a bit terse and does not convey the meaning very clearly.

In the end this is the prefix (size) delegated to each instance attached to this subnet, right? Can we somehow express this better? My suggestion would be

  • delegate_prefix
  • delegate_prefix_size
  • delegate_cidr_size

alternatively s/delegate/assign/g.

@fmoehler fmoehler force-pushed the introduce-prefix-allocation branch 2 times, most recently from 1156d02 to 9f17169 Compare May 2, 2025 19:02
@fmoehler fmoehler force-pushed the introduce-prefix-allocation branch from 9f17169 to 4b6f351 Compare May 6, 2025 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

3 participants