Skip to content

Conversation

@haseebsyed12
Copy link
Contributor

@haseebsyed12 haseebsyed12 commented Oct 24, 2025

Automatic synchronization of baremetal server hardware inventory from OpenStack Ironic to Nautobot when servers complete the inspection phase during enrollment.
By leveraging Ironic's existing hardware inspection process, Oslo event notifications, Argo Events and Argo Workflows, we can automatically populate Nautobot.

@haseebsyed12 haseebsyed12 force-pushed the puc-1186-capture-inspect-oslo-event branch 8 times, most recently from 9bfe0b7 to 4cb0fd4 Compare October 31, 2025 09:32
@haseebsyed12 haseebsyed12 changed the title feat: add missing device or port to Nautobot when inspection happens feat: Create/Update Nautobot Device and Network Interfaces using Ironic Inventory data Oct 31, 2025
@haseebsyed12 haseebsyed12 marked this pull request as ready for review October 31, 2025 09:33
@haseebsyed12 haseebsyed12 requested a review from a team October 31, 2025 09:34
@haseebsyed12 haseebsyed12 force-pushed the puc-1186-capture-inspect-oslo-event branch from 4cb0fd4 to b4a9792 Compare October 31, 2025 17:31
Copy link
Contributor

@stevekeay stevekeay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a partial review - I think this has borrowed some parts of the enrol process such as validations that are not really appropriate to perform here, since the node has already enrolled and it's too late to do anything about it.

I think this code should concentrate on doing one thing well which is too keep nautobot in sync as well as it can, and in a timely manner.

Validation and reporting probably belong elsewhere. That said, there is nothing wrong with noticing that there is a problem and logging it, or even updating nautobot to reflect the error state. However we should really have a use-case for doing this and I would prefer to see us use the openstack API to report/audit issues.

#### Device Creation Process

1. **Switch Discovery**
- Identifies switches using LLDP MAC addresses
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ironic has the switch hostname in baremetal port local_link attribute. Should we be using that? It is a real pain to gather the switch MAC address and populate these into nautobot in the first place, and we only ever did that to work around iDRAC's partial LLDP information. If we are using agent-based inspection then we have the full LLDP data available, including the hostname.

1. **Switch Discovery**
- Identifies switches using LLDP MAC addresses
- Queries Nautobot for devices with matching `chassis_mac_address` custom field
- Validates all switches are in the same location/rack
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct today, but I hard a rumor that maybe a server could be connected to switches in a neighboring rack, therefore a server could (in future) legitimately be connected to switches in two different racks.

2. **Device Lookup**
- Searches for existing device by serial number
- If not found, creates new device with:
- Status: "Planned"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Ironic was able to Inspect the server, I would say it has gotten past the planning phase. Can we check the reasoning for this Status value?

Comment on lines 175 to 173
- BMC interfaces: `1000base-t`
- Server interfaces: `25gbase-x-sfp28`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand you are just documenting existing behaviour, and you didn't decide on these values here. But these assumptions that are not true for all our existing hardware. If we don't know the exact interface type it might be better to set this to something generic, or if that is not available, then something obviously incorrect like like "10baseT".


4. **Cable Documentation**
- Creates cables between server and switch interfaces
- Uses LLDP data to identify correct switch ports
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was getting this information from Ironic?

MIN_REQUIRED_NEIGHBOR_COUNT = 3
```

If fewer than 3 neighbors are found, the workflow fails with a detailed error message showing which interfaces have LLDP data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who specified this behaviour? Shouldn't nautobot be updated with whatever ironic says? If we want to report on bad cabling then nautobot might be the way that a DC tech "views" the problem.


### Location Consistency

All connected switches must be in the same location and rack. If switches span multiple locations, the workflow fails to prevent topology errors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the desired behaviour? Shouldn't nautobot be updated with whatever ironic says?


### IP Address Conflicts

The system detects and prevents:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the IP is already listed in ironic then the duplicate has NOT been prevented.

The system can detect the duplicate but what does it DO with this information?

If bNautobot doesn't allow duplicate IPs, we just log the error and fail to create the IPAddress in nautobot. But then we have not documented this IP, so perhaps it should delete the conflicting IP address assignment and replace it with the new information?


### Switch Port Conflicts

When creating cables, the system validates:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again what does it do if it finds something is invalid?

- **Port ID (Type 2)**: Remote switch port name
- **Port Description (Type 4)**: Alternative port identifier

The system requires a minimum of 3 LLDP neighbors to be discovered for validation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the behaviour if we have an incomplete topology is to simply not bother updating nautobot? Wouldn't it be better to log the situation and continue, or maybe even set some kind of flag in nautobot like "NODE_WITH_INADEQUATE_NETWORK_TOPOLOGY"

@stevekeay
Copy link
Contributor

Also, this duplicates some of the functionality in #1347 where we do the LLDP parsing.

cardoe
cardoe previously requested changes Nov 3, 2025
@haseebsyed12 haseebsyed12 force-pushed the puc-1186-capture-inspect-oslo-event branch from b4a9792 to 0894888 Compare November 4, 2025 05:29
@haseebsyed12 haseebsyed12 requested a review from cardoe November 4, 2025 05:33
@haseebsyed12 haseebsyed12 force-pushed the puc-1186-capture-inspect-oslo-event branch 6 times, most recently from 5c7c359 to ca3a190 Compare November 6, 2025 07:05
@haseebsyed12 haseebsyed12 changed the title feat: Create/Update Nautobot Device and Network Interfaces using Ironic Inventory data feat: configure ironic inspection notifications for nautobot integration Nov 6, 2025
@haseebsyed12 haseebsyed12 force-pushed the puc-1186-capture-inspect-oslo-event branch 4 times, most recently from 5987ac7 to 2522238 Compare November 6, 2025 16:30
Copy link
Contributor

@cardoe cardoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 from me but I'll defer to @stevekeay to approve since he had the most review feedback.

@cardoe cardoe dismissed their stale review November 6, 2025 22:55

I'm good with these changes now.

@cardoe
Copy link
Contributor

cardoe commented Nov 6, 2025

@haseebsyed12 you'll want to clean up your commit messages.

@cardoe
Copy link
Contributor

cardoe commented Nov 6, 2025

You also need to rebase to not fail on the python dependencies change.

@haseebsyed12 haseebsyed12 force-pushed the puc-1186-capture-inspect-oslo-event branch from 2522238 to 79afa8e Compare November 7, 2025 01:08
@haseebsyed12 haseebsyed12 requested a review from cardoe November 7, 2025 14:09
@haseebsyed12 haseebsyed12 force-pushed the puc-1186-capture-inspect-oslo-event branch from 79afa8e to ab52bb0 Compare November 7, 2025 17:11
Copy link
Contributor

@cardoe cardoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay to me @stevekeay I'll defer to you to approve since you have a few outstanding review items.

return linux_interface_name


def parse_lldp_data(lldp_raw: list[list]) -> dict[str, str | None]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ironic has a built-in post-inspection hook that augments the node inventory with parsed LLDP data.

I was able to enable this in undercloud-deploy/.../helm-configs/ironic.yaml:

    inspector:
      ...
      hooks: "$default_hooks,parse-lldp,resource-class,update-baremetal-port"

I tried it with agent inspection, where it added the following key to the inventory. With redfish inspection we don't get any LLDP data, but Nidhi was going to synthesise some based on what we fetch from redfish, not sure yet how all of this is going to hang together:

  "plugin_data": {
    ...
    "parsed_lldp": {
      "eno3np0": {
        "switch_chassis_id": "c4:7e:e0:e4:55:3f",
        "switch_port_id": "Ethernet1/1",
        "switch_port_description": "Dell-93GSW04_NIC.Integrated.1-1",
        "switch_system_name": "f20-3-1.iad3",
        "switch_port_mtu": 9000,
        "switch_port_vlans": [
          {
            "name": "4010_provisioning",
            "id": 4010
          }
        ],
        "switch_port_link_aggregation_enabled": false,
        "switch_port_link_aggregation_support": true,
        "switch_port_link_aggregation_id": 0
      },

Note that we should only rely on switch_chassis_id, switch_system_name and switch_port_id. Per rackspace security standard, all the other fields ought to be disabled on the switch and are therefore likely to disappear at some point.

Comment on lines 173 to 197
class Inventory:
"""Data source that provides chassis information from Ironic inspection data."""

def __init__(self, inspection_data: dict):
self.inspection_data = inspection_data
self._chassis_info: ChassisInfo | None = None

def get_chassis_info(self) -> ChassisInfo:
if self._chassis_info is None:
self._chassis_info = chassis_info_from_ironic_data(self.inspection_data)
return self._chassis_info

@property
def ip_address(self) -> str:
return self.inspection_data["inventory"]["bmc_address"]

@property
def hostname(self) -> str:
return self.inspection_data["inventory"]["hostname"]


def get_device_info(inspection_data: dict) -> ChassisInfo:
try:
data_source = Inventory(inspection_data)
chassis_info = data_source.get_chassis_info()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What am I missing here? It looks like class Inventory does not actually do anything.

Suggested change
class Inventory:
"""Data source that provides chassis information from Ironic inspection data."""
def __init__(self, inspection_data: dict):
self.inspection_data = inspection_data
self._chassis_info: ChassisInfo | None = None
def get_chassis_info(self) -> ChassisInfo:
if self._chassis_info is None:
self._chassis_info = chassis_info_from_ironic_data(self.inspection_data)
return self._chassis_info
@property
def ip_address(self) -> str:
return self.inspection_data["inventory"]["bmc_address"]
@property
def hostname(self) -> str:
return self.inspection_data["inventory"]["hostname"]
def get_device_info(inspection_data: dict) -> ChassisInfo:
try:
data_source = Inventory(inspection_data)
chassis_info = data_source.get_chassis_info()
def get_device_info(inspection_data: dict) -> ChassisInfo:
try:
chassis_info = chassis_info_from_ironic_data(inspection_data)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the Inventory class and simplified the get_device_info function to directly call chassis_info_from_ironic_data()


except Exception as e:
hostname = inspection_data.get("inventory", {}).get("hostname", "unknown")
logger.error("Failed to enroll server from Ironic data for %s: %s", hostname, e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this message say enroll? This is not nessecarily happening at enrol time, is it?

node_inventory = ironic_client.get_node_inventory(
node_ident=str(payload_data["uuid"])
)
device_info: ChassisInfo = get_device_info(node_inventory)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
device_info: ChassisInfo = get_device_info(node_inventory)
device_info = get_device_info(node_inventory)

Comment on lines 16 to 30
# Pattern 1: Embedded interfaces (eno8303, eno8403)
# Must check this FIRST before other eno patterns
match = re.match(r"^eno8(\d)03$", linux_interface_name)
if match:
interface_num = int(match.group(1))
nic_num = interface_num - 2
return f"NIC.Embedded.{nic_num}-1-1"

# Pattern 2: Integrated with PHY (eno3np0, eno4np1)
match = re.match(r"^eno(\d+)n?p(\d+)$", linux_interface_name)
if match:
slot_num = int(match.group(1))
port_num = slot_num - 2
return f"NIC.Integrated.1-{port_num}"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These interface names are specific to Dell servers and might be misleading if e.g. a HP server also has an eno8*03 or an eno*p*. I feel like this ought to return linux_interface_name if we're not on a dell server.

Comment on lines 37 to 43
# Pattern 4: Slot interfaces (ens2f0np0, ens2f1np1)
match = re.match(r"^ens\d+f(\d+)n?p\d+$", linux_interface_name)
if match:
func_num = int(match.group(1))
port_num = func_num + 1
return f"NIC.Slot.1-{port_num}"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be tempted to just write this out the dumb way:

INTERFACE_NAME_MAPPING = {
  "eno8303": "NIC.Embedded.1-1-1",
  "eno8403": "NIC.Embedded.2-1-1",
  "ens2f0np0": "NIC.Slot.1-1",
  ...etc...
}

INTERFACE_NAME_MAPPING.get(linux_interface_name, linux_interface_name)

@stevekeay
Copy link
Contributor

It seems like this duplicates some of understack/python/understack-workflows/understack_workflows/oslo_event/ironic_port.py - how should we address that?

Created an inventory parser that transforms Ironic data into Nautobot-compatible format
Supporting multiple handlers per event type so both storage setup and device sync can happen on the same event
@haseebsyed12 haseebsyed12 force-pushed the puc-1186-capture-inspect-oslo-event branch from ab52bb0 to ca36b43 Compare November 12, 2025 16:36
@cardoe
Copy link
Contributor

cardoe commented Nov 13, 2025

So looking at https://docs.openstack.org/ironic/latest/admin/notifications.html wouldn't we really just want to trigger on:

  • baremetal.node.update.end and possibly baremetal.node.create.end though I suspect the node will not be created with enough info in it and we'd either create or update the node inside of Nautobot
  • baremetal.node.delete.end we would remove the node from Nautobot
  • baremetal.port.update.end and possibly bare metal.port.create.end would create/update the port in Nautobot
  • baremetal.port.delete.end would clean up said port
  • baremetal.portgroup.update.end and baremetal.portgroup.create.end same way
  • baremetal.portgroup.delete.end same

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants