-
Notifications
You must be signed in to change notification settings - Fork 7
feat: configure ironic inspection notifications for nautobot integration #1361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
9bfe0b7 to
4cb0fd4
Compare
4cb0fd4 to
b4a9792
Compare
stevekeay
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a partial review - I think this has borrowed some parts of the enrol process such as validations that are not really appropriate to perform here, since the node has already enrolled and it's too late to do anything about it.
I think this code should concentrate on doing one thing well which is too keep nautobot in sync as well as it can, and in a timely manner.
Validation and reporting probably belong elsewhere. That said, there is nothing wrong with noticing that there is a problem and logging it, or even updating nautobot to reflect the error state. However we should really have a use-case for doing this and I would prefer to see us use the openstack API to report/audit issues.
| #### Device Creation Process | ||
|
|
||
| 1. **Switch Discovery** | ||
| - Identifies switches using LLDP MAC addresses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ironic has the switch hostname in baremetal port local_link attribute. Should we be using that? It is a real pain to gather the switch MAC address and populate these into nautobot in the first place, and we only ever did that to work around iDRAC's partial LLDP information. If we are using agent-based inspection then we have the full LLDP data available, including the hostname.
| 1. **Switch Discovery** | ||
| - Identifies switches using LLDP MAC addresses | ||
| - Queries Nautobot for devices with matching `chassis_mac_address` custom field | ||
| - Validates all switches are in the same location/rack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct today, but I hard a rumor that maybe a server could be connected to switches in a neighboring rack, therefore a server could (in future) legitimately be connected to switches in two different racks.
| 2. **Device Lookup** | ||
| - Searches for existing device by serial number | ||
| - If not found, creates new device with: | ||
| - Status: "Planned" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Ironic was able to Inspect the server, I would say it has gotten past the planning phase. Can we check the reasoning for this Status value?
| - BMC interfaces: `1000base-t` | ||
| - Server interfaces: `25gbase-x-sfp28` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand you are just documenting existing behaviour, and you didn't decide on these values here. But these assumptions that are not true for all our existing hardware. If we don't know the exact interface type it might be better to set this to something generic, or if that is not available, then something obviously incorrect like like "10baseT".
|
|
||
| 4. **Cable Documentation** | ||
| - Creates cables between server and switch interfaces | ||
| - Uses LLDP data to identify correct switch ports |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it was getting this information from Ironic?
| MIN_REQUIRED_NEIGHBOR_COUNT = 3 | ||
| ``` | ||
|
|
||
| If fewer than 3 neighbors are found, the workflow fails with a detailed error message showing which interfaces have LLDP data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who specified this behaviour? Shouldn't nautobot be updated with whatever ironic says? If we want to report on bad cabling then nautobot might be the way that a DC tech "views" the problem.
|
|
||
| ### Location Consistency | ||
|
|
||
| All connected switches must be in the same location and rack. If switches span multiple locations, the workflow fails to prevent topology errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the desired behaviour? Shouldn't nautobot be updated with whatever ironic says?
|
|
||
| ### IP Address Conflicts | ||
|
|
||
| The system detects and prevents: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the IP is already listed in ironic then the duplicate has NOT been prevented.
The system can detect the duplicate but what does it DO with this information?
If bNautobot doesn't allow duplicate IPs, we just log the error and fail to create the IPAddress in nautobot. But then we have not documented this IP, so perhaps it should delete the conflicting IP address assignment and replace it with the new information?
|
|
||
| ### Switch Port Conflicts | ||
|
|
||
| When creating cables, the system validates: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again what does it do if it finds something is invalid?
| - **Port ID (Type 2)**: Remote switch port name | ||
| - **Port Description (Type 4)**: Alternative port identifier | ||
|
|
||
| The system requires a minimum of 3 LLDP neighbors to be discovered for validation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the behaviour if we have an incomplete topology is to simply not bother updating nautobot? Wouldn't it be better to log the situation and continue, or maybe even set some kind of flag in nautobot like "NODE_WITH_INADEQUATE_NETWORK_TOPOLOGY"
|
Also, this duplicates some of the functionality in #1347 where we do the LLDP parsing. |
components/site-workflows/sensors/sensor-ironic-oslo-inspecting-event.yaml
Show resolved
Hide resolved
python/understack-workflows/understack_workflows/main/update_nautobot.py
Outdated
Show resolved
Hide resolved
workflows/argo-events/workflowtemplates/update-nautobot-on-openstack-oslo-event.yaml
Outdated
Show resolved
Hide resolved
b4a9792 to
0894888
Compare
5c7c359 to
ca3a190
Compare
5987ac7 to
2522238
Compare
cardoe
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 from me but I'll defer to @stevekeay to approve since he had the most review feedback.
|
@haseebsyed12 you'll want to clean up your commit messages. |
|
You also need to rebase to not fail on the python dependencies change. |
2522238 to
79afa8e
Compare
79afa8e to
ab52bb0
Compare
cardoe
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks okay to me @stevekeay I'll defer to you to approve since you have a few outstanding review items.
| return linux_interface_name | ||
|
|
||
|
|
||
| def parse_lldp_data(lldp_raw: list[list]) -> dict[str, str | None]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ironic has a built-in post-inspection hook that augments the node inventory with parsed LLDP data.
I was able to enable this in undercloud-deploy/.../helm-configs/ironic.yaml:
inspector:
...
hooks: "$default_hooks,parse-lldp,resource-class,update-baremetal-port"I tried it with agent inspection, where it added the following key to the inventory. With redfish inspection we don't get any LLDP data, but Nidhi was going to synthesise some based on what we fetch from redfish, not sure yet how all of this is going to hang together:
"plugin_data": {
...
"parsed_lldp": {
"eno3np0": {
"switch_chassis_id": "c4:7e:e0:e4:55:3f",
"switch_port_id": "Ethernet1/1",
"switch_port_description": "Dell-93GSW04_NIC.Integrated.1-1",
"switch_system_name": "f20-3-1.iad3",
"switch_port_mtu": 9000,
"switch_port_vlans": [
{
"name": "4010_provisioning",
"id": 4010
}
],
"switch_port_link_aggregation_enabled": false,
"switch_port_link_aggregation_support": true,
"switch_port_link_aggregation_id": 0
},Note that we should only rely on switch_chassis_id, switch_system_name and switch_port_id. Per rackspace security standard, all the other fields ought to be disabled on the switch and are therefore likely to disappear at some point.
| class Inventory: | ||
| """Data source that provides chassis information from Ironic inspection data.""" | ||
|
|
||
| def __init__(self, inspection_data: dict): | ||
| self.inspection_data = inspection_data | ||
| self._chassis_info: ChassisInfo | None = None | ||
|
|
||
| def get_chassis_info(self) -> ChassisInfo: | ||
| if self._chassis_info is None: | ||
| self._chassis_info = chassis_info_from_ironic_data(self.inspection_data) | ||
| return self._chassis_info | ||
|
|
||
| @property | ||
| def ip_address(self) -> str: | ||
| return self.inspection_data["inventory"]["bmc_address"] | ||
|
|
||
| @property | ||
| def hostname(self) -> str: | ||
| return self.inspection_data["inventory"]["hostname"] | ||
|
|
||
|
|
||
| def get_device_info(inspection_data: dict) -> ChassisInfo: | ||
| try: | ||
| data_source = Inventory(inspection_data) | ||
| chassis_info = data_source.get_chassis_info() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What am I missing here? It looks like class Inventory does not actually do anything.
| class Inventory: | |
| """Data source that provides chassis information from Ironic inspection data.""" | |
| def __init__(self, inspection_data: dict): | |
| self.inspection_data = inspection_data | |
| self._chassis_info: ChassisInfo | None = None | |
| def get_chassis_info(self) -> ChassisInfo: | |
| if self._chassis_info is None: | |
| self._chassis_info = chassis_info_from_ironic_data(self.inspection_data) | |
| return self._chassis_info | |
| @property | |
| def ip_address(self) -> str: | |
| return self.inspection_data["inventory"]["bmc_address"] | |
| @property | |
| def hostname(self) -> str: | |
| return self.inspection_data["inventory"]["hostname"] | |
| def get_device_info(inspection_data: dict) -> ChassisInfo: | |
| try: | |
| data_source = Inventory(inspection_data) | |
| chassis_info = data_source.get_chassis_info() | |
| def get_device_info(inspection_data: dict) -> ChassisInfo: | |
| try: | |
| chassis_info = chassis_info_from_ironic_data(inspection_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the Inventory class and simplified the get_device_info function to directly call chassis_info_from_ironic_data()
|
|
||
| except Exception as e: | ||
| hostname = inspection_data.get("inventory", {}).get("hostname", "unknown") | ||
| logger.error("Failed to enroll server from Ironic data for %s: %s", hostname, e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this message say enroll? This is not nessecarily happening at enrol time, is it?
| node_inventory = ironic_client.get_node_inventory( | ||
| node_ident=str(payload_data["uuid"]) | ||
| ) | ||
| device_info: ChassisInfo = get_device_info(node_inventory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| device_info: ChassisInfo = get_device_info(node_inventory) | |
| device_info = get_device_info(node_inventory) |
| # Pattern 1: Embedded interfaces (eno8303, eno8403) | ||
| # Must check this FIRST before other eno patterns | ||
| match = re.match(r"^eno8(\d)03$", linux_interface_name) | ||
| if match: | ||
| interface_num = int(match.group(1)) | ||
| nic_num = interface_num - 2 | ||
| return f"NIC.Embedded.{nic_num}-1-1" | ||
|
|
||
| # Pattern 2: Integrated with PHY (eno3np0, eno4np1) | ||
| match = re.match(r"^eno(\d+)n?p(\d+)$", linux_interface_name) | ||
| if match: | ||
| slot_num = int(match.group(1)) | ||
| port_num = slot_num - 2 | ||
| return f"NIC.Integrated.1-{port_num}" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These interface names are specific to Dell servers and might be misleading if e.g. a HP server also has an eno8*03 or an eno*p*. I feel like this ought to return linux_interface_name if we're not on a dell server.
| # Pattern 4: Slot interfaces (ens2f0np0, ens2f1np1) | ||
| match = re.match(r"^ens\d+f(\d+)n?p\d+$", linux_interface_name) | ||
| if match: | ||
| func_num = int(match.group(1)) | ||
| port_num = func_num + 1 | ||
| return f"NIC.Slot.1-{port_num}" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be tempted to just write this out the dumb way:
INTERFACE_NAME_MAPPING = {
"eno8303": "NIC.Embedded.1-1-1",
"eno8403": "NIC.Embedded.2-1-1",
"ens2f0np0": "NIC.Slot.1-1",
...etc...
}
INTERFACE_NAME_MAPPING.get(linux_interface_name, linux_interface_name)
|
It seems like this duplicates some of understack/python/understack-workflows/understack_workflows/oslo_event/ironic_port.py - how should we address that? |
Created an inventory parser that transforms Ironic data into Nautobot-compatible format Supporting multiple handlers per event type so both storage setup and device sync can happen on the same event
ab52bb0 to
ca36b43
Compare
|
So looking at https://docs.openstack.org/ironic/latest/admin/notifications.html wouldn't we really just want to trigger on:
|
Automatic synchronization of baremetal server hardware inventory from OpenStack Ironic to Nautobot when servers complete the inspection phase during enrollment.
By leveraging Ironic's existing hardware inspection process, Oslo event notifications, Argo Events and Argo Workflows, we can automatically populate Nautobot.