-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCE datasource crashes if no NIC is able to obtain a DHCP lease #5997
Comments
bryanfraschetti
added
bug
Something isn't working correctly
new
An issue that still needs triage
labels
Jan 31, 2025
bryanfraschetti
added a commit
to bryanfraschetti/cloud-init
that referenced
this issue
Jan 31, 2025
This commit addresses issue canonical#5997 which reported crashes in init-local when cloud-init was examining GCELocal as a potential datasource. When all NICs failed at DHCP discovery cloud-init attempts to log the events by dereferencing a value that was never assigned. This commit modifies the _get_data function of DataSourceGCE.py by adding an empty dictionary definition for the ret variable at the top level of the function and some debugging logs when a candidate NIC fails to obtain a DHCP lease. At the same time, the commit replaces the direct key access operator on ret with the safe lookup method get(). This commit also adds a unit test that mocks the observed situation
bryanfraschetti
added a commit
to bryanfraschetti/cloud-init
that referenced
this issue
Jan 31, 2025
This commit addresses issue canonical#5997 which reported crashes in init-local when cloud-init was examining GCELocal as a potential datasource. When all NICs failed at DHCP discovery cloud-init attempts to log the events by dereferencing a value that was never assigned. This commit modifies the _get_data function of DataSourceGCE.py by adding an empty dictionary definition for the ret variable at the top level of the function and some debugging logs when a candidate NIC fails to obtain a DHCP lease. At the same time, the commit replaces the direct key access operator on ret with the safe lookup method get(). This commit also adds a unit test that mocks the observed situation
bryanfraschetti
added a commit
to bryanfraschetti/cloud-init
that referenced
this issue
Jan 31, 2025
This commit addresses issue canonical#5997 which reported crashes in init-local when cloud-init was examining GCELocal as a potential datasource. When all NICs failed at DHCP discovery cloud-init attempts to log the events by dereferencing a value that was never assigned. This commit modifies the _get_data function of DataSourceGCE.py by adding an empty dictionary definition for the ret variable at the top level of the function and some debugging logs when a candidate NIC fails to obtain a DHCP lease. At the same time, the commit replaces the direct key access operator on ret with the safe lookup method get(). This commit also adds a unit test that mocks the observed situation
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Bug report
During the init-local stage, when checking DataSourceGCELocal, each available NIC sequentially attempts to obtain a DHCP lease, however, the existing code at [1] crashes if all fail to do so. The function, read_md, populates the ret dictionary, which contains information about whether GCE has been successfully examined as a data source, the reason, and any obtained user-data or meta-data. A note is that this function executes within the network_context, which means if the network context fails to setup and throws NoDHCPLeaseError for each NIC, then we can enter a situation where all NICs have been tested and we have not assigned or populated the ret dictionary. Yet, in this situation before exiting we try to dereference the ret dict for logging. The result is that init-local crashes due to a reference before assignment error.
[1] https://github.com/canonical/cloud-init/blob/main/cloudinit/sources/DataSourceGCE.py#L99
Steps to reproduce the problem
The exact steps to reproduce this aren't entirely clear as this was observed in a customer environment and this was incidentally discovered as a by-product of a configuration error that resulted in cloud-init not detecting locally seeded data files. I suspect it could be reproduced by using GCELocal as a datasource while also disabling connectivity to the DHCP server
Environment details
Jammy, locally seeded files
cloud-init logs
From the logs we can see cloud-init attempt to use DataSourceGCELocal with two NICs, enp1s0 and enp2s0. First cloud-init attempts DHCP discovery on enp1s0 for five minutes, before failing and moving onto the next NIC. Again DHCP discovery is attempted for five minutes. At this point no NIC succeeded at DHCP discovery so read_md never executed, but we try to write to the logs that we failed at checking GCE as a datasource by reading from ret, which has no value and a reference before assignment error occurs
The text was updated successfully, but these errors were encountered: