Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCE datasource crashes if no NIC is able to obtain a DHCP lease #5997

Open
bryanfraschetti opened this issue Jan 31, 2025 · 0 comments · May be fixed by #5998
Open

GCE datasource crashes if no NIC is able to obtain a DHCP lease #5997

bryanfraschetti opened this issue Jan 31, 2025 · 0 comments · May be fixed by #5998
Labels
bug Something isn't working correctly new An issue that still needs triage

Comments

@bryanfraschetti
Copy link
Contributor

bryanfraschetti commented Jan 31, 2025

Bug report

During the init-local stage, when checking DataSourceGCELocal, each available NIC sequentially attempts to obtain a DHCP lease, however, the existing code at [1] crashes if all fail to do so. The function, read_md, populates the ret dictionary, which contains information about whether GCE has been successfully examined as a data source, the reason, and any obtained user-data or meta-data. A note is that this function executes within the network_context, which means if the network context fails to setup and throws NoDHCPLeaseError for each NIC, then we can enter a situation where all NICs have been tested and we have not assigned or populated the ret dictionary. Yet, in this situation before exiting we try to dereference the ret dict for logging. The result is that init-local crashes due to a reference before assignment error.

[1] https://github.com/canonical/cloud-init/blob/main/cloudinit/sources/DataSourceGCE.py#L99

Steps to reproduce the problem

The exact steps to reproduce this aren't entirely clear as this was observed in a customer environment and this was incidentally discovered as a by-product of a configuration error that resulted in cloud-init not detecting locally seeded data files. I suspect it could be reproduced by using GCELocal as a datasource while also disabling connectivity to the DHCP server

Environment details

Jammy, locally seeded files

cloud-init logs

From the logs we can see cloud-init attempt to use DataSourceGCELocal with two NICs, enp1s0 and enp2s0. First cloud-init attempts DHCP discovery on enp1s0 for five minutes, before failing and moving onto the next NIC. Again DHCP discovery is attempted for five minutes. At this point no NIC succeeded at DHCP discovery so read_md never executed, but we try to write to the logs that we failed at checking GCE as a datasource by reading from ret, which has no value and a reference before assignment error occurs

Jan 27 09:33:57 ubuntu cloud-init[570]: 2025-01-27 09:33:57,567 - DataSourceGCE.py[DEBUG]: Looking for the primary NIC in: ['enp1s0', 'enp2s0']
Jan 27 09:33:57 ubuntu cloud-init[570]: 2025-01-27 09:33:57,570 - subp.py[DEBUG]: Running command ['ip', '--json', 'addr'] with allowed return codes [0] (shell=False, capture=True)
Jan 27 09:33:58 ubuntu cloud-init[570]: 2025-01-27 09:33:57,693 - distros[DEBUG]: Using configured dhcp client priority list: ['dhclient', 'dhcpcd', 'udhcpc']
Jan 27 09:33:58 ubuntu cloud-init[570]: 2025-01-27 09:33:57,693 - distros[DEBUG]: DHCP client selected: dhclient
Jan 27 09:33:58 ubuntu cloud-init[570]: 2025-01-27 09:33:57,694 - dhcp.py[DEBUG]: Performing a dhcp discovery on enp1s0
Jan 27 09:33:58 ubuntu cloud-init[570]: 2025-01-27 09:33:57,694 - subp.py[DEBUG]: Running command ['ip', 'link', 'set', 'dev', 'enp1s0', 'up'] with allowed return codes [0] (shell=False, capture=True)
Jan 27 09:33:58 ubuntu cloud-init[570]: 2025-01-27 09:33:57,700 - util.py[DEBUG]: Reading from /sys/class/net/enp1s0/type (quiet=False)
Jan 27 09:33:58 ubuntu cloud-init[570]: 2025-01-27 09:33:57,700 - subp.py[DEBUG]: Running command ['/usr/sbin/dhclient', '-1', '-v', '-lf', '/run/dhclient.lease', '-pf', '/run/dhclient.pid', '-sf', '/bin/true', 'enp1s0'] with allowed return codes [0] (shell=False, capture=True)
Jan 27 09:33:57 ubuntu dhclient[1068]: Internet Systems Consortium DHCP Client 4.4.1
Jan 27 09:33:57 ubuntu dhclient[1068]: Copyright 2004-2018 Internet Systems Consortium.
Jan 27 09:33:57 ubuntu dhclient[1068]: All rights reserved.
Jan 27 09:33:57 ubuntu dhclient[1068]: For info, please visit https://www.isc.org/software/dhcp/
Jan 27 09:33:57 ubuntu dhclient[1068]: 
Jan 27 09:33:57 ubuntu dhclient[1068]: Listening on LPF/enp1s0/7e:aa:fe:d4:cc:0c
Jan 27 09:33:57 ubuntu dhclient[1068]: Sending on   LPF/enp1s0/7e:aa:fe:d4:cc:0c
Jan 27 09:33:57 ubuntu dhclient[1068]: Sending on   Socket/fallback
Jan 27 09:33:57 ubuntu dhclient[1068]: DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 3 (xid=0x98345b63)
Jan 27 09:34:00 ubuntu dhclient[1068]: DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 5 (xid=0x98345b63)
Jan 27 09:34:05 ubuntu dhclient[1068]: DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 8 (xid=0x98345b63)
Jan 27 09:34:13 ubuntu dhclient[1068]: DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 12 (xid=0x98345b63)
Jan 27 09:34:24 ubuntu systemd[1]: systemd-fsckd.service: Deactivated successfully.
Jan 27 09:34:24 ubuntu audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=unconfined msg='unit=systemd-fsckd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jan 27 09:34:25 ubuntu dhclient[1068]: DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 16 (xid=0x98345b63)
Jan 27 09:34:41 ubuntu dhclient[1068]: DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 13 (xid=0x98345b63)
... Truncated for brevity, but it is five minutes worth of enp1s0 attempting DHCP discovery
Jan 27 09:38:49 ubuntu dhclient[1068]: DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 8 (xid=0x98345b63)
Jan 27 09:38:57 ubuntu dhclient[1068]: DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 interval 1 (xid=0x98345b63)
Jan 27 09:38:58 ubuntu dhclient[1068]: No DHCPOFFERS received.
Jan 27 09:38:58 ubuntu dhclient[1068]: No working leases in persistent database - sleeping.
Jan 27 09:38:58 ubuntu audit[1074]: SYSCALL arch=c000003e syscall=59 success=no exit=-13 a0=7ffc053a2f1b a1=7ff5ade52d30 a2=7ff5a80012b0 a3=7ff5ae7e70c8 items=0 ppid=1068 pid=1074 auid=4294967295 uid=0 gid=0 eu
id=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="isc-worker0000" exe="/usr/sbin/dhclient" subj=/{,usr/}sbin/dhclient key=(null)
Jan 27 09:38:58 ubuntu audit: PROCTITLE proctitle=2F7573722F7362696E2F6468636C69656E74002D31002D76002D6C66002F72756E2F6468636C69656E742E6C65617365002D7066002F72756E2F6468636C69656E742E706964002D7366002F62696E2F
7472756500656E70317330
Jan 27 09:38:58 ubuntu cloud-init[570]: 2025-01-27 09:38:58,902 - performance.py[DEBUG]: Running ['/usr/sbin/dhclient', '-1', '-v', '-lf', '/run/dhclient.lease', '-pf', '/run/dhclient.pid', '-sf', '/bin/true', 
'enp1s0'] took 301.200 seconds
Jan 27 09:38:58 ubuntu cloud-init[570]: 2025-01-27 09:38:58,904 - util.py[DEBUG]: All files appeared after 0 seconds: ['/run/dhclient.pid', '/run/dhclient.lease']
Jan 27 09:38:58 ubuntu cloud-init[570]: 2025-01-27 09:38:58,904 - util.py[DEBUG]: Reading from /run/dhclient.pid (quiet=False)
Jan 27 09:38:58 ubuntu cloud-init[570]: 2025-01-27 09:38:58,904 - util.py[DEBUG]: Reading 5 bytes from /run/dhclient.pid
Jan 27 09:38:58 ubuntu cloud-init[570]: 2025-01-27 09:38:58,905 - util.py[DEBUG]: Reading from /proc/1068/stat (quiet=True)
Jan 27 09:38:58 ubuntu cloud-init[570]: 2025-01-27 09:38:58,905 - util.py[DEBUG]: Reading 310 bytes from /proc/1068/stat
Jan 27 09:38:58 ubuntu cloud-init[570]: 2025-01-27 09:38:58,907 - dhcp.py[DEBUG]: killing dhclient with pid=1068
Jan 27 09:38:58 ubuntu cloud-init[570]: 2025-01-27 09:38:58,909 - util.py[DEBUG]: Reading from /run/dhclient.lease (quiet=False)
Jan 27 09:38:58 ubuntu cloud-init[570]: 2025-01-27 09:38:58,909 - util.py[DEBUG]: Reading 0 bytes from /run/dhclient.lease
Jan 27 09:38:58 ubuntu cloud-init[570]: 2025-01-27 09:38:58,910 - subp.py[DEBUG]: Running command ['ip', '--json', 'addr'] with allowed return codes [0] (shell=False, capture=True)
Jan 27 09:38:58 ubuntu cloud-init[570]: 2025-01-27 09:38:58,918 - ephemeral.py[DEBUG]: No connectivity URLs provided. Skipping connectivity check before ephemeral network setup.
Jan 27 09:38:58 ubuntu cloud-init[570]: 2025-01-27 09:38:58,918 - dhcp.py[DEBUG]: Performing a dhcp discovery on enp2s0
Jan 27 09:38:58 ubuntu cloud-init[570]: 2025-01-27 09:38:58,919 - subp.py[DEBUG]: Running command ['ip', 'link', 'set', 'dev', 'enp2s0', 'up'] with allowed return codes [0] (shell=False, capture=True)
Jan 27 09:38:58 ubuntu cloud-init[570]: 2025-01-27 09:38:58,927 - util.py[DEBUG]: Reading from /sys/class/net/enp2s0/type (quiet=False)
Jan 27 09:38:58 ubuntu cloud-init[570]: 2025-01-27 09:38:58,928 - util.py[DEBUG]: Reading 2 bytes from /sys/class/net/enp2s0/type
Jan 27 09:38:58 ubuntu cloud-init[570]: 2025-01-27 09:38:58,928 - subp.py[DEBUG]: Running command ['/usr/sbin/dhclient', '-1', '-v', '-lf', '/run/dhclient.lease', '-pf', '/run/dhclient.pid', '-sf', '/bin/true', 'enp2s0'] with allowed return codes [0] (shell=False, capture=True)
Jan 27 09:38:58 ubuntu audit[1082]: SYSCALL arch=c000003e syscall=59 success=no exit=-13 a0=7fff13296f1b a1=7fff13295660 a2=5647c53ffd80 a3=7f51b79720c8 items=0 ppid=1078 pid=1082 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="dhclient" exe="/usr/sbin/dhclient" subj=/{,usr/}sbin/dhclient key=(null)
Jan 27 09:38:58 ubuntu audit: PROCTITLE proctitle=2F7573722F7362696E2F6468636C69656E74002D31002D76002D6C66002F72756E2F6468636C69656E742E6C65617365002D7066002F72756E2F6468636C69656E742E706964002D7366002F62696E2F7472756500656E70327330
Jan 27 09:38:58 ubuntu dhclient[1078]: Internet Systems Consortium DHCP Client 4.4.1
Jan 27 09:38:58 ubuntu dhclient[1078]: Copyright 2004-2018 Internet Systems Consortium.
Jan 27 09:38:58 ubuntu dhclient[1078]: All rights reserved.
Jan 27 09:38:58 ubuntu dhclient[1078]: For info, please visit https://www.isc.org/software/dhcp/
Jan 27 09:38:58 ubuntu dhclient[1078]: 
Jan 27 09:38:58 ubuntu dhclient[1078]: Listening on LPF/enp2s0/46:02:4f:95:75:a2
Jan 27 09:38:58 ubuntu dhclient[1078]: Sending on   LPF/enp2s0/46:02:4f:95:75:a2
Jan 27 09:38:58 ubuntu dhclient[1078]: Sending on   Socket/fallback
Jan 27 09:38:58 ubuntu dhclient[1078]: DHCPDISCOVER on enp2s0 to 255.255.255.255 port 67 interval 3 (xid=0xf9656921)
Jan 27 09:39:01 ubuntu dhclient[1078]: DHCPDISCOVER on enp2s0 to 255.255.255.255 port 67 interval 4 (xid=0xf9656921)
Jan 27 09:39:05 ubuntu dhclient[1078]: DHCPDISCOVER on enp2s0 to 255.255.255.255 port 67 interval 6 (xid=0xf9656921)
Jan 27 09:39:11 ubuntu dhclient[1078]: DHCPDISCOVER on enp2s0 to 255.255.255.255 port 67 interval 13 (xid=0xf9656921)
Jan 27 09:39:24 ubuntu dhclient[1078]: DHCPDISCOVER on enp2s0 to 255.255.255.255 port 67 interval 20 (xid=0xf9656921)
... Again truncated five minutes of enp2s0 attempting to obtain a lease
Jan 27 09:43:59 ubuntu dhclient[1078]: No working leases in persistent database - sleeping.
Jan 27 09:43:59 ubuntu cloud-init[570]: 2025-01-27 09:43:59,195 - performance.py[DEBUG]: Running ['/usr/sbin/dhclient', '-1', '-v', '-lf', '/run/dhclient.lease', '-pf', '/run/dhclient.pid', '-sf', '/bin/true', 'enp2s0'] took 300.266 seconds
Jan 27 09:43:59 ubuntu cloud-init[570]: 2025-01-27 09:43:59,207 - performance.py[DEBUG]: Getting metadata took 601.646 seconds
Jan 27 09:43:59 ubuntu cloud-init[570]: 2025-01-27 09:43:59,208 - handlers.py[DEBUG]: finish: init-local/search-GCELocal: FAIL: no local data found from DataSourceGCELocal
Jan 27 09:43:59 ubuntu cloud-init[570]: 2025-01-27 09:43:59,208 - log_util.py[WARNING]: Getting data from <class 'cloudinit.sources.DataSourceGCE.DataSourceGCELocal'> failed
Jan 27 09:43:59 ubuntu cloud-init[570]: 2025-01-27 09:43:59,208 - log_util.py[DEBUG]: Getting data from <class 'cloudinit.sources.DataSourceGCE.DataSourceGCELocal'> failed
Jan 27 09:43:59 ubuntu cloud-init[570]: Traceback (most recent call last):
Jan 27 09:43:59 ubuntu cloud-init[570]:   File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 1061, in find_source
Jan 27 09:43:59 ubuntu cloud-init[570]:     if s.update_metadata_if_supported(
Jan 27 09:43:59 ubuntu cloud-init[570]:   File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 937, in update_metadata_if_supported
Jan 27 09:43:59 ubuntu cloud-init[570]:     result = self.get_data()
Jan 27 09:43:59 ubuntu cloud-init[570]:   File "/usr/lib/python3/dist-packages/cloudinit/performance.py", line 103, in decorator
Jan 27 09:43:59 ubuntu cloud-init[570]:     return func(*args, **kwargs)
Jan 27 09:43:59 ubuntu cloud-init[570]:   File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 493, in get_data
Jan 27 09:43:59 ubuntu cloud-init[570]:     return_value = self._check_and_get_data()
Jan 27 09:43:59 ubuntu cloud-init[570]:   File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 424, in _check_and_get_data
Jan 27 09:43:59 ubuntu cloud-init[570]:     return self._get_data()
Jan 27 09:43:59 ubuntu cloud-init[570]:   File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceGCE.py", line 131, in _get_data
Jan 27 09:43:59 ubuntu cloud-init[570]:     if not ret["success"]:
Jan 27 09:43:59 ubuntu cloud-init[570]: UnboundLocalError: local variable 'ret' referenced before assignment
@bryanfraschetti bryanfraschetti added bug Something isn't working correctly new An issue that still needs triage labels Jan 31, 2025
bryanfraschetti added a commit to bryanfraschetti/cloud-init that referenced this issue Jan 31, 2025
This commit addresses issue canonical#5997 which reported crashes in init-local
when cloud-init was examining GCELocal as a potential datasource. When
all NICs failed at DHCP discovery cloud-init attempts to log the events
by dereferencing a value that was never assigned.

This commit modifies the _get_data function of DataSourceGCE.py by
adding an empty dictionary definition for the ret variable at the
top level of the function and some debugging logs when a candidate NIC
fails to obtain a DHCP lease. At the same time, the commit replaces the
direct key access operator on ret with the safe lookup method get(). This
commit also adds a unit test that mocks the observed situation
bryanfraschetti added a commit to bryanfraschetti/cloud-init that referenced this issue Jan 31, 2025
This commit addresses issue canonical#5997 which reported crashes in init-local
when cloud-init was examining GCELocal as a potential datasource. When
all NICs failed at DHCP discovery cloud-init attempts to log the events
by dereferencing a value that was never assigned.

This commit modifies the _get_data function of DataSourceGCE.py by
adding an empty dictionary definition for the ret variable at the
top level of the function and some debugging logs when a candidate NIC
fails to obtain a DHCP lease. At the same time, the commit replaces the
direct key access operator on ret with the safe lookup method get(). This
commit also adds a unit test that mocks the observed situation
bryanfraschetti added a commit to bryanfraschetti/cloud-init that referenced this issue Jan 31, 2025
This commit addresses issue canonical#5997 which reported crashes in init-local
when cloud-init was examining GCELocal as a potential datasource. When
all NICs failed at DHCP discovery cloud-init attempts to log the events
by dereferencing a value that was never assigned.

This commit modifies the _get_data function of DataSourceGCE.py by
adding an empty dictionary definition for the ret variable at the
top level of the function and some debugging logs when a candidate NIC
fails to obtain a DHCP lease. At the same time, the commit replaces the
direct key access operator on ret with the safe lookup method get(). This
commit also adds a unit test that mocks the observed situation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working correctly new An issue that still needs triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant