diff --git a/doc/dev/design/index.rst b/doc/dev/design/index.rst index a77b15dbda..51a95b1711 100644 --- a/doc/dev/design/index.rst +++ b/doc/dev/design/index.rst @@ -47,6 +47,7 @@ WIP :maxdepth: 1 NAT-address-discovery + metadata-discovery .. _design-docs-postponed: diff --git a/doc/dev/design/metadata-discovery.rst b/doc/dev/design/metadata-discovery.rst new file mode 100644 index 0000000000..541d172252 --- /dev/null +++ b/doc/dev/design/metadata-discovery.rst @@ -0,0 +1,210 @@ +************************** +Segment Metadata Discovery +************************** + +- Author(s): Jordi Subirà-Nieto, Tilmann Zäschke +- Last updated: 2025-05-02 +- Discussion at: :issue:`4761`, previously: :issue:`4742` +- Status: **WIP** + + +Abstract +======== +We propose to implement a mechanism or tool that automatically populates, +or helps with manually populating, metadata for path segments, +such as latency, bandwidth, internal hop count, geo location, or general notes. + +We believe that this is an important next step because metadata is essential +for informed path selection, which in turn is one of the main features of SCION. +Unfortunately, path segment metadata is currently (almost) non-existent +in the production network. +This could be remedied by simplifying or even automating metadata discovery. + + +Background +========== + +Path segments can contain metadata, such as latency, bandwidth, +internal hop count, geo location, or general notes. +This path metadata can be declared in a ``staticInfoConfig.json`` file that +is read on startup of a control server. + +This approach works but has disadvantages: + +* All data must be manually measured and added to the file before starting the control service. +* All data must be added manually to the file, which is error-prone and tedious. +* Updates require restarting the control service to be applied. +* Administrators have to log into the machine and inspect the file + in order to see current metadata settings (assuming the file was parsed + recently and correctly). + +Automating metadata collection and supporting admins with metadata editing +and monitoring would likely increase the presence of metadata in the production +network (and other networks). + + +Proposal +======== + +The main motivation for this proposal is to improve the availability of metadata. +While this section proposes details on how this can be achieved, it should +be seen as suggestions rather than as a definitive instruction. + + +Control Service +--------------- +The control service (CS) needs a mechanism that **detects updates to the +``staticInfoConfig.json`` file** and automatically reads the new file version. +The ``staticInfoConfig.json`` may remain the primary way to store and exchange +metadata info. However, even if it would be replaced by APIs (e.g. gRPC calls), +auto-reading the file is useful when the file is edited manually. + + +Metadata Service +---------------- + +We propose several tools/mechanisms. These can be combined but may also be +helpful on their own. The central component is a "metadata service" (MS). +The MS is responsible for the following: + +* Initialize an empty or non-existent ``staticInfoConfig.json`` file. + Bandwidth can be initialized with ``0``, latency with ``-1``, hop count with ``0``, + geo location with ``0, 0, "unknown"`` and notes with + ``" : All data autogenerated by Software ABC v1.42"``. +* Trigger collection of metadata or directly collect it. +* Store updates in the ``staticInfoConfig.json`` file with recent metadata. + The file is necessary for non-measurable data (notes, addresses, ...) and to have + metadata available immediately after a system restart. +* Communicate metadata to the control service (CS). This can be done by writing it to the + ``staticInfoConfig.json`` file or maybe additionally via an API. +* Detect changes to (or generally inconsistencies with) the topology file (new links, + border routers, ...). If a change is detected, administrators could be notified and/or + the metadata could be adapted automatically (add detected data or remove obsolete data). + +The MS can be implemented in many different ways: as a stand-alone process, it could be +integrated into the CS, it could be an Ansible playbook. +See also `alternative-metadata-service`_. + + +Metadata Collection +------------------- + +We need to collect different types of metadata. There is probably not one tool +to do it all. + +* Latency: latency could be measured automatically in regular intervals, + for example, on the border routers (machines) or even in the border routers + (router processes) by sending ICMP or SCMP echo messages to other border routers. +* Internal hop count: similar to latency, this could be done by the border + routers (on the machines or even in the router processes), potentially + using ``traceroute`` as first approximation. +* Bandwidth: this could potentially be extracted automatically via API calls, + e.g. for AWS or Equinix APIs. This could be done by the metadata service (MS). +* Geolocation: we could use an IP geolocator for border router IPs as first + approximation. This could be done by the MS. + + The measurements and data collection may be executed in configurable + intervals (once a day, once per hour, ...) or could be triggered manually. + All data would be reported back to the metadata service. + + All data should be stored locally to the MS so that it is immediately available + when the MS or CS is restarted. The easiest way may be to store the data directly + in the ``staticInfoConfig.json``. + + +File Format +----------- +Optionally: It may be useful to allow an additional attribute for each value, +for example: `override=true`. +This attribute should indicate that a value was manually overridden and should not be +modified by measurements. + +It is a bit unclear what the use case really is for this. Geolocation should normally +only run autodetection if no value is available. Maybe it is useful for bandwidth +when the autodetection gets incorrect values from the providers service API? + + +Management API +-------------- + +It may be useful to have a metadata management API that can be accessed remotely +by administrators to monitor metadata and edit non-measurable metadata +(notes, addresses, more accurate geolocation, ...). However, this is optional +and can be done by monitoring or manually editing the ``staticInfoConfig.json`` file. + +If we decide to have a remote monitoring API, in order to avoid concurrency issues +we should probably remove the runtime reparsing of the file. Reparsing of the +file would thus be an interim solution until the management API is available. +At that point, the file should only be parsed at startup of the metadata service. + + +Rationale +========= + +We believe that it is important to simplify metadata collection, configuration +and management. Metadata is necessary for enabling one of the core features: +informed path selection. + + +Auto Detection +-------------- + +Correctness: The automatic detection of metadata may result in imprecise data +(especially geo location). +However, since most of the data is not verifiable anyway, one can argue that +automatically detected data is at least better than no data at all. + +In the future, we may want to qualify the data origin or quality. +This could be done with an extra field that specifies the origin or data quality: +GENERATED_DEFAULT, MEASURED, MANUAL. +However, this is probably out of scope for an initial implementation. + +.. _alternative-metadata-service: + +Alternative: Integrate Metadata Service into the Control Service? +----------------------------------------------------------------- + +There are many ways to implement the metadata service. One idea is to +integrate it into the control service process. + +Advantages: + +* No administrative overhead for an additional service. No additional + config file entries (e.g. predefined port/IP to make it remotely reachable) +* When a remote monitoring API is implemented, it can monitor directly + what metadata the control service is using. If the metadata service + is a separate process, it could only report what was communicated to the CS, not + what the CS is actually using. + +Disadvantages: + +* Feature overload of the control service +* Implementation may be simpler as separate process or as Ansible Playbook. + +Compatibility +============= + +Some parts of the proposal require changes to the control service and +the (possibly) border routers. These changes are fully backwards compatible and +do not affect existing functionality. + +The changes can be deployed incrementally. The new APIs do no harm if they are not +used. +The metadata service must be able to handle border routers that are not yet prepared +for metadata collection. + +Implementation +============== + +The implementation can easily be done in multiple steps. These steps can be +released and deployed independently. + +Proposed order of implementation: + +1. Control service to detect updates to ``staticInfoConfig.json`` and reload the file. +2. Metadata service to collect metadata and write it to the ``staticInfoConfig.json`` file. +3. Implement latency and hop count measurements on/in border routers and send + results to the metadata service. Implement triggering of metadata collection + on/in border routers. +4. In the metadata service, implement API for remote administration and monitoring + of metadata.