Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/dev/design/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ WIP
:maxdepth: 1

NAT-address-discovery
metadata-discovery

.. _design-docs-postponed:

Expand Down
210 changes: 210 additions & 0 deletions doc/dev/design/metadata-discovery.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
**************************
Segment Metadata Discovery
**************************

- Author(s): Jordi Subirà-Nieto, Tilmann Zäschke
- Last updated: 2025-05-02
- Discussion at: :issue:`4761`, previously: :issue:`4742`
- Status: **WIP**


Abstract
========
We propose to implement a mechanism or tool that automatically populates,
or helps with manually populating, metadata for path segments,
such as latency, bandwidth, internal hop count, geo location, or general notes.

We believe that this is an important next step because metadata is essential
for informed path selection, which in turn is one of the main features of SCION.
Unfortunately, path segment metadata is currently (almost) non-existent
in the production network.
This could be remedied by simplifying or even automating metadata discovery.


Background
==========

Path segments can contain metadata, such as latency, bandwidth,
internal hop count, geo location, or general notes.
This path metadata can be declared in a ``staticInfoConfig.json`` file that
is read on startup of a control server.

This approach works but has disadvantages:

* All data must be manually measured and added to the file before starting the control service.
* All data must be added manually to the file, which is error-prone and tedious.
* Updates require restarting the control service to be applied.
* Administrators have to log into the machine and inspect the file
in order to see current metadata settings (assuming the file was parsed
recently and correctly).

Automating metadata collection and supporting admins with metadata editing
and monitoring would likely increase the presence of metadata in the production
network (and other networks).


Proposal
========

The main motivation for this proposal is to improve the availability of metadata.
While this section proposes details on how this can be achieved, it should
be seen as suggestions rather than as a definitive instruction.


Control Service
---------------
The control service (CS) needs a mechanism that **detects updates to the
``staticInfoConfig.json`` file** and automatically reads the new file version.
The ``staticInfoConfig.json`` may remain the primary way to store and exchange
metadata info. However, even if it would be replaced by APIs (e.g. gRPC calls),
auto-reading the file is useful when the file is edited manually.


Metadata Service
----------------

We propose several tools/mechanisms. These can be combined but may also be
helpful on their own. The central component is a "metadata service" (MS).
The MS is responsible for the following:

* Initialize an empty or non-existent ``staticInfoConfig.json`` file.
Bandwidth can be initialized with ``0``, latency with ``-1``, hop count with ``0``,
geo location with ``0, 0, "unknown"`` and notes with
``"<ISD-AS> : All data autogenerated by Software ABC v1.42"``.
* Trigger collection of metadata or directly collect it.
* Store updates in the ``staticInfoConfig.json`` file with recent metadata.
The file is necessary for non-measurable data (notes, addresses, ...) and to have
metadata available immediately after a system restart.
* Communicate metadata to the control service (CS). This can be done by writing it to the
``staticInfoConfig.json`` file or maybe additionally via an API.
* Detect changes to (or generally inconsistencies with) the topology file (new links,
border routers, ...). If a change is detected, administrators could be notified and/or
the metadata could be adapted automatically (add detected data or remove obsolete data).

The MS can be implemented in many different ways: as a stand-alone process, it could be
integrated into the CS, it could be an Ansible playbook.
See also `alternative-metadata-service`_.


Metadata Collection
-------------------

We need to collect different types of metadata. There is probably not one tool
to do it all.

* Latency: latency could be measured automatically in regular intervals,
for example, on the border routers (machines) or even in the border routers
(router processes) by sending ICMP or SCMP echo messages to other border routers.
* Internal hop count: similar to latency, this could be done by the border
routers (on the machines or even in the router processes), potentially
using ``traceroute`` as first approximation.
* Bandwidth: this could potentially be extracted automatically via API calls,
e.g. for AWS or Equinix APIs. This could be done by the metadata service (MS).
* Geolocation: we could use an IP geolocator for border router IPs as first
approximation. This could be done by the MS.

The measurements and data collection may be executed in configurable
intervals (once a day, once per hour, ...) or could be triggered manually.
All data would be reported back to the metadata service.

All data should be stored locally to the MS so that it is immediately available
when the MS or CS is restarted. The easiest way may be to store the data directly
in the ``staticInfoConfig.json``.


File Format
-----------
Optionally: It may be useful to allow an additional attribute for each value,
for example: `override=true`.
This attribute should indicate that a value was manually overridden and should not be
modified by measurements.

It is a bit unclear what the use case really is for this. Geolocation should normally
only run autodetection if no value is available. Maybe it is useful for bandwidth
when the autodetection gets incorrect values from the providers service API?


Management API
--------------

It may be useful to have a metadata management API that can be accessed remotely
by administrators to monitor metadata and edit non-measurable metadata
(notes, addresses, more accurate geolocation, ...). However, this is optional
and can be done by monitoring or manually editing the ``staticInfoConfig.json`` file.

If we decide to have a remote monitoring API, in order to avoid concurrency issues
we should probably remove the runtime reparsing of the file. Reparsing of the
file would thus be an interim solution until the management API is available.
At that point, the file should only be parsed at startup of the metadata service.


Rationale
=========

We believe that it is important to simplify metadata collection, configuration
and management. Metadata is necessary for enabling one of the core features:
informed path selection.


Auto Detection
--------------

Correctness: The automatic detection of metadata may result in imprecise data
(especially geo location).
However, since most of the data is not verifiable anyway, one can argue that
automatically detected data is at least better than no data at all.

In the future, we may want to qualify the data origin or quality.
This could be done with an extra field that specifies the origin or data quality:
GENERATED_DEFAULT, MEASURED, MANUAL.
However, this is probably out of scope for an initial implementation.

.. _alternative-metadata-service:

Alternative: Integrate Metadata Service into the Control Service?
-----------------------------------------------------------------

There are many ways to implement the metadata service. One idea is to
integrate it into the control service process.

Advantages:

* No administrative overhead for an additional service. No additional
config file entries (e.g. predefined port/IP to make it remotely reachable)
* When a remote monitoring API is implemented, it can monitor directly
what metadata the control service is using. If the metadata service
is a separate process, it could only report what was communicated to the CS, not
what the CS is actually using.

Disadvantages:

* Feature overload of the control service
* Implementation may be simpler as separate process or as Ansible Playbook.

Compatibility
=============

Some parts of the proposal require changes to the control service and
the (possibly) border routers. These changes are fully backwards compatible and
do not affect existing functionality.

The changes can be deployed incrementally. The new APIs do no harm if they are not
used.
The metadata service must be able to handle border routers that are not yet prepared
for metadata collection.

Implementation
==============

The implementation can easily be done in multiple steps. These steps can be
released and deployed independently.

Proposed order of implementation:

1. Control service to detect updates to ``staticInfoConfig.json`` and reload the file.
2. Metadata service to collect metadata and write it to the ``staticInfoConfig.json`` file.
3. Implement latency and hop count measurements on/in border routers and send
results to the metadata service. Implement triggering of metadata collection
on/in border routers.
4. In the metadata service, implement API for remote administration and monitoring
of metadata.
Loading