diff --git a/LICENSE.md b/LICENSE2.md similarity index 99% rename from LICENSE.md rename to LICENSE2.md index 0ad25db4b..16386ec77 100644 --- a/LICENSE.md +++ b/LICENSE2.md @@ -659,3 +659,4 @@ specific requirements. if any, to sign a "copyright disclaimer" for the program, if necessary. For more information on this, and how to apply and follow the GNU AGPL, see . +aaaaa diff --git a/README.md b/README.md deleted file mode 100644 index c2b38131b..000000000 --- a/README.md +++ /dev/null @@ -1,189 +0,0 @@ -# gProfiler -gProfiler combines multiple sampling profilers to produce unified visualization of -what your CPU is spending time on, displaying stack traces of your processes -across native programs[1](#perf-native) (includes Golang), Java and Python runtimes, and kernel routines. - -gProfiler can upload its results to the [Granulate Performance Studio](https://profiler.granulate.io/), which aggregates the results from different instances over different periods of time and can give you a holistic view of what is happening on your entire cluster. -To upload results, you will have to register and generate a token on the website. - -gProfiler runs on Linux. - -![Granulate Performance Studio example view](https://user-images.githubusercontent.com/58514213/124375504-36b0b200-dcab-11eb-8d64-caf20687a29f.gif) - -# Running - -This section describes the possible options to control gProfiler's output, and the various execution modes (as a container, as an executable, etc...) - -## Output options - -gProfiler can produce output in two ways: - -* Create an aggregated, collapsed stack samples file (`profile_.col`) - and a flamegraph file (`profile_.html`). Two symbolic links (`last_profile.col` and `last_flamegraph.html`) always point to the last output files. - - Use the `--output-dir`/`-o` option to specify the output directory. - - If `--rotating-output` is given, only the last results are kept (available via `last_profle.col` and `last_flamegraph.html`). This can be used to avoid increasing gProfiler's disk usage over time. Useful in conjunction with `--upload-results` (explained ahead) - historical results are available in the Granulate Performance Studio, and the very latest results are available locally. - - `--no-flamegraph` can be given to avoid generation of the `profile_.html` file - only the collapsed stack samples file will be created. - -* Send the results to the Granulate Performance Studio for viewing online with - filtering, insights, and more. - - Use the `--upload-results`/`-u` flag. Pass the `--token` option to specify the token - provided by Granulate Performance Studio, and the `--service-name` option to specify an identifier - for the collected profiles, as will be viewed in the [Granulate Performance Studio](https://profiler.granulate.io/). *Profiles sent from numerous - gProfilers using the same service name will be aggregated together.* - -Note: both flags can be used simultaneously, in which case gProfiler will create the local files *and* upload -the results. - -## Profiling options -* `--profiling-frequency`: The sampling frequency of the profiling, in *hertz*. -* `--profiling-duration`: The duration of the each profiling session, in *seconds*. - -The default profiling frequency is *11 hertz*. Using higher frequency will lead to more accurate results, but will create greater overhead on the profiled system & programs. - -For each profiling session (each profiling duration), gProfiler produces outputs (writing local files and/or uploading the results to the Granulate Performance Studio). - -### Java profiling options - -* `--no-java` or `--java-mode disabled`: Disable profilers for Java. - -### Python profiling options -* `--no-python`: Alias of `--python-mode disabled`. -* `--python-mode`: Controls which profiler is used for Python. - * `auto` - (default) try with PyPerf (eBPF), fall back to py-spy. - * `pyperf` - Use PyPerf with no py-spy fallback. - * `pyspy` - Use py-spy. - * `disabled` - Disable profilers for Python. - -Profiling using eBPF incurs lower overhead & provides kernel stacks. This (currently) requires kernel headers to be installed. - -### PHP profiling options -* `--no-php` or `--php-mode disabled`: Disable profilers for PHP. -* `--php-proc-filter`: Process filter (`pgrep`) to select PHP processes for profiling (this is phpspy's `-P` option) - -### Ruby profiling options -* `--no-ruby` or `--ruby-mode disabled`: Disable profilers for Ruby. - -### NodeJS profiling options -* `--nodejs-mode`: Controls which profiler is used for NodeJS. - * `none` - (default) no profiler is used. - * `perf` - augment the system profiler (`perf`) results with jitdump files generated by NodeJS. This requires running your `node` processes with `--perf-prof` (and for Node >= 10, with `--interpreted-frames-native-stack`). See this [NodeJS page](https://nodejs.org/en/docs/guides/diagnostics-flamegraph/) for more information. - -### System profiling options - -* `--perf-mode`: Controls the global perf strategy. Must be one of the following options: - * `fp` - Use Frame Pointers for the call graph - * `dwarf` - Use DWARF for the call graph (adds the `--call-graph dwarf` argument to the `perf` command) - * `smart` - Run both `fp` and `dwarf`, then choose the result with the highest average of stack frames count, per process. - * `disabled` - Avoids running `perf` at all. See [perf-less mode](#perf-less-mode). - -## Other options -### Sending logs to server -**By default, gProfiler sends logs to Granulate Performance Studio** (when using `--upload-results`/`-u` flag) -This behavior can be disabled by passing `--dont-send-logs` or the setting environment variable `GPROFILER_DONT_SEND_LOGS=1`. - -### Continuous mode -gProfiler can be run in a continuous mode, profiling periodically, using the `--continuous`/`-c` flag. -Note that when using `--continuous` with `--output-dir`, a new file will be created during *each* sampling interval. -Aggregations are only available when uploading to the Granulate Performance Studio. - -## Running as a Docker container -Run the following to have gProfiler running continuously, uploading to Granulate Performance Studio: -```bash -docker pull granulate/gprofiler:latest -docker run --name gprofiler -d --restart=always \ - --pid=host --userns=host --privileged \ - -v /lib/modules:/lib/modules:ro -v /usr/src:/usr/src:ro \ - -v /var/run/docker.sock:/var/run/docker.sock \ - granulate/gprofiler:latest -cu --token --service-name [options] -``` - -For profiling with eBPF, kernel headers must be accessible from within the container at -`/lib/modules/$(uname -r)/build`. On Ubuntu, this directory is a symlink pointing to `/usr/src`. -The command above mounts both of these directories. - -## Running as an executable -Run the following to have gprofiler running continuously, uploading to Granulate Performance Studio: -```bash -wget https://github.com/Granulate/gprofiler/releases/latest/download/gprofiler -sudo chmod +x gprofiler -sudo ./gprofiler -cu --token --service-name [options] -``` - -gProfiler unpacks executables to `/tmp` by default; if your `/tmp` is marked with `noexec`, -you can add `TMPDIR=/proc/self/cwd` to have everything unpacked in your current working directory. -```bash -sudo TMPDIR=/proc/self/cwd ./gprofiler -cu --token --service-name [options] -``` - -#### Executable known issues -The following platforms are currently not supported with the gProfiler executable: -+ Alpine - -**Remark:** container-based execution works and can be used in those cases. - -## Running as a Kubernetes DaemonSet -See [gprofiler.yaml](deploy/k8s/gprofiler.yaml) for a basic template of a DaemonSet running gProfiler. -Make sure to insert the `GPROFILER_TOKEN` and `GPROFILER_SERVICE` variables in the appropriate location! - -## Running from source -gProfiler requires Python 3.6+ to run. - -```bash -pip3 install -r requirements.txt -./scripts/copy_resources_from_image.sh -``` - -Then, run the following **as root**: -```bash -python3 -m gprofiler [options] -``` - -# Theory of operation -gProfiler invokes `perf` in system wide mode, collecting profiling data for all running processes. -Alongside `perf`, gProfiler invokes runtime-specific profilers for processes based on these environments: -* Java runtimes (version 7+) based on the HotSpot JVM, including the Oracle JDK and other builds of OpenJDK like AdoptOpenJDK and Azul Zulu. - * Uses async-profiler. -* The CPython interpreter, versions 2.7 and 3.5-3.9. - * eBPF profiling (based on PyPerf) requires Linux 4.14 or higher; see [Python profiling options](#python-profiling-options) for more info. - * If eBPF is not available for whatever reason, py-spy is used. -* PHP (Zend Engine), versions 7.0-8.0. - * Uses [Granulate's fork](https://github.com/Granulate/phpspy/) of the phpspy project. -* Ruby versions (versions 1.9.1 to 3.0.1) - * Uses [Granulate's fork](https://github.com/Granulate/rbspy) of the [rbspy](https://github.com/rbspy/rbspy) profiler. -* NodeJS (version >= 10 for functioning `--perf-prof`): - * Uses `perf inject --jit` and NodeJS's ability to generate jitdump files. See [NodeJS profiling options](#nodejs-profiling-options). - -The runtime-specific profilers produce stack traces that include runtime information (i.e, stacks of Java/Python functions), unlike `perf` which produces native stacks of the JVM / CPython interpreter. -The runtime stacks are then merged into the data collected by `perf`, substituting the *native* stacks `perf` has collected for those processes. - -## perf-less mode - -It is possible to run gProfiler without using `perf` - this is useful where `perf` can't be used, for whatever reason (e.g permissions). This mode is enabled by `--perf-mode disabled`. - -In this mode, gProfiler uses runtime-specific profilers only, and their results are concatenated (instead of scaled into the results collected by `perf`). This means that, although the results from different profilers are viewed on the same graph, they are not necessarily of the same scale: so you can compare the samples count of Java to Java, but not Java to Python. - -# Contribute -We welcome all feedback and suggestion through Github Issues: -* [Submit bugs and feature requests](https://github.com/granulate/gprofiler/issues) -* Upvote [popular feature requests](https://github.com/granulate/gprofiler/issues?q=is%3Aopen+is%3Aissue+label%3Aenhancement+sort%3Areactions-%2B1-desc+) - -## Releasing a new version -1. Update `__version__` in `__init__.py`. -2. Create a tag with the same version (after merging the `__version__` update) and push it. - -We recommend going through our [contribution guide](https://github.com/granulate/gprofiler/blob/master/CONTRIBUTING.md) for more details. - -# Credits -* [async-profiler](https://github.com/jvm-profiling-tools/async-profiler) by [Andrei Pangin](https://github.com/apangin). See [our fork](https://github.com/Granulate/async-profiler). -* [py-spy](https://github.com/benfred/py-spy) by [Ben Frederickson](https://github.com/benfred). See [our fork](https://github.com/Granulate/py-spy). -* [bcc](https://github.com/iovisor/bcc) (for PyPerf) by the IO Visor project. See [our fork](https://github.com/Granulate/bcc). -* [phpspy](https://github.com/adsr/phpspy) by [Adam Saponara](https://github.com/adsr). See [our fork](https://github.com/Granulate/phpspy). -* [rbspy](https://github.com/rbspy/rbspy) by the rbspy project. See [our fork](https://github.com/Granulate/rbspy) - -# Footnotes - -1: To profile native programs that were compiled without frame pointers, make sure you use the `--perf-mode smart` (which is the default). Read more about it in the [Profiling options](#profiling-options) section[↩](#a1) diff --git a/blah b/blah new file mode 100644 index 000000000..593e513e9 --- /dev/null +++ b/blah @@ -0,0 +1,6 @@ +fjkljflkdjf +fjkljflkdjf +fjkljflkdjf +fjkljflkdjf +fjkljflkdjf +fjkljflkdjf \ No newline at end of file diff --git a/dev-requirements.txt b/dev-requirements.txt index 64322fde8..3fb1c4e55 100644 --- a/dev-requirements.txt +++ b/dev-requirements.txt @@ -4,3 +4,4 @@ black==21.5b2 mypy==0.901 isort==5.8.0 types-requests==0.1.9 +types-dataclasses==0.1.5;python_version < '3.7' diff --git a/gprofiler/client.py b/gprofiler/client.py index 78521294c..51e494244 100644 --- a/gprofiler/client.py +++ b/gprofiler/client.py @@ -6,7 +6,7 @@ import gzip import json from io import BytesIO -from typing import Dict, List, Optional, Tuple +from typing import TYPE_CHECKING, Dict, List, Optional, Tuple import requests from requests import Session @@ -14,7 +14,10 @@ from gprofiler import __version__ from gprofiler.exceptions import APIError from gprofiler.log import get_logger_adapter -from gprofiler.utils import get_iso8601_format_time +from gprofiler.utils import get_iso8601_format_time, get_iso8601_format_time_from_epoch_time + +if TYPE_CHECKING: + from gprofiler.system_metrics import Metrics logger = get_logger_adapter(__name__) @@ -78,7 +81,13 @@ def _send_request( opts["headers"]["Content-type"] = "application/json" buffer = BytesIO() with gzip.open(buffer, mode="wt", encoding="utf-8") as gzip_file: - json.dump(data, gzip_file, ensure_ascii=False) # type: ignore + try: + json.dump(data, gzip_file, ensure_ascii=False) # type: ignore + except TypeError: + # This should only happen while in development, and is used to get a more indicative error. + bad_json = str(data) + logger.exception("Given data is not a valid JSON!", bad_json=bad_json) + raise opts["data"] = buffer.getvalue() opts["params"] = self._get_query_params() + [(k, v) for k, v in params.items()] @@ -119,6 +128,8 @@ def submit_profile( profile: str, total_samples: int, profile_api_version: Optional[str], + spawn_time: float, + metrics: 'Metrics', ) -> Dict: return self.post( "profiles", @@ -127,6 +138,9 @@ def submit_profile( "end_time": get_iso8601_format_time(end_time), "hostname": self._hostname, "profile": profile, + "cpu_avg": metrics.cpu_avg, + "mem_avg": metrics.mem_avg, + "spawn_time": get_iso8601_format_time_from_epoch_time(spawn_time), }, timeout=self._upload_timeout, api_version="v2" if profile_api_version is None else profile_api_version, diff --git a/gprofiler/exceptions.py b/gprofiler/exceptions.py index 5e58be124..a16e063ca 100644 --- a/gprofiler/exceptions.py +++ b/gprofiler/exceptions.py @@ -46,3 +46,12 @@ class UninitializedStateException(Exception): class StateAlreadyInitializedException(Exception): pass + + +class BadResponseCode(Exception): + def __init__(self, response_code: int): + super().__init__(f"Got a bad HTTP response code {response_code}") + + +class ThreadStopTimeoutError(Exception): + pass diff --git a/gprofiler/main.py b/gprofiler/main.py index 310a79e70..bb6fd272f 100644 --- a/gprofiler/main.py +++ b/gprofiler/main.py @@ -13,7 +13,7 @@ import time from pathlib import Path from threading import Event -from typing import Callable, Dict, Optional +from typing import Callable, Dict, Optional, Union import configargparse from requests import RequestException, Timeout @@ -23,6 +23,9 @@ from gprofiler.docker_client import DockerClient from gprofiler.log import RemoteLogsHandler, initial_root_logger_setup from gprofiler.merge import ProcessToStackSampleCounters +from gprofiler.metadata.metadata_collector import get_current_metadata, get_static_metadata +from gprofiler.metadata.metadata_type import Metadata +from gprofiler.metadata.system_metadata import get_hostname, get_run_mode, get_static_system_info from gprofiler.profilers.java import JavaProfiler from gprofiler.profilers.perf import SystemProfiler from gprofiler.profilers.php import DEFAULT_PROCESS_FILTER, PHPSpyProfiler @@ -31,19 +34,17 @@ from gprofiler.profilers.registry import get_profilers_registry from gprofiler.profilers.ruby import RbSpyProfiler from gprofiler.state import State, init_state +from gprofiler.system_metrics import NoopSystemMetricsMonitor, SystemMetricsMonitor, SystemMetricsMonitorBase from gprofiler.types import positive_integer from gprofiler.utils import ( TEMPORARY_STORAGE_PATH, CpuUsageLogger, TemporaryDirectoryWithMode, atomically_symlink, - get_hostname, get_iso8601_format_time, - get_run_mode, grab_gprofiler_mutex, is_root, is_running_in_init_pid, - log_system_info, reset_umask, resource_path, run_process, @@ -101,8 +102,11 @@ def __init__( pyperf_user_stacks_pages: Optional[int], runtimes: Dict[str, bool], client: APIClient, + collect_metrics: bool, + collect_metadata: bool, state: State, cpu_usage_logger: CpuUsageLogger, + run_args: Dict[str, Union[bool, str, int]], include_container_names=True, profile_api_version: Optional[str] = None, remote_logs_handler: Optional[RemoteLogsHandler] = None, @@ -118,7 +122,13 @@ def __init__( self._state = state self._remote_logs_handler = remote_logs_handler self._profile_api_version = profile_api_version + self._collect_metrics = collect_metrics + self._collect_metadata = collect_metadata self._stop_event = Event() + self._static_metadata: Optional[Metadata] = None + self._spawn_time = time.time() + if collect_metadata and self._client is not None: + self._static_metadata = get_static_metadata(spawn_time=self._spawn_time, run_args=run_args) self._executor = concurrent.futures.ThreadPoolExecutor(max_workers=10) # TODO: we actually need 2 types of temporary directories. # 1. accessible by everyone - for profilers that run code in target processes, like async-profiler @@ -173,6 +183,10 @@ def __init__( else: self._docker_client = None self._cpu_usage_logger = cpu_usage_logger + if collect_metrics: + self._system_metrics_monitor: SystemMetricsMonitorBase = SystemMetricsMonitor(self._stop_event) + else: + self._system_metrics_monitor = NoopSystemMetricsMonitor() def __enter__(self): self.start() @@ -243,6 +257,7 @@ def _strip_container_data(collapsed_data): def start(self): self._stop_event.clear() + self._system_metrics_monitor.start() for prof in ( self.python_profiler, @@ -256,6 +271,7 @@ def start(self): def stop(self): logger.info("Stopping ...") self._stop_event.set() + self._system_metrics_monitor.stop() for prof in ( self.python_profiler, @@ -298,13 +314,18 @@ def _snapshot(self): "Running perf failed; consider running gProfiler with '--perf-mode disabled' to avoid using perf" ) raise - + metadata = ( + get_current_metadata(self._static_metadata) + if self._collect_metadata and self._client + else {"hostname": get_hostname()} + ) if self._runtimes["perf"]: merged_result, total_samples = merge.merge_profiles( system_result, process_profiles, self._docker_client, self._profile_api_version != "v1", + metadata, ) else: assert system_result == {}, system_result # should be empty! @@ -312,15 +333,23 @@ def _snapshot(self): process_profiles, self._docker_client, self._profile_api_version != "v1", + metadata, ) if self._output_dir: self._generate_output_files(merged_result, local_start_time, local_end_time) if self._client: + metrics = self._system_metrics_monitor.get_metrics() try: self._client.submit_profile( - local_start_time, local_end_time, merged_result, total_samples, self._profile_api_version + local_start_time, + local_end_time, + merged_result, + total_samples, + self._profile_api_version, + self._spawn_time, + metrics, ) except Timeout: logger.error("Upload of profile to server timed out.") @@ -512,6 +541,22 @@ def parse_cmd_args(): help="Disable host PID NS check on startup", ) + parser.add_argument( + "--disable-metrics-collection", + action="store_false", + default=True, + dest="collect_metrics", + help="Disable sending system metrics to the Performance Studio", + ) + + parser.add_argument( + "--disable-metadata-collection", + action="store_false", + default=True, + dest="collect_metadata", + help="Disable sending system and cloud metadata to the Performance Studio", + ) + args = parser.parse_args() if args.upload_results: @@ -597,6 +642,19 @@ def setup_signals() -> None: signal.signal(signal.SIGTERM, sigint_handler) +def log_system_info() -> None: + system_info = get_static_system_info() + logger.info(f"gProfiler Python version: {system_info.python_version}") + logger.info(f"gProfiler deployment mode: {system_info.run_mode}") + logger.info(f"Kernel uname release: {system_info.kernel_release}") + logger.info(f"Kernel uname version: {system_info.kernel_version}") + logger.info(f"Total CPUs: {system_info.processors}") + logger.info(f"Total RAM: {system_info.memory_capacity_mb / (1 << 20):.2f} GB") + logger.info(f"Linux distribution: {system_info.os_name} | {system_info.os_release} | {system_info.os_codename}") + logger.info(f"libc version: {system_info.libc_type}-{system_info.libc_version}") + logger.info(f"Hostname: {system_info.hostname}") + + def main(): args = parse_cmd_args() verify_preconditions(args) @@ -673,15 +731,17 @@ def main(): args.pyperf_user_stacks_pages, runtimes, client, + args.collect_metrics, + args.collect_metadata, state, cpu_usage_logger, + args.__dict__ if args.collect_metadata else None, not args.disable_container_names, args.profile_api_version, remote_logs_handler, args.php_process_filter, ) logger.info("gProfiler initialized and ready to start profiling") - if args.continuous: gprofiler.run_continuous() else: diff --git a/gprofiler/merge.py b/gprofiler/merge.py index 73e4e36a2..32d9a9287 100644 --- a/gprofiler/merge.py +++ b/gprofiler/merge.py @@ -12,8 +12,8 @@ from gprofiler.docker_client import DockerClient from gprofiler.log import get_logger_adapter +from gprofiler.metadata.metadata_type import Metadata from gprofiler.types import ProcessToStackSampleCounters, StackToSampleCount -from gprofiler.utils import get_hostname logger = get_logger_adapter(__name__) @@ -209,7 +209,7 @@ def _parse_perf_script(script: Optional[str]) -> ProcessToStackSampleCounters: return pid_to_collapsed_stacks_counters -def _make_profile_metadata(docker_client: Optional[DockerClient], add_container_names: bool) -> str: +def _make_profile_metadata(docker_client: Optional[DockerClient], add_container_names: bool, metadata: Metadata) -> str: if docker_client is not None and add_container_names: container_names = docker_client.container_names docker_client.reset_cache() @@ -218,13 +218,12 @@ def _make_profile_metadata(docker_client: Optional[DockerClient], add_container_ container_names = [] enabled = False - return "# " + json.dumps( - { - 'containers': container_names, - 'hostname': get_hostname(), - 'container_names_enabled': enabled, - } - ) + profile_metadata = { + 'containers': container_names, + 'container_names_enabled': enabled, + } + profile_metadata["metadata"] = metadata + return "# " + json.dumps(profile_metadata) def _get_container_name(pid: int, docker_client: Optional[DockerClient], add_container_names: bool): @@ -235,6 +234,7 @@ def concatenate_profiles( process_profiles: ProcessToStackSampleCounters, docker_client: Optional[DockerClient], add_container_names: bool, + metadata: Metadata, ) -> Tuple[str, int]: """ Concatenate all stacks from all stack mappings in process_profiles. @@ -251,7 +251,7 @@ def concatenate_profiles( total_samples += count lines.append(f"{container_name + ';' if add_container_names else ''}{stack} {count}") - lines.insert(0, _make_profile_metadata(docker_client, add_container_names)) + lines.insert(0, _make_profile_metadata(docker_client, add_container_names, metadata)) return "\n".join(lines), total_samples @@ -260,6 +260,7 @@ def merge_profiles( process_profiles: ProcessToStackSampleCounters, docker_client: Optional[DockerClient], add_container_names: bool, + metadata: Metadata, ) -> Tuple[str, int]: # merge process profiles into the global perf results. for pid, stacks in process_profiles.items(): @@ -285,4 +286,4 @@ def merge_profiles( # swap them: use the samples from the runtime profiler. perf_pid_to_stacks_counter[pid] = stacks - return concatenate_profiles(perf_pid_to_stacks_counter, docker_client, add_container_names) + return concatenate_profiles(perf_pid_to_stacks_counter, docker_client, add_container_names, metadata) diff --git a/gprofiler/metadata/__init__.py b/gprofiler/metadata/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/gprofiler/metadata/cloud_metadata.py b/gprofiler/metadata/cloud_metadata.py new file mode 100644 index 000000000..23ba23b2e --- /dev/null +++ b/gprofiler/metadata/cloud_metadata.py @@ -0,0 +1,159 @@ +from dataclasses import dataclass +from http.client import NOT_FOUND +from typing import Dict, List, Optional + +import requests +from requests import Response + +from gprofiler.exceptions import BadResponseCode +from gprofiler.log import get_logger_adapter +from gprofiler.metadata.metadata_type import Metadata + +METADATA_REQUEST_TIMEOUT = 5 + +logger = get_logger_adapter(__name__) + + +@dataclass +class InstanceMetadataBase: + provider: str + + +@dataclass +class AwsInstanceMetadata(InstanceMetadataBase): + region: str + zone: str + instance_type: str + life_cycle: str + account_id: str + image_id: str + instance_id: str + + +@dataclass +class GcpInstanceMetadata(InstanceMetadataBase): + provider: str + zone: str + instance_type: str + preempted: bool + preemptible: bool + instance_id: str + image_id: str + name: str + + +@dataclass +class AzureInstanceMetadata(InstanceMetadataBase): + provider: str + instance_type: str + zone: str + region: str + subscription_id: str + resource_group_name: str + resource_id: str + instance_id: str + name: str + image_info: Optional[Dict[str, str]] + + +def get_aws_metadata() -> Optional[AwsInstanceMetadata]: + # Documentation: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-categories.html + metadata_response = send_request("http://169.254.169.254/latest/dynamic/instance-identity/document") + life_cycle_response = send_request("http://169.254.169.254/latest/meta-data/instance-life-cycle") + if life_cycle_response is None or metadata_response is None: + return None + instance = metadata_response.json() + return AwsInstanceMetadata( + provider="aws", + region=instance["region"], + zone=instance["availabilityZone"], + instance_type=instance["instanceType"], + life_cycle=life_cycle_response.text, + account_id=instance["accountId"], + image_id=instance["imageId"], + instance_id=instance["instanceId"], + ) + + +def get_gcp_metadata() -> Optional[GcpInstanceMetadata]: + # Documentation: https://cloud.google.com/compute/docs/storing-retrieving-metadata + response = send_request( + "http://metadata.google.internal/computeMetadata/v1/instance/?recursive=true", + headers={"Metadata-Flavor": "Google"}, + ) + if response is None: + return None + instance = response.json() + return GcpInstanceMetadata( + provider="gcp", + zone=instance["zone"], + instance_type=instance["machineType"], + preemptible=instance["scheduling"]["preemptible"] == "TRUE", + preempted=instance["preempted"] == "TRUE", + instance_id=str(instance["id"]), + image_id=instance["image"], + name=instance["name"], + ) + + +def get_azure_metadata() -> Optional[AzureInstanceMetadata]: + # Documentation: https://docs.microsoft.com/en-us/azure/virtual-machines/linux/instance-metadata-service?tabs=linux + response = send_request( + "http://169.254.169.254/metadata/instance/compute/?api-version=2019-08-15", headers={"Metadata": "true"} + ) + if response is None: + return None + instance = response.json() + image_info = None + storage_profile = instance.get("storageProfile") + if isinstance(storage_profile, dict): + image_reference = storage_profile.get("imageReference") + if isinstance(image_reference, dict): + image_info = { + "image_id": image_reference["id"], + "image_offer": image_reference["offer"], + "image_publisher": image_reference["publisher"], + "image_sku": image_reference["sku"], + "image_version": image_reference["version"], + } + + return AzureInstanceMetadata( + provider="azure", + instance_type=instance["vmSize"], + zone=instance["zone"], + region=instance["location"], + subscription_id=instance["subscriptionId"], + resource_group_name=instance["resourceGroupName"], + resource_id=instance["resourceId"], + instance_id=instance["vmId"], + name=instance["name"], + image_info=image_info, + ) + + +def send_request(url: str, headers: Dict[str, str] = None) -> Optional[Response]: + response = requests.get(url, headers=headers or {}, timeout=METADATA_REQUEST_TIMEOUT) + if response.status_code == NOT_FOUND: + # It's most likely the wrong cloud provider + return None + elif not response.ok: + raise BadResponseCode(response.status_code) + return response + + +def get_static_cloud_instance_metadata() -> Optional[Metadata]: + cloud_metadata_fetchers = [get_aws_metadata, get_gcp_metadata, get_azure_metadata] + raised_exceptions: List[Exception] = [] + for fetcher in cloud_metadata_fetchers: + try: + response = fetcher() + if response is not None: + return response.__dict__ + except Exception as exception: + raised_exceptions.append(exception) + formatted_exceptions = ', '.join([repr(exception) for exception in raised_exceptions]) + logger.debug( + f"Could not get any cloud instance metadata because of the following exceptions: {formatted_exceptions}." + " The most likely reason is that the gProfiler is not installed on a an AWS, GCP or Azure instance." + ) + return None diff --git a/gprofiler/metadata/metadata_collector.py b/gprofiler/metadata/metadata_collector.py new file mode 100644 index 000000000..4c27cc9e2 --- /dev/null +++ b/gprofiler/metadata/metadata_collector.py @@ -0,0 +1,32 @@ +import datetime +from typing import Dict, Union + +from gprofiler import __version__ +from gprofiler.metadata.cloud_metadata import get_static_cloud_instance_metadata +from gprofiler.metadata.metadata_type import Metadata +from gprofiler.metadata.system_metadata import get_static_system_info + + +def get_static_metadata(spawn_time: float, run_args: Dict[str, Union[bool, str, int]] = None) -> Metadata: + formatted_spawn_time = datetime.datetime.utcfromtimestamp(spawn_time).replace(microsecond=0).isoformat() + static_system_metadata = get_static_system_info() + cloud_metadata = get_static_cloud_instance_metadata() + + metadata_dict: Metadata = { + "cloud_provider": cloud_metadata.pop("provider") if cloud_metadata is not None else "unknown", + "agent_version": __version__, + "spawn_time": formatted_spawn_time, + } + metadata_dict.update(static_system_metadata.__dict__) + if cloud_metadata is not None: + metadata_dict["cloud_info"] = cloud_metadata + if run_args is not None: + metadata_dict["run_arguments"] = run_args + return metadata_dict + + +def get_current_metadata(static_metadata: Metadata) -> Metadata: + current_time = datetime.datetime.utcnow().replace(microsecond=0).isoformat() + dynamic_metadata = static_metadata + dynamic_metadata.update({"current_time": current_time}) + return dynamic_metadata diff --git a/gprofiler/metadata/metadata_type.py b/gprofiler/metadata/metadata_type.py new file mode 100644 index 000000000..1e2db3fce --- /dev/null +++ b/gprofiler/metadata/metadata_type.py @@ -0,0 +1,3 @@ +from typing import Dict, Union + +Metadata = Dict[str, Union[str, int, bool, Dict]] diff --git a/gprofiler/metadata/system_metadata.py b/gprofiler/metadata/system_metadata.py new file mode 100644 index 000000000..629d76976 --- /dev/null +++ b/gprofiler/metadata/system_metadata.py @@ -0,0 +1,236 @@ +import array +import fcntl +import os +import platform +import re +import socket +import struct +import subprocess +import sys +import time +from dataclasses import dataclass +from typing import Dict, Optional, Tuple + +import distro # type: ignore +import psutil + +from gprofiler.log import get_logger_adapter +from gprofiler.utils import run_in_ns, run_process + +logger = get_logger_adapter(__name__) +hostname: Optional[str] = None +RUN_MODE_TO_DEPLOYMENT_TYPE: Dict[str, str] = { + "k8s": "k8s", + "container": "containers", + "standalone_executable": "instances", + "local_python": "instances", +} + + +def get_libc_version() -> Tuple[str, str]: + # platform.libc_ver fails for musl, sadly (produces empty results). + # so we'll run "ldd --version" and extract the version string from it. + # not passing "encoding"/"text" - this runs in a different mount namespace, and Python fails to + # load the files it needs for those encodings (getting LookupError: unknown encoding: ascii) + def decode_libc_version(version: bytes) -> str: + return version.decode("utf-8", errors="replace") + + ldd_version = run_process( + ["ldd", "--version"], stdout=subprocess.PIPE, stderr=subprocess.STDOUT, suppress_log=True, check=False + ).stdout + # catches GLIBC & EGLIBC + m = re.search(br"GLIBC (.*?)\)", ldd_version) + if m is not None: + return "glibc", decode_libc_version(m.group(1)) + # catches GNU libc + m = re.search(br"\(GNU libc\) (.*?)\n", ldd_version) + if m is not None: + return "glibc", decode_libc_version(m.group(1)) + # musl + m = re.search(br"musl libc.*?\nVersion (.*?)\n", ldd_version, re.M) + if m is not None: + return "musl", decode_libc_version(m.group(1)) + + return "unknown", decode_libc_version(ldd_version) + + +def get_run_mode() -> str: + if os.getenv("GPROFILER_IN_K8S") is not None: # set in k8s/gprofiler.yaml + return "k8s" + elif os.getenv("GPROFILER_IN_CONTAINER") is not None: # set by our Dockerfile + return "container" + elif os.getenv("STATICX_BUNDLE_DIR") is not None: # set by staticx + return "standalone_executable" + else: + return "local_python" + + +def get_deployment_type(run_mode: str) -> str: + return RUN_MODE_TO_DEPLOYMENT_TYPE.get(run_mode, "unknown") + + +def get_private_ip() -> str: + # Fetches the local IP. Attempts to get it locally. If it fails, it will attempt to fetch it by connecting to + # Google's DNS servers (8.8.8.8) and getting the local IP address of the interface it connected with. + try: + private_ips = [ip for ip in socket.gethostbyname_ex(socket.gethostname())[2] if not ip.startswith("127.")] + except socket.error: + # Could happen when a network is unavailable + private_ips = [] + if not private_ips: + s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM) + try: + s.connect(("8.8.8.8", 53)) + private_ips.append(s.getsockname()[0]) + finally: + s.close() + return private_ips[0] if private_ips else "unknown" + + +def get_mac_address() -> str: + """ + Gets the MAC address of the first non-loopback interface. + """ + + assert sys.maxsize > 2 ** 32, "expected to run on 64-bit!" + SIZE_OF_STUCT_ifreq = 40 # correct for 64-bit + + IFNAMSIZ = 16 + IFF_LOOPBACK = 8 + MAC_BYTES_LEN = 6 + SIZE_OF_SHORT = struct.calcsize('H') + MAX_BYTE_COUNT = 4096 + + s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, socket.IPPROTO_IP) + + # run SIOCGIFCONF to get all interface names + buf = array.array('B', b'\0' * MAX_BYTE_COUNT) + ifconf = struct.pack('iL', MAX_BYTE_COUNT, buf.buffer_info()[0]) + outbytes = struct.unpack('iL', fcntl.ioctl(s.fileno(), 0x8912, ifconf))[0] # SIOCGIFCONF + data = buf.tobytes()[:outbytes] + for index in range(0, len(data), SIZE_OF_STUCT_ifreq): + iface = data[index : index + SIZE_OF_STUCT_ifreq] + + # iface is now a struct ifreq which starts with the interface name. + # we can use it for further calls. + res = fcntl.ioctl(s.fileno(), 0x8913, iface) # SIOCGIFFLAGS + ifr_flags = struct.unpack(f'{IFNAMSIZ}sH', res[: IFNAMSIZ + SIZE_OF_SHORT])[1] + if ifr_flags & IFF_LOOPBACK: + continue + + # okay, not loopback, get its MAC address. + res = fcntl.ioctl(s.fileno(), 0x8927, iface) # SIOCGIFHWADDR + address = struct.unpack(f'{IFNAMSIZ}sH{MAC_BYTES_LEN}s', res[: IFNAMSIZ + SIZE_OF_SHORT + MAC_BYTES_LEN])[2] + mac = struct.unpack(f'{MAC_BYTES_LEN}B', address) + address = ":".join(['%02X' % i for i in mac]) + return address + + return "unknown" + + +@dataclass +class SystemInfo: + python_version: str + run_mode: str + deployment_type: str + kernel_release: str + kernel_version: str + system_name: str + processors: int + memory_capacity_mb: int + hostname: str + os_name: str + os_release: str + os_codename: str + libc_type: str + libc_version: str + hardware_type: str + pid: int + mac_address: str + private_ip: str + spawn_uptime_ms: float + + +def get_static_system_info() -> SystemInfo: + hostname, distribution, libc_tuple, mac_address, private_ip = _initialize_system_info() + clock = getattr(time, "CLOCK_BOOTTIME", time.CLOCK_MONOTONIC) + spawn_uptime_ms = time.clock_gettime(clock) + libc_type, libc_version = libc_tuple + os_name, os_release, os_codename = distribution + uname = platform.uname() + cpu_count = os.cpu_count() or 0 + run_mode = get_run_mode() + deployment_type = get_deployment_type(run_mode) + return SystemInfo( + python_version=sys.version, + run_mode=run_mode, + deployment_type=deployment_type, + kernel_release=uname.release, + kernel_version=uname.version, + system_name=uname.system, + processors=cpu_count, + memory_capacity_mb=round(psutil.virtual_memory().total / 1024 / 1024), + hostname=hostname, + os_name=os_name, + os_release=os_release, + os_codename=os_codename, + libc_type=libc_type, + libc_version=libc_version, + hardware_type=uname.machine, + pid=os.getpid(), + mac_address=mac_address, + private_ip=private_ip, + spawn_uptime_ms=spawn_uptime_ms, + ) + + +def get_hostname() -> str: + assert hostname is not None, "hostname not initialized!" + return hostname + + +def _initialize_system_info(): + # initialized first + global hostname + hostname = "" + distribution = ("unknown", "unknown", "unknown") + libc_version = ("unknown", "unknown") + mac_address = "unknown" + private_ip = "unknown" + + # move to host mount NS for distro & ldd. + # now, distro will read the files on host. + # also move to host UTS NS for the hostname. + def get_infos(): + nonlocal distribution, libc_version, mac_address, private_ip + global hostname + + try: + distribution = distro.linux_distribution() + except Exception: + logger.exception("Failed to get distribution") + + try: + libc_version = get_libc_version() + except Exception: + logger.exception("Failed to get libc version") + + try: + hostname = socket.gethostname() + except Exception: + logger.exception("Failed to get hostname") + + try: + + mac_address = get_mac_address() + except Exception: + logger.exception("Failed to get MAC address") + + try: + private_ip = get_private_ip() + except Exception: + logger.exception("Failed to get the local IP") + + run_in_ns(["mnt", "uts", "net"], get_infos) + + return hostname, distribution, libc_version, mac_address, private_ip diff --git a/gprofiler/system_metrics.py b/gprofiler/system_metrics.py new file mode 100644 index 000000000..41bda6897 --- /dev/null +++ b/gprofiler/system_metrics.py @@ -0,0 +1,108 @@ +import statistics +from abc import ABCMeta, abstractmethod +from dataclasses import dataclass +from threading import Event, RLock, Thread +from typing import List, Optional + +import psutil + +from gprofiler.exceptions import ThreadStopTimeoutError + +DEFAULT_POLLING_INTERVAL_SECONDS = 5 +STOP_TIMEOUT_SECONDS = 30 + + +@dataclass +class Metrics: + # The average CPU usage between gProfiler cycles + cpu_avg: Optional[float] + # The average RAM usage between gProfiler cycles + mem_avg: Optional[float] + + +class SystemMetricsMonitorBase(metaclass=ABCMeta): + @abstractmethod + def start(self): + pass + + @abstractmethod + def stop(self): + pass + + @abstractmethod + def _get_average_memory_utilization(self) -> Optional[float]: + raise NotImplementedError + + @abstractmethod + def _get_cpu_utilization(self) -> Optional[float]: + """ + Returns the CPU utilization percentage since the last time this method was called. + """ + raise NotImplementedError + + def get_metrics(self) -> Metrics: + return Metrics(self._get_cpu_utilization(), self._get_average_memory_utilization()) + + +class SystemMetricsMonitor(SystemMetricsMonitorBase): + def __init__(self, stop_event: Event, polling_rate_seconds: int = DEFAULT_POLLING_INTERVAL_SECONDS): + self._polling_rate_seconds = polling_rate_seconds + self._mem_percentages: List[float] = [] + self._stop_event = stop_event + self._thread = None + self._lock = RLock() + + self._get_cpu_utilization() # Call this once to set the necessary data + + def start(self): + assert self._thread is None, "SystemMetricsMonitor is already running" + assert ( + not self._stop_event.is_set() + ), "Stop condition is already set (perhaps the gProfiler was already stopped?)" + self._thread = Thread(target=self._continuously_poll_memory, args=(self._polling_rate_seconds,)) + self._thread.start() + + def stop(self): + self._stop_event.set() + assert self._thread is not None, "SystemMetricsMonitor is not running" + self._thread.join(STOP_TIMEOUT_SECONDS) + if self._thread.is_alive(): + raise ThreadStopTimeoutError("Timed out while waiting for the SystemMetricsMonitor internal thread to stop") + self._thread = None + + def _continuously_poll_memory(self, polling_rate_seconds: int): + while not self._stop_event.is_set(): + current_ram_percent = psutil.virtual_memory().percent + self._mem_percentages.append(current_ram_percent) + self._stop_event.wait(timeout=polling_rate_seconds) + + def _get_average_memory_utilization(self) -> Optional[float]: + # Make sure there's only one thread that takes out the values + # NOTE - Since there's currently only a single consumer, this is not necessary but is done to support multiple + # consumers. + with self._lock: + current_length = len(self._mem_percentages) + if current_length == 0: + return None + average_memory = statistics.mean(self._mem_percentages[:current_length]) + self._mem_percentages[:current_length] = [] + return average_memory + + def _get_cpu_utilization(self) -> float: + # None-blocking call. Must be called at least once before attempting to get a meaningful value. + # See `psutil.cpu_percent` documentation. + return psutil.cpu_percent(interval=None) + + +class NoopSystemMetricsMonitor(SystemMetricsMonitorBase): + def start(self): + pass + + def stop(self): + pass + + def _get_average_memory_utilization(self) -> Optional[float]: + return None + + def _get_cpu_utilization(self) -> Optional[float]: + return None diff --git a/gprofiler/utils.py b/gprofiler/utils.py index 851eceb07..d68f47f66 100644 --- a/gprofiler/utils.py +++ b/gprofiler/utils.py @@ -9,23 +9,20 @@ import glob import logging import os -import platform import random import re import shutil import socket import string import subprocess -import sys import time from functools import lru_cache from pathlib import Path from subprocess import CompletedProcess, Popen, TimeoutExpired from tempfile import TemporaryDirectory from threading import Event, Thread -from typing import Callable, List, Optional, Tuple, Union +from typing import Callable, List, Optional, Union -import distro # type: ignore import importlib_resources import psutil from psutil import Process @@ -43,7 +40,6 @@ TEMPORARY_STORAGE_PATH = "/tmp/gprofiler_tmp" gprofiler_mutex: Optional[socket.socket] -hostname: Optional[str] = None @lru_cache(maxsize=None) @@ -233,6 +229,10 @@ def pgrep_maps(match: str) -> List[Process]: return processes +def get_iso8601_format_time_from_epoch_time(time: float) -> str: + return get_iso8601_format_time(datetime.datetime.utcfromtimestamp(time)) + + def get_iso8601_format_time(time: datetime.datetime) -> str: return time.replace(microsecond=0).isoformat() @@ -304,41 +304,6 @@ def assert_program_installed(program: str): raise ProgramMissingException(program) -def get_libc_version() -> Tuple[str, bytes]: - # platform.libc_ver fails for musl, sadly (produces empty results). - # so we'll run "ldd --version" and extract the version string from it. - # not passing "encoding"/"text" - this runs in a different mount namespace, and Python fails to - # load the files it needs for those encodings (getting LookupError: unknown encoding: ascii) - ldd_version = run_process( - ["ldd", "--version"], stdout=subprocess.PIPE, stderr=subprocess.STDOUT, suppress_log=True, check=False - ).stdout - # catches GLIBC & EGLIBC - m = re.search(br"GLIBC (.*?)\)", ldd_version) - if m is not None: - return ("glibc", m.group(1)) - # catches GNU libc - m = re.search(br"\(GNU libc\) (.*?)\n", ldd_version) - if m is not None: - return ("glibc", m.group(1)) - # musl - m = re.search(br"musl libc.*?\nVersion (.*?)\n", ldd_version, re.M) - if m is not None: - return ("musl", m.group(1)) - - return ("unknown", ldd_version) - - -def get_run_mode() -> str: - if os.getenv("GPROFILER_IN_K8S") is not None: # set in k8s/gprofiler.yaml - return "k8s" - elif os.getenv("GPROFILER_IN_CONTAINER") is not None: # set by our Dockerfile - return "container" - elif os.getenv("STATICX_BUNDLE_DIR") is not None: # set by staticx - return "standalone_executable" - else: - return "local_python" - - def run_in_ns(nstypes: List[str], callback: Callable[[], None], target_pid: int = 1) -> None: """ Runs a callback in a new thread, switching to a set of the namespaces of a target process before @@ -380,54 +345,6 @@ def _switch_and_run(): t.join() -def _initialize_system_info(): - # initialized first - global hostname - hostname = "" - distribution = "unknown" - libc_version = "unknown" - - # move to host mount NS for distro & ldd. - # now, distro will read the files on host. - # also move to host UTS NS for the hostname. - def get_infos(): - nonlocal distribution, libc_version - global hostname - - try: - distribution = distro.linux_distribution() - except Exception: - logger.exception("Failed to get distribution") - - try: - libc_version = get_libc_version() - except Exception: - logger.exception("Failed to get libc version") - - try: - hostname = socket.gethostname() - except Exception: - logger.exception("Failed to get hostname") - - run_in_ns(["mnt", "uts"], get_infos) - - return hostname, distribution, libc_version - - -def log_system_info() -> None: - uname = platform.uname() - logger.info(f"gProfiler Python version: {sys.version}") - logger.info(f"gProfiler run mode: {get_run_mode()}") - logger.info(f"Kernel uname release: {uname.release}") - logger.info(f"Kernel uname version: {uname.version}") - logger.info(f"Total CPUs: {os.cpu_count()}") - logger.info(f"Total RAM: {psutil.virtual_memory().total / (1 << 30):.2f} GB") - hostname, distribution, libc_version = _initialize_system_info() - logger.info(f"Linux distribution: {distribution}") - logger.info(f"libc version: {libc_version}") - logger.info(f"Hostname: {hostname}") - - def grab_gprofiler_mutex() -> bool: """ Implements a basic, system-wide mutex for gProfiler, to make sure we don't run 2 instances simultaneously. @@ -517,11 +434,6 @@ def limit_frequency(limit: Optional[int], requested: int, msg_header: str, runti return requested -def get_hostname() -> str: - assert hostname is not None, "hostname not initialized!" - return hostname - - def random_prefix() -> str: return ''.join(random.choice(string.ascii_letters) for _ in range(16)) diff --git a/mypy.ini b/mypy.ini index ab0eab4e8..b8b8f0c9d 100644 --- a/mypy.ini +++ b/mypy.ini @@ -13,4 +13,3 @@ ignore_missing_imports = True ignore_missing_imports = True [mypy-docker.*] ignore_missing_imports = True - diff --git a/requirements.txt b/requirements.txt index 3ca10b6f8..65f490514 100644 --- a/requirements.txt +++ b/requirements.txt @@ -5,3 +5,4 @@ ConfigArgParse==1.3 distro==1.5.0 docker==5.0.0 six==1.16.0 +dataclasses==0.8; python_version < '3.7' diff --git a/yo b/yo new file mode 100644 index 000000000..92971984f --- /dev/null +++ b/yo @@ -0,0 +1 @@ +hfjksdhfjkdshf \ No newline at end of file