Docs: rework observability arch & translate (kubevela#1046)

* Docs: rework observability Signed-off-by: Yin Da <[email protected]> * Docs: update o11y v1.6 Signed-off-by: Yin Da <[email protected]> Signed-off-by: Yin Da <[email protected]>
jxs1211 · Nov 2, 2022 · 2e750f6 · 2e750f6
1 parent 92258e4
commit 2e750f6
Show file tree

Hide file tree

Showing 46 changed files with 1,666 additions and 2,194 deletions.
diff --git a/...ngineers/operations/o11y/visualization.md → ...rm-engineers/operations/o11y/dashboard.md b/...ngineers/operations/o11y/visualization.md → ...rm-engineers/operations/o11y/dashboard.md
diff --git a/docs/platform-engineers/operations/o11y/installation.md b/docs/platform-engineers/operations/o11y/installation.md
@@ -0,0 +1,91 @@
+---
+title: Installation
+---
+
+:::tip
+Before installing observability addons, we recommend you to start from the [introduction of the observability feature](../observability).
+:::
+
+## Quick Start
+
+To enable the observability addons, you simply need to run the `vela addon enable` commands as below.
+
+1. Install the kube-state-metrics addon
+
+```shell
+vela addon enable kube-state-metrics
+```
+
+2. Install the node-exporter addon
+
+```shell
+vela addon enable node-exporter
+```
+
+3. Install the prometheus-server addon
+
+```shell
+vela addon enable prometheus-server
+```
+
+4. Install the loki addon
+
+```shell
+vela addon enable loki
+```
+
+5. Install the grafana addon
+
+```shell
+vela addon enable grafana
+```
+
+6. Access your grafana through port-forward.
+
+```shell
+kubectl port-forward svc/grafana -n o11y-system 8080:3000
+```
+
+Now you can access your grafana by access `http://localhost:8080` in your browser. The default username and password are `admin` and `kubevela` respectively.
+
+> You can change it by adding `adminUser=super-user adminPassword=PASSWORD` to step 6.
+
+You will see several pre-installed dashboards and use them to view your system and applications. For more details of those pre-installed dashboards, see [Out-of-the-Box](./out-of-the-box) section.
+
+![kubevela-application-dashboard](../../../resources/kubevela-application-dashboard.png)
+
+:::caution
+**Resource**: The observability suite includes several addons which requires some computation resources to work properly. The recommended installation resources for you cluster are 2 cores + 4 Gi memory.
+
+**Version**: We recommend you to use KubeVela (>= v1.6.0) to use the observability addons. For version v1.5.0, logging is not supported.
+:::
+
+:::tip
+**Addon Suite**: If you want to enable these addons in one command, you can use [WorkflowRun](https://github.com/kubevela/workflow) to orchestrate the install process. It allows you to manage the addon enable process as code and make it reusable across different systems.
+:::
+
+## Multi-cluster Installation
+
+If you want to install observability addons in multi-cluster scenario, make sure your Kubernetes clusters support LoadBalancer service and are mutatually accessible.
+
+By default, the installation process for `kube-state-metrics`, `node-exporter` and `prometheus-server` are natually multi-cluster supported (they will be automatically installed to all clusters). But to let your `grafana` on the control plane to be able to access prometheus-server in managed clusters, you need to use the following command to enable `prometheus-server`.
+
+```shell
+vela addon enable prometheus-server thanos=true serviceType=LoadBalancer
+```
+
+This will install [thanos](https://github.com/thanos-io/thanos) sidecar & query along with prometheus-server. Then enable grafana, you will be able to see aggregated prometheus metrics now.
+
+You can also choose which clusters to install addons by using commands as below
+
+```shell
+vela addon enable kube-state-metrics clusters=\{local,c2\}
+```
+
+For `loki` addon, the storage is hosted on the hub control plane by default, and the agent ([promtail](https://grafana.com/docs/loki/latest/clients/promtail/) or [vector](https://vector.dev/)) installation is multi-cluster supported. You can run the following command to let multi-cluster agents to send logs to the loki service on the `local` cluster.
+
+```shell
+vela addon enable loki agent=vector serviceType=LoadBalancer
+```
+
+> If you add new clusters to your control plane after addons being installed, you need to re-enable the addon to let it take effect.
diff --git a/docs/platform-engineers/operations/o11y/integration.md b/docs/platform-engineers/operations/o11y/integration.md
@@ -94,10 +94,6 @@ For more details, you can refer to [vela-prism](https://github.com/kubevela/pris
 
 It is also possible to make integrations through KubeVela's configuration management system, no matter you are using CLI or VelaUX.
 
-### Prometheus
-
-You can read the Configuration Management documentation for more details.
-
 ## Integrate Other Tools or Systems
 
 There are a wide range of community tools or eco-systems that users can leverage for building their observability system, such as prometheus-operator or DataDog. By far, KubeVela does not have existing best practices for those integration. We may integrate with those popular projects through KubeVela addons in the future. We are also welcome to community contributions for broader explorations and more connections.
diff --git a/docs/platform-engineers/operations/o11y/logging.md b/docs/platform-engineers/operations/o11y/logging.md
@@ -227,7 +227,7 @@ spec:
 
 In this example, we transform nginx `combinded` format logs to json format, and adding a `new_field` json key to each log, the json value is `new value`. Please refer to [document](https://vector.dev/docs/reference/vrl/) for how to write vector VRL.
 
-If you have a special log analysis dashboard for this processing method, you can refer to [document](./visualization#dashboard-customization) to import it into grafana.
+If you have a special log analysis dashboard for this processing method, you can refer to [document](./dashboard) to import it into grafana.
 
 ## Collecting file log
 

diff --git a/docs/platform-engineers/operations/o11y/metrics.md b/docs/platform-engineers/operations/o11y/metrics.md
@@ -2,6 +2,47 @@
 title: Metrics
 ---
 
+## Exposing Metrics in your Application
+
+In your application, if you want to expose the metrics of your component (like webservice) to Prometheus, you just need to add the `prometheus-scrape` trait as follows.
+
+```yaml
+apiVersion: core.oam.dev/v1beta1
+kind: Application
+metadata:
+  name: my-app
+spec:
+  components:
+    - name: my-app
+      type: webservice
+      properties:
+        image: somefive/prometheus-client-example:new
+      traits:
+        - type: prometheus-scrape
+```
+
+You can also explicitly specify which port and which path to expose metrics.
+
+```yaml
+apiVersion: core.oam.dev/v1beta1
+kind: Application
+metadata:
+  name: my-app
+spec:
+  components:
+    - name: my-app
+      type: webservice
+      properties:
+        image: somefive/prometheus-client-example:new
+      traits:
+        - type: prometheus-scrape
+          properties:
+            port: 8080
+            path: /metrics
+```
+
+This will let your application be scrapable by the prometheus server. If you want to see those metrics on Grafana, you need to create Grafana dashboard further. Go to [Dashboard](./dashboard) for learning the following steps.
+
 ## Customized Prometheus Installation
 
 If you want to make customization to your prometheus-server installation, you can put your configuration into an individual ConfigMap, like `my-prom` in namespace o11y-system. To distribute your custom config to all clusters, you can also use a KubeVela Application to do the job.
@@ -78,45 +119,3 @@ vela addon enable prometheus-server storage=1G
 ```
 
 This will create PersistentVolumeClaims and let the addon use the provided storage. The storage will not be automatically recycled even if the addon is disabled. You need to clean up the storage manually.
-
-
-## Exposing Metrics in your Application
-
-In your application, if you want to expose the metrics of your component (like webservice) to Prometheus, you just need to add the `prometheus-scrape` trait as follows.
-
-```yaml
-apiVersion: core.oam.dev/v1beta1
-kind: Application
-metadata:
-  name: my-app
-spec:
-  components:
-    - name: my-app
-      type: webservice
-      properties:
-        image: somefive/prometheus-client-example:new
-      traits:
-        - type: prometheus-scrape
-```
-
-You can also explicitly specify which port and which path to expose metrics.
-
-```yaml
-apiVersion: core.oam.dev/v1beta1
-kind: Application
-metadata:
-  name: my-app
-spec:
-  components:
-    - name: my-app
-      type: webservice
-      properties:
-        image: somefive/prometheus-client-example:new
-      traits:
-        - type: prometheus-scrape
-          properties:
-            port: 8080
-            path: /metrics
-```
-
-This will let your application be scrapable by the prometheus server. If you want to see those metrics on Grafana, you need to create Grafana dashboard further. Go to [Visualization](./visualization#dashboard-customization) for learning the following steps.
diff --git a/docs/platform-engineers/operations/o11y/out-of-the-box.md b/docs/platform-engineers/operations/o11y/out-of-the-box.md
@@ -0,0 +1,94 @@
+---
+title: Out of the Box
+---
+
+By default, a series of dashboards are pre-installed with the `grafana` addon and provide basic panels for viewing observability data. If you follow the [installation guide](./installation), you should be able to use these dashboards without further configurations.
+
+## Dashboards
+
+### KubeVela Application
+
+This dashboard shows the basic information for one application.
+
+URL: http://localhost:8080/d/application-overview/kubevela-applications
+
+![kubevela-application-dashboard](../../../resources/kubevela-application-dashboard.png)
+
+:::info
+  The KubeVela Application dashboard shows the overview of the metadata for the application. It directly accesses the Kubernetes API to retrieve the runtime application information, where you can use it as an entrance. You can navigate to detail information for application resources by clicking the `Details` link in the *Managed Resources* panel.
+
+The **Basic Information** section extracts key information into panels and give you the most straightforward view for the current application.
+
+The **Related Resources** section shows those resources that work together with the application itself, including the managed resources, the recorded ResourceTrackers and the revisions.
+
+:::
+
+### Kubernetes Deployemnt
+
+This dashboard shows the overview of native deployments. You can navigate deployments across clusters.
+
+URL: http://localhost:8080/d/kubernetes-deployment/kubernetes-deployment
+
+![kubernetes-deployment-dashboard](../../../resources/kubernetes-deployment-dashboard.jpg)
+
+:::info
+  The Kubernetes Deployment dashboard gives you the detail running status for the deployment.
+
+The **Pods** panel shows the pods that the deployment itself is currently managing.
+
+The **Replicas** panel shows how the number of replicas changes, which can be used to diagnose when and how your deployment shifted to undesired state.
+
+The **Resource** section includes the details for the resource usage (including the CPU / Memory / Network / Storage) which can be used to identify if the pods of the deployment are facing resource pressure or making/receiving unexpected traffics.
+
+There are a list of dashboards for various types of Kubernetes resources, such as DaemonSet and StatefulSet. You can navigate to those dashboards depending on your workload type.
+
+:::
+
+### KubeVela System
+
+This dashboard shows the overview of the KubeVela system. It can be used to see if KubeVela controller is healthy.
+
+URL: http://localhost:8080/d/kubevela-system/kubevela-system
+
+![kubevela-system](../../../resources/kubevela-system.jpg)
+
+:::info
+  The KubeVela System dashboard gives you the running details of the KubeVela core modules, including the controller and the cluster-gateway. Other modules like velaux or prism are expected to be added in the future.
+
+The **Computation Resource** section shows the usage for core modules. It can be used to track if there is any memory leak (if the memory usage is continuously increasing) or under high pressure (the cpu usage is always very high). If the memory usage hits the resource limit, the corresponding module will be killed and restarted, which indicates the lack of computation resource. You should add more CPU/Memory for them.
+
+The **Controller** section includes a wide range of panels which can help you to diagnose the bottleneck of the KubeVela controller in your scenario.
+
+The **Controller Queue** and **Controller Queue Add Rate** panels show you the controller working queue changes. If the controller queue is increasing continuously, it means there are too much applications or application changes in the system, and the controller is unable to handle them in time. Then it means there is performance issues for KubeVela controller. A temporary increase for the controller queue is tolerable, but keeping for a long time will lead to memory increase which will finally cause Out-Of-Memory problems.
+
+**Reconcile Rate** and **Average Reconcile Time** panels give you the overview of the controller status. If reconcile rate is steady and average reconcile time is reasonable (like under 500ms, depending on your scenario), your KubeVela controller is healthy. If the controller queue add rate is increasing but the reconcile rate does not go up, it will gradually lead to increase for the controller queue and cause troubles. There are various cases that your controller is unhealthy:
+
+1. Reconcile is healthy but there are too much applications, you will find everything is okay except the controller queue metrics increasing. Check your CPU/Memory usage for the controller. You might need to add more computation resources.
+2. Reconcile is not healthy due to too much errors. You will find lots of errors in the **Reconcile Rate** panel. This means your system is continuously facing process errors for application. It could be caused by invalid application configurations or unexpected errors while running workflows. Check application details and see which applications are causing errors.
+3. Reconcile is not healthy due to long reconcile times. You need to check **ApplicationController Reconcile Time** panel and see whether it is a common case (the average reconcile time is high), or only part of your applications have problems (the p95 reconcile time is high). For the former case, usually it is caused by either insufficient CPU (CPU usage is high) or too much requests and rate limited by kube-apiserver (check **ApplicationController Client Request Throughput** and **ApplicationController Client Request Average Time** panel and see which resource requests is slow or excessive). For the later case you need to check which application is large and uses lots of time for reconciliations.
+
+Sometimes you might need refer to **ApplicationController Reconcile Stage Time** and see if there is some special reconcile stages are abnormal. For example, GCResourceTrackers use lots of time means there might be blockings for recycling resource in KubeVela system.
+
+The **Application** section shows the overview of the applications in your whole KubeVela system. It can be used to see the changes of the application numbers and the used workflow steps. The **Workflow Initialize Rate** is an auxiliary panel which can be used to see how frequent new workflow execution is launched. The **Workflow Average Complete Time** can further show how much time it costs to finish the whole workflow.
+
+:::
+
+### Kubernetes APIServer
+
+This dashboard shows the running status of all Kubernetes apiservers.
+
+URL: http://localhost:8080/d/kubernetes-apiserver/kubernetes-apiserver
+
+![kubernetes-apiserver](../../../resources/kubernetes-apiserver.jpg)
+
+:::info
+  The Kubernetes APIServer dashboard helps you to see the most fundamental part for your Kubernetes system. If your Kubernetes APIServer is not running healthy, all of your controllers and modules in your Kubernetes system will be abnormal and unable to handle requests successfully. So it is important to make sure everything is fine in this dashboard.
+
+The **Requests** section includes a series of panels which shows the QPS and latency for various kind of requests. Usually your APIServer could fail to respond if it is flooded by too much requests. At this time, you can see which type of requests is causing trouble.
+
+The **WorkQueue** section shows the process status of the Kubernetes APIServer. If the **Queue Size** is large, it means the number of requests is out of the process capability of your Kubernetes APIServer.
+
+The **Watches** section shows the number of watches in your Kubernetes APIServer. Compared to other types of requests, WATCH requests will continuously consume computation resources in Kubernetes APIServer, so it will be helpful to keep the number of watches limited.
+
+:::
+