Skip to content

Commit e151870

Browse files
authored
Availability modes (#1095)
* Introduce availability modes * Address current review comments * Auto-enable kernel session persistence if availability mode is set * Incorporate existing kernel persistence docs * Rename availability modes per review * apply renaming to cli options
1 parent 538d2d4 commit e151870

File tree

10 files changed

+222
-57
lines changed

10 files changed

+222
-57
lines changed
Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
# Availability modes
2+
3+
Enterprise Gateway can be optionally configured in one of two "availability modes": _standalone_ or _replication_. When configured, Enterprise Gateway can recover from failures and reconnect to any active remote kernels that were previously managed by the terminated EG instance. As such, both modes require that kernel session persistence also be enabled via `KernelSessionManager.enable_persistence=True`.
4+
5+
```{note}
6+
Kernel session persistence will be automtically enabled whenever availability mode is configured.
7+
```
8+
9+
```{caution}
10+
**Availability modes and kernel session persistence should be considered experimental!**
11+
12+
Known issues include:
13+
1. Culling configurations do not account for different nodes and therefore could result in the incorrect culling of kernels.
14+
2. Each "node switch" requires a manual reconnect to the kernel.
15+
16+
We hope to address these in future releaases (depending on demand).
17+
```
18+
19+
## Standalone availability
20+
21+
_Standalone availability_ assumes that, upon failure of the original EG instance, another EG instance will be started. Upon startup of the second instance (following the termination of the first), EG will attempt to load and reconnect to all kernels that were deemed active when the previous instance terminated. This mode is somewhat analogous to the classic HA/DR mode of _active-passive_ and is typically used when node resources are at a premium or the number of replicas (in the Kubernetes sense) must remain at 1.
22+
23+
To enable Enterprise Gateway for 'standalone' availability, configure `EnterpiseGatewayApp.availability_mode=standalone` or set env `EG_AVAILABILITY_MODE=standalone`.
24+
25+
Here's an example for starting Enterprise Gateway with standalone availability:
26+
27+
```bash
28+
#!/bin/bash
29+
30+
LOG=/var/log/enterprise_gateway.log
31+
PIDFILE=/var/run/enterprise_gateway.pid
32+
33+
jupyter enterprisegateway --ip=0.0.0.0 --port_retries=0 --log-level=DEBUG \
34+
--EnterpriseGatewayApp.availability_mode=standalone > $LOG 2>&1 &
35+
36+
if [ "$?" -eq 0 ]; then
37+
echo $! > $PIDFILE
38+
else
39+
exit 1
40+
fi
41+
```
42+
43+
## Replication availability
44+
45+
With _replication availability_, multiple EG instances (or replicas) are operating at the same time, and fronted with some kind of reverse proxy or load balancer. Because state still resides within each `KernelManager` instance executing within a given EG instance, we strongly suggest configuring some form of _client affinity_ (a.k.a, "sticky session") to avoid node switches wherever possible since each node switch requires manual reconnection of the front-end (today).
46+
47+
```{tip}
48+
Configuring client affinity is **strongly recommended**, otherwise functionality that relies on state within the servicing node (e.g., culling) can be affected upon node switches, resulting in incorrect behavior.
49+
```
50+
51+
In this mode, when one node goes down, the subsequent request will be routed to a different node that doesn't know about the kernel. Prior to returning a `404` (not found) status code, EG will check its persisted store to determine if the kernel was managed and, if so, attempt to "hydrate" a `KernelManager` instance associated with the remote kernel. (Of course, if the kernel was running local to the downed server, chances are it cannot be _revived_.) Upon successful "hydration" the request continues as if on the originating node. Because _client affinity_ is in place, subsequent requests should continue to be routed to the "servicing node".
52+
53+
To enable Enterprise Gateway for 'replication' availability, configure `EnterpiseGatewayApp.availability_mode=replication` or set env `EG_AVAILABILITY_MODE=replication`.
54+
55+
```{attention}
56+
To preserve backwards compatibility, if only kernel session persistence is enabled via `KernelSessionManager.enable_persistence=True`, the availability mode will be automatically configured to 'replication' if `EnterpiseGatewayApp.availability_mode` is not configured.
57+
```
58+
59+
Here's an example for starting Enterprise Gateway with replication availability:
60+
61+
```bash
62+
#!/bin/bash
63+
64+
LOG=/var/log/enterprise_gateway.log
65+
PIDFILE=/var/run/enterprise_gateway.pid
66+
67+
jupyter enterprisegateway --ip=0.0.0.0 --port_retries=0 --log-level=DEBUG \
68+
--EnterpriseGatewayApp.availability_mode=replication > $LOG 2>&1 &
69+
70+
if [ "$?" -eq 0 ]; then
71+
echo $! > $PIDFILE
72+
else
73+
exit 1
74+
fi
75+
```
76+
77+
# Kernel Session Persistence
78+
79+
Enabling kernel session persistence allows Jupyter Notebooks to reconnect to kernels when Enterprise Gateway is restarted and forms the basis for the _availability modes_ described above. Enterprise Gateway provides two ways of persisting kernel sessions: _File Kernel Session Persistence_ and _Webhook Kernel Session Persistence_, although others can be provided by subclassing `KernelSessionManager` (see below).
80+
81+
```{attention}
82+
Due to its experimental nature, kernel session persistence is disabled by default. To enable this functionality, you must configure `KernelSessionManger.enable_persistence=True` or configure `EnterpriseGatewayApp.availability_mode` to either `standalone` or `replication`.
83+
```
84+
85+
As noted above, the availability modes rely on the persisted information relative to the kernel. This information consists of the arguments and options used to launch the kernel, along with its connection information. In essence, it consists of any information necessary to re-establish communication with the kernel.
86+
87+
## File Kernel Session Persistence
88+
89+
File Kernel Session Persistence stores kernel sessions as files in a specified directory. To enable this form of persistence, set the environment variable `EG_KERNEL_SESSION_PERSISTENCE=True` or configure `FileKernelSessionManager.enable_persistence=True`. To change the directory in which the kernel session file is being saved, either set the environment variable `EG_PERSISTENCE_ROOT` or configure `FileKernelSessionManager.persistence_root` to the directory. By default, the directory used to store a given kernel's session information is the `JUPYTER_DATA_DIR`.
90+
91+
```{note}
92+
Because `FileKernelSessionManager` is the default class for kernel session persistence, configuring `EnterpriseGatewayApp.kernel_session_manager_class` to `enterprise_gateway.services.sessions.kernelsessionmanager.FileKernelSessionManager` is not necessary.
93+
```
94+
95+
## Webhook Kernel Session Persistence
96+
97+
Webhook Kernel Session Persistence stores all kernel sessions to any database. In order for this to work, an API must be created. The API must include four endpoints:
98+
99+
- A `GET` that will retrieve a list of all kernel sessions from a database
100+
- A `GET` that will take the kernel id as a path variable and retrieve that information from a database
101+
- A `DELETE` that will delete all kernel sessions, where the body of the request is a list of kernel ids
102+
- A `POST` that will take kernel id as a path variable and kernel session in the body of the request and save it to a database where the object being saved is:
103+
104+
```
105+
{
106+
kernel_id: UUID string,
107+
kernel_session: JSON
108+
}
109+
```
110+
111+
To enable the webhook kernel session persistence, set the environment variable `EG_KERNEL_SESSION_PERSISTENCE=True` or configure `WebhookKernelSessionManager.enable_persistence=True`. To connect the API, set the environment variable `EG_WEBHOOK_URL` or configure `WebhookKernelSessionManager.webhook_url` to the API endpoint.
112+
113+
Because `WebhookKernelSessionManager` is not the default kernel session persistence class, an additional configuration step must be taken to instruct EG to use this class: `EnterpriseGatewayApp.kernel_session_manager_class = enterprise_gateway.services.sessions.kernelsessionmanager.WebhookKernelSessionManager`.
114+
115+
### Enabling Authentication
116+
117+
Enabling authentication is an option if the API requires it for requests. Set the environment variable `EG_AUTH_TYPE` or configure `WebhookKernelSessionManager.auth_type` to be either `Basic` or `Digest`. If it is set to an empty string authentication won't be enabled.
118+
119+
Then set the environment variables `EG_WEBHOOK_USERNAME` and `EG_WEBHOOK_PASSWORD` or configure `WebhookKernelSessionManager.webhook_username` and `WebhookKernelSessionManager.webhook_password` to provide the username and password for authentication.
120+
121+
## Bring Your Own Kernel Session Persistence
122+
123+
To introduce a different implementation, you must configure the kernel session manager class. Here's an example for starting Enterprise Gateway using a custom `KernelSessionManager` and 'standalone' availability. Note that setting `--MyCustomKernelSessionManager.enable_persistence=True` is not necessary because an availability mode is specified, but displayed here for completeness:
124+
125+
```bash
126+
#!/bin/bash
127+
128+
LOG=/var/log/enterprise_gateway.log
129+
PIDFILE=/var/run/enterprise_gateway.pid
130+
131+
jupyter enterprisegateway --ip=0.0.0.0 --port_retries=0 --log-level=DEBUG \
132+
--EnterpriseGatewayApp.kernel_session_manager_class=custom.package.MyCustomKernelSessionManager \
133+
--MyCustomKernelSessionManager.enable_persistence=True \
134+
--EnterpriseGatewayApp.availability_mode=standalone > $LOG 2>&1 &
135+
136+
if [ "$?" -eq 0 ]; then
137+
echo $! > $PIDFILE
138+
else
139+
exit 1
140+
fi
141+
```
142+
143+
Alternative persistence implementations using SQL and NoSQL databases would be ideal and, as always, contributions are welcome!
144+
145+
## Testing Kernel Session Persistence
146+
147+
Once kernel session persistence has been enabled and configured, create a kernel by opening up a Jupyter Notebook. Save some variable in that notebook and shutdown Enterprise Gateway using `kill -9 PID`, where `PID` is the PID of gateway. Restart Enterprise Gateway and refresh you notebook tab. If all worked correctly, the variable should be loaded without the need to rerun the cell.
148+
149+
If you are using docker, ensure the container isn't tied to the PID of Enterprise Gateway. The container should still run after killing that PID.

docs/source/operators/config-cli.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,11 @@ EnterpriseGatewayApp(EnterpriseGatewayConfigMixin, JupyterApp) options
106106
will be raised on a failed match. This option requires TLS to be enabled.
107107
It does not support IP addresses. (EG_AUTHORIZED_ORIGIN env var)
108108
Default: ''
109+
--EnterpriseGatewayApp.availability_mode=<CaselessStrEnum>
110+
Specifies the type of availability. Values must be one of "standalone"
111+
or "replication". (EG_AVAILABILITY_MODE env var)
112+
Choices: any of ['standalone', 'replication'] (case-insensitive) or None
113+
Default: None
109114
--EnterpriseGatewayApp.base_url=<Unicode>
110115
The base path for mounting all API resources (EG_BASE_URL env var)
111116
Default: '/'
@@ -242,7 +247,7 @@ EnterpriseGatewayApp(EnterpriseGatewayConfigMixin, JupyterApp) options
242247
Default: None
243248
--EnterpriseGatewayApp.trust_xheaders=<CBool>
244249
Use x-* header values for overriding the remote-ip, useful when application
245-
is behing a proxy. (EG_TRUST_XHEADERS env var)
250+
is behind a proxy. (EG_TRUST_XHEADERS env var)
246251
Default: False
247252
--EnterpriseGatewayApp.unauthorized_users=<set-item-1>...
248253
Comma-separated list of user names (e.g., ['root','admin']) against which
@@ -252,7 +257,7 @@ EnterpriseGatewayApp(EnterpriseGatewayConfigMixin, JupyterApp) options
252257
Default: {'root'}
253258
--EnterpriseGatewayApp.ws_ping_interval=<Int>
254259
Specifies the ping interval(in seconds) that should be used by zmq port
255-
associated withspawned kernels.Set this variable to 0 to disable ping mechanism.
260+
associated with spawned kernels.Set this variable to 0 to disable ping mechanism.
256261
(EG_WS_PING_INTERVAL_SECS env var)
257262
Default: 30
258263
--EnterpriseGatewayApp.yarn_endpoint=<Unicode>

docs/source/operators/config-kernel-persistence.md

Lines changed: 0 additions & 39 deletions
This file was deleted.

docs/source/operators/config-security.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Configuring Security
1+
# Configuring security
22

33
Jupyter Enterprise Gateway does not currently perform user _authentication_ but, instead, assumes that all users
44
issuing requests have been previously authenticated. Recommended applications for this are

docs/source/operators/deploy-kubernetes.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Deploying Enterprise Gateway on Kubernetes
1+
# Kubernetes deployments
22

33
## Overview
44

docs/source/operators/index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,5 +65,5 @@ Jupyter Enterprise Gateway adheres to
6565
config-kernel-override
6666
config-dynamic
6767
config-culling
68-
config-kernel-persistence
68+
config-availability
6969
config-security

enterprise_gateway/enterprisegatewayapp.py

Lines changed: 24 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -141,9 +141,28 @@ def init_configurables(self):
141141
config=self.config, # required to get command-line options visible
142142
)
143143

144-
# Attempt to start persisted sessions
145-
# Commented as part of https://github.com/jupyter-server/enterprise_gateway/pull/737#issuecomment-567598751
146-
# self.kernel_session_manager.start_sessions()
144+
# For B/C purposes, check if session persistence is enabled. If so, and availability
145+
# mode is not enabled, go ahead and default availability mode to 'multi-instance'.
146+
if self.kernel_session_manager.enable_persistence:
147+
if self.availability_mode is None:
148+
self.availability_mode = EnterpriseGatewayConfigMixin.AVAILABILITY_REPLICATION
149+
self.log.info(
150+
f"Kernel session persistence is enabled but availability mode is not. "
151+
f"Setting EnterpriseGatewayApp.availability_mode to '{self.availability_mode}'."
152+
)
153+
else:
154+
# Persistence is not enabled, check if availability_mode is configured and, if so,
155+
# auto-enable persistence
156+
if self.availability_mode is not None:
157+
self.kernel_session_manager.enable_persistence = True
158+
self.log.info(
159+
f"Availability mode is set to '{self.availability_mode}' yet kernel session "
160+
"persistence is not enabled. Enabling kernel session persistence."
161+
)
162+
163+
# If we're using single-instance availability, attempt to start persisted sessions
164+
if self.availability_mode == EnterpriseGatewayConfigMixin.AVAILABILITY_STANDALONE:
165+
self.kernel_session_manager.start_sessions()
147166

148167
self.contents_manager = None # Gateways don't use contents manager
149168

@@ -253,11 +272,11 @@ def _build_ssl_options(self) -> Optional[ssl.SSLContext]:
253272
return ssl_context
254273

255274
def init_http_server(self):
256-
"""Initializes a HTTP server for the Tornado web application on the
275+
"""Initializes an HTTP server for the Tornado web application on the
257276
configured interface and port.
258277
259278
Tries to find an open port if the one configured is not available using
260-
the same logic as the Jupyer Notebook server.
279+
the same logic as the Jupyter Notebook server.
261280
"""
262281
ssl_options = self._build_ssl_options()
263282
self.http_server = httpserver.HTTPServer(

enterprise_gateway/mixins.py

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
from tornado.log import LogFormatter
1414
from traitlets import (
1515
Bool,
16+
CaselessStrEnum,
1617
CBool,
1718
Instance,
1819
Integer,
@@ -269,7 +270,7 @@ def expose_headers_default(self):
269270
False,
270271
config=True,
271272
help="""Use x-* header values for overriding the remote-ip, useful when
272-
application is behing a proxy. (EG_TRUST_XHEADERS env var)""",
273+
application is behind a proxy. (EG_TRUST_XHEADERS env var)""",
273274
)
274275

275276
@default("trust_xheaders")
@@ -633,7 +634,7 @@ def max_kernels_per_user_default(self):
633634
ws_ping_interval_default_value,
634635
config=True,
635636
help="""Specifies the ping interval(in seconds) that should be used by zmq port
636-
associated withspawned kernels.Set this variable to 0 to disable ping mechanism.
637+
associated with spawned kernels. Set this variable to 0 to disable ping mechanism.
637638
(EG_WS_PING_INTERVAL_SECS env var)""",
638639
)
639640

@@ -680,6 +681,23 @@ def dynamic_config_interval_changed(self, event):
680681

681682
dynamic_config_poller = None
682683

684+
# Availability Mode
685+
AVAILABILITY_STANDALONE = "standalone"
686+
AVAILABILITY_REPLICATION = "replication"
687+
availability_mode_env = "EG_AVAILABILITY_MODE"
688+
availability_mode_default_value = None
689+
availability_mode = CaselessStrEnum(
690+
allow_none=True,
691+
values=[AVAILABILITY_REPLICATION, AVAILABILITY_STANDALONE],
692+
config=True,
693+
help="""Specifies the type of availability. Values must be one of "standalone" or "replication".
694+
(EG_AVAILABILITY_MODE env var)""",
695+
)
696+
697+
@default("availability_mode")
698+
def availability_mode_env_default(self):
699+
return os.getenv(self.availability_mode_env, self.availability_mode_default_value)
700+
683701
kernel_spec_manager = Instance("jupyter_client.kernelspec.KernelSpecManager", allow_none=True)
684702

685703
kernel_spec_manager_class = Type(

0 commit comments

Comments
 (0)