|
| 1 | + |
| 2 | +## Roadmap, planning |
| 3 | + |
| 4 | +Here, we will maintain a list of features. |
| 5 | +These may be converted into issues later. |
| 6 | +The main goal is to not forget about them. |
| 7 | + |
| 8 | +I will break this down into 2 categories. |
| 9 | +The first category is pre-alpha. |
| 10 | +I will assume that all commits will be squashed and this will be |
| 11 | +moved to a new repo. At that point it will become public. |
| 12 | +Everything before it becomes public are pre-alpha things to do. |
| 13 | + |
| 14 | +### Pre-alpha |
| 15 | + |
| 16 | +#### Track more data with new types |
| 17 | + |
| 18 | +The AWX dispatcher had a model where a task could have certain parameters. |
| 19 | +Like, imagine that we throw a generalized task timeout in. |
| 20 | + |
| 21 | +The design of this was a little choppy, because the `@task` decorator |
| 22 | +would declare these paramters. |
| 23 | + |
| 24 | +Yet these parameters would be send in the pg_notify JSON data. |
| 25 | + |
| 26 | +This doesn't really fit the model that we want. |
| 27 | +We want it to be _impossible_ to run a task with parameters |
| 28 | +other than what are declared on `@task` |
| 29 | + |
| 30 | +Think of it this way, there are |
| 31 | + - runtime arguments (or parameters) |
| 32 | + - configuration parameters |
| 33 | + |
| 34 | +Something like a task timeout is a configuration parameter. |
| 35 | +This implies that when we register a method with `@task` |
| 36 | +we have to **save it in a registry**. |
| 37 | + |
| 38 | +When the dispatcher gets a message saying to run a task, |
| 39 | +the most correct thing to do is look that task up in the registry. |
| 40 | +This reduces the JSON data passed, and makes a more consistent |
| 41 | +source-of-truth. |
| 42 | + |
| 43 | +But this means that we need a registry, and way before that, |
| 44 | +we need to introduce a type for methods that will be called. |
| 45 | + |
| 46 | +Additionally, the next big objective is that we want |
| 47 | +detailed tracking timing for every time a task is called. |
| 48 | +This goes with the call, not the task. |
| 49 | + |
| 50 | +So the new types we probably want are: |
| 51 | + - `Task` |
| 52 | + - `Call` |
| 53 | + |
| 54 | +This is in addition to the `PoolWorker` which is the worker that |
| 55 | +runs the task. |
| 56 | + |
| 57 | +The `Call` will track the lifecycle of the call. |
| 58 | +The `PoolWorker` will reference the `Call` it is running. |
| 59 | +The `Call` will reference the `Task` it is a call of. |
| 60 | + |
| 61 | +We don't need/want to pass any of this through the IPC queue |
| 62 | +to the worker, we only want it for the main dispatcher. |
| 63 | +This will mainly be useful as we ask the dispatcher to respond |
| 64 | +with stats about what it has been running. |
| 65 | + |
| 66 | +Also... I see this as a mechanism to write integration tests. |
| 67 | +We can submit tasks, wait for a signal they finished, |
| 68 | +and then get the work history from the dispatcher as it ran those. |
| 69 | + |
| 70 | +We should look at the `Call` class as corresponding to a log record. |
| 71 | +This should have the call details, identifiers, and mostly be a |
| 72 | +record of the call lifecycle. This is mostly log-like, and should have |
| 73 | +mostly scalar type data of floats, strings, ids, etc. |
| 74 | + |
| 75 | +#### Finish integrating publisher logic |
| 76 | + |
| 77 | +The content existing in `dispatcher.publish` is mostly not connected. |
| 78 | + |
| 79 | +What's interesting here is that `dispatcher.publish` should import |
| 80 | +from the broker module. |
| 81 | +That gets hard to manage with multiple connections (ala Django). |
| 82 | +But some version of it we should do... |
| 83 | + |
| 84 | +#### Finish integrating the worker loop |
| 85 | + |
| 86 | +Overlapping with the publisher stuff, the `dispatcher.publish` should |
| 87 | +get the method name, and the args, import the method, and run it. |
| 88 | + |
| 89 | +This requires moving more code in from AWX |
| 90 | + |
| 91 | +https://github.com/ansible/awx/blob/devel/awx/main/dispatch/worker/task.py |
| 92 | + |
| 93 | +That has |
| 94 | + - importing logic |
| 95 | + - calling logic |
| 96 | + - supporting stuff to include timings |
| 97 | + - signal handling |
| 98 | + - exception handling |
| 99 | + |
| 100 | +### Post-alpha |
| 101 | + |
| 102 | +#### Conditional skipping logic on publishing |
| 103 | + |
| 104 | +AWX uses sqlite3 for unit tests, which would error on async tasks. |
| 105 | +Because of this, it did not publish a message if `is_testing` was True. |
| 106 | +It's not reasonable for us to implement that same thing here, and |
| 107 | +we will likely need some callback approach. |
| 108 | + |
| 109 | +So the ask here is that we have some app-wide configuration, |
| 110 | +which can inspect a message _before publishing_ and take some action, |
| 111 | +or possibly cancel the NOTIFY. |
| 112 | + |
| 113 | +Probably not good coding practice generally, but probably useful. |
| 114 | + |
| 115 | +#### Feature branch to integrate with AWX |
| 116 | + |
| 117 | +Make AWX run using this library, this should be an early goal in this stage. |
| 118 | + |
| 119 | +#### Worker and Broker Self-Checks |
| 120 | + |
| 121 | +A moderate version of this was proposed in: |
| 122 | + |
| 123 | +https://github.com/ansible/awx/pull/14749 |
| 124 | + |
| 125 | +In grand conclusion, there is no way to assure that the LISTSEN connection |
| 126 | +is not dropped. |
| 127 | +Worse, when it is dropped, we may get no notification. |
| 128 | +Astonishingly, there appears to be no way around this. |
| 129 | + |
| 130 | +Because of this, the ultimate option of last-resource must be taken. |
| 131 | +That means that we can only assure health of a connection of a worker |
| 132 | +by experiential means. |
| 133 | + |
| 134 | +To know if a connection works, you must publish a control message and receive it. |
| 135 | +To know if a worker is alive, you must send a message and receive a reply. |
| 136 | + |
| 137 | +Because of this knowledge, the new dispatcher library must just straight to this eventuality. |
| 138 | +Implement checks for brokers and workers based on send-and-receive. |
| 139 | +This can be done fully with asynio patterns. |
| 140 | + |
| 141 | +For the issues related to AWX 14749, we also need means to recycle connections |
| 142 | +in cases where we fail to receive check messages. |
| 143 | + |
| 144 | +#### Worker Allocation Cookbook |
| 145 | + |
| 146 | +Several very practical problems are not intended to ever be solved by the dispatcher. |
| 147 | +However, for someone using postgres or any other modern database, |
| 148 | +combined with the dispatcher, they have the ability to solve these problems. |
| 149 | + |
| 150 | +https://github.com/ansible/awx/issues/11997 |
| 151 | + |
| 152 | +Breakdown of those problems: |
| 153 | +1. Have a node in the cluster, any node, process a task |
| 154 | +2. Have a periodic task run, anywhere in the cluster, at a certain frequency |
| 155 | + |
| 156 | +The solution for (1) is to add an entry to a table when submitting the task. |
| 157 | +Then depending on the use case, there are 2 decent options: |
| 158 | + - broadcast a task asking any willing node to run the task, get lock, if lock is taken, bail |
| 159 | + - run a periodic task that will use `select_for_update` to get entries and mark as received |
| 160 | + |
| 161 | +The solution for (2) in AWX uses the Solo model to track a `datetime`. |
| 162 | +This is self-obviously needed for the feature of _user_ schedules. |
| 163 | + |
| 164 | +#### Task Timeout |
| 165 | + |
| 166 | +When using `@task()` decorator, we add `timeout=5` to timeout in 5 seconds. |
| 167 | + |
| 168 | +A solution was drafted in the branch: |
| 169 | + |
| 170 | +https://github.com/ansible/awx/compare/devel...AlanCoding:awx:dispatcher_timeout |
| 171 | + |
| 172 | +#### Singleton Tasks |
| 173 | + |
| 174 | +AWX commonly used pg locks to prevent multiple workers running the same task, |
| 175 | +but a more efficient alternative is to never start those tasks. |
| 176 | + |
| 177 | +This proposes another argument to `@task()` decorator that makes the task exclusive. |
| 178 | +When another version of the task is already running, there are 2 sub-options we could do: |
| 179 | + - wait for the existing task to finish before running the new task |
| 180 | + - discard the new task |
| 181 | + |
| 182 | +The use cases for AWX mainly wand the 2nd one. |
| 183 | +Idepotent tasks are used extremely heavily on schedules, meaning that |
| 184 | +when the dispatcher receives too many it should simply discard extras. |
| 185 | + |
| 186 | +#### Triggering Tasks from Tasks |
| 187 | + |
| 188 | +For the solution to (2) in the cookbook to be fully functional, |
| 189 | +it is best that tasks can directly start other tasks via messaging |
| 190 | +internal to the worker pool. |
| 191 | + |
| 192 | +This means passing some kind of object into the task being called |
| 193 | +where this object contains callbacks that can be used to |
| 194 | +trigger methods in the worker pool's finished watcher. |
0 commit comments