-
-
Notifications
You must be signed in to change notification settings - Fork 7
RFC 0012: timers #12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
RFC 0012: timers #12
Changes from 2 commits
bca292a
fcaad28
240d789
576618f
7ee1b15
00a9117
8ee766c
1ec077d
4874638
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,310 @@ | ||
| - Feature Name: `timers` | ||
| - Start Date: 2024-11-22 | ||
| - RFC PR: [crystal-lang/rfcs#12](https://github.com/crystal-lang/rfcs/pull/12) | ||
| - Issue: ... | ||
|
|
||
| # Summary | ||
|
|
||
| Determine a general interface and internal data structure to handle and store | ||
| timers in the Crystal runtime. | ||
|
|
||
| # Motivation | ||
|
|
||
| With the Event Loop overhaul made possible by [RFC 7] and achieved in [RFC 9] | ||
| where we remove the libevent dependency, that we already didn't use on Windows, | ||
| we need to handle the correct execution of timers ourselves. | ||
|
|
||
| We must handle timers, we must store them into efficient data structure(s), and | ||
| we must suppor the following operations: | ||
|
|
||
| - create a timer; | ||
| - cancel a timer; | ||
| - execute expired timers; | ||
| - determine the next timer to expire, so we can decide for how long a process or | ||
| thread can be suspended (usually when there is nothing to do). | ||
|
Comment on lines
+25
to
+26
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The need for this depends on the event loop. Is there any need at all for all this complexity if the event loop is driven by io_uring? Just emit a timeout event for each fiber that is waiting, and that's it. I'm all for shared code between underlying event loops that are limited in what they can do, but are there any reason to lock event loops into more structure? It also supports timeouting io operations.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Each eventloop can use whatever it pleases, yet... there are still "select action timeouts" that can be cancelled, so even Even with events to notify the blocking waits (which we use for epoll, kqueue and IOCP), we still need to rearm the timer after it triggered (for example) and need to know when the next timer is expiring. I don't think So far, my naive vision is for
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes. For example using the uring op TIMEOUT_REMOVE. That said, it is when considering what actually goes on in a select action loop that made me really dislike it in general. So much pointless teardown and rearming..
No, that does not follow. It may be an issue if we are not ok waiting for the response to the timeout removal I guess, and it also need to handle the race condition where the timer is already triggering and execute before the actual timeout removal. But it is definitely doable without.
FWIW, uring timeout op also take timespec structs as arguments with the same precision as timerfd. What uring doesn't seem to support is the periodic part of the argument, but instead there is an MULTISHOT argument if you want repeating triggers.
I'd suggest not using epoll at all and instead use the uring POLL op, which does more or less the same but a lot simpler. But in any case I guess it doesn't matter too much as it doesn't really impact the public interfaces so in the end it can be changed when the need arises..
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for all the information! I'll have to dig much deeper into AFAIK all timeouts in the Linux kernel go into the timer wheel (tick based, low precision, now loses precision over time) while timers go into hrtimer (high precision, no ticks, nanoseconds clock). I'd expect io_uring timeouts to end up into the timer wheel, which is fine for timeouts, but I'd like to keep |
||
|
|
||
| The IOCP event loop currently uses an unordered `Deque`, and thus needs a simple | ||
| O(1) operation to insert a time, but needs a linear scan to delete the timer and | ||
| a full scan to decide the next expiring timer or to dequeue the expired timers. | ||
|
|
||
| The Polling event-loop (wraps `epoll` and `kqueue`) uses an ordered `Deque` and | ||
| needs a linear scan for insert and delete, but getting the next expiring timer | ||
| and dequeueing the expired timers is O(1). | ||
|
|
||
| This is far from efficient. We can do better. | ||
|
|
||
| # Guide-level explanation | ||
|
|
||
| First we emphasize that Crystal cannot be a realtime language (at least without | ||
| dropping the whole stdlib) because it relies on a GC that can stop the world at | ||
| any time and for a long time; the fiber schedulers also only reach to the event | ||
| loop when there is nothing left to do. These necessarily **introduce latencies | ||
| to the execution of expired timers**. | ||
|
|
||
| We can categorize timers into two categories, that I shamelessly took from the | ||
| [Hrtimers and Beyond: Transforming the Linux Time | ||
| Subsystems](https://www.kernel.org/doc/ols/2006/ols2006v1-pages-333-346.pdf) | ||
| paper about the introduction of high resolution timers in the Linux kernel: | ||
|
|
||
| 1. **Timeouts**: Timeouts are used primarily to detect when an event (I/O | ||
| completion, for example) does not occur as expected. They have low resolution | ||
| requirements, and they are almost always removed before they actually expire. | ||
|
|
||
| In Crystal such a `timeout` may be created before every blocking read or | ||
| write IO operation (when configured on the IO object) or to handle the | ||
| timeout action of a `select` statement. They're usually cancelled once the IO | ||
| operation or a channel operation becomes ready; they may expire, that is | ||
| raise an `IO::Timeout` exception or execute the timeout branch of the | ||
| `select` action. | ||
|
|
||
| The low resolution is because timeouts are mostly about bookkeeping, to | ||
| eventually close a connection after some time has passed for example, so a | ||
| 10s timeout running after 11s won't create issues. | ||
|
|
||
| 2. **Timers**: Timers are used to schedule ongoing events. They can have high | ||
| resolution requirements, and usually expire. | ||
|
|
||
| In crystal such a `timer` is created when we call `sleep(Time::Span)` or | ||
| `Fiber.yield` that behaves as a `sleep(0.seconds)`. There are no public API | ||
| to cancel a sleep, and they always expire. | ||
ysbaddaden marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| The high resolution is because timers are expected to run at the scheduled | ||
| time. As explained above this might be hard, but we can still try to avoid | ||
| too much latency. | ||
|
|
||
| Both categories share common traits: | ||
|
|
||
| - fast `insert` operation (lower impact on performance, especially with | ||
| parallelism); | ||
| - fast `get-min` operation (same as `insert` but less frequently called); | ||
| - reasonably fast `delete-min` operation (only needed when processing expired | ||
| timers); | ||
|
|
||
| However they differ in these other traits: | ||
|
|
||
| 1. Timeouts: | ||
|
|
||
| - low precision (some milliseconds is acceptable); | ||
| - fast `delete` operation (likely to be cancelled); | ||
| - must accomodate many timeouts at any given time (e.g. c10k problem). | ||
ysbaddaden marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| 2. Timers (sleeps): | ||
|
|
||
| - high precision (sub-millisecond and below is desireable); | ||
| - no need for `delete` (never cancelled); | ||
| - more reasonable number of timers (**BOLD CLAIM TO BE VERIFIED**) | ||
ysbaddaden marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| These requirements can help us to shape which data structure(s) to choose. | ||
|
|
||
| ## Relative clock | ||
|
|
||
| The relative clock to compare the time against. For example `libevent` uses the | ||
| monotonic clock, and the other event loop implementations followed suits (AFAIK). | ||
|
|
||
| This hasn't been an issue for the current usages in Crystal that always consider | ||
| an interval from now (be it a timeout or a sleep). | ||
|
|
||
| # Reference-level explanation | ||
|
|
||
| > [!CAUTION] | ||
| > This is a rough draft, asking more questions than providing answers! | ||
| > | ||
| > The technical definition will come and evolve as we experiment and refactor | ||
| > the different event loops. | ||
| > | ||
| > For example the technical details of abstracting the interface to be usable | ||
| > from different event loops lead to technical issues, notably around how to | ||
| > define the individual `Timer` interface, its relationship with the event loop | ||
| > `Event` actual object (e.g. struct pointer in the polling evloop), ... | ||
|
|
||
| **TBD**: the general internal interface, for example (loosely shaped from the | ||
| polling event loop, with different wording): | ||
|
|
||
| ```crystal | ||
| # The type `T` must implement `#wake_at : Time::Span` and return the absolute | ||
| # time at which a timer expires (monotonic clock). | ||
|
|
||
| class Crystal::Timers(T) | ||
| # Schedules a timer. Returns true if it is the next timer to expire. | ||
| abstract def schedule(timer : T) : Bool | ||
|
|
||
| # Cancels a previously scheduled timer. Returns a tuple(deleted, | ||
| # was_next_timer_to_expire). | ||
| abstract def cancel(timer : T) : {Bool, Bool} | ||
|
|
||
| # Yields and dequeues expired timers to be executed (cancel timeout, resume | ||
| # fiber, ...). | ||
| abstract def dequeue_expired(& : T ->) : Nil | ||
|
|
||
| # Returns the absolute time at which the next expiring timer is scheduled at. | ||
| # Returns nil if there are no timers. | ||
| abstract def next_expiring_at? : Time::Span? | ||
| end | ||
| ``` | ||
|
|
||
| ## Data structure: min pairing heap | ||
|
|
||
| A min-heap is a simple, fast and efficient tree data structure, that keeps the | ||
| smaller value as the HEAD of the tree (the rest isn't ordered). This is enough | ||
ysbaddaden marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
ysbaddaden marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| for timers in general as we only really need to know about the next expiring | ||
| timer, we don't need the list to be fully ordered. | ||
|
|
||
| From the [wikipedia page](https://en.wikipedia.org/wiki/Pairing_heap): in | ||
| practice a D-ary heap is always faster unless the `decrease-key` operation is | ||
| needed, in which case the Pairing HEAP often becomes faster (even to supposedly | ||
| more efficient algorithms, like the Fibonacci HEAP). | ||
|
|
||
| An initial implementation (twopass algorithm, no auxiliary insert, intrusive | ||
| nodes) led to to slighly faster `insert` time than a D-ary Heap (that needs more | ||
| swaps) especially when timers come out of order, but a noticeably slower | ||
| `delete-min` since it must rebalance the tree. The `delete` operation however | ||
| quickly outperforms the 4-heap, even at low occupancy (a hundred timers) and | ||
| never balloons. | ||
|
|
||
| Despite the drawback on the `delete-min` operation, a benchmark using mixed | ||
| operations (insert + delete-min, insert + delete) led the pairing heap to have | ||
| the best overall performance. See the [benchmark | ||
| results](https://gist.github.com/ysbaddaden/a5d98c88105ea58ba85f4db1ed814d70à) | ||
ysbaddaden marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| for more details. | ||
|
|
||
| Since it performs well for timers (add / delete-min) and timeouts (add / delete | ||
| and sometimes delete-min) as well I propose to use it to store both categories | ||
| in a single data structure. | ||
|
|
||
| Reference: | ||
|
|
||
| - [Pairing Heaps: Experiments and Analysis](https://dl.acm.org/doi/pdf/10.1145/214748.214759) | ||
|
|
||
| # Drawbacks | ||
|
|
||
| TBD. | ||
|
|
||
| # Rationale | ||
|
|
||
| This is an initial proposal for a long term work to internally handle timers in | ||
| the Crystal runtime. It aims to forge the path forward as we refactor the | ||
| different event loops (`IOCP`, `Polling`), introduce new ones (`io_uring`), and | ||
| as we evolve the public API interface. | ||
|
|
||
| # Alternatives | ||
|
|
||
| ## Deque | ||
|
|
||
| We could treat `Fiber.yield` and `sleep(0.seconds)` and by extension any already | ||
| expired timer specifically with a push to a mere `Deque`: no need to keep these | ||
| in an ordered data structure. | ||
|
|
||
| ## 4-heap (D-ary HEAP) | ||
|
|
||
| A [D-ary HEAP] can be implemented as a flat array to take full advantage of CPU | ||
| caches, and be binary or higher. Even at large occupancy (million timers) the | ||
| overall performance is excellent... except for the `delete` operation that | ||
| cannot benefit from the tree structure, and requires a linear scan. Performance | ||
| quickly plummets at low to moderate occupancy (thousand timers) and becomes | ||
| unbearable at higher occupancies. | ||
|
|
||
| Aside from timeouts, timers (sleeps) could take advantage of this data structure | ||
| since we can't cancel a sleep (so far). | ||
|
|
||
| ## Skip list | ||
|
|
||
| An alternative to heaps is the [skip list](https://en.wikipedia.org/wiki/Skip_list) | ||
| data structure. It's a simple doubly linked list but with multiple levels. The | ||
| lowest level is the whole list, while the higher levels skip over more and more | ||
| entries, leading to quick arbitrary lookups (from highest down to the lowest). | ||
|
|
||
| While the `delete-min` has excellent performance, the increased cost of keeping | ||
| the whole list ordered on every add/remove and creating and deleting multiple | ||
| links reduces the overall performance compared to the pairing heap. | ||
|
|
||
| ## Non-cascading timer wheel | ||
|
|
||
| > [!NOTE] | ||
| > The concept is a total rip-off from the Linux kernel! | ||
| > - [documentation](https://www.kernel.org/doc/html/latest/timers/highres.html) | ||
| > - [LWN article](https://lwn.net/Articles/646950/) that explains the core idea; | ||
| > - [implementation](https://github.com/torvalds/linux/blob/master/kernel/time/timer.c) (warning: GPL license!) | ||
|
|
||
| The idea derives from the "hierarchical timing wheels" design. This is a ring | ||
| (circular array) of N slots sub-divided into M slots where each individual slot | ||
| represents a jiffy (or moment) with a specific precision (1ms, 4ms or 10ms for | ||
| example). Each slot is a doubly linked list of events scheduled for the | ||
| specified jiffy. Each M slots represent a wheel, with less precision the higher | ||
| we climb up the wheels. When we process timers, we process the expired timers | ||
| from the "last" processed slot up to the "current" slot. | ||
|
|
||
| The usual disadvantage of hierarchical timer wheels is that whenever we loop on | ||
| the initial wheel we must cascade down the timers from the upper wheel into the | ||
| lower wheel. This can lead to multiple cascades in a row. | ||
|
|
||
| The trick is to skip the cascade altogether. This means losing precision (the | ||
| farther in the future the larger the delta), which is unacceptable for timers, | ||
| but for timeouts? They're usually cancelled and we don't need to run precisely | ||
| at the scheduled time, we just need them to run. | ||
|
|
||
| Example table from the current linux kernel (jiffies at 10ms precision, aka | ||
| 100HZ). The ring has 512 slots in total and can accomodate timers up to 15 days | ||
| from now: | ||
|
|
||
| Level Offset Granularity Range | ||
| 0 0 10 ms 0 ms - 630 ms | ||
| 1 64 80 ms 640 ms - 5110 ms (640ms - ~5s) | ||
| 2 128 640 ms 5120 ms - 40950 ms (~5s - ~40s) | ||
| 3 192 5120 ms (~5s) 40960 ms - 327670 ms (~40s - ~5m) | ||
| 4 256 40960 ms (~40s) 327680 ms - 2621430 ms (~5m - ~43m) | ||
| 5 320 327680 ms (~5m) 2621440 ms - 20971510 ms (~43m - ~5h) | ||
| 6 384 2621440 ms (~43m) 20971520 ms - 167772150 ms (~5h - ~1d) | ||
| 7 448 20971520 ms (~5h) 167772160 ms - 1342177270 ms (~1d - ~15d) | ||
|
|
||
| The technical operations are: | ||
|
|
||
| - `insert`: determine the slot (relative to the current slot), append (or | ||
| prepend) to the linked list; | ||
| - `delete`: remove the timer from any linked list it may be in (no need to | ||
| lookup the timer); | ||
| - `get-min`: the delta between the current and the first non empty slot (can be | ||
| sped up with a bitmap of (not)empty slots); | ||
| - `delete-min`: process the linked list(s) as we advance the slot(s). | ||
|
|
||
| Aside from deciding the slot, all these operations involve mere doubly linked | ||
| list operations. | ||
|
|
||
| **NOTE** I didn't test this solution, it currently sounds overkill; yet the | ||
| overall simplicity makes it a good contender to the pairing heap for storing | ||
| timeouts. In that case maybe a dual D-ary Heap for timers and a Timing Wheel for | ||
| timeouts would be a better choice than the single Pairing Heap? | ||
|
|
||
| # Prior art | ||
|
|
||
| - `libevent` stores events with a timer into a min-heap, but it also keeps a | ||
| list of "common timeouts"... I didn't investigate what they mean by it | ||
| exactly. | ||
|
|
||
| - Go stores all timers into a min-heap (4-ary) but allocates timers in the GC | ||
| HEAP and merely marks cancelled timers on delete. I didn't investigate how | ||
| it deals with the tombstones. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This sounds worth investigating further to me, I actually wrote a message above which described doing exactly this, but I deleted it when I read this part. Tombstones can probably be kept around until dequeue, though there may be other opportune times to delete them if scanning/moving entries anyway.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It could be interesting to understand, but I wonder about the benefit. Keeping the tombstones means they might stay for seconds, minutes or hours despite having been cancelled. The occupancy would no longer be how many active timers there are now, but the total number of timers created in the last N seconds/minutes/hours. They also increase the cost of delete-min: it must be repeated multiple times until we reach a not cancelled timer (not cool). We'd have to allocate the event in the GC HEAP (we currently allocate events on the stack) and they'd stay allocated until they finally leave the 4-heap. We can probably clear the tombstones we meet as we swap items (cool), but that means dereferencing each pointer, which reduces the CPU cache benefit of the flat array...
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, the practicalities of the solution might outweigh any performance benefit, but Go going that way provides signal to me it's worth doing the benchmarking.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It could be a cleanup during a GC collection 🤔 |
||
|
|
||
| - The Linux kernel keeps timeouts in a non cascading timing wheel, and timers in | ||
| a red-black tree. See the [hrtimers] page. | ||
|
|
||
| # Unresolved questions | ||
|
|
||
| TBD. | ||
|
|
||
| # Future possibilities | ||
|
|
||
| The monotonic relative clock can be an issue for timers that need to execute at | ||
| a specific realtime, that is relative to the realtime clock. We might want to | ||
| introduce an explicit `Timer` type that could expire once or at a defined | ||
| interval, using different clocks (realtime, monotonic, boottime), as well as be | ||
| absolute or relative to the current time. | ||
|
|
||
| These would fall into the *timers* category, and change the requirements for | ||
| them from "never cancelled" to "sometimes cancelled", though in practice it | ||
| should probably be implemented using system timers, for example `timerfd` on | ||
| Linux, `EVFILT_TIMER` on BSD, something else on Windows. | ||
|
|
||
| [RFC 7]: https://github.com/crystal-lang/rfcs/pull/7 | ||
| [RFC 9]: https://github.com/crystal-lang/rfcs/pull/9 | ||
| [D-ary HEAP]: https://en.wikipedia.org/wiki/D-ary_heap | ||
| [hrtimers]: https://www.kernel.org/doc/html/latest/timers/hrtimers.html | ||
Uh oh!
There was an error while loading. Please reload this page.