-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store timelines in internal catalogs #848
Store timelines in internal catalogs #848
Conversation
Usually that involves creating a new C module that will include all the needed headers. Here, that would be something like a “pgsql_timeline.c” module I suppose.
At the moment we fetch the timeline mostly for DEBUG purposes. Supporting a TLI change in Logical Decoding is still an open topic in pgcopydb. We need to have enough information to decide if we can follow a TLI change, depending on what we replayed already in the past etc. So for the moment, we just store the information in a way that's easy to review/debug, but we not use the information anywhere yet. |
See #834 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I think there is a major aspect that needs a revisit, about the allocation strategy for the timeline history bits. We receive them in a single chunk of memory, but having a static size won't be good enough.
Other than than, the usual amount of nitpicking.
Thanks for the idea of a new module. Here are some information on how I created the new module:
I only store the current timeline details in memory now. After moving some portion of code to a new module, some of those changes may escape the attention of the reader/reviewer.
|
The function catalog_lookup_timeline is renamed to catalog_lookup_timeline_history for consistency with the other functions in the catalog module. The memory representation is called a TimelineHistoryEntry, so the function name should reflect that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another round of review, the PR is now taking a good shape! thanks
src/bin/pgcopydb/catalog.c
Outdated
bool | ||
catalog_add_timeline_history(void *ctx, TimelineHistoryEntry *entry) | ||
{ | ||
DatabaseCatalog *catalog = (DatabaseCatalog *) ctx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use a void *
context that actually is a DatabaseCatalog *
argument? It would be better to skip the cast and have the compiler validation for us... is there a reason to do this that I can't see and that is better than having the compiler checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could not figure out the circular dependency in our headers. In the end I ended up writing some forward declaration for DatabaseCatalog
to circumvent that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need a new header pgsql_timeline.h
that could use both the catalogs and the lower-level pgsql header?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created a new header in a new commit, and removed the forward declaration for DatabaseCatalog
It contains the declaration of a function defined in psql.c
just because I can not include the file that defines the struct DatabaseCatalog
in psql.h
.
/* pgsql.c */
bool pgsql_start_replication(LogicalStreamClient *client, DatabaseCatalog *catalog,
char *cdcPathDir);
src/bin/pgcopydb/pgsql_timeline.c
Outdated
bool | ||
parseTimelineHistory(const char *filename, const char *content, | ||
IdentifySystem *system, void *context) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too I believe using DatabaseCatalog *
rather than void *
would make sense. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also is the filename necessary here at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think that it makes sense to use DatabaseCatalog *
here. However, it is not trivial to solve the circular dependency. We have similar issues in other occurences of void pointers in this branch. Sharing an example:
parseTimelineHistory
declaration is atpgsql.h
, implementation is atpgsql_timeline.c
DatabaseCatalog
definition is inschema.h
schema.h
includespgsql.h
- If I want to add an include directive for
schema.h
inpgsql.h
, we got a circular dependency.
Shall I move things to a separate module and try to come up with a better hierarchy?
graph TD;
parseTimelineHistory --> pgsql.h
parseTimelineHistory --> pgsql_timeline.c
DatabaseCatalog --> schema.h
schema.h --> pgsql.h
pgsql.h ==does not work==> schema.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can also use forward declarations for problematic structs. I never liked them, and I avoid them usually, but there may be no other option.
I guess there is no other way around it. I'll add the following to pgsql.h
:
/* forward declaration */
typedef struct DatabaseCatalog DatabaseCatalog;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my other comment about pgsql_timeline.h
maybe?
src/bin/pgcopydb/pgsql_timeline.c
Outdated
log_trace("parseTimelineHistory line %lld is \"%s\"", | ||
(long long) lineNumber, | ||
lbuf.lines[lineNumber]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we get to store the history file as-is on-disk then we should be able to remove that line entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is doable but not so easy right now. Right now, I have the following lines of code on my branch:
if (!pgsql_start_replication(&stream, specs->sourceDB))
{
/* errors have already been logged */
return false;
}
/* write the wal_segment_size and timeline history files */
if (!stream_write_context(specs, &stream))
{
/* errors have already been logged */
return false;
}
The first function call, reads the contents of the file, parses them and stores records in internal catalog. We do not store the file contents in memory like we used to.
The second function writes the information in memory to files on disk. There is no elegant way I can think of that can continue to uphold separation of concerns here. I will add a separate commit that will change the first function to write the files to disk. In case you object, I can revert that and come up with another method. We can eventually move away from writing timeline history to disk if we can not come up with an elegant solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's why the placement of my comment about storing to disk was at this seemingly random place. I would like to store the file exactly as Postgres makes it available to us, only for debugging purposes, in case our parsing of the file were to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now what you've done, and I think that's good. When implementing debug traces that are not used in the normal flow of operations, or even in the code, I don't think separation of concerns is as imperative. Here, we want to log-to-disk the actual result from the Postgres protocol, before any kind of processing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading the following sentence in the last comment, I realized that I did not print a debug message when writing the tli history file.
When implementing debug traces that are not used in the normal flow of operations, or even in the code, I don't think separation of concerns is as imperative.
I pushed a new commit 2eacb24 that adds a debug message similar to what we had prior to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only thinking about it now, sorry: what could make sense is to store the timeline data received from the protocol directly to disk, no parsing, and then have a function that reads the file one line at a time and parses the content etc.
This would be better at memory management, and also separation of concerns and internal API/headers.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually write contents to disk before parsing it.
Currently pgsql_identify_system
function does the following calls:
- parseTimelineHistoryResult parses PGResult into a TimelineHistoryResult
- writeTimelineHistoryFile writes the TimelineHistoryResult content that holds raw data to disk
- parseTimelineHistoryfile parses the TimelineHistoryResult and stores it in internal catalog
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the other comments I have added, with more details. I think we have an opportunity to implement things in a way that start_logical_decoding
does not need access to our SQLite catalogs.
…f-the-history-file-contents-needs-dynamic-memory-management
src/bin/pgcopydb/pgsql_timeline.c
Outdated
log_debug("Wrote tli %s timeline history file \"%s/%s\"", | ||
system->currentTimeline.tli, cdcPathDir, hContext.filename); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be done in writeTimelineHistoryFile
and before the call to write_file
, so that if the writing to file happens fail or take a long time, we already would have printed the log information about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I'll move this before the call to write_file
and change wording slightly (e.g. Wrote -> Writing).
In my defense, in stream_write_context
we print the logs only after successfully writing the files to disk.
src/bin/pgcopydb/pgsql_timeline.c
Outdated
context->content = strdup(value); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking we might want to write to file at this place in the code, then skipping the strdup
entirely, and keeping only the filename in our internal structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. After implementing the iterator API from scratch, I needed to move the filename to another structure. However, I think it is fine as I did not duplicate the file content and store only the file name.
src/bin/pgcopydb/pgsql_timeline.c
Outdated
LinesBuffer lbuf = { 0 }; | ||
|
||
if (!splitLines(&lbuf, (char *) content)) | ||
{ | ||
/* errors have already been logged */ | ||
return false; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we write the file to disk as soon as receiving it, and without duplicating the contents in-memory, then we don't need to splitLines
here and instead of looping through the array lines we could read the file line-by-line using the iterator and a callback function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented the iterator API, and removed the splitLines
here in favor of our file iterators.
If I can share some context, I did not really write this piece of code myself. I just moved it from one file to another as it made sense to group all timeline related functions to the same module. We had a function who was good at parsing the complete file by itself, and know we have timeline_iter_history[|_init|_next|_finish]
functions that handle portions of it.
src/bin/pgcopydb/ld_stream.c
Outdated
if (!pgsql_start_replication(&stream)) | ||
if (!pgsql_start_replication(&stream, specs->sourceDB, specs->paths.dir)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole idea behing the latest opened discussion in this review is to allow to skip using the sourceDB when doing pgsql_start_replication()
, getting back to your point about separation of concerns. At replication start, only store the timeline history to a file.
In stream_write_context()
(or maybe in a new function, if needed?) we can then parse the history file with a file iterator and store our own format in the SQLite catalogs.
src/bin/pgcopydb/pgsql_timeline.c
Outdated
log_trace("parseTimelineHistory line %lld is \"%s\"", | ||
(long long) lineNumber, | ||
lbuf.lines[lineNumber]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the other comments I have added, with more details. I think we have an opportunity to implement things in a way that start_logical_decoding
does not need access to our SQLite catalogs.
…f-the-history-file-contents-needs-dynamic-memory-management
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick review from the metro, hope it's useful. I am surprised we need to define our own iterator. Expected that a callback to the file line iterator would do.
src/bin/pgcopydb/pgsql.c
Outdated
@@ -4182,7 +3816,7 @@ pgsql_timestamptz_to_string(TimestampTz ts, char *str, size_t size) | |||
* Send the START_REPLICATION logical replication command. | |||
*/ | |||
bool | |||
pgsql_start_replication(LogicalStreamClient *client) | |||
pgsql_start_replication(LogicalStreamClient *client, char *cdcPathDir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add the cdcPathDir to the LogicalStreamClient structure?
src/bin/pgcopydb/pgsql_timeline.h
Outdated
bool timeline_iter_history_next(TimelineHistoryIterator *iter); | ||
bool timeline_iter_history_finish(TimelineHistoryIterator *iter); | ||
|
||
bool timeline_history_add_hook(void *context, TimelineHistoryEntry *entry); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not fond of the name. Not a showstopper. Will think. The idea is to describe what the hook does, not where we call it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up removing this completely. Relevant discussion at other comment: #848 (comment)
src/bin/pgcopydb/pgsql_timeline.c
Outdated
} | ||
|
||
iter->filename = filename; | ||
iter->currentTimeline = context->currentTimeline; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The iterator should never look into the context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great point. Applying a fix for this allowed me to make the following improvements:
- I passed the value as an argument, I had a slimmer context.
- Now that the context had a single attribute, I deleted that completely.
- A function with a bad name
timeline_history_add_hook
that was a wrapper forcatalog_add_timeline_history
was no longer needed as I did not need to have the logic to access a subset of values in a context.
When the number of parameters get large, I start to think of context structs as a means to group them. This was a wrong way to think, and I see that now.
I had 2 main issues that were not easy to overcome if I used the current file line iterators, and file_iter_lines function. Problem 1:When parsing a line, I needed to store some values in the iterator, and it was not possible with the current file iterator. I believe we need to be able to pass custom iterators for that to be useful. I considered updating the current implementation of file_iter_lines to accept an iterator instead of some fixed set of arguments that are used to construct the iterator in the function body. That would have worked out for me. Problem 2:I need to be able to call the callback function after all the lines are processed. This way, we are able to add a final timeline entry for the current timeline. This was not easy to accomplish in the current implementation of file_iter_lines. Below is a comment from @dimitri as he mistakenly edited the original comment instead of quoting it in a new comment:
It seems to me that there is still confusion about the responsibilities of the iterator parts of the code, which should only know how to read a line at a time, and the callback/context side of things, where I expected we would deal with It might be my confusion though, so I will have another look later today and maybe try to actually write it the way I though. Sometimes it's the only way to understand the difference between ideas and reality... |
Thanks for pushing all the work Hanefi, I like what we ended-up with! |
This is a WIP PR that still has some rough edges.
TODO:
pgsql.c
. I'll try to find a workaround