Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add byte-level progress to pgcopydb list progress #839

Closed

Conversation

rimbi
Copy link
Contributor

@rimbi rimbi commented Jul 9, 2024

Fixes: #388

Allows the user to see the progress of the copy operation by executing the pgcopydb list progress command. The user will see that the duration and the bytes transferred will be increasing.

@rimbi rimbi linked an issue Jul 9, 2024 that may be closed by this pull request
@rimbi rimbi requested a review from hanefi July 9, 2024 13:59
@hanefi hanefi changed the title [WIP] Fixes #388: Add byte-level progress to pgcopydb list progress Add byte-level progress to pgcopydb list progress Jul 9, 2024
@hanefi hanefi marked this pull request as draft July 9, 2024 15:06
@rimbi rimbi marked this pull request as ready for review July 10, 2024 05:40
@rimbi rimbi requested a review from dimitri July 10, 2024 05:40
@rimbi rimbi self-assigned this Jul 10, 2024
@rimbi rimbi force-pushed the 388-add-byte-level-progress-to-pgcopydb-list-progress branch from b3f436b to f82a0be Compare July 10, 2024 09:57
Copy link
Owner

@dimitri dimitri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR misses the mark on applying the “Separation of Concerns” principle, as detailed in the review. The callback context should handle its own specific data in a way that the CopyArgs does not have to know about.

Also the PR comment invokes the “Single Source of Truth” principle, but I read the code as a violation of that principle, wherein it was previously made in a way that the single source of truth would be the SQLite catalogs, so that external processes can have a look at the same information as the internal code. Maybe the code evolved in a way that makes this not true anymore, but I think I would like to get back in that direction. For instance, we could build a pgcopydb watch command that acts like a htop command for pgcopydb: this would be an external process that would ideally just use the SQLite contents.

src/bin/pgcopydb/copydb.h Outdated Show resolved Hide resolved
src/bin/pgcopydb/extensions.c Outdated Show resolved Hide resolved
src/bin/pgcopydb/pgsql.c Outdated Show resolved Hide resolved
src/bin/pgcopydb/pgsql.c Outdated Show resolved Hide resolved
src/bin/pgcopydb/pgsql.h Outdated Show resolved Hide resolved
Comment on lines -1869 to -1874
if (!summary_pretty_print_timing(catalog, timing))
{
/* errors have already been logged */
return false;
}

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does that work now with respect to the duration_pretty column that still exists in the SQLite catalogs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the existing sources, there seems to be ne need to keep the field in the table.

Comment on lines 2325 to 2328
/*
* Skip reading `ppDuration` from DB and create it from `durationMs`
* in accordance with the single source of truth principle.
*/
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This principle was already implemented before, where the decision was made to compute the data before SQLite storage and then use what is in the SQLite database. This allows any external tooling to have access to the same data as our internal code. Including commands that run alongside the main pgcopydb process rather than as a child process.

IOW I believe you introduced a regression here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem was that the data in these 2 fields were inconsistent. What are the external tools which use these fields?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Is there a way to fix the bug wrt inconsistencies here, and keeping the data all in the SQLite database file?
  • Could the inconsistency be an explanation for End report after a db clone shows higher numbers than expected #697?
  • I mean “external” as in “not running within the process-tree with access to all the memory etc”, so that means the command pgcopydb list progress is already an “external” tool. All it has access to is the SQLite database file contents.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Probably there is. But is it a good idea to keep the 2 versions of the same data in the DB?
  • I don't know, maybe.
  • If we are talking about the pgcopydb then that means we have no problem here.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we use our own version of SQLite and we know we have JSON support in SQL, we could rewrite all the SQLite to JSON transformation code and use a SQL query instead. It was my impression that having the data in two forms in the SQLite database was useful for that.

Another use-case is debugging from the SQLite data file when something strange happens. Then we might appreciate having the data available in pretty printed format for debugging.

The use-case I mentioned before still do exists, it's nice to be able to re-use the data from the SQLite data file either in numeric format or in pretty-printed format without having to re-compute the pretty-printed format every time. It's always the same with caching: do you want to use more CPU or more memory?

Comment on lines 2339 to 2342
/*
* Skip reading `ppBytes` from DB and create it from `bytes`
* in accordance with the single source of truth principle.
*/
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as before. What about external processes that might want to access to the telemetry?

Comment on lines 1252 to 1297
/*
* copydb_mark_table_as_done creates the table doneFile with the expected
* summary content. To create a doneFile we must acquire the synchronisation
* semaphore first. The lockFile is also removed here.
* copydb_save_copy_progress saves the duration and the number of
* bytes transmitted about the copy progress into the catalog.
*/
bool
copydb_mark_table_as_done(CopyDataSpec *specs,
copydb_save_copy_progress(CopyDataSpec *specs,
CopyTableDataSpec *tableSpecs)
{
DatabaseCatalog *sourceDB = &(specs->catalogs.source);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand this change. The summary_finish_table function is responsible to maintain the done_time_epoch field in the SQLite catalogs, and the rest of the code uses that information to determine if the table COPY is finished or not. Now we will update that done_time_epoch field multiple times during the COPY. What am I missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed a change for this.

@rimbi
Copy link
Contributor Author

rimbi commented Jul 10, 2024

This PR misses the mark on applying the “Separation of Concerns” principle, as detailed in the review. The callback context should handle its own specific data in a way that the CopyArgs does not have to know about.

Also the PR comment invokes the “Single Source of Truth” principle, but I read the code as a violation of that principle, wherein it was previously made in a way that the single source of truth would be the SQLite catalogs, so that external processes can have a look at the same information as the internal code. Maybe the code evolved in a way that makes this not true anymore, but I think I would like to get back in that direction. For instance, we could build a pgcopydb watch command that acts like a htop command for pgcopydb: this would be an external process that would ideally just use the SQLite contents.

This explanation sounds misleading. If we have the 2 versions of the same data present, that is the symptom of the violation of single source of truth. It caused problems and I spent a lot of time to find this problem. Any application (not only pgcopydb) can make use of the data in the db represented as integer and display it however it wants.

@dimitri
Copy link
Owner

dimitri commented Jul 10, 2024

This explanation sounds misleading. If we have the 2 versions of the same data present, that is the symptom of the violation of single source of truth. It caused problems and I spent a lot of time to find this problem. Any application (not only pgcopydb) can make use of the data in the db represented as integer and display it however it wants.

I think we're just using a different source of truth and then the same argument. Now that you know where the problem comes from, do you see any difficulty or architectural problem in using our SQLite database file as the source of truth?

@rimbi
Copy link
Contributor Author

rimbi commented Jul 10, 2024

This explanation sounds misleading. If we have the 2 versions of the same data present, that is the symptom of the violation of single source of truth. It caused problems and I spent a lot of time to find this problem. Any application (not only pgcopydb) can make use of the data in the db represented as integer and display it however it wants.

I think we're just using a different source of truth and then the same argument. Now that you know where the problem comes from, do you see any difficulty or architectural problem in using our SQLite database file as the source of truth?

Yes, there seems to be a misunderstanding. I'm not objecting to using the sqlite as the source of truth, but to using 2 different fields for the same data but with different representations (int and string).

Comment on lines 2325 to 2328
/*
* Skip reading `ppDuration` from DB and create it from `durationMs`
* in accordance with the single source of truth principle.
*/
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we use our own version of SQLite and we know we have JSON support in SQL, we could rewrite all the SQLite to JSON transformation code and use a SQL query instead. It was my impression that having the data in two forms in the SQLite database was useful for that.

Another use-case is debugging from the SQLite data file when something strange happens. Then we might appreciate having the data available in pretty printed format for debugging.

The use-case I mentioned before still do exists, it's nice to be able to re-use the data from the SQLite data file either in numeric format or in pretty-printed format without having to re-compute the pretty-printed format every time. It's always the same with caching: do you want to use more CPU or more memory?

Comment on lines 863 to 877
ctx->bytesTransmittedBeforeSavingProgress += bytesTransmitted;
ctx->bytesTransmitted += bytesTransmitted;
Copy link
Owner

@dimitri dimitri Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When are the two values different? There is no comment about that, and the naming of the variable does not explain that either. Why do we need the two slots? Can we have a better name for bytesTransmittedBeforeSavingProgress? I currently have two problems with it: I don't understand the name in relation with the code, and I would like a shorter name if that's possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because you don't understand it doesn't mean others don't, right? So the variable keeps the number of bytes transmitted before saving the progress to the db and reset to zero on each save. If you have a better name, please feel free to suggest one.
Assuming that you mean the vars starting with bytesTransmitted... one keeps the total as before (no change here) and the other as explained above.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that attitude necessary @rimbi ? I can tell you it's not welcome.

Two things for you to consider:

  1. I am asking questions on aspects of your code that is not obvious to me when reviewing the PR. When things are not obvious, we need to either improve the code, or add good comments to help the reader understand what's happening with less efforts.
  2. I have read a lot of code, including some great code in many different programming languages. The great code that I have read always have these two properties: what/how it does is obvious, and it is elegant. Elegance has something personal to it, so I will merge code that I do not find elegant. I will not merge code that I do not find obvious, though. That's not personal.

src/bin/pgcopydb/table-data.c Show resolved Hide resolved
@dimitri
Copy link
Owner

dimitri commented Jul 10, 2024

Yes, there seems to be a misunderstanding. I'm not objecting to using the sqlite as the source of truth, but to using 2 different fields for the same data but with different representations (int and string).

I think that's useful (see other comments). I would like to keep that.

Copy link
Owner

@dimitri dimitri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is still a minor update to make to distribute responsibilities more cleanly (see new comment), and a major blocker to resolve.

You are stating that the human readable pretty-printed versions of the duration and bytes should not be maintained anymore, which I disagree with. But even if we agreed, the proper way to change that would be do open a separate PR and properly remove the two fields from the SQLite database and all the code that uses them, not just the code that maintains the values at UPDATE time!

In the context of this PR, please refrain from making a controversial half-baked change.

.bytesTransmitted = 0,
.lastSavingTimeMs = 0
};
INSTR_TIME_SET_CURRENT(context.lastSavingTimeMs);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we deal with that responsibility from within the hook function, allowing/expecting the context value to be set to zero by the caller?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to understand your reasoning: This pattern is quite common in the codebase. What is it that sets this piece of code apart form others?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which pattern do you mean? I can't remember of another place where we use INSTR TIME portability macros for high-precision time measurement from libpq and also a callback function and a context...

Please use the INSTR TIME macros only in the callback function, so that the caller does not need to know why your callback implementation decided to use INSTR TIME macros instead of time(NULL) for a 5 seconds duration where accuracy is not a concern, allowing us to change our mind later about how the callback function is implemented without having to also change the code from the caller that has to prepare the context.

A good init value would be zero, because in C name = { .specs = specs } will leave the other fields properly initialized to zero. It is very easy in the callback function to initialize the lastSavingTimeMs to now when called the first time.

@rimbi rimbi requested a review from dimitri July 18, 2024 07:08
@marikkan-microsoft marikkan-microsoft force-pushed the 388-add-byte-level-progress-to-pgcopydb-list-progress branch from 65c9b56 to a3591b5 Compare August 1, 2024 10:05
@dimitri
Copy link
Owner

dimitri commented Aug 6, 2024

Closing for triage purposes. A part of this PR has been superseded in #858, other aspects need to be addressed in their own PR.

@dimitri dimitri closed this Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add byte-level progress to pgcopydb list progress
2 participants