Add byte-level progress to pgcopydb list progress #839

rimbi · 2024-07-09T13:59:21Z

Fixes: #388

Allows the user to see the progress of the copy operation by executing the pgcopydb list progress command. The user will see that the duration and the bytes transferred will be increasing.

dimitri

This PR misses the mark on applying the “Separation of Concerns” principle, as detailed in the review. The callback context should handle its own specific data in a way that the CopyArgs does not have to know about.

Also the PR comment invokes the “Single Source of Truth” principle, but I read the code as a violation of that principle, wherein it was previously made in a way that the single source of truth would be the SQLite catalogs, so that external processes can have a look at the same information as the internal code. Maybe the code evolved in a way that makes this not true anymore, but I think I would like to get back in that direction. For instance, we could build a pgcopydb watch command that acts like a htop command for pgcopydb: this would be an external process that would ideally just use the SQLite contents.

src/bin/pgcopydb/copydb.h

src/bin/pgcopydb/extensions.c

src/bin/pgcopydb/pgsql.c

src/bin/pgcopydb/pgsql.h

dimitri · 2024-07-10T10:18:32Z

src/bin/pgcopydb/summary.c

-		if (!summary_pretty_print_timing(catalog, timing))
-		{
-			/* errors have already been logged */
-			return false;
-		}
-


How does that work now with respect to the duration_pretty column that still exists in the SQLite catalogs?

Looking at the existing sources, there seems to be ne need to keep the field in the table.

dimitri · 2024-07-10T10:20:20Z

src/bin/pgcopydb/summary.c

+	/*
+	 * Skip reading `ppDuration` from DB and create it from `durationMs`
+	 * in accordance with the single source of truth principle.
+	 */


This principle was already implemented before, where the decision was made to compute the data before SQLite storage and then use what is in the SQLite database. This allows any external tooling to have access to the same data as our internal code. Including commands that run alongside the main pgcopydb process rather than as a child process.

IOW I believe you introduced a regression here.

The problem was that the data in these 2 fields were inconsistent. What are the external tools which use these fields?

Is there a way to fix the bug wrt inconsistencies here, and keeping the data all in the SQLite database file?

Could the inconsistency be an explanation for End report after a db clone shows higher numbers than expected #697?

I mean “external” as in “not running within the process-tree with access to all the memory etc”, so that means the command pgcopydb list progress is already an “external” tool. All it has access to is the SQLite database file contents.

Probably there is. But is it a good idea to keep the 2 versions of the same data in the DB?

I don't know, maybe.

If we are talking about the pgcopydb then that means we have no problem here.

Now that we use our own version of SQLite and we know we have JSON support in SQL, we could rewrite all the SQLite to JSON transformation code and use a SQL query instead. It was my impression that having the data in two forms in the SQLite database was useful for that.

Another use-case is debugging from the SQLite data file when something strange happens. Then we might appreciate having the data available in pretty printed format for debugging.

The use-case I mentioned before still do exists, it's nice to be able to re-use the data from the SQLite data file either in numeric format or in pretty-printed format without having to re-compute the pretty-printed format every time. It's always the same with caching: do you want to use more CPU or more memory?

dimitri · 2024-07-10T10:20:59Z

src/bin/pgcopydb/summary.c

+	/*
+	 * Skip reading `ppBytes` from DB and create it from `bytes`
+	 * in accordance with the single source of truth principle.
+	 */


Same as before. What about external processes that might want to access to the telemetry?

dimitri · 2024-07-10T10:24:27Z

src/bin/pgcopydb/table-data.c

 /*
- * copydb_mark_table_as_done creates the table doneFile with the expected
- * summary content. To create a doneFile we must acquire the synchronisation
- * semaphore first. The lockFile is also removed here.
+ * copydb_save_copy_progress saves the duration and the number of
+ * bytes transmitted about the copy progress into the catalog.
 */
 bool
-copydb_mark_table_as_done(CopyDataSpec *specs,
+copydb_save_copy_progress(CopyDataSpec *specs,
 						  CopyTableDataSpec *tableSpecs)
 {
 	DatabaseCatalog *sourceDB = &(specs->catalogs.source);


I do not understand this change. The summary_finish_table function is responsible to maintain the done_time_epoch field in the SQLite catalogs, and the rest of the code uses that information to determine if the table COPY is finished or not. Now we will update that done_time_epoch field multiple times during the COPY. What am I missing?

pushed a change for this.

rimbi · 2024-07-10T12:27:35Z

This PR misses the mark on applying the “Separation of Concerns” principle, as detailed in the review. The callback context should handle its own specific data in a way that the CopyArgs does not have to know about.

Also the PR comment invokes the “Single Source of Truth” principle, but I read the code as a violation of that principle, wherein it was previously made in a way that the single source of truth would be the SQLite catalogs, so that external processes can have a look at the same information as the internal code. Maybe the code evolved in a way that makes this not true anymore, but I think I would like to get back in that direction. For instance, we could build a pgcopydb watch command that acts like a htop command for pgcopydb: this would be an external process that would ideally just use the SQLite contents.

This explanation sounds misleading. If we have the 2 versions of the same data present, that is the symptom of the violation of single source of truth. It caused problems and I spent a lot of time to find this problem. Any application (not only pgcopydb) can make use of the data in the db represented as integer and display it however it wants.

dimitri · 2024-07-10T12:38:05Z

This explanation sounds misleading. If we have the 2 versions of the same data present, that is the symptom of the violation of single source of truth. It caused problems and I spent a lot of time to find this problem. Any application (not only pgcopydb) can make use of the data in the db represented as integer and display it however it wants.

I think we're just using a different source of truth and then the same argument. Now that you know where the problem comes from, do you see any difficulty or architectural problem in using our SQLite database file as the source of truth?

rimbi · 2024-07-10T13:00:33Z

This explanation sounds misleading. If we have the 2 versions of the same data present, that is the symptom of the violation of single source of truth. It caused problems and I spent a lot of time to find this problem. Any application (not only pgcopydb) can make use of the data in the db represented as integer and display it however it wants.

I think we're just using a different source of truth and then the same argument. Now that you know where the problem comes from, do you see any difficulty or architectural problem in using our SQLite database file as the source of truth?

Yes, there seems to be a misunderstanding. I'm not objecting to using the sqlite as the source of truth, but to using 2 different fields for the same data but with different representations (int and string).

dimitri · 2024-07-10T14:26:47Z

src/bin/pgcopydb/summary.c

+	/*
+	 * Skip reading `ppDuration` from DB and create it from `durationMs`
+	 * in accordance with the single source of truth principle.
+	 */


Now that we use our own version of SQLite and we know we have JSON support in SQL, we could rewrite all the SQLite to JSON transformation code and use a SQL query instead. It was my impression that having the data in two forms in the SQLite database was useful for that.

Another use-case is debugging from the SQLite data file when something strange happens. Then we might appreciate having the data available in pretty printed format for debugging.

The use-case I mentioned before still do exists, it's nice to be able to re-use the data from the SQLite data file either in numeric format or in pretty-printed format without having to re-compute the pretty-printed format every time. It's always the same with caching: do you want to use more CPU or more memory?

dimitri · 2024-07-10T14:30:08Z

src/bin/pgcopydb/table-data.c

+	ctx->bytesTransmittedBeforeSavingProgress += bytesTransmitted;
+	ctx->bytesTransmitted += bytesTransmitted;


When are the two values different? There is no comment about that, and the naming of the variable does not explain that either. Why do we need the two slots? Can we have a better name for bytesTransmittedBeforeSavingProgress? I currently have two problems with it: I don't understand the name in relation with the code, and I would like a shorter name if that's possible.

Just because you don't understand it doesn't mean others don't, right? So the variable keeps the number of bytes transmitted before saving the progress to the db and reset to zero on each save. If you have a better name, please feel free to suggest one.
Assuming that you mean the vars starting with bytesTransmitted... one keeps the total as before (no change here) and the other as explained above.

Is that attitude necessary @rimbi ? I can tell you it's not welcome.

Two things for you to consider:

I am asking questions on aspects of your code that is not obvious to me when reviewing the PR. When things are not obvious, we need to either improve the code, or add good comments to help the reader understand what's happening with less efforts.

I have read a lot of code, including some great code in many different programming languages. The great code that I have read always have these two properties: what/how it does is obvious, and it is elegant. Elegance has something personal to it, so I will merge code that I do not find elegant. I will not merge code that I do not find obvious, though. That's not personal.

src/bin/pgcopydb/table-data.c

dimitri · 2024-07-10T14:37:19Z

Yes, there seems to be a misunderstanding. I'm not objecting to using the sqlite as the source of truth, but to using 2 different fields for the same data but with different representations (int and string).

I think that's useful (see other comments). I would like to keep that.

dimitri

I think there is still a minor update to make to distribute responsibilities more cleanly (see new comment), and a major blocker to resolve.

You are stating that the human readable pretty-printed versions of the duration and bytes should not be maintained anymore, which I disagree with. But even if we agreed, the proper way to change that would be do open a separate PR and properly remove the two fields from the SQLite database and all the code that uses them, not just the code that maintains the values at UPDATE time!

In the context of this PR, please refrain from making a controversial half-baked change.

dimitri · 2024-07-15T12:54:39Z

src/bin/pgcopydb/table-data.c

+			.bytesTransmitted = 0,
+			.lastSavingTimeMs = 0
+		};
+		INSTR_TIME_SET_CURRENT(context.lastSavingTimeMs);


Can we deal with that responsibility from within the hook function, allowing/expecting the context value to be set to zero by the caller?

Just to understand your reasoning: This pattern is quite common in the codebase. What is it that sets this piece of code apart form others?

Which pattern do you mean? I can't remember of another place where we use INSTR TIME portability macros for high-precision time measurement from libpq and also a callback function and a context...

Please use the INSTR TIME macros only in the callback function, so that the caller does not need to know why your callback implementation decided to use INSTR TIME macros instead of time(NULL) for a 5 seconds duration where accuracy is not a concern, allowing us to change our mind later about how the callback function is implemented without having to also change the code from the caller that has to prepare the context.

A good init value would be zero, because in C name = { .specs = specs } will leave the other fields properly initialized to zero. It is very easy in the callback function to initialize the lastSavingTimeMs to now when called the first time.

- Replaces size-based saving check with interval-based saving check

…e as well

dimitri · 2024-08-06T13:01:35Z

Closing for triage purposes. A part of this PR has been superseded in #858, other aspects need to be addressed in their own PR.

rimbi linked an issue Jul 9, 2024 that may be closed by this pull request

Add byte-level progress to pgcopydb list progress #388

Closed

rimbi requested a review from hanefi July 9, 2024 13:59

hanefi changed the title ~~[WIP] Fixes #388: Add byte-level progress to pgcopydb list progress~~ Add byte-level progress to pgcopydb list progress Jul 9, 2024

hanefi marked this pull request as draft July 9, 2024 15:06

rimbi marked this pull request as ready for review July 10, 2024 05:40

rimbi requested a review from dimitri July 10, 2024 05:40

rimbi self-assigned this Jul 10, 2024

rimbi force-pushed the 388-add-byte-level-progress-to-pgcopydb-list-progress branch from b3f436b to f82a0be Compare July 10, 2024 09:57

dimitri requested changes Jul 10, 2024

View reviewed changes

dimitri reviewed Jul 10, 2024

View reviewed changes

dimitri requested changes Jul 15, 2024

View reviewed changes

rimbi requested a review from dimitri July 18, 2024 07:08

rimbi added 12 commits August 1, 2024 13:05

Fixes #388: Add byte-level progress to pgcopydb list progress

cb38b90

- Removes redundant summary_pretty_print_timing function

e09c667

- Replaces size-based saving check with interval-based saving check

Add comment

80970de

Update the doc

1582c96

Apply changes in accordance with the review comments

9214c22

Apply changes in accordance with the review comments

cdd1b8c

Move bytesTransmitted fıeld from CopyArgs to CopyProgressContext

9ef07ac

Remove the redundant field from the context

9e8aad3

Apply changes in accordance with the review comments

0b11d4e

Make pgcopy list command display the copy progress in non-summary mod…

9acf51a

…e as well

Add function doc

ac8e6b8

Remove duration_pretty and bytes_pretty fields from the timings table

a3591b5

marikkan-microsoft force-pushed the 388-add-byte-level-progress-to-pgcopydb-list-progress branch from 65c9b56 to a3591b5 Compare August 1, 2024 10:05

dimitri mentioned this pull request Aug 2, 2024

Update COPY statistics during the copy operation. #858

Merged

dimitri closed this Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add byte-level progress to pgcopydb list progress #839

Add byte-level progress to pgcopydb list progress #839

rimbi commented Jul 9, 2024 •

edited by hanefi

Loading

dimitri left a comment

dimitri Jul 10, 2024

rimbi Jul 10, 2024

dimitri Jul 10, 2024

rimbi Jul 10, 2024

dimitri Jul 10, 2024

rimbi Jul 10, 2024

dimitri Jul 10, 2024

dimitri Jul 10, 2024

dimitri Jul 10, 2024

rimbi Jul 10, 2024

rimbi commented Jul 10, 2024

dimitri commented Jul 10, 2024

rimbi commented Jul 10, 2024

dimitri Jul 10, 2024

dimitri Jul 10, 2024 •

edited

Loading

rimbi Jul 10, 2024

dimitri Jul 10, 2024

dimitri commented Jul 10, 2024

dimitri left a comment •

edited

Loading

dimitri Jul 15, 2024

rimbi Jul 16, 2024

dimitri Jul 16, 2024

dimitri commented Aug 6, 2024

		ctx->bytesTransmittedBeforeSavingProgress += bytesTransmitted;
		ctx->bytesTransmitted += bytesTransmitted;

Add byte-level progress to pgcopydb list progress #839

Add byte-level progress to pgcopydb list progress #839

Conversation

rimbi commented Jul 9, 2024 • edited by hanefi Loading

dimitri left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rimbi commented Jul 10, 2024

dimitri commented Jul 10, 2024

rimbi commented Jul 10, 2024

Choose a reason for hiding this comment

dimitri Jul 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dimitri commented Jul 10, 2024

dimitri left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dimitri commented Aug 6, 2024

rimbi commented Jul 9, 2024 •

edited by hanefi

Loading

dimitri Jul 10, 2024 •

edited

Loading

dimitri left a comment •

edited

Loading