Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pgcopydb seems to get stuck after setting endpos #427

Closed
marcocitus opened this issue Aug 15, 2023 · 8 comments · Fixed by #563
Closed

pgcopydb seems to get stuck after setting endpos #427

marcocitus opened this issue Aug 15, 2023 · 8 comments · Fixed by #563
Assignees
Labels
bug Something isn't working
Milestone

Comments

@marcocitus
Copy link

marcocitus commented Aug 15, 2023

Not yet entirely sure whether this is a bug or I'm doing something wrong, but while doing a pgcopydb clone --follow and using pgcopydb stream sentinel set endpos --current, I regularly see pgcopydb getting (permanently) stuck.

Log suggests pgcopydb does exit the loop startLogicalStreaming as expected:

2023-08-15 16:17:04 23482 INFO pgsql.c:4298 Reported write_lsn 4/DD152450, flush_lsn 4/DD152450, replay_lsn 4/DD151210
2023-08-15 16:17:04 23482 INFO pgsql.c:4298 Reported write_lsn 4/DD152450, flush_lsn 4/DD152450, replay_lsn 4/DD151210
2023-08-15 16:17:04 23482 INFO ld_stream.c:465 Streamed up to write_lsn 4/DD152450, flush_lsn 4/DD152450, stopping

Some stacktraces from different pgcopydb processes:

(gdb) bt
#0  0x00007fdbe7ce674d in __GI___select (nfds=nfds@entry=0, readfds=readfds@entry=0x0, writefds=writefds@entry=0x0, exceptfds=exceptfds@entry=0x0, 
    timeout=timeout@entry=0x7ffc685f26b0) at ../sysdeps/unix/sysv/linux/select.c:69
#1  0x000055cfb81fdccb in pg_usleep (microsec=microsec@entry=150000) at pgsleep.c:56
#2  0x000055cfb81a78a6 in cli_clone_follow_wait_subprocess (name=name@entry=0x55cfb820077b "follow", pid=24049) at cli_clone_follow.c:730
#3  0x000055cfb81a7b79 in clone_and_follow (copySpecs=copySpecs@entry=0x7ffc686fdf10) at cli_clone_follow.c:308
#4  0x000055cfb81a7c61 in cli_clone (argc=<optimized out>, argv=<optimized out>) at cli_clone_follow.c:142
#5  0x000055cfb81f3993 in commandline_run (command=0x55cfb82350e0 <clone_command>, argc=0, argc@entry=14, argv=0x7ffc688110a0, argv@entry=0x7ffc68811030)
    at /home/marco/pgcopydb/src/bin/pgcopydb/../lib/subcommands.c/commandline.c:71
#6  0x000055cfb81f3a75 in commandline_run (command=command@entry=0x7ffc68810ab0, argc=14, argc@entry=15, argv=0x7ffc68811030, argv@entry=0x7ffc68811028)
    at /home/marco/pgcopydb/src/bin/pgcopydb/../lib/subcommands.c/commandline.c:94
#7  0x000055cfb81d57b5 in main (argc=15, argv=0x7ffc68811028) at main.c:167
#0  0x00007f5361c3774d in __GI___select (nfds=nfds@entry=0, readfds=readfds@entry=0x0, writefds=writefds@entry=0x0, exceptfds=exceptfds@entry=0x0,                                 
    timeout=timeout@entry=0x7ffd48643800) at ../sysdeps/unix/sysv/linux/select.c:69                                                                                                
#1  0x00005621a6082ccb in pg_usleep (microsec=microsec@entry=150000) at pgsleep.c:56     
#2  0x00005621a60497f1 in follow_wait_subprocesses (specs=specs@entry=0x7ffd48643d90) at follow.c:936                                                                              
#3  0x00005621a604996a in followDB (copySpecs=copySpecs@entry=0x7ffd4874f570, streamSpecs=streamSpecs@entry=0x7ffd48643d90) at follow.c:571                                        
#4  0x00005621a6049cce in follow_main_loop (copySpecs=copySpecs@entry=0x7ffd4874f570, streamSpecs=streamSpecs@entry=0x7ffd48643d90) at follow.c:286                                
#5  0x00005621a602c1f1 in start_follow_process (copySpecs=copySpecs@entry=0x7ffd4874f570, streamSpecs=streamSpecs@entry=0x7ffd48643d90, pid=pid@entry=0x7ffd48643d8c)              
    at cli_clone_follow.c:674                                                                                                                                                      
#6  0x00005621a602ca68 in clone_and_follow (copySpecs=copySpecs@entry=0x7ffd4874f570) at cli_clone_follow.c:269                                                                    
#7  0x00005621a602cc61 in cli_clone (argc=<optimized out>, argv=<optimized out>) at cli_clone_follow.c:142                                                                         
#8  0x00005621a6078993 in commandline_run (command=0x5621a60ba0e0 <clone_command>, argc=0, argc@entry=14, argv=0x7ffd48862700, argv@entry=0x7ffd48862690)                          
    at /home/marco/pgcopydb/src/bin/pgcopydb/../lib/subcommands.c/commandline.c:71                                                                                                 
#9  0x00005621a6078a75 in commandline_run (command=command@entry=0x7ffd48862110, argc=14, argc@entry=15, argv=0x7ffd48862690, argv@entry=0x7ffd48862688)                           
    at /home/marco/pgcopydb/src/bin/pgcopydb/../lib/subcommands.c/commandline.c:94       
#10 0x00005621a605a7b5 in main (argc=15, argv=0x7ffd48862688) at main.c:167  
#0  0x00007f5361c3774d in __GI___select (nfds=nfds@entry=20, readfds=readfds@entry=0x7ffd48641b30, writefds=writefds@entry=0x0, exceptfds=exceptfds@entry=0x0, 
    timeout=timeout@entry=0x7ffd48641b20) at ../sysdeps/unix/sysv/linux/select.c:69                                                                                                
#1  0x00005621a604681d in read_from_stream (stream=0x5621a7915860, context=context@entry=0x7ffd48643c10) at file_utils.c:445                                                       
#2  0x00005621a6058e9a in stream_transform_stream (specs=specs@entry=0x7ffd48643d90) at ld_transform.c:98                                                                          
#3  0x00005621a6049148 in follow_start_transform (specs=0x7ffd48643d90) at follow.c:667                                                                                            
#4  0x00005621a60492ae in follow_start_subprocess (specs=specs@entry=0x7ffd48643d90, subprocess=subprocess@entry=0x7ffd4874cad0) at follow.c:769                                   
#5  0x00005621a6049a8e in followDB (copySpecs=copySpecs@entry=0x7ffd4874f570, streamSpecs=streamSpecs@entry=0x7ffd48643d90) at follow.c:542                                        
#6  0x00005621a6049cce in follow_main_loop (copySpecs=copySpecs@entry=0x7ffd4874f570, streamSpecs=streamSpecs@entry=0x7ffd48643d90) at follow.c:286                                
#7  0x00005621a602c1f1 in start_follow_process (copySpecs=copySpecs@entry=0x7ffd4874f570, streamSpecs=streamSpecs@entry=0x7ffd48643d90, pid=pid@entry=0x7ffd48643d8c)
    at cli_clone_follow.c:674                                                                                                                                                      
#8  0x00005621a602ca68 in clone_and_follow (copySpecs=copySpecs@entry=0x7ffd4874f570) at cli_clone_follow.c:269                                                                    
#9  0x00005621a602cc61 in cli_clone (argc=<optimized out>, argv=<optimized out>) at cli_clone_follow.c:142                                                                         
#10 0x00005621a6078993 in commandline_run (command=0x5621a60ba0e0 <clone_command>, argc=0, argc@entry=14, argv=0x7ffd48862700, argv@entry=0x7ffd48862690)                          
    at /home/marco/pgcopydb/src/bin/pgcopydb/../lib/subcommands.c/commandline.c:71                                                                                                 
#11 0x00005621a6078a75 in commandline_run (command=command@entry=0x7ffd48862110, argc=14, argc@entry=15, argv=0x7ffd48862690, argv@entry=0x7ffd48862688)                           
    at /home/marco/pgcopydb/src/bin/pgcopydb/../lib/subcommands.c/commandline.c:94                                                                                                 
#12 0x00005621a605a7b5 in main (argc=15, argv=0x7ffd48862688) at main.c:167  

It happens both with writes and no writes during the clone.

Version:

pgcopydb version 0.13.14.gf348a87
compiled with PostgreSQL 15.3 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0, 64-bit
compatible with Postgres 10, 11, 12, 13, 14, and 15
@arajkumar
Copy link
Contributor

arajkumar commented Oct 10, 2023

I see same issue.

After setting the endpos, apply process exits with the following log,

09:25:39.155 422954 INFO   Replay reached end position 119/BECB8000 at 119/BEE10FF8
09:25:39.216 422954 INFO   Replayed reached endpos 119/BECB8000 at replay_lsn 119/BED25928, stopping

But receive and transform seem to be blocked in file write,

pgcopydb,421092 clone --filter /home/ubuntu/dimitri/filter.ini --follow --fail-fast --skip-extensions --plugin wal2json --snapshot 00000020-00000D35-1
  `-pgcopydb,421095
      |-pgcopydb,421125
      `-pgcopydb,421126
$ sudo gdb attach 421092

#0  0x00007f629f51b78d in __GI___select (nfds=0, readfds=0x0, writefds=0x0,
    exceptfds=0x0, timeout=0x7ffc5b05a900) at ../sysdeps/unix/sysv/linux/select.c:69
#1  0x000055c5b9f13ef1 in pg_usleep ()
#2  0x000055c5b9ebd545 in cli_clone_follow_wait_subprocess (
    name=name@entry=0x55c5b9f151d2 "follow", pid=pid@entry=421095)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/cli_clone_follow.c:747
#3  0x000055c5b9ebec17 in clone_and_follow (copySpecs=0x7ffc5b166690)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/cli_clone_follow.c:315
#4  cli_clone (argc=<optimized out>, argv=<optimized out>)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/cli_clone_follow.c:143
#5  0x000055c5b9ebcd81 in commandline_run (argv=0x7ffc5b38d3a0, argc=0,
    command=0x7ffc5b38cc90)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/../lib/subcommands.c/commandline.c:71
#6  main (argc=<optimized out>, argv=<optimized out>)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/main.c:167

$ sudo gdb attach 421095

#0  0x00007f629f51b78d in __GI___select (nfds=0, readfds=0x0, writefds=0x0,
    exceptfds=0x0, timeout=0x7ffc5b05a3b0) at ../sysdeps/unix/sysv/linux/select.c:69
#1  0x000055c5b9f13ef1 in pg_usleep ()
#2  0x000055c5b9edcdc6 in follow_wait_subprocesses (
    specs=specs@entry=0x7ffc5b05aef0)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/follow.c:943
#3  0x000055c5b9edd0f2 in followDB (streamSpecs=0x7ffc5b05aef0,
    copySpecs=<optimized out>)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/follow.c:571
#4  0x000055c5b9f0fba0 in followDB (copySpecs=<optimized out>,
    streamSpecs=0x7ffc5b05aef0)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/follow.c:493
#5  follow_main_loop.part.0.isra.0 (streamSpecs=0x7ffc5b05aef0,
    copySpecs=<optimized out>)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/follow.c:286
#6  0x000055c5b9ebebf3 in follow_main_loop (copySpecs=<optimized out>,
    streamSpecs=0x7ffc5b05aef0)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/follow.c:262
#7  start_follow_process (copySpecs=0x7ffc5b166690, pid=<synthetic pointer>,
    streamSpecs=0x7ffc5b05aef0)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/cli_clone_follow.c:691
#8  clone_and_follow (copySpecs=0x7ffc5b166690)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/cli_clone_follow.c:276
#9  cli_clone (argc=<optimized out>, argv=<optimized out>)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/cli_clone_follow.c:143
#10 0x000055c5b9ebcd81 in commandline_run (argv=0x7ffc5b38d3a0, argc=0,
    command=0x7ffc5b38cc90)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/../lib/subcommands.c/commandline.c:71
#11 main (argc=<optimized out>, argv=<optimized out>)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/main.c:167
$ sudo gdb attach 421125

#0  0x00007f629f514a77 in __GI___libc_write (fd=6, buf=0x55c5badb7540, nbytes=575)
    at ../sysdeps/unix/sysv/linux/write.c:26
#1  0x00007f629f48aeed in _IO_new_file_write (f=0x55c5bacba030,
    data=0x55c5badb7540, n=575) at ./libio/fileops.c:1180
#2  0x00007f629f48c9e1 in new_do_write (to_do=575,
    data=0x55c5badb7540 "{\"action\":\"I\",\"xid\":\"6286140\",\"lsn\":\"116/347FFF68\",\"timestamp\":\"2023-10-10 07:42:05.187393+0000\",\"message\":{\"action\":\"I\",\"xid\":6286140,\"schema\":\"_timescaledb_internal\",\"table\":\"_hyper_37_4151_chunk\",\""..., fp=0x55c5bacba030) at ./libio/libioP.h:947
#3  _IO_new_do_write (to_do=575,
    data=0x55c5badb7540 "{\"action\":\"I\",\"xid\":\"6286140\",\"lsn\":\"116/347FFF68\",\"timestamp\":\"2023-10-10 07:42:05.187393+0000\",\"message\":{\"action\":\"I\",\"xid\":6286140,\"schema\":\"_timescaledb_internal\",\"table\":\"_hyper_37_4151_chunk\",\""..., fp=0x55c5bacba030) at ./libio/fileops.c:425
#4  _IO_new_do_write (fp=0x55c5bacba030,
    data=0x55c5badb7540 "{\"action\":\"I\",\"xid\":\"6286140\",\"lsn\":\"116/347FFF68\",\"timestamp\":\"2023-10-10 07:42:05.187393+0000\",\"message\":{\"action\":\"I\",\"xid\":6286140,\"schema\":\"_timescaledb_internal\",\"table\":\"_hyper_37_4151_chunk\",\""..., to_do=575) at ./libio/fileops.c:422
#5  0x00007f629f48b6d5 in _IO_new_file_xsputn (n=575, data=<optimized out>,
    f=<optimized out>) at ./libio/libioP.h:947
#6  _IO_new_file_xsputn (f=0x55c5bacba030, data=<optimized out>, n=575)
    at ./libio/fileops.c:1196
#7  0x00007f629f47ffd7 in __GI__IO_fwrite (buf=buf@entry=0x55c5bad9edd0,
    size=size@entry=1, count=count@entry=575, fp=fp@entry=0x55c5bacba030)
    at ./libio/libioP.h:947
#8  0x000055c5b9ee5b49 in write_to_stream (size=<optimized out>,
    buffer=0x55c5bad9edd0 "{\"action\":\"I\",\"xid\":\"6286140\",\"lsn\":\"116/347FFF68\",\"timestamp\":\"2023-10-10 07:42:05.187393+0000\",\"message\":{\"action\":\"I\",\"xid\":6286140,\"schema\":\"_timescaledb_internal\",\"table\":\"_hyper_37_4151_chunk\",\""..., stream=0x55c5bacba030)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/file_utils.c:392
#9  stream_write_json (context=context@entry=0x7ffc5ae528a0,
    previous=previous@entry=false)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/ld_stream.c:796
#10 0x000055c5b9ee61e2 in streamWrite (context=0x7ffc5ae528a0)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/ld_stream.c:686
#11 0x000055c5b9ee27f9 in pgsql_stream_logical (context=0x7ffc5ae528a0,
    client=0x7ffc5af53530)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/pgsql.c:4158
#12 startLogicalStreaming (specs=specs@entry=0x7ffc5b05aef0)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/ld_stream.c:446
#13 0x000055c5b9ee3920 in follow_start_prefetch (specs=0x7ffc5b05aef0)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/follow.c:614
#14 follow_start_prefetch (specs=0x7ffc5b05aef0)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/follow.c:594
#15 0x000055c5b9edcbea in follow_start_subprocess (
    specs=specs@entry=0x7ffc5b05aef0, subprocess=subprocess@entry=0x7ffc5b163c10)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/follow.c:776
#16 0x000055c5b9edd188 in followDB (streamSpecs=0x7ffc5b05aef0,
    copySpecs=<optimized out>)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/follow.c:536
#17 0x000055c5b9f0fba0 in followDB (copySpecs=<optimized out>,
    streamSpecs=0x7ffc5b05aef0)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/follow.c:493
#18 follow_main_loop.part.0.isra.0 (streamSpecs=0x7ffc5b05aef0,
    copySpecs=<optimized out>)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/follow.c:286
#19 0x000055c5b9ebebf3 in follow_main_loop (copySpecs=<optimized out>,
    streamSpecs=0x7ffc5b05aef0)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/follow.c:262
#20 start_follow_process (copySpecs=0x7ffc5b166690, pid=<synthetic pointer>,
    streamSpecs=0x7ffc5b05aef0)
   at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/cli_clone_follow.c:691
#21 clone_and_follow (copySpecs=0x7ffc5b166690)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/cli_clone_follow.c:276
#22 cli_clone (argc=<optimized out>, argv=<optimized out>)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/cli_clone_follow.c:143
#23 0x000055c5b9ebcd81 in commandline_run (argv=0x7ffc5b38d3a0, argc=0,
    command=0x7ffc5b38cc90)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/../lib/subcommands.c/commandline.c:71
#24 main (argc=<optimized out>, argv=<optimized out>)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/main.c:167
$ sudo gdb attach 421126

#0  0x00007f629f514a77 in __GI___libc_write (fd=8, buf=0x55c5badab350, nbytes=4096)
    at ../sysdeps/unix/sysv/linux/write.c:26
#1  0x00007f629f48aeed in _IO_new_file_write (f=0x55c5bad5adb0,
    data=0x55c5badab350, n=4096) at ./libio/fileops.c:1180
#2  0x00007f629f48c9e1 in new_do_write (to_do=4096,
    data=0x55c5badab350 "0000\",\"4504\",\"1\",null,\"2016-01-24 08:44:50+00\",\"3756\",\"0.400000\",\"4504\",\"3\",null,\"2016-01-24 08:44:50+00\",\"2051\",\"0.400000\",\"4504\",\"2\",null,\"2016-01-24 08:44:50+00\",\"2052\",\"0.400000\",\"4504\",\"0\",null,\""..., fp=0x55c5bad5adb0) at ./libio/libioP.h:947
#3  _IO_new_do_write (to_do=4096,
    data=0x55c5badab350 "0000\",\"4504\",\"1\",null,\"2016-01-24 08:44:50+00\",\"3756\",\"0.400000\",\"4504\",\"3\",null,\"2016-01-24 08:44:50+00\",\"2051\",\"0.400000\",\"4504\",\"2\",null,\"2016-01-24 08:44:50+00\",\"2052\",\"0.400000\",\"4504\",\"0\",null,\""..., fp=0x55c5bad5adb0) at ./libio/fileops.c:425
#4  _IO_new_do_write (fp=0x55c5bad5adb0,
    data=0x55c5badab350 "0000\",\"4504\",\"1\",null,\"2016-01-24 08:44:50+00\",\"3756\",\"0.400000\",\"4504\",\"3\",null,\"2016-01-24 08:44:50+00\",\"2051\",\"0.400000\",\"4504\",\"2\",null,\"2016-01-24 08:44:50+00\",\"2052\",\"0.400000\",\"4504\",\"0\",null,\""..., to_do=4096) at ./libio/fileops.c:422
#5  0x00007f629f48b6d5 in _IO_new_file_xsputn (n=1024, data=<optimized out>,
    f=<optimized out>) at ./libio/libioP.h:947
#6  _IO_new_file_xsputn (f=0x55c5bad5adb0, data=<optimized out>, n=1024)
    at ./libio/fileops.c:1196
#7  0x00007f629f47ffd7 in __GI__IO_fwrite (buf=0x7ffc5b057b90, size=size@entry=1,
    count=count@entry=1024, fp=0x55c5bad5adb0) at ./libio/libioP.h:947
#8  0x000055c5b9eff421 in flushbuffer (target=target@entry=0x7ffc5b057b60)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/../lib/pg/snprintf.c:311
#9  0x000055c5b9eff6d8 in dostr (target=<optimized out>, slen=28733,
    str=0x55c5bbbf8280 "6-01-24 08:45:40+00\",\"2057\",\"0.400000\",\"3339\",\"5\",null,\"2016-01-24 08:45:40+00\",\"2075\",\"0.400000\",\"3339\",\"1\",null,\"2016-01-24 08:45:40+00\",\"2059\",\"0.400000\",\"3339\",\"0\",null,\"2016-01-24 08:45:40+00\",\"2"...) at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/../lib/pg/snprintf.c:1389
#10 dostr (
    str=str@entry=0x55c5bbbf2e90 "[\"2016-01-24 08:42:10+00\",\"2103\",\"0.400000\",\"2746\",\"3\",null,\"2016-01-24 08:42:10+00\",\"4966\",\"0.400000\",\"2746\",\"0\",null,\"2016-01-24 08:42:10+00\",\"2584\",\"0.400000\",\"2746\",\"4\",null,\"2016-01-24 08:42:10+0"..., slen=<optimized out>, target=target@entry=0x7ffc5b057b60)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/../lib/pg/snprintf.c:1364
#11 0x000055c5b9f01ed6 in dopr (target=target@entry=0x7ffc5b057b60,
    format=0x55c5b9f2687d ";\n", args=<optimized out>)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/../lib/pg/snprintf.c:427
#12 0x000055c5b9f0334c in pg_vfprintf (stream=<optimized out>, fmt=<optimized out>,
    args=<optimized out>)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/../lib/pg/snprintf.c:258
#13 0x000055c5b9edba3f in fformat (stream=<optimized out>, fmt=<optimized out>)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/file_utils.c:1186
#14 0x000055c5b9ee7676 in stream_write_insert (insert=0x55c5badb5218,
    out=0x55c5bad5adb0)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/ld_transform.c:2074
#15 stream_write_transaction (txn=0x7ffc5b1658d0, out=0x55c5bad5adb0)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/ld_transform.c:1773
#16 stream_write_message (out=0x55c5bad5adb0, msg=msg@entry=0x7ffc5b1658c8)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/ld_transform.c:1630
#17 0x000055c5b9ee8add in stream_transform_write_message (
    privateContext=privateContext@entry=0x7ffc5b163c88,
    currentMsgIndex=currentMsgIndex@entry=0x7ffc5b05a398)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/ld_transform.c:289
#18 0x000055c5b9eef6f6 in stream_transform_line (ctx=0x7ffc5b05a390,
    line=0x55c5badaade0 "{\"action\":\"C\",\"xid\":\"6286137\",\"lsn\":\"116/347ED480\",\"timestamp\":\"2023-10-10 07:42:05.180187+0000\",\"message\":{\"action\":\"C\",\"xid\":6286137}}", stop=0x7ffc5b0582a7)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/ld_transform.c:220
#19 0x000055c5b9ed945b in read_from_stream (stream=<optimized out>,
    context=0x7ffc5b05a3a0)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/file_utils.c:642
#20 0x000055c5b9eef1a5 in stream_transform_stream (specs=0x7ffc5b05aef0)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/ld_transform.c:103
#21 0x000055c5b9eda1d0 in follow_start_transform (specs=0x7ffc5b05aef0)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/follow.c:667
#22 follow_start_transform (specs=0x7ffc5b05aef0)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/follow.c:636
#23 0x000055c5b9edcbea in follow_start_subprocess (
    specs=specs@entry=0x7ffc5b05aef0, subprocess=subprocess@entry=0x7ffc5b163c30)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/follow.c:776
#24 0x000055c5b9edd1a3 in followDB (streamSpecs=0x7ffc5b05aef0,
    copySpecs=<optimized out>)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/follow.c:542
#25 0x000055c5b9f0fba0 in followDB (copySpecs=<optimized out>,
    streamSpecs=0x7ffc5b05aef0)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/follow.c:493
#26 follow_main_loop.part.0.isra.0 (streamSpecs=0x7ffc5b05aef0,
    copySpecs=<optimized out>)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/follow.c:286
#27 0x000055c5b9ebebf3 in follow_main_loop (copySpecs=<optimized out>,
    streamSpecs=0x7ffc5b05aef0)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/follow.c:262
#28 start_follow_process (copySpecs=0x7ffc5b166690, pid=<synthetic pointer>,
    streamSpecs=0x7ffc5b05aef0)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/cli_clone_follow.c:691
#29 clone_and_follow (copySpecs=0x7ffc5b166690)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/cli_clone_follow.c:276
#30 cli_clone (argc=<optimized out>, argv=<optimized out>)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/cli_clone_follow.c:143
#31 0x000055c5b9ebcd81 in commandline_run (argv=0x7ffc5b38d3a0, argc=0,
    command=0x7ffc5b38cc90)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/../lib/subcommands.c/commandline.c:71
#32 main (argc=<optimized out>, argv=<optimized out>)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/main.c:167

@arajkumar
Copy link
Contributor

Few a times the apply process blocks forever when the follower switches to live replay mode,

10:44:53.806 425422 INFO   Apply reached 119/E70047D0 in "/home/ubuntu/.local/share/pgcopydb/0000000400000119000000E6.sql"
10:44:53.806 425422 INFO   Apply new filename: "/home/ubuntu/.local/share/pgcopydb/0000000400000119000000E7.sql"
10:44:53.807 425422 INFO   Replaying changes from file "/home/ubuntu/.local/share/pgcopydb/0000000400000119000000E7.sql"
10:44:54.711 425422 INFO   Apply reached 119/E789A830 in "/home/ubuntu/.local/share/pgcopydb/0000000400000119000000E7.sql"
10:44:54.711 425422 INFO   Apply new filename: "/home/ubuntu/.local/share/pgcopydb/0000000400000119000000E7.sql"
10:44:54.711 425422 INFO   Reached end of file "/home/ubuntu/.local/share/pgcopydb/0000000400000119000000E7.sql" at 119/E789A830.
10:44:54.748 425422 INFO   Current sentinel replay_lsn is 119/E7844F18
10:44:54.748 425422 INFO   Restarting logical decoding follower in replay mode
10:44:54.750 425452 INFO   Resuming streaming from latest file "/home/ubuntu/.local/share/pgcopydb/0000000400000119000000E7.json"
10:44:54.800 425454 WARN   Failed to connect to "postgres://txxx:36898/tsdb?sslmode=require&keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60", retrying until the server is ready
10:44:54.851 425454 INFO   Successfully connected to "postgres://txxxx:36898/tsdb?sslmode=require&keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60" after 2 attempts in 72 ms.
10:44:54.862 425454 INFO   Replaying changes from LSN 119/E789A830
10:44:55.123 425452 INFO   Resuming streaming at LSN 119/E78ABDB0 from first message with that LSN read in JSON file "/home/ubuntu/.local/share/pgcopydb/0000000400000119000000E7.json", line 37963
10:44:55.207 425452 INFO   Reported write_lsn 119/E78ABDB0, flush_lsn 119/E78ABDB0, replay_lsn 119/E6F42518
#0  0x00007f3eb0f1b78d in __GI___select (nfds=0, readfds=0x0, writefds=0x0,
    exceptfds=0x0, timeout=0x7fff3faebb30) at ../sysdeps/unix/sysv/linux/select.c:69
#1  0x000055b64e5d1f11 in pg_usleep ()
#2  0x000055b64e59d636 in parseTxnMetadataFile (metadata=0x7fff3faebc10,
    filename=0x7fff3faebc70 "/home/ubuntu/.local/share/pgcopydb/6360655.json")
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/ld_apply.c:1546
#3  readTxnCommitLSN (metadata=0x7fff3faec100, context=0x7fff3faee300)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/ld_apply.c:1483
#4  readTxnCommitLSN (metadata=0x7fff3faec100, context=0x7fff3faee300)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/ld_apply.c:1458
#5  stream_apply_sql (context=0x7fff3faee300, metadata=0x7fff3faec100,
    sql=<optimized out>)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/ld_apply.c:657
#6  0x000055b64e59f3ee in stream_replay_line (ctx=0x7fff3faee300,
    line=0x55b650070490 "BEGIN; -- {\"xid\":6360655,\"lsn\":\"119/E7877998\",\"timestamp\":\"2023-10-10 10:44:50.920671+0000\"}", stop=0x7fff3faec1e7)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/ld_replay.c:149
#7  0x000055b64e59745b in read_from_stream (stream=<optimized out>,
    context=0x7fff3faee2d0)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/file_utils.c:642
#8  0x000055b64e5a2ea3 in stream_apply_replay (specs=specs@entry=0x7fff3fbf8590)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/ld_replay.c:79
#9  0x000055b64e5a30dd in follow_start_catchup (specs=0x7fff3fbf8590)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/follow.c:715
#10 follow_start_catchup (specs=0x7fff3fbf8590)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/follow.c:699
#11 0x000055b64e59abfa in follow_start_subprocess (
    specs=specs@entry=0x7fff3fbf8590, subprocess=subprocess@entry=0x7fff3fd012f0)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/follow.c:779
#12 0x000055b64e59b1dd in followDB (streamSpecs=0x7fff3fbf8590,
    copySpecs=<optimized out>)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/follow.c:556
#13 0x000055b64e5cdbc0 in followDB (copySpecs=<optimized out>,
    streamSpecs=0x7fff3fbf8590)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/follow.c:493
#14 follow_main_loop.part.0.isra.0 (streamSpecs=0x7fff3fbf8590,
    copySpecs=<optimized out>)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/follow.c:286
#15 0x000055b64e57cbf3 in follow_main_loop (copySpecs=<optimized out>,
    streamSpecs=0x7fff3fbf8590)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/follow.c:262
#16 start_follow_process (copySpecs=0x7fff3fd03d30, pid=<synthetic pointer>,
    streamSpecs=0x7fff3fbf8590)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/cli_clone_follow.c:691
#17 clone_and_follow (copySpecs=0x7fff3fd03d30)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/cli_clone_follow.c:276
#18 cli_clone (argc=<optimized out>, argv=<optimized out>)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/cli_clone_follow.c:143
#19 0x000055b64e57ad81 in commandline_run (argv=0x7fff3ff2aa40, argc=0,
    command=0x7fff3ff2a330)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/../lib/subcommands.c/commandline.c:71
#20 main (argc=<optimized out>, argv=<optimized out>)
    at /home/ubuntu/dimitri/pgcopydb/src/bin/pgcopydb/main.c:167

IIUC, the apply process waits for the transaction metadata file 6360655.json which will be created by transform process when it encounters a commit for the txid 6360655.

$ rg 6360655 -l ~/.local/share/pgcopydb
/home/ubuntu/.local/share/pgcopydb/0000000400000119000000E7.sql
/home/ubuntu/.local/share/pgcopydb/0000000400000119000000E7.json
/home/ubuntu/.local/share/pgcopydb/0000000400000119000000E7.json.partial

I could see that transaction commit at line # 38046 in 0000000400000119000000E7.json, but not in the transformed sql file. Refer the attached files below,

files-0000000400000119000000E7.zip

Also attached a full log.
clone.log

@dimitri
Copy link
Owner

dimitri commented Oct 10, 2023

Hi,

It's unclear to me if you're hitting twice the same problem or two different problems.

  • Seems like there is something going wrong with the PIPEs mechanism wherein the termination of the reading-end of the pipe (the apply process) seems to not be acknowledged from the writing-end of the pipe (the transform process).
  • And also something wrong with the transform process missing data, here it failed to process the COMMIT part of a transaction that span over multiple files, in a way that the metadata JSON file is needed.

The second problem seems related maybe to your previous area of work with resuming logical decoding. The first problem is strange because we use select(1) in our implementation of read_from_stream in order to avoid that exact scenario.

@arajkumar
Copy link
Contributor

@dimitri Both of the issues mentioned above [1] and [2] occur after pgcopydb stream sentinel set endpos --current. Sometimes, I encounter [1], while at other times, it's [2].

The first problem is strange because we use select(1) in our implementation of read_from_stream in order to avoid that exact scenario.

It seems the write of receive and transform is blocking after termination of apply.

@arajkumar
Copy link
Contributor

arajkumar commented Oct 11, 2023

I see that transform process blocked in read_from_stream and we never check endpos being reached from transform.

I think the easiest approach would be sending SIGTERM from follow once we know endpos <= flush_lsn ?

EDIT: I see that there is a mechanism in place for transform, but it never gets called because the input stream for transform didn't had any content and it just loops around in select with 0 FD ready loop.

@arajkumar
Copy link
Contributor

Few a times the apply process blocks forever when the follower switches to live replay mode,

IIUC, the apply process waits for the transaction metadata file 6360655.json which will be created by transform process when it encounters a commit for the txid 6360655.

$ rg 6360655 -l ~/.local/share/pgcopydb
/home/ubuntu/.local/share/pgcopydb/0000000400000119000000E7.sql
/home/ubuntu/.local/share/pgcopydb/0000000400000119000000E7.json
/home/ubuntu/.local/share/pgcopydb/0000000400000119000000E7.json.partial

I could see that transaction commit at line # 38046 in 0000000400000119000000E7.json, but not in the transformed sql file. Refer the attached files below,

@dimitri IIUC, this is caused by a deadlock during replay mode.

The transform process writes the transaction statements to stdout once it encounters COMMIT or KEEPALIVE or ENDPOS or SWITCH and followed by writeTxnMetadataFile if the message is COMMIT. In the problematic case, the write to stdout is triggered by KEEPALIVE and the apply process receives the BEGIN and gets blocked in readTxnCommitLSN. However, transform also blocked in writing to stdout because the pipe buffer is full due to larger transaction, this leads to dead lock.

@arajkumar
Copy link
Contributor

Let me create a separate issue to track deadlock while doing pgcopydb clone --follow.

@dimitri
Copy link
Owner

dimitri commented Nov 6, 2023

I believe #502 and #521 are contributing to fix that issue, closing now.

@dimitri dimitri closed this as completed Nov 6, 2023
@dimitri dimitri added this to the v0.14 milestone Nov 6, 2023
@dimitri dimitri added the bug Something isn't working label Nov 6, 2023
@dimitri dimitri self-assigned this Nov 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants