Skip to content

Conversation

@deardeng
Copy link
Contributor

@deardeng deardeng commented Oct 27, 2025

What problem does this PR solve?

Currently, tablet report logic uses a ForkJoinPool to process tablet information, but often encounters unexplained hangs in the ForkJoinPool. The printed stack trace doesn't reveal where the hang occurs, making it difficult to troubleshoot the issue.

use forkjoin pool, report stuck stack such as

"report-thread" #187 daemon prio=5 os_prio=0 cpu=97864.95ms elapsed=2469428.50s tid=0x00007ff1cb5c8530 nid=0xef2 waiting on condition  [0x00007fef462e1000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
        - parking to wait for  <0x00000006dc3fae00> (a java.util.concurrent.ForkJoinTask$AdaptedRunnableAction)
        at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:341)
        at java.util.concurrent.ForkJoinTask.awaitDone([email protected]/ForkJoinTask.java:468)
        at java.util.concurrent.ForkJoinTask.join([email protected]/ForkJoinTask.java:670)
        at org.apache.doris.catalog.TabletInvertedIndex.tabletReport(TabletInvertedIndex.java:370)
        at org.apache.doris.master.ReportHandler.tabletReport(ReportHandler.java:509)
        at org.apache.doris.master.ReportHandler$ReportTask.exec(ReportHandler.java:339)
        at org.apache.doris.master.ReportHandler.runOneCycle(ReportHandler.java:1466)
        at org.apache.doris.common.util.Daemon.run(Daemon.java:119)

   Locked ownable synchronizers:
        - None

Can't find where the problem is in this stack

When the tablet report is stuck, the TabletInvertedIndex holds a read lock, leading to a deadlock.

This pr uses a normal thread pool to replace forkjoinpool

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Oct 27, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@deardeng
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-DS: Total hot run time: 190209 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit cb5a5c8466a7a8eaa3f0e765e8af3cedd354b8e4, data reload: false

query1	1095	431	405	405
query2	6553	1704	1732	1704
query3	6748	219	217	217
query4	26439	23346	23043	23043
query5	4372	665	500	500
query6	360	253	230	230
query7	4662	525	302	302
query8	310	273	282	273
query9	8744	2600	2595	2595
query10	503	348	290	290
query11	15881	15038	14878	14878
query12	189	123	120	120
query13	1691	578	450	450
query14	12738	9366	9496	9366
query15	263	196	177	177
query16	7803	685	511	511
query17	1617	764	614	614
query18	2073	464	356	356
query19	271	229	219	219
query20	145	145	134	134
query21	213	143	127	127
query22	4835	4848	5036	4848
query23	35120	33639	33763	33639
query24	8710	2505	2555	2505
query25	632	525	485	485
query26	1483	270	165	165
query27	2751	514	365	365
query28	5148	2264	2204	2204
query29	789	634	534	534
query30	302	251	207	207
query31	1007	1036	761	761
query32	86	70	74	70
query33	580	382	338	338
query34	830	877	538	538
query35	856	914	855	855
query36	987	1049	892	892
query37	131	115	91	91
query38	3581	3627	3464	3464
query39	1456	1409	1399	1399
query40	221	131	114	114
query41	70	66	74	66
query42	125	120	112	112
query43	484	495	469	469
query44	1263	750	755	750
query45	184	190	183	183
query46	899	996	642	642
query47	1775	1812	1715	1715
query48	395	446	327	327
query49	796	501	450	450
query50	650	701	401	401
query51	3978	3862	3843	3843
query52	116	114	102	102
query53	241	274	207	207
query54	603	593	521	521
query55	95	86	86	86
query56	327	349	307	307
query57	1169	1200	1125	1125
query58	285	289	280	280
query59	2678	2594	2506	2506
query60	346	343	330	330
query61	161	161	156	156
query62	827	745	691	691
query63	233	201	194	194
query64	4361	1188	916	916
query65	4053	3942	3950	3942
query66	1086	443	344	344
query67	15468	15280	15039	15039
query68	8373	885	615	615
query69	477	320	295	295
query70	1388	1318	1333	1318
query71	503	373	328	328
query72	5903	4960	5031	4960
query73	736	649	360	360
query74	8974	9140	8716	8716
query75	4042	3369	2801	2801
query76	3700	1152	757	757
query77	817	419	326	326
query78	9643	9684	8925	8925
query79	2385	850	583	583
query80	765	563	530	530
query81	514	261	233	233
query82	490	161	131	131
query83	275	283	245	245
query84	255	112	88	88
query85	947	468	417	417
query86	391	321	322	321
query87	3714	3713	3655	3655
query88	3760	2194	2206	2194
query89	392	327	307	307
query90	1991	220	213	213
query91	160	166	135	135
query92	88	71	66	66
query93	2022	980	635	635
query94	694	447	311	311
query95	428	327	313	313
query96	495	583	281	281
query97	2985	3026	2929	2929
query98	252	218	219	218
query99	1788	1439	1314	1314
Total cold run time: 284190 ms
Total hot run time: 190209 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 27.76 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit cb5a5c8466a7a8eaa3f0e765e8af3cedd354b8e4, data reload: false

query1	0.05	0.05	0.05
query2	0.10	0.05	0.06
query3	0.25	0.08	0.08
query4	1.62	0.12	0.11
query5	0.27	0.27	0.26
query6	1.19	0.66	0.64
query7	0.03	0.03	0.03
query8	0.05	0.04	0.04
query9	0.62	0.52	0.53
query10	0.57	0.58	0.58
query11	0.16	0.11	0.12
query12	0.14	0.12	0.12
query13	0.61	0.61	0.60
query14	1.02	1.01	1.01
query15	0.85	0.83	0.86
query16	0.39	0.39	0.39
query17	1.02	1.03	1.02
query18	0.21	0.22	0.20
query19	1.86	1.77	1.81
query20	0.02	0.02	0.01
query21	15.45	0.19	0.12
query22	5.07	0.07	0.05
query23	15.67	0.27	0.10
query24	2.52	0.65	0.82
query25	0.07	0.07	0.06
query26	0.14	0.15	0.13
query27	0.06	0.05	0.06
query28	4.80	1.13	0.93
query29	12.69	3.94	3.28
query30	0.28	0.14	0.11
query31	2.81	0.58	0.38
query32	3.24	0.54	0.48
query33	3.03	3.15	3.03
query34	15.87	5.13	4.56
query35	4.58	4.58	4.57
query36	0.71	0.50	0.48
query37	0.10	0.07	0.07
query38	0.07	0.05	0.04
query39	0.03	0.03	0.03
query40	0.17	0.14	0.14
query41	0.08	0.03	0.04
query42	0.04	0.02	0.03
query43	0.05	0.03	0.03
Total cold run time: 98.56 s
Total hot run time: 27.76 s

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 61.07% (91/149) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants