Any changes on repo level concatenation? #22

IQ179 · 2024-06-25T04:56:52Z

In deepseek coder v1, I was able to find a detailed explanation of repo level concat in an issue.
Has anything changed from the method used in v1 to v2?

guoday · 2024-07-03T05:19:42Z

V2 switched from topological sorting concatenation to random concatenation mainly because random concatenation is more language-friendly.

IQ179 · 2024-07-04T02:41:12Z

Thanks for answer. Can I ask few more questions?

Random concatenation means you literally concatenate every file without any order? So it does not have to consider the dependency of functions or class?
I'm not sure what 'language-friendly' means. Does "more language-friendly' mean "more natural", or "more similar to human codes"?

guoday · 2024-07-04T03:17:00Z

Yes, because we believe that randomly concatenating files is more reflective of real-world scenarios. Programmers may not always write dependent functions or classes first; they might complete the logic of main.py first and then implement the code for the dependent files.
Language-friendly refers to being easier to handle. Topological sorting concatenation needs to consider the dependency parsing for each programming language, making it challenging to extend to over 300 languages.

IQ179 · 2024-07-04T07:09:14Z

Thanks a lot!
Then you mean concatenation of codes on their dependency is not that important, right?
I thought it was the key to improve the performance in very long codes.

yiyepiaoling0715 · 2024-07-04T14:12:54Z

complete the logic of main.py

V1 mention, just 4 lang deal by repo-level dependence deal, why V2 not keep the 4 lang in V1？
I thought repo-level depend parsing is a key point for the improvement.
e.g.
file1:
def func_a(a,b):
return a,b
file2:
def main():
c=func_a(1,2)
if model learn file2 before file1, model did not know what is func_a, it will memory the func_a(1,2)
if model learn file1 before file2, when model lean c=func_a(1,2), because it learn func_a(,b) before,so maybe inspire it's reasoning ability

yiyepiaoling0715 · 2024-07-04T14:30:28Z

Yes, because we believe that randomly concatenating files is more reflective of real-world scenarios. Programmers may not always write dependent functions or classes first; they might complete the logic of main.py first and then implement the code for the dependent files.

Language-friendly refers to being easier to handle. Topological sorting concatenation needs to consider the dependency parsing for each programming language, making it challenging to extend to over 300 languages.

can you give a explantion for me? am i right with the thought

guoday · 2024-07-08T04:44:01Z

Dependencies between files are important. When randomly concatenating files, some cases may satisfy these dependencies and can help improve the performance in very long codes. Other cases may contribute to the robustness of the model because not all application scenarios provide complete dependencies.

yiyepiaoling0715 · 2024-07-08T10:13:08Z

so remove file topological graph is not benificial inall? is this confirmed by experiments?
the key point of my question is shuffled file has more halluciation case than dependpent files? because i think the biggest advantage of dependent files is hallucation and reasoning

Dependencies between files are important. When randomly concatenating files, some cases may satisfy these dependencies and can help improve the performance in very long codes. Other cases may contribute to the robustness of the model because not all application scenarios provide complete dependencies.

guoday · 2024-07-08T10:21:11Z

so remove file topological graph is not benificial inall? is this confirmed by experiments? the key point of my question is shuffled file has more halluciation case than dependpent files? because i think the biggest advantage of dependent files is hallucation and reasoning

Dependencies between files are important. When randomly concatenating files, some cases may satisfy these dependencies and can help improve the performance in very long codes. Other cases may contribute to the robustness of the model because not all application scenarios provide complete dependencies.

In our experiment, we did not observe significant differences between random ordering and topological sorting on the benchmarks.

yiyepiaoling0715 · 2024-07-08T16:41:35Z

so remove file topological graph is not benificial inall? is this confirmed by experiments? the key point of my question is shuffled file has more halluciation case than dependpent files? because i think the biggest advantage of dependent files is hallucation and reasoning

Dependencies between files are important. When randomly concatenating files, some cases may satisfy these dependencies and can help improve the performance in very long codes. Other cases may contribute to the robustness of the model because not all application scenarios provide complete dependencies.

In our experiment, we did not observe significant differences between random ordering and topological sorting on the benchmarks.

Why deepdeek coder. V1. Mention. Toplogical. Graph. As. A. Good. Point?

guoday · 2024-07-09T01:47:11Z

so remove file topological graph is not benificial inall? is this confirmed by experiments? the key point of my question is shuffled file has more halluciation case than dependpent files? because i think the biggest advantage of dependent files is hallucation and reasoning

Dependencies between files are important. When randomly concatenating files, some cases may satisfy these dependencies and can help improve the performance in very long codes. Other cases may contribute to the robustness of the model because not all application scenarios provide complete dependencies.

In our experiment, we did not observe significant differences between random ordering and topological sorting on the benchmarks.

Why deepdeek coder. V1. Mention. Toplogical. Graph. As. A. Good. Point?

The Topological Graph is a good way to organize repo-level data, as you believe that "the biggest advantage of dependent files is alleviating hallucination and reasoning." However, based on our current experiments, the improvement of organizing repo-level data using a topological graph compared to random ordering on benchmarks is marginal, with no significant differences. Given the resources required to use topological sorting to parse hundreds of programming languages, random ordering is more efficient.

yiyepiaoling0715 · 2024-07-10T11:54:45Z

so remove file topological graph is not benificial inall? is this confirmed by experiments? the key point of my question is shuffled file has more halluciation case than dependpent files? because i think the biggest advantage of dependent files is hallucation and reasoning

Dependencies between files are important. When randomly concatenating files, some cases may satisfy these dependencies and can help improve the performance in very long codes. Other cases may contribute to the robustness of the model because not all application scenarios provide complete dependencies.

In our experiment, we did not observe significant differences between random ordering and topological sorting on the benchmarks.

Why deepdeek coder. V1. Mention. Toplogical. Graph. As. A. Good. Point?

The Topological Graph is a good way to organize repo-level data, as you believe that "the biggest advantage of dependent files is alleviating hallucination and reasoning." However, based on our current experiments, the improvement of organizing repo-level data using a topological graph compared to random ordering on benchmarks is marginal, with no significant differences. Given the resources required to use topological sorting to parse hundreds of programming languages, random ordering is more efficient.

thanks,i see

yiyepiaoling0715 · 2024-07-23T11:47:20Z

so remove file topological graph is not benificial inall? is this confirmed by experiments? the key point of my question is shuffled file has more halluciation case than dependpent files? because i think the biggest advantage of dependent files is hallucation and reasoning

Dependencies between files are important. When randomly concatenating files, some cases may satisfy these dependencies and can help improve the performance in very long codes. Other cases may contribute to the robustness of the model because not all application scenarios provide complete dependencies.

In our experiment, we did not observe significant differences between random ordering and topological sorting on the benchmarks.

Why deepdeek coder. V1. Mention. Toplogical. Graph. As. A. Good. Point?

The Topological Graph is a good way to organize repo-level data, as you believe that "the biggest advantage of dependent files is alleviating hallucination and reasoning." However, based on our current experiments, the improvement of organizing repo-level data using a topological graph compared to random ordering on benchmarks is marginal, with no significant differences. Given the resources required to use topological sorting to parse hundreds of programming languages, random ordering is more efficient.

thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any changes on repo level concatenation? #22

Any changes on repo level concatenation? #22

IQ179 commented Jun 25, 2024

guoday commented Jul 3, 2024

IQ179 commented Jul 4, 2024

guoday commented Jul 4, 2024

IQ179 commented Jul 4, 2024

yiyepiaoling0715 commented Jul 4, 2024

yiyepiaoling0715 commented Jul 4, 2024

guoday commented Jul 8, 2024 •

edited

Loading

yiyepiaoling0715 commented Jul 8, 2024

guoday commented Jul 8, 2024

yiyepiaoling0715 commented Jul 8, 2024

guoday commented Jul 9, 2024

yiyepiaoling0715 commented Jul 10, 2024

yiyepiaoling0715 commented Jul 23, 2024

Any changes on repo level concatenation? #22

Any changes on repo level concatenation? #22

Comments

IQ179 commented Jun 25, 2024

guoday commented Jul 3, 2024

IQ179 commented Jul 4, 2024

guoday commented Jul 4, 2024

IQ179 commented Jul 4, 2024

yiyepiaoling0715 commented Jul 4, 2024

yiyepiaoling0715 commented Jul 4, 2024

guoday commented Jul 8, 2024 • edited Loading

yiyepiaoling0715 commented Jul 8, 2024

guoday commented Jul 8, 2024

yiyepiaoling0715 commented Jul 8, 2024

guoday commented Jul 9, 2024

yiyepiaoling0715 commented Jul 10, 2024

yiyepiaoling0715 commented Jul 23, 2024

guoday commented Jul 8, 2024 •

edited

Loading