Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why training hangs #1

Open
figurine2018 opened this issue Apr 1, 2018 · 3 comments
Open

why training hangs #1

figurine2018 opened this issue Apr 1, 2018 · 3 comments

Comments

@figurine2018
Copy link

@Aetf
I created the relevant environment and run embedding.py on my own computer according to your documentation. The program hung after it run and printed 1-25 pieces of information (the position of the stall was different each time the program was run), but it did not exit.

2018-04-01 06:01:12.024821: myglobal 1 epoch 1 step 1 loss = 21.25 (0.9 samples/sec; 1.175 sec/batch)
2018-04-01 06:01:12.354372: myglobal 2 epoch 1 step 2 loss = 17.27 (3.2 samples/sec; 0.312 sec/batch)
2018-04-01 06:01:12.787619: myglobal 3 epoch 1 step 3 loss = 10.45 (2.9 samples/sec; 0.346 sec/batch)
2018-04-01 06:01:13.477380: myglobal 4 epoch 1 step 4 loss = 17.19 (1.5 samples/sec; 0.678 sec/batch)
2018-04-01 06:01:14.020272: myglobal 5 epoch 1 step 5 loss = 17.10 (1.9 samples/sec; 0.518 sec/batch)
2018-04-01 06:01:14.258575: myglobal 6 epoch 1 step 6 loss = 10.39 (4.4 samples/sec; 0.228 sec/batch)
2018-04-01 06:01:14.698754: myglobal 7 epoch 1 step 7 loss = 26.52 (2.5 samples/sec; 0.407 sec/batch)
2018-04-01 06:01:14.965694: myglobal 8 epoch 1 step 8 loss = 15.85 (4.1 samples/sec; 0.246 sec/batch)
2018-04-01 06:01:15.259785: myglobal 9 epoch 1 step 9 loss = 17.02 (3.6 samples/sec; 0.274 sec/batch)
<------it hangs and do nothing forever and different position in next rerunning

Ctrl+c does not work, and ctrl+z can exit.
I used the "top" command to see that the host's CPU and memory were idle and not busy running any more.

my system is Ubuntu16.04 LTS, tensorflow=1.0.0, tensorflow_fold_fold=0.0.1 python=3.5, CPU only

Linux ubuntu 4.13.0-37-generic #42~16.04.1-Ubuntu SMP Wed Mar 7 16:03:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

How do i solve this problem?
Thanks very much!

@Aetf
Copy link
Owner

Aetf commented Apr 1, 2018

Hmm, I can't think of any particular reason that could cause this problem. What was the exact command you used to run it? Also, could you run pytest in the top level folder and see if all the tests pass?

@figurine2018
Copy link
Author

The command which I used is python embedding.py for the result above. Is these codes never stuck on your computer?

I found that both the test_embedding.py file and the test_tbcnn.py file wrote test code according to unittest (for example, unittest.main() and class TestEmbedding(unittest.TestCase):). If I use the pytest command in the root directory, a series of errors may be generated (this is indeed the case).

@shiyy123
Copy link

shiyy123 commented Sep 9, 2018

I modify the default value of argument word_dim in tbcnn/config.py from 100->400, then it can run.

# parser.add_argument('--word_dim', help='dimension of node feature', type=int, default=100)
parser.add_argument('--word_dim', help='dimension of node feature', type=int, default=40)

hmstepanek added a commit to hmstepanek/tensorflow-tbcnn that referenced this issue Nov 4, 2021
hmstepanek added a commit to hmstepanek/tensorflow-tbcnn that referenced this issue Nov 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants