why training hangs #1

figurine2018 · 2018-04-01T13:17:55Z

@Aetf
I created the relevant environment and run embedding.py on my own computer according to your documentation. The program hung after it run and printed 1-25 pieces of information (the position of the stall was different each time the program was run), but it did not exit.

2018-04-01 06:01:12.024821: myglobal 1 epoch 1 step 1 loss = 21.25 (0.9 samples/sec; 1.175 sec/batch)
2018-04-01 06:01:12.354372: myglobal 2 epoch 1 step 2 loss = 17.27 (3.2 samples/sec; 0.312 sec/batch)
2018-04-01 06:01:12.787619: myglobal 3 epoch 1 step 3 loss = 10.45 (2.9 samples/sec; 0.346 sec/batch)
2018-04-01 06:01:13.477380: myglobal 4 epoch 1 step 4 loss = 17.19 (1.5 samples/sec; 0.678 sec/batch)
2018-04-01 06:01:14.020272: myglobal 5 epoch 1 step 5 loss = 17.10 (1.9 samples/sec; 0.518 sec/batch)
2018-04-01 06:01:14.258575: myglobal 6 epoch 1 step 6 loss = 10.39 (4.4 samples/sec; 0.228 sec/batch)
2018-04-01 06:01:14.698754: myglobal 7 epoch 1 step 7 loss = 26.52 (2.5 samples/sec; 0.407 sec/batch)
2018-04-01 06:01:14.965694: myglobal 8 epoch 1 step 8 loss = 15.85 (4.1 samples/sec; 0.246 sec/batch)
2018-04-01 06:01:15.259785: myglobal 9 epoch 1 step 9 loss = 17.02 (3.6 samples/sec; 0.274 sec/batch)
<------it hangs and do nothing forever and different position in next rerunning

Ctrl+c does not work, and ctrl+z can exit.
I used the "top" command to see that the host's CPU and memory were idle and not busy running any more.

my system is Ubuntu16.04 LTS, tensorflow=1.0.0, tensorflow_fold_fold=0.0.1 python=3.5, CPU only

Linux ubuntu 4.13.0-37-generic #42~16.04.1-Ubuntu SMP Wed Mar 7 16:03:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

How do i solve this problem？
Thanks very much!

Aetf · 2018-04-01T17:48:02Z

Hmm, I can't think of any particular reason that could cause this problem. What was the exact command you used to run it? Also, could you run pytest in the top level folder and see if all the tests pass?

figurine2018 · 2018-04-03T16:24:27Z

The command which I used is python embedding.py for the result above. Is these codes never stuck on your computer?

I found that both the test_embedding.py file and the test_tbcnn.py file wrote test code according to unittest (for example, unittest.main() and class TestEmbedding(unittest.TestCase):). If I use the pytest command in the root directory, a series of errors may be generated (this is indeed the case).

shiyy123 · 2018-09-09T08:49:17Z

I modify the default value of argument word_dim in tbcnn/config.py from 100->400, then it can run.

# parser.add_argument('--word_dim', help='dimension of node feature', type=int, default=100)
parser.add_argument('--word_dim', help='dimension of node feature', type=int, default=40)

hmstepanek added a commit to hmstepanek/tensorflow-tbcnn that referenced this issue Nov 4, 2021

Increase word_dim as described in Aetf#1

567217c

hmstepanek added a commit to hmstepanek/tensorflow-tbcnn that referenced this issue Nov 8, 2021

Set word_dimm to 400 as recommended in Aetf#1

6f362d4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why training hangs #1

why training hangs #1

figurine2018 commented Apr 1, 2018

Aetf commented Apr 1, 2018

figurine2018 commented Apr 3, 2018

shiyy123 commented Sep 9, 2018

why training hangs #1

why training hangs #1

Comments

figurine2018 commented Apr 1, 2018

Aetf commented Apr 1, 2018

figurine2018 commented Apr 3, 2018

shiyy123 commented Sep 9, 2018