You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: python/selfplay/synchronous_loop.sh
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -102,7 +102,7 @@ do
102
102
(
103
103
# Skip validate since peeling off 5% of data is actually a bit too chunky and discrete when running at a small scale, and validation data
104
104
# doesn't actually add much to debugging a fast-changing RL training.
105
-
time SKIP_VALIDATE=1 ./shuffle.sh "$BASEDIR""$SCRATCHDIR""$NUM_THREADS_FOR_SHUFFLING""$BATCHSIZE"-min-rows "$SHUFFLE_MINROWS" -keep-target-rows "$SHUFFLE_KEEPROWS" -taper-window-scale "$TAPER_WINDOW_SCALE"| tee -a "$BASEDIR"/logs/outshuffle.txt
105
+
time SKIP_VALIDATE=1 ./shuffle.sh "$BASEDIR""$SCRATCHDIR""$NUM_THREADS_FOR_SHUFFLING" -min-rows "$SHUFFLE_MINROWS" -keep-target-rows "$SHUFFLE_KEEPROWS" -taper-window-scale "$TAPER_WINDOW_SCALE"| tee -a "$BASEDIR"/logs/outshuffle.txt
If you want to control the "scale" of the power law differently than the min rows, you can specify -taper-window-scale as well.
738
722
There is also a bit of a hack to cap the number of random rows (rows generated by random play without a neural net), since random row generation at the start of a run can be very fast due to not hitting the GPU, and overpopulate the run.
739
723
740
-
Additionally, NOT all of the shuffled window is output, only a random shuffled 20M rows will be kept. Adjust this using -keep-target-rows. The intention is that this script will be repeatedly run as new data comes in, such that well before train.py would need more than 20M rows, the data would have been shuffled again and a new random 20M rows chosen.
724
+
Additionally, NOT all of the shuffled window need be output: -keep-target-rows controls how many rows are randomly sampled and kept (pass 'all' to keep the whole window). For ongoing self-play training the intention is that this script is rerun as new data comes in, such that well before train.py would need more than -keep-target-rows rows, the data would have been reshuffled and a fresh random sample chosen.
741
725
742
-
If you are NOT doing ongoing self-play training, but simply want to shuffle an entire dataset (not just a window of it) and want to output all of it (not just 20M of it) then you can use arguments like:
743
-
-taper-window-exponent 1.0 \\
744
-
-expand-window-per-row 1.0 \\
745
-
-keep-target-rows SOME_VERY_LARGE_NUMBER
726
+
If you are NOT doing ongoing self-play training, but simply want to shuffle an entire dataset (not just a window of it) and output all of it, the default window args already select the whole dataset, so you just need:
727
+
-keep-target-rows all
746
728
747
729
If you ARE doing ongoing self-play training, but want a fixed window size, then you can use arguments like:
optional_args.add_argument('-min-rows', type=int, required=False, help='Minimum training rows to use, default 250k')
781
-
optional_args.add_argument('-max-rows', type=int, required=False, help='Maximum training rows to use, default unbounded')
782
-
optional_args.add_argument('-keep-target-rows', type=int, required=False, help='Target number of rows to actually keep in the final data set, default 20M')
783
-
required_args.add_argument('-expand-window-per-row', type=float, required=True, help='Beyond min rows, initially expand the window by this much every post-random data row')
784
-
required_args.add_argument('-taper-window-exponent', type=float, required=True, help='Make the window size asymtotically grow as this power of the data rows')
761
+
optional_args.add_argument('-min-rows', type=int, required=False, help='Minimum size of the desired training window, default 250k')
762
+
optional_args.add_argument('-max-rows', type=int, required=False, help='Maximum size of the desired training window, default unbounded')
763
+
required_args.add_argument('-keep-target-rows', required=True, help="Target number of rows to actually sample and keep in the final output shuffle, or 'all' to keep the whole window")
764
+
optional_args.add_argument('-expand-window-per-row', type=float, required=False, default=1.0, help='Beyond min rows, initially expand the window by this much every post-random data row (default 1.0)')
765
+
optional_args.add_argument('-taper-window-exponent', type=float, required=False, default=1.0, help='Make the window size asymtotically grow as this power of the data rows (default 1.0)')
785
766
optional_args.add_argument('-taper-window-scale', type=float, required=False, help='The scale at which the power law applies, defaults to -min-rows')
786
767
optional_args.add_argument('-add-to-data-rows', type=float, required=False, help='Compute the window size as if the number of data rows were this much larger/smaller')
787
-
optional_args.add_argument('-add-to-window-size', type=float, required=False, help='DEPRECATED due to being misnamed name, use -add-to-data-rows')
788
768
optional_args.add_argument('-summary-file', required=False, help='Summary json file for directory contents')
789
-
required_args.add_argument('-out-dir', required=True, help='Dir to output training files')
790
-
required_args.add_argument('-out-tmp-dir', required=True, help='Dir to use as scratch space')
769
+
optional_args.add_argument('-out-dir', required=False, help='Dir to output training files (not required in --dry-run-print-resource-cost mode)')
770
+
optional_args.add_argument('-out-tmp-dir', required=False, help='Dir to use as scratch space (not required in --dry-run-print-resource-cost mode)')
791
771
optional_args.add_argument('-approx-rows-per-out-file', type=int, required=False, default=70000, help='Number of rows per output file, default 70k')
792
772
optional_args.add_argument('-approx-rows-per-bucket', type=int, required=False, help='Each merge worker takes one whole bucket in RAM and splits it equally into output files. Bigger buckets means shard files. Must be a multiple of -approx-rows-per-out-file. Default: equal to -approx-rows-per-out-file.')
793
773
optional_args.add_argument('-num-waves', type=int, required=False, default=1, help='If > 1, shuffle in this many waves to bound peak intermediate shard count and temp disk usage for very large (whole-dataset) shuffles. Default 1 (no waves).')
794
774
optional_args.add_argument('--dry-run-print-resource-cost', type=int, required=False, metavar='NUM_DATASET_ROWS', help='Do not actually shuffle (or even scan the dataset). Assume the dataset has this many total rows, run the window-size / keep / md5-filter math, and print rough estimates of output files, peak intermediate shard count, peak temp disk usage, and peak memory. Assumes 19x19 data and typical measured per-row sizes.')
795
775
required_args.add_argument('-num-processes', type=int, required=True, help='Number of multiprocessing processes for shuffling in parallel')
796
-
required_args.add_argument('-batch-size', type=int, required=True, help='Batch size to write training examples in')
797
-
optional_args.add_argument('-ensure-batch-multiple', type=int, required=False, help='Ensure each file is a multiple of this many batches')
798
776
optional_args.add_argument('-worker-group-size', type=int, required=False, default=80000, help='Internally, target having many rows per parallel sharding worker (doesnt affect merge)')
799
777
optional_args.add_argument('-exclude', required=False, help='Text file with npzs to ignore, one per line')
800
778
optional_args.add_argument('-exclude-prefix', required=False, help='Prefix to concat to lines in exclude to produce the full file path')
optional_args.add_argument('-only-include-md5-path-prop-ubound', type=float, required=False, help='Just before sharding, include only filepaths hashing to float < this')
0 commit comments