## Make sure related data is put into the same bucket - Assign related items to the same data partition - e.g. chunks of the same file, audio spoken by the same person etc This will make sure your network has been _contaminated_ by seeing testing samples during the traing phase. e.g. the following function from tensorflow examples assigns files stably to partitions, ignoring a regex in the file name: ```python validation_percentage = 10.0 testing_percentage = 10.0 MAX_NUM_WAVS_PER_CLASS = 10000000000 def which_set(filename, validation_percentage, testing_percentage): base_name = os.path.basename(filename) # We want to ignore anything after '_nohash_' in the file name when # deciding which set to put a wav in, so the data set creator has a way of # grouping wavs that are close variations of each other. hash_name = re.sub(r'\d+', '', base_name, 1) # This looks a bit magical, but we need to decide whether this file should # go into the training, testing, or validation sets, and we want to keep # existing files in the same set even if more files are subsequently # added. # To do that, we need a stable way of deciding based on just the file name # itself, so we do a hash of that and then use that to generate a # probability value that we use to assign it. hash_name_hashed = hashlib.sha1(compat.as_bytes(hash_name)).hexdigest() percentage_hash = ((int(hash_name_hashed, 16) % (MAX_NUM_WAVS_PER_CLASS + 1)) * (100.0 / MAX_NUM_WAVS_PER_CLASS)) if percentage_hash < validation_percentage: result = 'validation' elif percentage_hash < (testing_percentage + validation_percentage): result = 'testing' else: result = 'training' return result ```