## Make sure related data is put into the same bucket

- Assign related items to the same data partition 
  - e.g. chunks of the same file, audio spoken by the same person etc
  
This will make sure your network has been _contaminated_ by seeing testing samples during the traing phase.

e.g. the following function from tensorflow examples assigns files stably to partitions, ignoring a regex in the file name: 

```python

validation_percentage = 10.0
testing_percentage = 10.0
MAX_NUM_WAVS_PER_CLASS = 10000000000

def which_set(filename, validation_percentage, testing_percentage):
  base_name = os.path.basename(filename)
  # We want to ignore anything after '_nohash_' in the file name when
  # deciding which set to put a wav in, so the data set creator has a way of
  # grouping wavs that are close variations of each other.
  hash_name = re.sub(r'\d+', '', base_name, 1)
  # This looks a bit magical, but we need to decide whether this file should
  # go into the training, testing, or validation sets, and we want to keep
  # existing files in the same set even if more files are subsequently
  # added.
  # To do that, we need a stable way of deciding based on just the file name
  # itself, so we do a hash of that and then use that to generate a
  # probability value that we use to assign it.
  hash_name_hashed = hashlib.sha1(compat.as_bytes(hash_name)).hexdigest()
  percentage_hash = ((int(hash_name_hashed, 16) %
                      (MAX_NUM_WAVS_PER_CLASS + 1)) *
                     (100.0 / MAX_NUM_WAVS_PER_CLASS))
  if percentage_hash < validation_percentage:
    result = 'validation'
  elif percentage_hash < (testing_percentage + validation_percentage):
    result = 'testing'
  else:
    result = 'training'
  return result

```