Skip to content

Instantly share code, notes, and snippets.

@bilal2vec
Last active November 27, 2019 03:11
Show Gist options
  • Select an option

  • Save bilal2vec/67bb9b5e6132e5d3c30e366c8d403369 to your computer and use it in GitHub Desktop.

Select an option

Save bilal2vec/67bb9b5e6132e5d3c30e366c8d403369 to your computer and use it in GitHub Desktop.

Try more architectures
Basic architectures are sometimes better
Try other forms of ensembling than cv
Blend with linear regression
Rely more on shakeup predictions
Make sure copied code is correct
Pay more attention to correlations between folds
Try not to extensively tune hyperparameters
Optimizing thresholds can lead to "brittle" models
Random initializations between folds might help diversity
Something that works for someone might not help you
Look into label smoothing?
Use multiprocessing
Train model and throw away very confidant predictions
Consider using bce + soft f1 loss
Bucket sentences in batches with similar lengths
Mask before softmax
What boosted my model most was unfreezing embeddings towards the end of each run and updating unknown words so that subsequent models/ folds could use more words for training.
https://www.kaggle.com/c/quora-insincere-questions-classification/discussion/80542\

Look at hill climbing to get best coefs
Make sure cyclic lr ends training with the lr at lowest point
Concentrate on embeddings layer and get least amount of oov words
If you have continous features, bin them and add as categorical auxillary features/targets\

vocabulary on train val and test between folds can lead to information being leaked and artificially increases cv score

Discussion post ideas

  • leaderboard trends
  • pytorch_zoo
  • cv scores
  • shakeup vis

Quora:

Try using checkpoint ensembling
Use distribution of oov embeddings
Weighted average of embeddings
Spatial dropout after embeddings
Gaussian noise after embeddings
Use convolutions on the outputs on rnns
Use batch norm after dense layers
Try one-cycle
Use more of spacy's features to get less oov words\

Look at pos tagging
https://www.kaggle.com/ryches/22nd-place-solution-6-models-pos-tagging
https://www.kaggle.com/ryches/parts-of-speech-disambiguation-error-analysis \

reinit embedding matrix between runs

Jigsaw

  • optimize for the metric

  • can bias different models towards different subgroups

  • drop 50% of negative samples (concentrate only on important samples)

  • use batch random samplers, gradient accumulation, mixed precision

  • if you don't want to use loss weighting, remove negative samples from the dataset (https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/discussion/97484)

  • sort data by length in torch dataset and don't use a random sampler

  • manually init layers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment