{ "cells": [ { "metadata": {}, "cell_type": "markdown", "source": "# Using BERT to solve 'open cloze' exercises\n\n@Data_sigh\n\nBERT (Bidirectional Encoder Representations from Transformers) with pretrained weights is loaded from the library of state-of-the-art pretrained models from HuggingFace\n\nBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\nhttps://arxiv.org/abs/1810.04805\n@article{devlin2018bert,\n title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding},\n author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},\n journal={arXiv preprint arXiv:1810.04805},\n year={2018}\n}\nhttps://github.com/google-research/bert\n\nhttps://github.com/huggingface/pytorch-transformers\n\nhttps://colab.research.google.com/github/pytorch/pytorch.github.io/blob/master/assets/hub/huggingface_pytorch-pretrained-bert_bert.ipynb\n\nBertTokenizer: to perform end-to-end tokenization, i.e. basic tokenization followed by WordPiece tokenization\nText normalization: Convert all whitespace characters to spaces, and (for the Uncased model) lowercase the input and strip out accent markers. E.g., John Johanson's, → john johanson's,.\nPunctuation splitting: Split all punctuation characters on both sides (i.e., add whitespace around all punctuation characters). Punctuation characters are defined as (a) Anything with a P* Unicode class, (b) any non-letter/number/space ASCII character (e.g., characters like $ which are technically not punctuation). E.g., john johanson's, → john johanson ' s ,\nWordPiece tokenization: Apply whitespace tokenization to the output of the above procedure, and apply WordPiece tokenization to each token separately. (Our implementation is directly based on the one from tensor2tensor, which is linked). E.g., john johanson ' s , → john johan ##son ' s ,\n\nBertForMaskedLM: BERT Transformer with the pre-trained masked language modelling head on top (fully pre-trained)" }, { "metadata": { "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19", "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5", "trusted": false }, "cell_type": "code", "source": "from fastai.text import * \nfrom pytorch_transformers import BertTokenizer, BertForMaskedLM\nimport glob", "execution_count": 1, "outputs": [] }, { "metadata": { "_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0", "_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a", "trusted": false }, "cell_type": "code", "source": "bert_model_name=\"bert-base-uncased\" # Pretrained weights shortcut\ntokenizer = BertTokenizer.from_pretrained(bert_model_name, do_lower_case=True) # wordpiece tokenizer\nmaskedLM_model = BertForMaskedLM.from_pretrained(bert_model_name)\nmaskedLM_model.eval();", "execution_count": 2, "outputs": [] }, { "metadata": { "trusted": false }, "cell_type": "code", "source": "fnames = glob.glob(\"C:/Users/aliso/.fastai/data/cambridge_nlp/test/open_cloze/*.xlsx\") # test data", "execution_count": 3, "outputs": [] }, { "metadata": { "trusted": false }, "cell_type": "code", "source": "def exam_test_open_cloze(df_test, n=1):\n score = 0\n txt_before_gap = tokenizer.tokenize(df_test.text[0])\n txt_before_gap = ' '.join(txt_before_gap[-6:]) # no more than 6 tokens\n for i in range(len(df_test)-1): \n txt_after_gap = df_test.text[i+1]\n txt = '[CLS] ' + txt_before_gap + ' [MASK] ' + txt_after_gap + ' [SEP]'\n # Tokenized input\n tokens_txt = tokenizer.tokenize(txt)\n idx_tokens = [tokenizer.convert_tokens_to_ids(tokens_txt)]\n masked_idx = tokens_txt.index('[MASK]')\n segments_ids = [0] * masked_idx + [1] * (len(tokens_txt)-masked_idx)\n # Convert inputs to PyTorch tensors\n segments_tensors = torch.tensor(segments_ids)\n tokens_tensor = torch.tensor(idx_tokens)\n # Predict the missing token (indicated with [MASK]) with `BertForMaskedLM`\n with torch.no_grad(): preds = maskedLM_model(tokens_tensor, segments_tensors)\n preds_idx = [torch.argmax(preds[0][0, masked_idx,:]).item()]\n pred_token = tokenizer.convert_ids_to_tokens(preds_idx)[0]\n \n if pred_token in [',','.']: # Take second highest prediction if the first is punctuation\n preds[0][0,masked_idx,preds_idx]=0\n preds_idx = [torch.argmax(preds[0][0, masked_idx,:]).item()]\n pred_token = tokenizer.convert_ids_to_tokens(preds_idx)[0]\n \n if n==2:\n # Make two suggestions for each word\n preds[0][0,masked_idx,preds_idx]=0\n preds_idx = [torch.argmax(preds[0][0, masked_idx,:]).item()]\n pred_token2 = tokenizer.convert_ids_to_tokens(preds_idx)[0]\n pred_token = [pred_token, pred_token2] # propose 2 words\n\n print (df_test.text[i], \"[\", pred_token, \":\", df_test.answer[i], \"]\")\n \n actual_answer = list(df_test.answer[i].lower().split(\"'\"))\n if n==2:\n if (any(x == pred_token[0] for x in actual_answer) or any(x == pred_token[1] for x in actual_answer)):\n score +=1\n else:\n if pred_token in actual_answer:\n score +=1\n \n txt_before_gap = tokenizer.tokenize(txt_after_gap)\n txt_before_gap = ' '.join(txt_before_gap[-7:])\n \n print (df_test.text[len(df_test)-1])\n print (\"SCORE\", score, '/', len(df_test)-1 )\n return score", "execution_count": 4, "outputs": [] }, { "metadata": { "trusted": false }, "cell_type": "code", "source": "pd.set_option('display.max_colwidth', 150)\ndf_test = pd.read_excel(fnames[0])\ndf_test", "execution_count": 5, "outputs": [ { "data": { "text/html": "
| \n | question | \nanswer | \ntext | \n
|---|---|---|---|
| 0 | \n0.0 | \nas | \nI work | \n
| 1 | \n9.0 | \nwhere | \na motorbike stunt rider - that is, I do tricks on my motorbike at shows. The Le Mans racetrack in France was | \n
| 2 | \n10.0 | \nso | \nI first saw some guys doing motorbike stunts. I'd never seen anyone riding a motorbike using just the back wheel before and I was | \n
| 3 | \n11.0 | \nmyself | \nimpressed I went straight home and taught | \n
| 4 | \n12.0 | \nin | \nto do the same. It wasn't very long before I began to earn my living at shows performing my own motorbike stunts. I have a degree | \n
| 5 | \n13.0 | \n['which','that'] | \nmechanical engineering; this helps me to look at the physics | \n
| 6 | \n14.0 | \n['out','on','at'] | \nlies behind each stunt. In addition to being responsible for design changes to the motorbike, I have to work | \n
| 7 | \n15.0 | \nfrom | \nevery stunt I do. People often think that my work is very dangerous, but, apart | \n
| 8 | \n16.0 | \nany | \nsome minor mechanical problems happening occasionally during a stunt, nothing ever goes wrong. I never feel in | \n
| 9 | \nNaN | \nNaN | \nkind of danger because I'm very experienced. | \n