{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Using Keras and Spacy for NLP tasks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Extracting data from the corpus"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will have a look at the subtitle corpus that will be used in the assignment."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dataset is structured with one line per utterance, except a special line starting with ### to denote the start of a new subtitle:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "### {\"file\": \"OpenSubtitles/raw/no/0/1115475/4558788.xml\", \"genre\": \"Documentary\", \"duration2\": 1399.78, \"tokens\": \"2365\", \"sentences\": \"255\", \"language\": \"Norwegian\", \"rating\": \"4.0\", \"blocks\": \"257\"}\n",
      "For å forstå hvordan en storby virker må man løfte på huden og blottlegge den skjulte livsnerven.\n",
      "Et ufattelig komplisert system som er nødvendig for alle, men begripelig for få.\n",
      "Her begynner vår oppdagelsesferd under overflaten i verdens storbyer.\n",
      "London, en gang imperiets hovedstad og fortsatt et av verdens knutepunkt.\n",
      "London ble bygget ved en elv.\n",
      "Byen overlever ved hjelp av tre transportårer:\n",
      "Langs vann, på land og i luften.\n",
      "Alle må holdes åpne for at London skal klare seg.\n",
      "For å unngå katastrofer overvåkes London dag og natt.\n",
      "Og dette er vaktene:\n",
      "Kameraer.\n",
      "Sensorer.\n",
      "Radarer.\n",
      "London er jordens mest overvåkede by.\n",
      "Storebror følger med døgnet rundt:\n",
      "På land, langs elva og i lufta.\n",
      "Heathrow - porten til London.\n",
      "Ingen annen flyplass gjør så mye med så lite.\n",
      "En halv million flygninger per år med bare to rullebaner.\n"
     ]
    }
   ],
   "source": [
    "!head -n 20 /nr/samba/user/plison/code/grounding/outputs/no-all.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We start with extracting 1000 subtitles from the text data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [],
   "source": [
    "fd = open(\"/nr/samba/user/plison/code/grounding/outputs/no-all.txt\")\n",
    "nb_subtitles =1000\n",
    "dialogues = []\n",
    "for line in fd:\n",
    "    if line.startswith(\"###\"):\n",
    "        dialogues.append([])\n",
    "        if len(dialogues) >= nb_subtitles:\n",
    "            break\n",
    "    else:\n",
    "        dialogues[-1].append(line.rstrip(\"\\n\"))\n",
    "fd.close()\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each dialogue is a list of utterances:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Folk frykter at Grace plutselig skal gjøre noe som gjør stor skade.',\n",
       " 'Hvorfor vil du ikke snakke om det?',\n",
       " 'Du har ikke snakket med meg!',\n",
       " 'Lensmann Jansen og onkelen min, de tjenestegjorde sammen.',\n",
       " '-Hei, er det Johansson, journalisten?',\n",
       " '-Ja, det er meg.',\n",
       " '-Jeg heter Elise og...',\n",
       " '-Beklager, jeg kan ikke.',\n",
       " 'Bjørn sier han kan hjelpe, men...',\n",
       " 'Ja det nytter iallfall ikke å gjøre avtaler med det onde.',\n",
       " 'Faen!',\n",
       " '-Hvem var han?',\n",
       " '-De aner ikke.',\n",
       " 'Vi mistenker at han kom over på russisk side og ledet oppdrag derfra.',\n",
       " 'Mia Holt og Thomas Lønnhøiden, de skal stanses.',\n",
       " '--==DBRETAiL==-- Released on Danishbits.org',\n",
       " '-Dette er risikabelt.',\n",
       " '-Slapp av nå.',\n",
       " '-Ikke så lenge vi ikke har en plan.',\n",
       " '-Tror du ikke jeg har det?']"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dialogues[10][:20]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now have to tokenise and lemmatise the dialogues. The easiest is to use Spacy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [],
   "source": [
    "import spacy\n",
    "nlp = spacy.load(\"nb_core_news_sm\")  # This is the standard Spacy model for Norwegian Bokmål"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Pierre with POS tag: PROPN___ and dependency relation: nsubj with ga as head\n",
      "ga with POS tag: VERB__Mood=Ind|Tense=Past|VerbForm=Fin and dependency relation: ROOT with ga as head\n",
      "boken with POS tag: NOUN__Definite=Def|Gender=Masc|Number=Sing and dependency relation: dobj with ga as head\n",
      "til with POS tag: ADP___ and dependency relation: case with Jan as head\n",
      "Jan with POS tag: PROPN__Gender=Masc and dependency relation: nmod with ga as head\n",
      "Tore with POS tag: PROPN__Gender=Masc and dependency relation: name with Jan as head\n",
      "mens with POS tag: SCONJ___ and dependency relation: mark with Universitetet as head\n",
      "de with POS tag: PRON__Case=Nom|Number=Plur|Person=3|PronType=Prs and dependency relation: nsubj with Universitetet as head\n",
      "var with POS tag: VERB__Mood=Ind|Tense=Past|VerbForm=Fin and dependency relation: cop with Universitetet as head\n",
      "på with POS tag: ADP___ and dependency relation: case with Universitetet as head\n",
      "Universitetet with POS tag: PROPN___ and dependency relation: advcl with ga as head\n",
      ". with POS tag: PUNCT___ and dependency relation: punct with ga as head\n",
      "Pierre PER\n",
      "Jan Tore PER\n",
      "Universitetet ORG\n"
     ]
    }
   ],
   "source": [
    "doc = nlp(\"Pierre ga boken til Jan Tore mens de var på Universitetet.\")\n",
    "for tok in doc:\n",
    "    print(tok, \"with POS tag:\", tok.tag_, \"and dependency relation:\", tok.dep_, \"with\", tok.head, \"as head\")\n",
    "for ent in doc.ents:\n",
    "    print(ent, ent.label_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We run the tokenisation on the texts:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of tokenised subtitles: 0\n",
      "Number of tokenised subtitles: 100\n",
      "Number of tokenised subtitles: 200\n",
      "Number of tokenised subtitles: 300\n",
      "Number of tokenised subtitles: 400\n",
      "Number of tokenised subtitles: 500\n",
      "Number of tokenised subtitles: 600\n",
      "Number of tokenised subtitles: 700\n",
      "Number of tokenised subtitles: 800\n",
      "Number of tokenised subtitles: 900\n"
     ]
    }
   ],
   "source": [
    "nlp = spacy.load(\"nb_core_news_sm\", disable=[\"tagger\", \"parser\", \"ner\"])  # This is the standard Spacy model for Norwegian Bokmål\n",
    "for i, dialogue in enumerate(dialogues):\n",
    "    for j, utterance in enumerate(nlp.pipe(dialogue)):\n",
    "        dialogues[i][j] = [tok.lower_ for tok in utterance]\n",
    "    if i % 100 == 0:\n",
    "        print(\"Number of tokenised subtitles:\", i)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['for',\n",
       " 'å',\n",
       " 'forstå',\n",
       " 'hvordan',\n",
       " 'en',\n",
       " 'storby',\n",
       " 'virker',\n",
       " 'må',\n",
       " 'man',\n",
       " 'løfte',\n",
       " 'på',\n",
       " 'huden',\n",
       " 'og',\n",
       " 'blottlegge',\n",
       " 'den',\n",
       " 'skjulte',\n",
       " 'livsnerven',\n",
       " '.']"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dialogues[0][0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Building a simple neural network for NLP"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's start with a simple toy example: we wish to predict will be a _clarification ellipsis_, such as \"Mark killed everyone.\" --> \"Mark?\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [],
   "source": [
    "import keras\n",
    "\n",
    "max_utterance_length = 32\n",
    "utterance_input = keras.layers.Input((max_utterance_length,), dtype=np.int32)\n",
    "\n",
    "vocab_size = 10000\n",
    "embedding = keras.layers.Embedding(input_dim=vocab_size, output_dim=100)\n",
    "utterance_word_embeddings = embedding(utterance_input)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A simple approach is then to perform a max pooling of all the embeddings, followed by a dense layer for the final prediction:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model: \"model_3\"\n",
      "_________________________________________________________________\n",
      "Layer (type)                 Output Shape              Param #   \n",
      "=================================================================\n",
      "input_2 (InputLayer)         (None, 32)                0         \n",
      "_________________________________________________________________\n",
      "embedding_2 (Embedding)      (None, 32, 100)           1000000   \n",
      "_________________________________________________________________\n",
      "global_max_pooling1d_2 (Glob (None, 100)               0         \n",
      "_________________________________________________________________\n",
      "dense_3 (Dense)              (None, 1)                 101       \n",
      "=================================================================\n",
      "Total params: 1,000,101\n",
      "Trainable params: 1,000,101\n",
      "Non-trainable params: 0\n",
      "_________________________________________________________________\n"
     ]
    }
   ],
   "source": [
    "pooling = keras.layers.GlobalMaxPooling1D()\n",
    "utterance_embedding = pooling(utterance_word_embeddings)\n",
    "\n",
    "output = keras.layers.Dense(1, activation=\"sigmoid\")\n",
    "prediction = output(utterance_embedding)\n",
    "\n",
    "model = keras.models.Model(utterance_input, prediction)\n",
    "model.compile(optimizer=\"adam\", loss=\"binary_crossentropy\", metrics=[\"accuracy\"])\n",
    "model.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preparing the data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We build a vocabulary (based on the most common words):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [],
   "source": [
    "counts = {}\n",
    "for dialogue in dialogues:\n",
    "    for utterance in dialogue:\n",
    "        for tok in utterance:\n",
    "            counts[tok] = counts.get(tok, 0) + 1\n",
    "\n",
    "sorted_toks = sorted(counts.keys(), key=lambda x: counts[x], reverse=True)\n",
    "vocab_mapping = {tok:(i+2) for i, tok in enumerate(sorted_toks) if i < vocab_size-2}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can then map tokens to indices:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of indexed dialogues: 0\n",
      "Number of indexed dialogues: 100\n",
      "Number of indexed dialogues: 200\n",
      "Number of indexed dialogues: 300\n",
      "Number of indexed dialogues: 400\n",
      "Number of indexed dialogues: 500\n",
      "Number of indexed dialogues: 600\n",
      "Number of indexed dialogues: 700\n",
      "Number of indexed dialogues: 800\n",
      "Number of indexed dialogues: 900\n"
     ]
    }
   ],
   "source": [
    "input_data = []\n",
    "target_data = []\n",
    "for i, dialogue in enumerate(dialogues):\n",
    "    for j, utterance in enumerate(dialogue):\n",
    "        token_indices = [vocab_mapping.get(tok, 1) for tok in utterance]\n",
    "\n",
    "        input_data.append(token_indices)\n",
    "        \n",
    "        ce_next_utt = (j < len(dialogues[i])-1 and dialogues[i][j+1][-1]==\"?\" and \n",
    "                       set(dialogues[i][j+1][:-1]) <= set(dialogues[i][j]))\n",
    "  #      if ce_next_utt:\n",
    "  #          print(dialogues[i][j], dialogues[i][j+1])\n",
    "        target_data.append(ce_next_utt)\n",
    "    if i % 100 == 0:\n",
    "        print(\"Number of indexed dialogues:\", i)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5182/1162770 (0.45 %) utterances are followed by a clarification ellipsis\n"
     ]
    }
   ],
   "source": [
    "print(\"%i/%i (%.2f %%) utterances are followed by a clarification ellipsis\"%(sum(target_data), len(target_data), \n",
    "                                                                             100*sum(target_data)/len(target_data)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But the input data is not yet in the proper format for Keras: we need to \"pad\" the utterances to the maximum utterance length, in order to have a single X matrix as input."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_data2 = np.zeros((len(input_data), max_utterance_length), dtype=np.int32)\n",
    "for i, utterance in enumerate(input_data):\n",
    "    if len(utterance) <= max_utterance_length:\n",
    "        input_data2[i,:len(utterance)] = utterance\n",
    "    else:\n",
    "        input_data2[i,:] = utterance[:max_utterance_length]\n",
    "input_data = input_data2\n",
    "target_data = np.array(target_data, dtype=np.float32)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is an example of a data point + target:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[   1  815   39 1178   34  941   27   19  106   15  121  113    2    0\n",
      "    0    0    0    0    0    0    0    0    0    0    0    0    0    0\n",
      "    0    0    0    0] --> 0.0\n"
     ]
    }
   ],
   "source": [
    "print(input_data[100], \"-->\", target_data[100])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we need to split the data into a training, development and test set (we use 1000 utterances for development, 1000 for test, and the rest for training):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, y_train = input_data[:-2000], target_data[:-2000]\n",
    "X_dev, y_dev = input_data[-2000:-1000], target_data[-2000:-1000]\n",
    "X_test, y_test = input_data[-1000:], target_data[-1000:]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And we can now fit the model:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train on 1160770 samples, validate on 1000 samples\n",
      "Epoch 1/1\n",
      "1160770/1160770 [==============================] - 137s 118us/step - loss: 0.0288 - accuracy: 0.9955 - val_loss: 0.0587 - val_accuracy: 0.9900\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<keras.callbacks.callbacks.History at 0x7fc744226890>"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.fit(X_train, y_train, validation_data=(X_dev, y_dev))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, the model does not seem to improve upon a majority baseline in this case."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model with a recurrent layer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we can try using a recurrent layer instead of a max pooling operation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model: \"model_4\"\n",
      "_________________________________________________________________\n",
      "Layer (type)                 Output Shape              Param #   \n",
      "=================================================================\n",
      "input_2 (InputLayer)         (None, 32)                0         \n",
      "_________________________________________________________________\n",
      "embedding_2 (Embedding)      (None, 32, 100)           1000000   \n",
      "_________________________________________________________________\n",
      "gru_2 (GRU)                  (None, 100)               60300     \n",
      "_________________________________________________________________\n",
      "dense_4 (Dense)              (None, 1)                 101       \n",
      "=================================================================\n",
      "Total params: 1,060,401\n",
      "Trainable params: 1,060,401\n",
      "Non-trainable params: 0\n",
      "_________________________________________________________________\n"
     ]
    }
   ],
   "source": [
    "gru = keras.layers.GRU(100)\n",
    "utterance_embedding = gru(utterance_word_embeddings)\n",
    "\n",
    "output = keras.layers.Dense(1, activation=\"sigmoid\")\n",
    "prediction = output(utterance_embedding)\n",
    "\n",
    "model = keras.models.Model(utterance_input, prediction)\n",
    "model.compile(optimizer=\"adam\", loss=\"binary_crossentropy\", metrics=[\"accuracy\"])\n",
    "model.summary()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/nr/samba/user/plison/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:424: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.\n",
      "  \"Converting sparse IndexedSlices to a dense Tensor of unknown shape. \"\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train on 1160770 samples, validate on 1000 samples\n",
      "Epoch 1/1\n",
      " 536128/1160770 [============>.................] - ETA: 12:51 - loss: 0.0275 - accuracy: 0.9956"
     ]
    }
   ],
   "source": [
    "model.fit(X_train, y_train, validation_data=(X_dev, y_dev))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
