{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# An Inefficient Vector Space Model" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from collections import defaultdict\n", "from math import log, sqrt\n", "import re" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset is the TIME dataset, available at http://ir.dcs.gla.ac.uk/resources/test_collections/time/" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def import_dataset():\n", " \"\"\"\n", " This function import all the articles in the TIME corpus,\n", " returning list of lists where each sub-list contains all the\n", " terms present in the document as a string.\n", " \"\"\"\n", " articles = []\n", " with open('TIME.ALL', 'r') as f:\n", " tmp = []\n", " for row in f:\n", " if row.startswith(\"*TEXT\"):\n", " if tmp != []:\n", " articles.append(tmp)\n", " tmp = []\n", " else:\n", " row = re.sub(r'[^a-zA-Z\\s]+', '', row)\n", " tmp += row.split()\n", " return articles" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def make_inverted_index(articles):\n", " \"\"\"\n", " This function builds an inverted index as an hash table (dictionary)\n", " where the keys are the terms and the values are ordered lists of\n", " docIDs containing the term.\n", " \"\"\"\n", " index = defaultdict(set)\n", " for docid, article in enumerate(articles):\n", " for term in article:\n", " index[term].add(docid)\n", " return index" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def make_positional_index(articles):\n", " \"\"\"\n", " A more advanced version of make_inverted_index. Here each posting is\n", " non only a document id, but a list of positions where the term is\n", " contained in the article.\n", " \"\"\"\n", " index = defaultdict(dict)\n", " for docid, article in enumerate(articles):\n", " for pos, term in enumerate(article):\n", " try:\n", " index[term][docid].append(pos)\n", " except KeyError:\n", " index[term][docid] = [pos]\n", " return index" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def documents_as_vectors(articles):\n", " \"\"\"\n", " Here we generate a list of dictionaries. Each element of the list\n", " represents a document and each document has an associated dict where\n", " to each term the corresponding tf-idf is associated. Since this function\n", " creates a structure of size O(#documents \\times #terms), it can\n", " be used only for small collections.\n", " \"\"\"\n", " p_index = make_positional_index(articles)\n", " vectors = []\n", " n = len(articles)\n", " idf = {}\n", " for term in p_index.keys():\n", " idf[term] = log(n/len(p_index[term]))\n", " for docid in range(0, len(articles)):\n", " # We create a dictionary with a key for each dimension (term)\n", " v = {}\n", " for term in p_index.keys():\n", " try:\n", " tfidf = len(p_index[term][docid]) * idf[term]\n", " except KeyError:\n", " tfidf = 0\n", " v[term] = tfidf\n", " vectors.append(v)\n", " return vectors" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def show_document_vector(v, docid):\n", " \"\"\"\n", " This function prints, for a document represented as a vector in v, all the\n", " non-zero weights (both normalized and not) and the corresponding terms\n", " \"\"\"\n", " non_zero_terms = [x for x in v[docid].keys() if v[docid][x] > 0]\n", " vector = [(x, v[docid][x]) for x in non_zero_terms]\n", " vector.sort(key=lambda x: x[1], reverse=True)\n", " length = sqrt(sum([x[1]**2 for x in vector]))\n", " normalized = {k: tfidf/length for k, tfidf in vector}\n", " for (term, tfidf) in vector:\n", " print(f\"{term}:\\t{tfidf}\\t(normalized: {normalized[term]})\")" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# Example of usage\n", "articles = import_dataset()\n", "vectors = documents_as_vectors(articles)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'BERLIN ONE LAST RUN HANS WEIDNER HAD BEEN HOPING FOR MONTHS TO ESCAPE DRAB EAST GERMANY AND MAKE HIS WAY TO THE WEST THE ODDS WERE AGAINST HIM FOR WEIDNER WAS A CRIPPLE ON CRUTCHES WHO LIVED IN THE VILLAGE OF NEUGERSDORF MILES SOUTHEAST OF THE FRONTIER OF FREEDOM BUT HANS WEIDNER DID HAVE ONE MAJOR ASSET THE BUS THAT HE OPERATED FOR THE LOCAL COMMUNIST REGIME IT WAS AN UGLY THING AND ANCIENT ITS CHASSIS CREAKED AND THE ENGINE COUGHED A CREAMCOLORED COAT OF PAINT COULD NOT DISGUISE THE WELTS AND BRUISES OF TWO DECADES OF CHUGGING SERVICE IN FACT THE BUS WAS READY FOR THE JUNK PILE WHEN WEIDNER DECIDED TO PRESS IT INTO SERVICE FOR ONE LAST RUN SHARP BLADES THE HAZARDS WOULD BE GREAT ON THE JOURNEY TO THE BORDER SO WEIDNER SIGNED UP A FELLOW VILLAGER JURGEN WAGNER TO TAKE THE WHEEL EIGHT DAYS BEFORE CHRISTMAS THE PAIR BEGAN THE FEVERISH PREPARATIONS IN WEIDNERS GARAGE FIRST WEIDNER AND WAGNER ATTACHED A HEAVY SNOWPLOW TO THE FRONT OF THE BUS NOT TO PLOW SNOW BUT TO SCOOP AWAY THE HEAVY OBSTACLES THEY KNEW AWAITED THEM AT ROADBLOCKS AHEAD TO ALL SIX LUGS ON EACH FRONT WHEEL THEY BOLTED SHARP BLADES OF THE TOUGHEST STEEL AFFIXED SO THAT THE WHIRLING EDGES WOULD CHOP BARBED WIRE TO BITS THEN THEY WEDGED ONEQUARTERINCH SECTIONS OF STEEL PLATE INSIDE THE BUS TO STOP BULLETS AT LAST ALL WAS READY ON CHRISTMAS EVE WEIDNER AND WAGNER PILED THEIR WIVES AND FOUR CHILDREN ABOARD NOT FORGETTING THREE TONS OF HOUSEHOLD BELONGINGS FOR ADDED PROTECTION THE PLOTTERS SHOVELED A TON OF COAL AND POTATOES INTO THE BACK OF THE BUS THEN THEY CHUGGED OFF NORTH TOWARD BERLIN ALONG BACK ROADS TO ESCAPE COMMUNIST PATROLS JUST BEFORE THEY REACHED THE WALL THEY PLANNED TO SWING WEST IN ORDER TO ENTER THE EASTWEST AUTOBAHN LEADING TO THE US SECTOR OF THE CITY EN ROUTE THE RADIATOR FROZE IN THE SUBZERO WEATHER THAT FIXED THEY WERE ONLY A FEW MILES FARTHER WHEN A TIRE BLEW OUT THE KIDS WERE CRYING AND THE WIVES SHIVERING WITH COLD AND PANIC WHEN AT LAST THEY ARRIVED AT DREWITZ THE MOST HEAVILY GUARDED CHECKPOINT ON THE ENTIRE AUTOBAHN TO BERLIN IT WAS NO TIME TO STOP AND RECONSIDER FLYING POTATOES WAHAH WAHAH SHRIEKED THE POLICETYPE KLAXONS THAT WEIDNER HAD THOUGHTFULLY INSTALLED IN ADVANCE THE COMMUNIST GUARDS OBEDIENTLY RAISED THE FIRST OF THREE BARRIERS BUT WHAT WAS A BUS DOING ON EMERGENCY DUTY SUDDENLY THE SHOOTING BEGAN TOO LATE WAGNER AT MPH WAS ALREADY CRASHING THROUGH THE SECOND BARRIER YARDS AHEAD THEN THE THIRD ONLY YARDS AWAY ITS WINDSHIELD SMASHED ITS PASSENGERS SHAKEN ITS CARGO OF COAL AND POTATOES IN EVERY CORNER OF THE CAB THE OLD BUS FINALLY LURCHED TO A STOP A FEW MILES DOWN THE ROAD WHERE THE COMMUNISTS NO LONGER MATTERED AT THE US CHECKPOINT A FOOT OR TWO INSIDE WEST BERLIN'" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\" \".join(articles[2])" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WEIDNER:\t48.360042512288096\t(normalized: 0.534471813494657)\n", "BUS:\t25.52977028866349\t(normalized: 0.2821532388193592)\n", "WAGNER:\t24.180021256144048\t(normalized: 0.2672359067473285)\n", "POTATOES:\t13.306702204805735\t(normalized: 0.14706474373401815)\n", "WAHAH:\t12.090010628072024\t(normalized: 0.13361795337366425)\n", "BERLIN:\t11.815851442710784\t(normalized: 0.1305879651980127)\n", "BLADES:\t10.703716266952133\t(normalized: 0.1182967248814251)\n", "AUTOBAHN:\t9.892786050735804\t(normalized: 0.10933438074848466)\n", "HANS:\t8.871134803203823\t(normalized: 0.09804316248934543)\n", "WHEEL:\t8.871134803203823\t(normalized: 0.09804316248934543)\n", "CHECKPOINT:\t8.871134803203823\t(normalized: 0.09804316248934543)\n", "CHRISTMAS:\t7.931127544712352\t(normalized: 0.0876542678969468)\n", "STOP:\t7.833054328652597\t(normalized: 0.08657036956022819)\n", "YARDS:\t7.695561473399585\t(normalized: 0.08505080812330507)\n", "STEEL:\t6.811895968841506\t(normalized: 0.07528459866176831)\n", "SHARP:\t6.673910225867603\t(normalized: 0.07375958986416584)\n", "WIVES:\t6.544833183592461\t(normalized: 0.07233303940470764)\n", "COAL:\t6.423583939959592\t(normalized: 0.07099300122933076)\n", "ESCAPE:\t6.309267112279694\t(normalized: 0.06972957963106592)\n", "CRUTCHES:\t6.045005314036012\t(normalized: 0.06680897668683212)\n", "NEUGERSDORF:\t6.045005314036012\t(normalized: 0.06680897668683212)\n", "CREAKED:\t6.045005314036012\t(normalized: 0.06680897668683212)\n", "COUGHED:\t6.045005314036012\t(normalized: 0.06680897668683212)\n", "WELTS:\t6.045005314036012\t(normalized: 0.06680897668683212)\n", "CHUGGING:\t6.045005314036012\t(normalized: 0.06680897668683212)\n", "HAZARDS:\t6.045005314036012\t(normalized: 0.06680897668683212)\n", "VILLAGER:\t6.045005314036012\t(normalized: 0.06680897668683212)\n", "JURGEN:\t6.045005314036012\t(normalized: 0.06680897668683212)\n", "WEIDNERS:\t6.045005314036012\t(normalized: 0.06680897668683212)\n", "SNOWPLOW:\t6.045005314036012\t(normalized: 0.06680897668683212)\n", "SCOOP:\t6.045005314036012\t(normalized: 0.06680897668683212)\n", "LUGS:\t6.045005314036012\t(normalized: 0.06680897668683212)\n", "CHOP:\t6.045005314036012\t(normalized: 0.06680897668683212)\n", "ONEQUARTERINCH:\t6.045005314036012\t(normalized: 0.06680897668683212)\n", "FORGETTING:\t6.045005314036012\t(normalized: 0.06680897668683212)\n", "SHOVELED:\t6.045005314036012\t(normalized: 0.06680897668683212)\n", "RADIATOR:\t6.045005314036012\t(normalized: 0.06680897668683212)\n", "SUBZERO:\t6.045005314036012\t(normalized: 0.06680897668683212)\n", "TIRE:\t6.045005314036012\t(normalized: 0.06680897668683212)\n", "DREWITZ:\t6.045005314036012\t(normalized: 0.06680897668683212)\n", "POLICETYPE:\t6.045005314036012\t(normalized: 0.06680897668683212)\n", "KLAXONS:\t6.045005314036012\t(normalized: 0.06680897668683212)\n", "THOUGHTFULLY:\t6.045005314036012\t(normalized: 0.06680897668683212)\n", "CAB:\t6.045005314036012\t(normalized: 0.06680897668683212)\n", "LURCHED:\t6.045005314036012\t(normalized: 0.06680897668683212)\n", "MILES:\t5.432696428316257\t(normalized: 0.06004178163140234)\n", "PLOW:\t5.351858133476067\t(normalized: 0.05914836244071255)\n", "CRIPPLE:\t5.351858133476067\t(normalized: 0.05914836244071255)\n", "CHASSIS:\t5.351858133476067\t(normalized: 0.05914836244071255)\n", "CREAMCOLORED:\t5.351858133476067\t(normalized: 0.05914836244071255)\n", "JUNK:\t5.351858133476067\t(normalized: 0.05914836244071255)\n", "FEVERISH:\t5.351858133476067\t(normalized: 0.05914836244071255)\n", "PREPARATIONS:\t5.351858133476067\t(normalized: 0.05914836244071255)\n", "BOLTED:\t5.351858133476067\t(normalized: 0.05914836244071255)\n", "TOUGHEST:\t5.351858133476067\t(normalized: 0.05914836244071255)\n", "AFFIXED:\t5.351858133476067\t(normalized: 0.05914836244071255)\n", "WHIRLING:\t5.351858133476067\t(normalized: 0.05914836244071255)\n", "EDGES:\t5.351858133476067\t(normalized: 0.05914836244071255)\n", "BITS:\t5.351858133476067\t(normalized: 0.05914836244071255)\n", "SECTIONS:\t5.351858133476067\t(normalized: 0.05914836244071255)\n", "PLATE:\t5.351858133476067\t(normalized: 0.05914836244071255)\n", "BELONGINGS:\t5.351858133476067\t(normalized: 0.05914836244071255)\n", "KIDS:\t5.351858133476067\t(normalized: 0.05914836244071255)\n", "SHIVERING:\t5.351858133476067\t(normalized: 0.05914836244071255)\n", "OBEDIENTLY:\t5.351858133476067\t(normalized: 0.05914836244071255)\n", "WINDSHIELD:\t5.351858133476067\t(normalized: 0.05914836244071255)\n", "AHEAD:\t5.287615864747713\t(normalized: 0.05843836137192669)\n", "DRAB:\t4.946393025367902\t(normalized: 0.05466719037424233)\n", "ASSET:\t4.946393025367902\t(normalized: 0.05466719037424233)\n", "PAINT:\t4.946393025367902\t(normalized: 0.05466719037424233)\n", "DISGUISE:\t4.946393025367902\t(normalized: 0.05466719037424233)\n", "BRUISES:\t4.946393025367902\t(normalized: 0.05466719037424233)\n", "WEDGED:\t4.946393025367902\t(normalized: 0.05466719037424233)\n", "CHUGGED:\t4.946393025367902\t(normalized: 0.05466719037424233)\n", "FROZE:\t4.946393025367902\t(normalized: 0.05466719037424233)\n", "MPH:\t4.946393025367902\t(normalized: 0.05466719037424233)\n", "CRASHING:\t4.946393025367902\t(normalized: 0.05466719037424233)\n", "MATTERED:\t4.946393025367902\t(normalized: 0.05466719037424233)\n", "SERVICE:\t4.922972751159803\t(normalized: 0.05440835113882677)\n", "READY:\t4.922972751159803\t(normalized: 0.05440835113882677)\n", "HEAVY:\t4.814838308619252\t(normalized: 0.05321325682948663)\n", "ENGINE:\t4.658710952916121\t(normalized: 0.051487748194592974)\n", "PILE:\t4.658710952916121\t(normalized: 0.051487748194592974)\n", "EN:\t4.658710952916121\t(normalized: 0.051487748194592974)\n", "CRYING:\t4.658710952916121\t(normalized: 0.051487748194592974)\n", "PANIC:\t4.658710952916121\t(normalized: 0.051487748194592974)\n", "FRONT:\t4.614671391505287\t(normalized: 0.05100102603658873)\n", "AWAITED:\t4.4355674016019115\t(normalized: 0.049021581244672714)\n", "ROADBLOCKS:\t4.4355674016019115\t(normalized: 0.049021581244672714)\n", "PILED:\t4.4355674016019115\t(normalized: 0.049021581244672714)\n", "HOUSEHOLD:\t4.4355674016019115\t(normalized: 0.049021581244672714)\n", "PLOTTERS:\t4.4355674016019115\t(normalized: 0.049021581244672714)\n", "PATROLS:\t4.4355674016019115\t(normalized: 0.049021581244672714)\n", "SECTOR:\t4.4355674016019115\t(normalized: 0.049021581244672714)\n", "RECONSIDER:\t4.4355674016019115\t(normalized: 0.049021581244672714)\n", "SHRIEKED:\t4.4355674016019115\t(normalized: 0.049021581244672714)\n", "SHAKEN:\t4.4355674016019115\t(normalized: 0.049021581244672714)\n", "INSIDE:\t4.432727835093834\t(normalized: 0.04899019855387534)\n", "RUN:\t4.265964617215731\t(normalized: 0.047147143112787455)\n", "OPERATED:\t4.253245844807957\t(normalized: 0.04700657612812275)\n", "COAT:\t4.253245844807957\t(normalized: 0.04700657612812275)\n", "GARAGE:\t4.253245844807957\t(normalized: 0.04700657612812275)\n", "ATTACHED:\t4.253245844807957\t(normalized: 0.04700657612812275)\n", "BARBED:\t4.253245844807957\t(normalized: 0.04700657612812275)\n", "BLEW:\t4.253245844807957\t(normalized: 0.04700657612812275)\n", "BARRIERS:\t4.253245844807957\t(normalized: 0.04700657612812275)\n", "BARRIER:\t4.253245844807957\t(normalized: 0.04700657612812275)\n", "CARGO:\t4.253245844807957\t(normalized: 0.04700657612812275)\n", "DECADES:\t4.099095164980699\t(normalized: 0.04530291357700374)\n", "OBSTACLES:\t4.099095164980699\t(normalized: 0.04530291357700374)\n", "WIRE:\t4.099095164980699\t(normalized: 0.04530291357700374)\n", "PROTECTION:\t4.099095164980699\t(normalized: 0.04530291357700374)\n", "FIXED:\t4.099095164980699\t(normalized: 0.04530291357700374)\n", "JOURNEY:\t3.965563772356176\t(normalized: 0.0438271339484734)\n", "TON:\t3.965563772356176\t(normalized: 0.0438271339484734)\n", "SWING:\t3.8477807366997925\t(normalized: 0.04252540406165253)\n", "SMASHED:\t3.8477807366997925\t(normalized: 0.04252540406165253)\n", "FARTHER:\t3.742420221041966\t(normalized: 0.04136096699855314)\n", "ODDS:\t3.742420221041966\t(normalized: 0.04136096699855314)\n", "SNOW:\t3.742420221041966\t(normalized: 0.04136096699855314)\n", "THEY:\t3.6620532450860988\t(normalized: 0.040472756791266835)\n", "PAIR:\t3.6471100412376414\t(normalized: 0.040307605545622745)\n", "EASTWEST:\t3.5600986642480112\t(normalized: 0.03934596188200318)\n", "GUARDED:\t3.5600986642480112\t(normalized: 0.03934596188200318)\n", "PASSENGERS:\t3.5600986642480112\t(normalized: 0.03934596188200318)\n", "WEST:\t3.4866101743489226\t(normalized: 0.038533772222381406)\n", "HOPING:\t3.480055956574475\t(normalized: 0.03846133546513377)\n", "ENTER:\t3.480055956574475\t(normalized: 0.03846133546513377)\n", "INSTALLED:\t3.480055956574475\t(normalized: 0.03846133546513377)\n", "GUARDS:\t3.480055956574475\t(normalized: 0.03846133546513377)\n", "CORNER:\t3.480055956574475\t(normalized: 0.03846133546513377)\n", "BULLETS:\t3.405947984420753\t(normalized: 0.03764229933088416)\n", "AWAY:\t3.402399784364656\t(normalized: 0.03760308487158871)\n", "EVE:\t3.3369551129338015\t(normalized: 0.03687979493208292)\n", "ABOARD:\t3.3369551129338015\t(normalized: 0.03687979493208292)\n", "ADVANCE:\t3.3369551129338015\t(normalized: 0.03687979493208292)\n", "TONS:\t3.2724165917962305\t(normalized: 0.03616651970235382)\n", "EMERGENCY:\t3.2724165917962305\t(normalized: 0.03616651970235382)\n", "SHOOTING:\t3.2724165917962305\t(normalized: 0.03616651970235382)\n", "WEATHER:\t3.211791969979796\t(normalized: 0.03549650061466538)\n", "COMMUNIST:\t3.1637181817718267\t(normalized: 0.03496519246375666)\n", "ROUTE:\t3.154633556139847\t(normalized: 0.03486478981553296)\n", "FOOT:\t3.154633556139847\t(normalized: 0.03486478981553296)\n", "WALL:\t3.1005663348695713\t(normalized: 0.034267242660862895)\n", "FEW:\t3.0903912874114936\t(normalized: 0.0341547887467471)\n", "ROADS:\t3.0492730404820207\t(normalized: 0.03370035275243356)\n", "FLYING:\t3.0492730404820207\t(normalized: 0.03370035275243356)\n", "UGLY:\t3.0004828763125886\t(normalized: 0.033161127264413934)\n", "REACHED:\t3.0004828763125886\t(normalized: 0.033161127264413934)\n", "THEN:\t2.9662785260631113\t(normalized: 0.032783103173500465)\n", "SOUTHEAST:\t2.953962860677696\t(normalized: 0.03264699129950317)\n", "FELLOW:\t2.953962860677696\t(normalized: 0.03264699129950317)\n", "HEAVILY:\t2.953962860677696\t(normalized: 0.03264699129950317)\n", "DUTY:\t2.953962860677696\t(normalized: 0.03264699129950317)\n", "LIVED:\t2.8669514836880663\t(normalized: 0.031685347635883605)\n", "DOING:\t2.8669514836880663\t(normalized: 0.031685347635883605)\n", "BEGAN:\t2.840065001503482\t(normalized: 0.031388200111912026)\n", "FRONTIER:\t2.826129489167811\t(normalized: 0.031234185802513303)\n", "VILLAGE:\t2.826129489167811\t(normalized: 0.031234185802513303)\n", "PLANNED:\t2.826129489167811\t(normalized: 0.031234185802513303)\n", "ANCIENT:\t2.749168448031683\t(normalized: 0.03038361774906274)\n", "SIGNED:\t2.712800803860808\t(normalized: 0.029981685084764585)\n", "KNEW:\t2.6438079323738566\t(normalized: 0.029219180685963346)\n", "COLD:\t2.6110181095508658\t(normalized: 0.028856789853409396)\n", "ENTIRE:\t2.6110181095508658\t(normalized: 0.028856789853409396)\n", "EIGHT:\t2.579269411236285\t(normalized: 0.02850590545623425)\n", "RAISED:\t2.579269411236285\t(normalized: 0.02850590545623425)\n", "LEADING:\t2.5186447894198505\t(normalized: 0.02783588636854581)\n", "ARRIVED:\t2.489657252546598\t(normalized: 0.027515518134844322)\n", "THING:\t2.4614863755799017\t(normalized: 0.027204175569413385)\n", "ROAD:\t2.4614863755799017\t(normalized: 0.027204175569413385)\n", "FREEDOM:\t2.3561258599220753\t(normalized: 0.026039738506313987)\n", "LATE:\t2.3561258599220753\t(normalized: 0.026039738506313987)\n", "LONGER:\t2.3561258599220753\t(normalized: 0.026039738506313987)\n", "DECIDED:\t2.3073356957526436\t(normalized: 0.025500513018294365)\n", "CHILDREN:\t2.3073356957526436\t(normalized: 0.025500513018294365)\n", "SECOND:\t2.283805198342449\t(normalized: 0.02524045560374469)\n", "THIRD:\t2.2608156801177506\t(normalized: 0.024986377053383597)\n", "SUDDENLY:\t2.216363917546917\t(normalized: 0.02449509927693767)\n", "BORDER:\t2.153185015925385\t(normalized: 0.023796850467175345)\n", "LOCAL:\t2.0937615954545845\t(normalized: 0.023140106972894623)\n", "BEFORE:\t2.0821180161811053\t(normalized: 0.02301142294768344)\n", "ADDED:\t2.07471340048389\t(normalized: 0.022929587651965573)\n", "BACK:\t2.055450954442175\t(normalized: 0.022716700443157965)\n", "ORDER:\t2.0196536233008624\t(normalized: 0.02232107083864501)\n", "ITS:\t1.9527090093458963\t(normalized: 0.021581203639084154)\n", "MAJOR:\t1.9506607518139112\t(normalized: 0.02155856643984377)\n", "NORTH:\t1.9341314498627005\t(normalized: 0.021375885748704603)\n", "CITY:\t1.9178709289909204\t(normalized: 0.021196175607289824)\n", "GERMANY:\t1.8861222306763399\t(normalized: 0.020845291210114678)\n", "TOWARD:\t1.8706180441403748\t(normalized: 0.02067394002297436)\n", "SIX:\t1.825497608859905\t(normalized: 0.020175272122426233)\n", "REGIME:\t1.8108988094387524\t(normalized: 0.020013927210467446)\n", "PRESS:\t1.7965100719866527\t(normalized: 0.019854903888724747)\n", "SO:\t1.7152390163905136\t(normalized: 0.018956701856374252)\n", "THREE:\t1.7152390163905136\t(normalized: 0.018956701856374252)\n", "ALONG:\t1.6882964873464201\t(normalized: 0.018658934906424403)\n", "FACT:\t1.6882964873464201\t(normalized: 0.018658934906424403)\n", "ALREADY:\t1.6755574615689903\t(normalized: 0.0185181440829323)\n", "EAST:\t1.6629786793621302\t(normalized: 0.018379124260194415)\n", "COMMUNISTS:\t1.6629786793621302\t(normalized: 0.018379124260194415)\n", "FIRST:\t1.6601391128540526\t(normalized: 0.018347741569397044)\n", "EACH:\t1.6505561593635731\t(normalized: 0.018241831436472945)\n", "FINALLY:\t1.6505561593635731\t(normalized: 0.018241831436472945)\n", "GREAT:\t1.556368944303872\t(normalized: 0.01720088091149815)\n", "FOUR:\t1.556368944303872\t(normalized: 0.01720088091149815)\n", "US:\t1.5134765666829533\t(normalized: 0.016726837348647165)\n", "TAKE:\t1.470294335532629\t(normalized: 0.016249590344825884)\n", "EVERY:\t1.4600378353654397\t(normalized: 0.01613623622105577)\n", "TOO:\t1.449885463901422\t(normalized: 0.016024032920443156)\n", "WHEN:\t1.3620550005754655\t(normalized: 0.015053336771820625)\n", "MAKE:\t1.3536574318068681\t(normalized: 0.01496052742808281)\n", "WAY:\t1.3354751127236777\t(normalized: 0.014759577707009633)\n", "DAYS:\t1.291415122929647\t(normalized: 0.01427262977594055)\n", "DID:\t1.2658818209244824\t(normalized: 0.013990437504836993)\n", "JUST:\t1.2328209586635943\t(normalized: 0.013625051163338554)\n", "WOULD:\t1.231319370163142\t(normalized: 0.013608455712067898)\n", "TWO:\t1.213852010225633\t(normalized: 0.013415407669556517)\n", "OLD:\t1.2087234070845339\t(normalized: 0.01335872670570457)\n", "MONTHS:\t1.2087234070845339\t(normalized: 0.01335872670570457)\n", "NO:\t1.2051752070284367\t(normalized: 0.013319512246409126)\n", "INTO:\t1.145469280729074\t(normalized: 0.012659646517434835)\n", "WHERE:\t1.0545727272572756\t(normalized: 0.01165506415458555)\n", "ONE:\t1.0538195256999363\t(normalized: 0.011646739823560081)\n", "DOWN:\t1.0410590080905526\t(normalized: 0.01150571147384172)\n", "THROUGH:\t1.0277254772210875\t(normalized: 0.011358350221578982)\n", "WHAT:\t0.982410281009045\t(normalized: 0.010857529836812726)\n", "ONLY:\t0.9457025637164944\t(normalized: 0.010451838708116685)\n", "AT:\t0.9367641342187849\t(normalized: 0.010353051809361974)\n", "AGAINST:\t0.9330175256794686\t(normalized: 0.010311644553362267)\n", "HIM:\t0.8802193401124978\t(normalized: 0.009728122692684913)\n", "ALL:\t0.878406495480029\t(normalized: 0.00970808726037417)\n", "WERE:\t0.867789301347299\t(normalized: 0.009590746772079457)\n", "THEM:\t0.8576195081952568\t(normalized: 0.009478350928187126)\n", "COULD:\t0.8465082827701861\t(normalized: 0.009355550440541226)\n", "OFF:\t0.8355191611945908\t(normalized: 0.009234099436114807)\n", "NOT:\t0.8206925727179877\t(normalized: 0.009070237015418724)\n", "TIME:\t0.7367376166348071\t(normalized: 0.008142372702267217)\n", "OR:\t0.7268853201917955\t(normalized: 0.00803348580983644)\n", "MOST:\t0.7122865207706428\t(normalized: 0.007872140897877654)\n", "WAS:\t0.6787922739875605\t(normalized: 0.007501964820897514)\n", "IT:\t0.6393685102574851\t(normalized: 0.007066256136600054)\n", "HAD:\t0.5785262008981994\t(normalized: 0.006393831181386304)\n", "ON:\t0.5661765686897093\t(normalized: 0.006257343908431849)\n", "BEEN:\t0.4728512818582472\t(normalized: 0.005225919354058342)\n", "BUT:\t0.4435543391978137\t(normalized: 0.004902131589198703)\n", "OUT:\t0.4173842003453748\t(normalized: 0.004612901041721085)\n", "UP:\t0.40309824309789827\t(normalized: 0.004455013639624079)\n", "LAST:\t0.3878812994214631\t(normalized: 0.004286837040512865)\n", "HAVE:\t0.3682515117677301\t(normalized: 0.004069889997855906)\n", "WHO:\t0.3412228393798108\t(normalized: 0.0037711709976843595)\n", "THEIR:\t0.31166403713826635\t(normalized: 0.0034444891790165223)\n", "BE:\t0.28610354015873113\t(normalized: 0.003161996350954893)\n", "AN:\t0.26426179824368246\t(normalized: 0.0029206029442337953)\n", "THAT:\t0.23421323500629654\t(normalized: 0.002588508321233542)\n", "HE:\t0.2308747822109453\t(normalized: 0.0025516119740199455)\n", "HIS:\t0.17853725710271504\t(normalized: 0.0019731813005706787)\n", "FOR:\t0.1003607628318815\t(normalized: 0.0011091801439345956)\n", "AND:\t0.08561272936648934\t(normalized: 0.0009461859077379432)\n", "WITH:\t0.06865440473807798\t(normalized: 0.0007587636879234121)\n", "TO:\t0.04270464636534712\t(normalized: 0.0004719687701212316)\n", "IN:\t0.01660736247541277\t(normalized: 0.0001835434106027012)\n" ] } ], "source": [ "show_document_vector(vectors, 2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3.11.0 ('ir')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.0" }, "vscode": { "interpreter": { "hash": "139e7c84632f54486abb9d698f2a5412a324e85ce1b1331ea63d3255168fb27f" } } }, "nbformat": 4, "nbformat_minor": 4 }