{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "44056652",
   "metadata": {},
   "source": [
    "# Week V - Vectorization, profiling and non-uniform random distributions"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49ca7c09",
   "metadata": {},
   "source": [
    "## Vectorization\n",
    "\n",
    "#### Definition \n",
    "\n",
    "* In the context of Python programming, the term **vectorization** is used in a broad sense and typically referes to the practice of avoiding for loops in favour of vector products and linear algebra operations, often through calls to dedicated numerical libraries.\n",
    "* More strictly speaking, **vectorization** is the redesign of an algorithm to take advantage of the vector processing capability of the hardware.\n",
    "\n",
    "#### Faster code through vectorization\n",
    "\n",
    "In practice, avoiding for loops will make your Python code faster because:\n",
    "* you remove/lower the overhead of the loop due to the interpreter\n",
    "* you use dedicated libraries which are very fast, owing to:\n",
    "    * efficient algorithms\n",
    "    * efficient implementations of those algorithms \n",
    "    * written in a low-level language and compiled (implementations are written in one or more of Fortran, C, assembly, etc.). \n",
    "    * exploit vector processing capabilities such as SIMD (Single Instructions Multiple Data)\n",
    "    * might be optimized for specific hardware\n",
    "(different CPUs architectures, GPUs, ...)\n",
    "\n",
    "For instance, NumPy wraps BLAS and LAPACK.\n",
    "\n",
    "Note that:\n",
    "* Vectorization also applies very much to compiled languages, but it becomes often unavoidable for interpreted languages.\n",
    "* Some degree of auto-vectorization can be introduced by the compiler with appropriate flags (check compiler documentations, e.g. -O3 flag), but often effective vectorization requires substantial expertise in algorithms and low-level languages -> *use existing and established libraries as much as possible*.\n",
    "\n",
    "#### Vectorization and SIMD\n",
    "The concept of Single Instruction Multiple Data (SIMD) became popular in the 70s, with the first vector supercomputers. \\\n",
    "*Writing a computer program so that the compiler can generate effective SIMD vector instructions is called vectorization* (Hager & Wellein, Introduction to HPC for Scientist and Engineers, CRC Press, 2011).\n",
    "\n",
    "SIMD allows the concurrent execution of arithmetic operations on a wide register that can hold, e.g., two DP or four SP floating-point words (Hager & Wellein, 2011). For instance, a single instruction can initiate four additions of single-precision floating-point values, each at once.\n",
    "\n",
    "For example, Intel's AVX instructions are based on up to 512-bit wide SIMD, that is, eight floating point numbers can be processed simultaneously. \n",
    "\n",
    "> Modern supercomputers are *not* SIMD-based (instead, thery are MIMD), but most modern CPUs includes strong SIMD components. General purpose GPUs are essentially SIMD-based.\n",
    "\n",
    "> **Processing instructions is actually expensive compared to a floating point operation**. *The use of SIMD is mostly motivated by power considerations. Decoding instructions is actually more power consuming than executing them*, **so SIMD parallelism is a way to save power** (Victor Eijkhout, Introduction to High Performance Scientific Computing, 2016).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "695aef0a",
   "metadata": {},
   "source": [
    "----------------------------------------"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d250a8ee",
   "metadata": {},
   "source": [
    "## Profiling and timing code\n",
    "\n",
    "In the process of developing code, there are typically several possible implementations to get the job done.\n",
    "Typically the right choice is a trade-off optimization between development time/effort, computational efficiency and code readibility. As [Donald Knuth](https://en.wikipedia.org/wiki/Donald_Knuth) famously put it:\n",
    "\n",
    "> *Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered: **premature optimization is the root of all evil***.\n",
    "\n",
    "An effective approach is to \n",
    " 1. first design the code\n",
    " 2. then implement the code according to the design\n",
    " 3. profile/benchmark the code to understand\n",
    "     * where the program spends most of his time\n",
    "     * which parts can be optimised\n",
    "     \n",
    ">*In software engineering, it is very typical that a program spents most of the time only in a small part of the program as exemplified by the common 90/10 rule: **90% of time is spent in 10% of the source code***. \n",
    "\n",
    "The two main ways to analyze performance are via applications own timers, which measure the time spent in specific region of a program, or by utilizing special performance analysis software. \n",
    "\n",
    "> Profiling is concerned with where the time is spent in the code. Often memory is key, not just compute time.\n",
    "\n",
    "IPython provides very useful *profiling* and *timing* tools\n",
    "\n",
    "* `%time`, `%timeit` : timing single staments (see the lecture of last week)\n",
    "* `%prun`: run the code with the profile\n",
    "* `%lprun` : line-by-line profiler\n",
    "* `%memit` : memory fo a single statement\n",
    "* `%mprun` : line-by-line memory profiler\n",
    "\n",
    "> For timing, you can also insert multiple time checks inside the code with the module ``time``:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16da2ad2",
   "metadata": {},
   "source": [
    "Let's practise upon the exercise of last week about correlations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fb88156a",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "def corr_for_loop(N,k_list=np.arange(100)):\n",
    "    '''\n",
    "    Calculates the correlation of random numbers\n",
    "    with two nested loops\n",
    "    ''' \n",
    "    np.random.seed(42)\n",
    "    a = np.random.rand(N)\n",
    "    s = 0.0\n",
    "    c_list = []\n",
    "    for k in k_list:\n",
    "        for i in range(len(a)-k):\n",
    "            s = s + a[i]*a[i+k]\n",
    "        s = s/float(len(a)-k)\n",
    "        c_list.append(s)\n",
    "    return k_list,np.array(c_list)\n",
    "\n",
    "def corr(N,k_list=np.arange(100)):\n",
    "    '''\n",
    "    Calculates the correlation of random numbers\n",
    "    with some vectorization\n",
    "    ''' \n",
    "    np.random.seed(42)\n",
    "    a = np.random.rand(N)\n",
    "    def corr_k(k):\n",
    "        if k==0:\n",
    "            res = np.average(a**2)\n",
    "        else:\n",
    "            res = np.average(a[:-k]*a[k:]) \n",
    "        return res\n",
    "    \n",
    "    return k_list,np.array([corr_k(k) for k in k_list])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "25f623a9",
   "metadata": {},
   "outputs": [],
   "source": [
    "%timeit corr_for_loop(1000)\n",
    "%timeit corr(1000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ef6f9b75",
   "metadata": {},
   "outputs": [],
   "source": [
    "#We need to load the profiler (to be installed with pip3 install line_profiler memory_profiler)\n",
    "%load_ext line_profiler\n",
    "%load_ext memory_profiler"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "169f99f3",
   "metadata": {},
   "outputs": [],
   "source": [
    "#Memory usage\n",
    "print('loop-based')\n",
    "%memit corr_for_loop(1000000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "227a9c33",
   "metadata": {},
   "outputs": [],
   "source": [
    "print('vectorized')\n",
    "%memit corr(1000000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ad39156e",
   "metadata": {},
   "outputs": [],
   "source": [
    "%memit??"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b9eb2b84",
   "metadata": {},
   "outputs": [],
   "source": [
    "#Line-by-line profiler\n",
    "%lprun -f corr_for_loop corr_for_loop(1000)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea80d5ef",
   "metadata": {},
   "source": [
    "----------------------------------------"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9b494a8e",
   "metadata": {},
   "source": [
    "## Exercise - Random numbers with non-uniform distributions\n",
    "\n",
    "While for simple and common distribution we can use the intrinsic random generators of NumPy, there are non-uniform distributions who need to be constructed by hand. In this exercise, we will try to sample the following distribution between -1 and +1:\n",
    "\n",
    "$$\\frac{1}{\\pi\\sqrt{(1-x^2)}}.$$\n",
    "\n",
    "We will use the inverse transformation method and generate a random number U with uniform distribution between 0 and 1, then we trasform it according to:\n",
    "\n",
    "$$x = \\cos{(\\pi U)}$$\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a6e91457",
   "metadata": {},
   "outputs": [],
   "source": [
    "def nunf_dist(N):\n",
    "    '''\n",
    "    Generate a list of N random variables\n",
    "    sampling the distribution \n",
    "    1.0*(np.sqrt(1.0-x**2)*np.pi\n",
    "    '''\n",
    "    import numpy as np\n",
    "    return list(map(lambda x: np.cos(np.pi*x),np.random.rand(N)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "df575bd0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's sample!\n",
    "import time\n",
    "t0 = time.process_time()\n",
    "res = nunf_dist(int(1e5))\n",
    "t1 = time.process_time()\n",
    "print(\"Time spent\", t1 - t0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1aa3e26c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Now we plot the results and compare it with the target distribution\n",
    "import matplotlib.pyplot as plt\n",
    "n_bins = 100\n",
    "plt.hist(res,weights = [n_bins/(2*len(res))]*len(res),bins=n_bins,edgecolor='black',label='random numbers')\n",
    "x= np.linspace(-0.99,0.99,1000)\n",
    "y = list(map(lambda x: 1.0/(np.sqrt(1.0-x**2)*np.pi),x))\n",
    "plt.xlabel('x')\n",
    "plt.ylabel('distribution')\n",
    "plt.plot(x,y,'-',lw=3,label=r'$\\frac{1}{\\pi\\sqrt{(1-x^2)}}$')\n",
    "plt.legend()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7f34f1bd",
   "metadata": {},
   "outputs": [],
   "source": [
    "#Save data to file\n",
    "np.savetxt('./rand_numbers',res)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c99cebaf",
   "metadata": {},
   "outputs": [],
   "source": [
    "#Load data from file\n",
    "data = np.loadtxt('./rand_numbers')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5ff6cfad",
   "metadata": {},
   "outputs": [],
   "source": [
    "data"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}