Last active
December 20, 2015 14:09
-
-
Save mehitabel/6144403 to your computer and use it in GitHub Desktop.
probabilistic counting basics
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| { | |
| "metadata": { | |
| "name": "Intro to Probabilistic Counting" | |
| }, | |
| "nbformat": 3, | |
| "nbformat_minor": 0, | |
| "worksheets": [ | |
| { | |
| "cells": [ | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": "Here's a quick implementation of probabilistic counting. The current implementation counts exactly to $256$ and then increments in powers of $2$ with very low probability. The counter will increment roughly $\\log N + 256$ times. Since each counter is extremely inaccurate, it's valuable to average over a few counters. This implementation is terrible but gives the idea, we can go to the literature and to several active open source projects that do this well. google \"hyper log log counting\"" | |
| }, | |
| { | |
| "cell_type": "code", | |
| "collapsed": false, | |
| "input": "import numpy as np\n\nclass SimpleProbabilisticCounter(object):\n \"\"\" produces exact counts up to threshold, probabilistic after. \"\"\" \n def __init__(self, thresh=8):\n self.count = 0\n self.prob = 1.0\n self.thresh = thresh\n \n def update_prob(self):\n \"\"\" counters become probabilistic at high load. \"\"\"\n if self.count < 2**self.thresh:\n self.prob = 1.0\n else:\n self.prob = 1.0 / self.count\n \n def tick(self): \n \"\"\" tick and update if necessary \"\"\"\n if self.count < 2**self.thresh:\n self.count += 1\n elif np.random.random() < self.prob:\n self.count *= 2\n self.update_prob()\n \n def multitick(self, n):\n \"\"\" tick n times \"\"\"\n for i in xrange(n):\n self.tick()\n \n ", | |
| "language": "python", | |
| "metadata": {}, | |
| "outputs": [], | |
| "prompt_number": 37 | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": "Let's have a look at how to use one of these things." | |
| }, | |
| { | |
| "cell_type": "code", | |
| "collapsed": false, | |
| "input": "counters = [SimpleProbabilisticCounter() for i in xrange(8)]\nfor i in xrange(1000):\n for c in counters:\n c.tick()\n \n2**np.mean([np.log2(c.count) for c in counters])", | |
| "language": "python", | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "output_type": "pyout", | |
| "prompt_number": 55, | |
| "text": "861.07792921980365" | |
| } | |
| ], | |
| "prompt_number": 55 | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": "Formalizing this approach to testing, let's try counting to $1000$, we'll do it $100$ times, averaging over eight counters each time." | |
| }, | |
| { | |
| "cell_type": "code", | |
| "collapsed": false, | |
| "input": "def test_this_approach(trials=100):\n results = []\n for trial_num in xrange(trials):\n counters = [SimpleProbabilisticCounter() for i in xrange(8)]\n for c in counters:\n c.multitick(1000)\n results.append(2**np.mean([np.log2(c.count) for c in counters]))\n return results", | |
| "language": "python", | |
| "metadata": {}, | |
| "outputs": [], | |
| "prompt_number": 56 | |
| }, | |
| { | |
| "cell_type": "code", | |
| "collapsed": false, | |
| "input": "import pandas as pd\n\ndata = test_this_approach()\nhist(data)", | |
| "language": "python", | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "output_type": "pyout", | |
| "prompt_number": 57, | |
| "text": "(array([ 8, 21, 17, 24, 12, 8, 3, 2, 1, 4]),\n array([ 558.33995912, 647.321432 , 736.30290487, 825.28437775,\n 914.26585062, 1003.2473235 , 1092.22879637, 1181.21026925,\n 1270.19174212, 1359.173215 , 1448.15468787]),\n <a list of 10 Patch objects>)" | |
| }, | |
| { | |
| "output_type": "display_data", | |
| "png": "iVBORw0KGgoAAAANSUhEUgAAAWwAAAD9CAYAAACY0k3rAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAEbtJREFUeJzt3VtsVFX/xvFnQ8tLXmmhKEyBgbThIC0tbeVkjOgQnCIm\nVEBCACMNbYjBaCQQJIagxQQoGv8a8IYLlEYMyg2HC214uRg8REuwJSGeQNKGAu1EKIVChXJY/4vq\nYAWmZY5dM99PMsmwO7N/v70cn07XrD3bMcYYAQB6vT7xbgAA0DMENgBYgsAGAEsQ2ABgCQIbACxB\nYAOAJYIGdmNjo2bMmKEJEyYoLy9PW7dulSRVVFTI7XarqKhIRUVFqq6ujkmzAJDMnGDrsJubm9Xc\n3KzCwkJduXJFkyZN0r59+7Rnzx6lpaVp1apVsewVAJJaSrAfZmZmKjMzU5I0YMAA5eTk6OzZs5Ik\nzrcBgNjq8Rx2Q0OD6urq9Pjjj0uStm3bpoKCApWXl6u1tTVqDQIA/mJ6oK2tzUyaNMns3bvXGGOM\n3+83t2/fNrdv3zbr1q0zZWVldz1HEjdu3LhxC+F2P90GdkdHhykuLjYffPDBPX9eX19v8vLy7hnY\nvcHbb78d7xZ6DcbiDsbiDsbijt4wFsGyM+iUiDFG5eXlys3N1cqVKwPbm5qaAvf37t2r/Pz8YLsB\nAERA0A8dv/vuO+3atUsTJ05UUVGRJGnTpk3avXu3jh07JsdxlJ2dre3bt8ekWQBIZkED+8knn9Tt\n27fv2j579uyoNRRpHo8n3i30GozFHYzFHYzFHb19LIKuww5rx47D0j8AeEDBspNT0wHAEgQ2AFiC\nwAYASwT90BHJJT19sNraLsa8blpahi5fbol5XcA2fOiIAMdx1HmiVcwr81oB/sKHjgCQAAhsALAE\ngQ0AliCwAcASBDYAWILABgBLENgAYAkCGwAsQWADgCUIbACwBIENAJYgsAHAEgQ2AFiCwAYASxDY\nAGAJAhsALEFgA4AlCGwAsASBDQCWILABwBIENgBYgsAGAEsQ2ABgCQIbACyREu8Gerv09MFqa7sY\n87ppaRm6fLkl5nUB9F6OMcZEZceOoyjtOqYcx5EUj+OI/fgl07ECvVWw7GRKBAAsQWADgCUIbACw\nBIENAJYIGtiNjY2aMWOGJkyYoLy8PG3dulWS1NLSIq/Xq3Hjxqm4uFitra0xaRYAklnQVSLNzc1q\nbm5WYWGhrly5okmTJmnfvn365JNP9Mgjj+iNN97Qli1bdPHiRVVWVnbdMatEwq3MKhEgCYW8SiQz\nM1OFhYWSpAEDBignJ0dnz57VgQMHVFpaKkkqLS3Vvn37ItwyAODfenziTENDg+rq6jRt2jT5/X65\nXC5Jksvlkt/vv+dzKioqAvc9Ho88Hk9YzQJAovH5fPL5fD16bI9OnLly5YqefvpprV+/XnPnzlVG\nRoYuXrxz9t/gwYPV0tL1rDymRMKuzJQIkITCOnHmxo0beuGFF/TSSy9p7ty5kjrfVTc3N0uSmpqa\nNHTo0Ai2CwC4l6CBbYxReXm5cnNztXLlysD2kpISVVVVSZKqqqoCQQ4AiJ6gUyLffvutnnrqKU2c\nOPGvP5elzZs3a+rUqVq4cKFOnz6trKws7dmzR4MGDeq6Y6ZEwq3MlAiQhIJlJ1/+1I1kCrFkOlag\nt+LLnwAgARDYAGAJAhsALEFgA4AlCGwAsASBDQCWILABwBIENgBYgsAGAEsQ2ABgCQIbACxBYAOA\nJQhsALAEgQ0AliCwAcASBDYAWILABgBLENgAYAkCGwAsQWADgCUIbACwREq8G8D9pPx1FXMA6ERg\n91o3Jd37UvfRwy8IoDdjSgQALEFgA4AlCGwAsASBDQCWILABwBIENgBYgsAGAEsQ2ABgCQIbACxB\nYAOAJQhsALAEgQ0Alug2sMvKyuRyuZSfnx/YVlFRIbfbraKiIhUVFam6ujqqTQIAehDYy5YtuyuQ\nHcfRqlWrVFdXp7q6Oj377LNRaxAA0KnbwJ4+fboyMjLu2m5MrL/6EwCSW8hz2Nu2bVNBQYHKy8vV\n2toayZ4AAPcQ0gUMVqxYobfeekuStH79eq1evVo7duy463EVFRWB+x6PRx6PJ6QmASBR+Xw++Xy+\nHj3WMT2Y22hoaNCcOXN0/PjxHv/McZyEmDbpvExXPI4jHnXjd6yJ8FoBIiFYdoY0JdLU1BS4v3fv\n3i4rSAAA0dHtlMjixYt1+PBhnT9/XiNHjtSGDRvk8/l07NgxOY6j7Oxsbd++PRa9AkBS69GUSEg7\nZkok3MpxqMuUCBBvEZ8SAQDEHoENAJYgsAHAEgQ2AFiCwAYASxDYAGAJAhsALEFgA4AlCGwAsASB\nDQCWILABwBIENgBYgsAGAEsQ2ABgCQIbACxBYAOAJQhsALAEgQ0AliCwAcASBDYAWILABgBLENgA\nYImUeDcASClyHCfmVdPSMnT5ckvM6wKhcowxJio7dhxFadcx1Rkk8TiOeNRNpmPtrJsIr1EklmDZ\nyZQIAFiCwAYASxDYAGAJAhsALEFgA4AlCGwAsASBDQCWILABwBIENgBYgsAGAEsQ2ABgCQIbACzR\nbWCXlZXJ5XIpPz8/sK2lpUVer1fjxo1TcXGxWltbo9okAKAHgb1s2TJVV1d32VZZWSmv16sTJ05o\n5syZqqysjFqDAIBOPfp61YaGBs2ZM0fHjx+XJI0fP16HDx+Wy+VSc3OzPB6Pfv3116475utVw60c\nh7rJdKyddRPhNYrEEvGvV/X7/XK5XJIkl8slv98fencAgB4J+4ozjuPc92ohFRUVgfsej0cejyfc\ncgCQUHw+n3w+X48eG/KUiM/nU2ZmppqamjRjxgymRCJfOQ51k+lYO+smwmsUiSXiUyIlJSWqqqqS\nJFVVVWnu3LmhdwcA6JFu32EvXrxYhw8f1vnz5+VyufTOO+/o+eef18KFC3X69GllZWVpz549GjRo\nUNcd8w473MpxqJtMx9pZNxFeo0gswbKTi/B2g8BO7LqJ8BpFYuEivACQAAhsALAEgQ0AliCwAcAS\nBDYAWILABgBLENgAYAkCGwAsQWADgCUIbACwBIENAJYI+/uwY6Gjo0Pnzp2Led0+ffh9BqD3sCKw\nN22q1KZN/6d+/QZ1/+AIun69Oab1EGsp9734RjSlpWXo8uWWmNeF/awI7OvXr+vGjTW6cWNdTOsO\nHFisS5f+F9OaiKWbise3BLa1xf6XBBIDf/MDgCUIbACwBIENAJYgsAHAEgQ2AFiCwAYASxDYAGAJ\nAhsALEFgA4AlCGwAsASBDQCWILABwBIENgBYgsAGAEsQ2ABgCQIbACxBYAOAJQhsALAEgQ0AliCw\nAcASBDYAWILABgBLpITz5KysLKWnp6tv375KTU3VkSNHItUXAOBfwgpsx3Hk8/k0ePDgSPUDALiP\nsKdEjDGR6AMA0I2w32E/88wz6tu3r15++WUtX768y88rKioC9z0ejzweTzjlACDh+Hw++Xy+Hj3W\nMWG8RW5qatKwYcP0xx9/yOv1atu2bZo+fXrnjh0nYu++33xznSor/ytpXUT211MDBxbr0qX/SYrH\nXxFOHOrGo2Zy1uUvU9xPsOwMa0pk2LBhkqQhQ4Zo3rx5fOgIAFEUcmC3t7erra1NknT16lUdPHhQ\n+fn5EWsMANBVyHPYfr9f8+bNkyTdvHlTL774ooqLiyPWGACgq5ADOzs7W8eOHYtkLwCAIDjTEQAs\nQWADgCUIbACwRFgnzgBAb5SePlhtbRdjXjctLUOXL7dEbf8ENoCE0xnWsT85qa3Nier+mRIBAEsQ\n2ABgCQIbACxBYAOAJQhsALAEgQ0AliCwAcASrMMGYi5FjhPd9br3Eu2TOhB9BDYQczeViCd1IPqY\nEgEASxDYAGAJAhsALEFgA4AlCGwAsASBDQCWILABwBKswwaSRuxP2OFkncgisIGkEfsTdjhZJ7KY\nEgEASxDYAGAJAhsALEFgA4AlCGwAsASBDQCWILABwBKswwYQRfG5uk6iIrABRFF8rq4jJeYvCaZE\nAMASBDYAWILABgBLhBzY1dXVGj9+vMaOHastW7ZEsqcI88W7gV7EF+8GehFfvBvoRXzxbqAX8cW7\ngaBCCuxbt27p1VdfVXV1tX7++Wft3r1bv/zyS6R7ixBfvBvoRXzxbqAX8cW7gV7EF+8GehFfvBsI\nKqTAPnLkiMaMGaOsrCylpqZq0aJF2r9/f6R7AwD8Q0jL+s6ePauRI0cG/u12u1VTUxOxpv6tT58+\n+s9/PlX//j888HOvXTuh/v1/DKnutWt1IT0PAKIhpMDu6UL4SC+Yv379txCfdyLMyvFa0xmNuhvi\nULMnqBvfut29LqJRM9pCrRveWETzRKGQAnvEiBFqbGwM/LuxsVFut7vLY4yJx2J5AEhcIc1hT548\nWSdPnlRDQ4M6Ojr0xRdfqKSkJNK9AQD+IaR32CkpKfroo480a9Ys3bp1S+Xl5crJyYl0bwCAfwh5\nHfbs2bP122+/6ffff9ebb74ZyZ4eSGtrqxYsWKCcnBzl5uaqpqZGLS0t8nq9GjdunIqLi9Xa2hp4\n/ObNmzV27FiNHz9eBw8ejFvf0bB582ZNmDBB+fn5WrJkia5fv540Y1FWViaXy6X8/PzAtlCO/ccf\nf1R+fr7Gjh2r119/PabHECn3Gos1a9YoJydHBQUFmj9/vi5duhT4WbKNxd/ef/999enTRy0td67q\n3uvHwlhu6dKlZseOHcYYY27cuGFaW1vNmjVrzJYtW4wxxlRWVpq1a9caY4z56aefTEFBgeno6DD1\n9fVm9OjR5tatW3HrPZLq6+tNdna2uXbtmjHGmIULF5qdO3cmzVh8/fXXpra21uTl5QW2Pcix3759\n2xhjzJQpU0xNTY0xxpjZs2ebr776KsZHEr57jcXBgwcD/33Xrl2b1GNhjDGnT582s2bNMllZWebC\nhQvGGDvGwupT0y9duqRvvvlGZWVlkjqnagYOHKgDBw6otLRUklRaWqp9+/ZJkvbv36/FixcrNTVV\nWVlZGjNmjI4cORK3/iMpPT1dqampam9v182bN9Xe3q7hw4cnzVhMnz5dGRkZXbY9yLHX1NSoqalJ\nbW1tmjp1qiRp6dKlgefY5F5j4fV61adP5//u06ZN05kzZyQl51hI0qpVq/Tuu+922WbDWFgd2PX1\n9RoyZIiWLVumxx57TMuXL9fVq1fl9/vlcrkkSS6XS36/X5J07ty5LqtZ3G63zp49G5feI23w4MFa\nvXq1Ro0apeHDh2vQoEHyer1JORZ/e9Bj//f2ESNGJNyYSNLHH3+s5557TlJyjsX+/fvldrs1ceLE\nLtttGAurA/vmzZuqra3VK6+8otraWj300EOqrKzs8hjHcYKui0yUL1c/deqUPvzwQzU0NOjcuXO6\ncuWKdu3a1eUxyTIW99LdsSeLjRs3ql+/flqyZEm8W4mL9vZ2bdq0SRs23FlrbSxagmx1YLvdbrnd\nbk2ZMkWStGDBAtXW1iozM1PNzc2SpKamJg0dOlTS3evHz5w5oxEjRsS+8Sg4evSonnjiCT388MNK\nSUnR/Pnz9f333yflWPzN5XL1+NjdbrdGjBgRmCr4e3sijcnOnTv15Zdf6rPPPgtsS7axOHXqlBoa\nGlRQUKDs7GydOXNGkyZNkt/vt2IsrA7szMxMjRw5UidOdJ7JeOjQIU2YMEFz5sxRVVWVJKmqqkpz\n586VJJWUlOjzzz9XR0eH6uvrdfLkycC8lO3Gjx+vH374QX/++aeMMTp06JByc3OTciz+VlJS8kDH\nnpmZqfT0dNXU1MgYo08//TTwHNtVV1frvffe0/79+9W/f//A9mQbi/z8fPn9ftXX16u+vl5ut1u1\ntbVyuVx2jEVcPuqMoGPHjpnJkyebiRMnmnnz5pnW1lZz4cIFM3PmTDN27Fjj9XrNxYsXA4/fuHGj\nGT16tHn00UdNdXV1HDuPvC1btpjc3FyTl5dnli5dajo6OpJmLBYtWmSGDRtmUlNTjdvtNh9//HFI\nx3706FGTl5dnRo8ebV577bV4HErY/j0WO3bsMGPGjDGjRo0yhYWFprCw0KxYsSLw+GQYi379+gVe\nF/+UnZ0dWCViTO8fC8cYiyZwACCJWT0lAgDJhMAGAEsQ2ABgCQIbACxBYAOAJQhsALDE/wPNRBb9\nU8vqkwAAAABJRU5ErkJggg==\n" | |
| } | |
| ], | |
| "prompt_number": 57 | |
| }, | |
| { | |
| "cell_type": "code", | |
| "collapsed": false, | |
| "input": "df = pd.DataFrame(data)\ndf.describe()", | |
| "language": "python", | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "html": "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>0</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>count</th>\n <td> 100.000000</td>\n </tr>\n <tr>\n <th>mean</th>\n <td> 860.896357</td>\n </tr>\n <tr>\n <th>std</th>\n <td> 188.416866</td>\n </tr>\n <tr>\n <th>min</th>\n <td> 558.339959</td>\n </tr>\n <tr>\n <th>25%</th>\n <td> 724.077344</td>\n </tr>\n <tr>\n <th>50%</th>\n <td> 861.077929</td>\n </tr>\n <tr>\n <th>75%</th>\n <td> 939.012140</td>\n </tr>\n <tr>\n <th>max</th>\n <td> 1448.154688</td>\n </tr>\n </tbody>\n</table>\n</div>", | |
| "output_type": "pyout", | |
| "prompt_number": 58, | |
| "text": " 0\ncount 100.000000\nmean 860.896357\nstd 188.416866\nmin 558.339959\n25% 724.077344\n50% 861.077929\n75% 939.012140\nmax 1448.154688" | |
| } | |
| ], | |
| "prompt_number": 58 | |
| }, | |
| { | |
| "cell_type": "code", | |
| "collapsed": false, | |
| "input": "", | |
| "language": "python", | |
| "metadata": {}, | |
| "outputs": [] | |
| } | |
| ], | |
| "metadata": {} | |
| } | |
| ] | |
| } |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment