Skip to content

Instantly share code, notes, and snippets.

@mehitabel
Last active December 20, 2015 14:09
Show Gist options
  • Select an option

  • Save mehitabel/6144403 to your computer and use it in GitHub Desktop.

Select an option

Save mehitabel/6144403 to your computer and use it in GitHub Desktop.
probabilistic counting basics
{
"metadata": {
"name": "Intro to Probabilistic Counting"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": "Here's a quick implementation of probabilistic counting. The current implementation counts exactly to $256$ and then increments in powers of $2$ with very low probability. The counter will increment roughly $\\log N + 256$ times. Since each counter is extremely inaccurate, it's valuable to average over a few counters. This implementation is terrible but gives the idea, we can go to the literature and to several active open source projects that do this well. google \"hyper log log counting\""
},
{
"cell_type": "code",
"collapsed": false,
"input": "import numpy as np\n\nclass SimpleProbabilisticCounter(object):\n \"\"\" produces exact counts up to threshold, probabilistic after. \"\"\" \n def __init__(self, thresh=8):\n self.count = 0\n self.prob = 1.0\n self.thresh = thresh\n \n def update_prob(self):\n \"\"\" counters become probabilistic at high load. \"\"\"\n if self.count < 2**self.thresh:\n self.prob = 1.0\n else:\n self.prob = 1.0 / self.count\n \n def tick(self): \n \"\"\" tick and update if necessary \"\"\"\n if self.count < 2**self.thresh:\n self.count += 1\n elif np.random.random() < self.prob:\n self.count *= 2\n self.update_prob()\n \n def multitick(self, n):\n \"\"\" tick n times \"\"\"\n for i in xrange(n):\n self.tick()\n \n ",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 37
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Let's have a look at how to use one of these things."
},
{
"cell_type": "code",
"collapsed": false,
"input": "counters = [SimpleProbabilisticCounter() for i in xrange(8)]\nfor i in xrange(1000):\n for c in counters:\n c.tick()\n \n2**np.mean([np.log2(c.count) for c in counters])",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 55,
"text": "861.07792921980365"
}
],
"prompt_number": 55
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Formalizing this approach to testing, let's try counting to $1000$, we'll do it $100$ times, averaging over eight counters each time."
},
{
"cell_type": "code",
"collapsed": false,
"input": "def test_this_approach(trials=100):\n results = []\n for trial_num in xrange(trials):\n counters = [SimpleProbabilisticCounter() for i in xrange(8)]\n for c in counters:\n c.multitick(1000)\n results.append(2**np.mean([np.log2(c.count) for c in counters]))\n return results",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 56
},
{
"cell_type": "code",
"collapsed": false,
"input": "import pandas as pd\n\ndata = test_this_approach()\nhist(data)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 57,
"text": "(array([ 8, 21, 17, 24, 12, 8, 3, 2, 1, 4]),\n array([ 558.33995912, 647.321432 , 736.30290487, 825.28437775,\n 914.26585062, 1003.2473235 , 1092.22879637, 1181.21026925,\n 1270.19174212, 1359.173215 , 1448.15468787]),\n <a list of 10 Patch objects>)"
},
{
"output_type": "display_data",
"png": "iVBORw0KGgoAAAANSUhEUgAAAWwAAAD9CAYAAACY0k3rAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAEbtJREFUeJzt3VtsVFX/xvFnQ8tLXmmhKEyBgbThIC0tbeVkjOgQnCIm\nVEBCACMNbYjBaCQQJIagxQQoGv8a8IYLlEYMyg2HC214uRg8REuwJSGeQNKGAu1EKIVChXJY/4vq\nYAWmZY5dM99PMsmwO7N/v70cn07XrD3bMcYYAQB6vT7xbgAA0DMENgBYgsAGAEsQ2ABgCQIbACxB\nYAOAJYIGdmNjo2bMmKEJEyYoLy9PW7dulSRVVFTI7XarqKhIRUVFqq6ujkmzAJDMnGDrsJubm9Xc\n3KzCwkJduXJFkyZN0r59+7Rnzx6lpaVp1apVsewVAJJaSrAfZmZmKjMzU5I0YMAA5eTk6OzZs5Ik\nzrcBgNjq8Rx2Q0OD6urq9Pjjj0uStm3bpoKCApWXl6u1tTVqDQIA/mJ6oK2tzUyaNMns3bvXGGOM\n3+83t2/fNrdv3zbr1q0zZWVldz1HEjdu3LhxC+F2P90GdkdHhykuLjYffPDBPX9eX19v8vLy7hnY\nvcHbb78d7xZ6DcbiDsbiDsbijt4wFsGyM+iUiDFG5eXlys3N1cqVKwPbm5qaAvf37t2r/Pz8YLsB\nAERA0A8dv/vuO+3atUsTJ05UUVGRJGnTpk3avXu3jh07JsdxlJ2dre3bt8ekWQBIZkED+8knn9Tt\n27fv2j579uyoNRRpHo8n3i30GozFHYzFHYzFHb19LIKuww5rx47D0j8AeEDBspNT0wHAEgQ2AFiC\nwAYASwT90BHJJT19sNraLsa8blpahi5fbol5XcA2fOiIAMdx1HmiVcwr81oB/sKHjgCQAAhsALAE\ngQ0AliCwAcASBDYAWILABgBLENgAYAkCGwAsQWADgCUIbACwBIENAJYgsAHAEgQ2AFiCwAYASxDY\nAGAJAhsALEFgA4AlCGwAsASBDQCWILABwBIENgBYgsAGAEsQ2ABgCQIbACyREu8Gerv09MFqa7sY\n87ppaRm6fLkl5nUB9F6OMcZEZceOoyjtOqYcx5EUj+OI/fgl07ECvVWw7GRKBAAsQWADgCUIbACw\nBIENAJYIGtiNjY2aMWOGJkyYoLy8PG3dulWS1NLSIq/Xq3Hjxqm4uFitra0xaRYAklnQVSLNzc1q\nbm5WYWGhrly5okmTJmnfvn365JNP9Mgjj+iNN97Qli1bdPHiRVVWVnbdMatEwq3MKhEgCYW8SiQz\nM1OFhYWSpAEDBignJ0dnz57VgQMHVFpaKkkqLS3Vvn37ItwyAODfenziTENDg+rq6jRt2jT5/X65\nXC5Jksvlkt/vv+dzKioqAvc9Ho88Hk9YzQJAovH5fPL5fD16bI9OnLly5YqefvpprV+/XnPnzlVG\nRoYuXrxz9t/gwYPV0tL1rDymRMKuzJQIkITCOnHmxo0beuGFF/TSSy9p7ty5kjrfVTc3N0uSmpqa\nNHTo0Ai2CwC4l6CBbYxReXm5cnNztXLlysD2kpISVVVVSZKqqqoCQQ4AiJ6gUyLffvutnnrqKU2c\nOPGvP5elzZs3a+rUqVq4cKFOnz6trKws7dmzR4MGDeq6Y6ZEwq3MlAiQhIJlJ1/+1I1kCrFkOlag\nt+LLnwAgARDYAGAJAhsALEFgA4AlCGwAsASBDQCWILABwBIENgBYgsAGAEsQ2ABgCQIbACxBYAOA\nJQhsALAEgQ0AliCwAcASBDYAWILABgBLENgAYAkCGwAsQWADgCUIbACwREq8G8D9pPx1FXMA6ERg\n91o3Jd37UvfRwy8IoDdjSgQALEFgA4AlCGwAsASBDQCWILABwBIENgBYgsAGAEsQ2ABgCQIbACxB\nYAOAJQhsALAEgQ0Alug2sMvKyuRyuZSfnx/YVlFRIbfbraKiIhUVFam6ujqqTQIAehDYy5YtuyuQ\nHcfRqlWrVFdXp7q6Oj377LNRaxAA0KnbwJ4+fboyMjLu2m5MrL/6EwCSW8hz2Nu2bVNBQYHKy8vV\n2toayZ4AAPcQ0gUMVqxYobfeekuStH79eq1evVo7duy463EVFRWB+x6PRx6PJ6QmASBR+Xw++Xy+\nHj3WMT2Y22hoaNCcOXN0/PjxHv/McZyEmDbpvExXPI4jHnXjd6yJ8FoBIiFYdoY0JdLU1BS4v3fv\n3i4rSAAA0dHtlMjixYt1+PBhnT9/XiNHjtSGDRvk8/l07NgxOY6j7Oxsbd++PRa9AkBS69GUSEg7\nZkok3MpxqMuUCBBvEZ8SAQDEHoENAJYgsAHAEgQ2AFiCwAYASxDYAGAJAhsALEFgA4AlCGwAsASB\nDQCWILABwBIENgBYgsAGAEsQ2ABgCQIbACxBYAOAJQhsALAEgQ0AliCwAcASBDYAWILABgBLENgA\nYImUeDcASClyHCfmVdPSMnT5ckvM6wKhcowxJio7dhxFadcx1Rkk8TiOeNRNpmPtrJsIr1EklmDZ\nyZQIAFiCwAYASxDYAGAJAhsALEFgA4AlCGwAsASBDQCWILABwBIENgBYgsAGAEsQ2ABgCQIbACzR\nbWCXlZXJ5XIpPz8/sK2lpUVer1fjxo1TcXGxWltbo9okAKAHgb1s2TJVV1d32VZZWSmv16sTJ05o\n5syZqqysjFqDAIBOPfp61YaGBs2ZM0fHjx+XJI0fP16HDx+Wy+VSc3OzPB6Pfv3116475utVw60c\nh7rJdKyddRPhNYrEEvGvV/X7/XK5XJIkl8slv98fencAgB4J+4ozjuPc92ohFRUVgfsej0cejyfc\ncgCQUHw+n3w+X48eG/KUiM/nU2ZmppqamjRjxgymRCJfOQ51k+lYO+smwmsUiSXiUyIlJSWqqqqS\nJFVVVWnu3LmhdwcA6JFu32EvXrxYhw8f1vnz5+VyufTOO+/o+eef18KFC3X69GllZWVpz549GjRo\nUNcd8w473MpxqJtMx9pZNxFeo0gswbKTi/B2g8BO7LqJ8BpFYuEivACQAAhsALAEgQ0AliCwAcAS\nBDYAWILABgBLENgAYAkCGwAsQWADgCUIbACwBIENAJYI+/uwY6Gjo0Pnzp2Led0+ffh9BqD3sCKw\nN22q1KZN/6d+/QZ1/+AIun69Oab1EGsp9734RjSlpWXo8uWWmNeF/awI7OvXr+vGjTW6cWNdTOsO\nHFisS5f+F9OaiKWbise3BLa1xf6XBBIDf/MDgCUIbACwBIENAJYgsAHAEgQ2AFiCwAYASxDYAGAJ\nAhsALEFgA4AlCGwAsASBDQCWILABwBIENgBYgsAGAEsQ2ABgCQIbACxBYAOAJQhsALAEgQ0AliCw\nAcASBDYAWILABgBLpITz5KysLKWnp6tv375KTU3VkSNHItUXAOBfwgpsx3Hk8/k0ePDgSPUDALiP\nsKdEjDGR6AMA0I2w32E/88wz6tu3r15++WUtX768y88rKioC9z0ejzweTzjlACDh+Hw++Xy+Hj3W\nMWG8RW5qatKwYcP0xx9/yOv1atu2bZo+fXrnjh0nYu++33xznSor/ytpXUT211MDBxbr0qX/SYrH\nXxFOHOrGo2Zy1uUvU9xPsOwMa0pk2LBhkqQhQ4Zo3rx5fOgIAFEUcmC3t7erra1NknT16lUdPHhQ\n+fn5EWsMANBVyHPYfr9f8+bNkyTdvHlTL774ooqLiyPWGACgq5ADOzs7W8eOHYtkLwCAIDjTEQAs\nQWADgCUIbACwRFgnzgBAb5SePlhtbRdjXjctLUOXL7dEbf8ENoCE0xnWsT85qa3Nier+mRIBAEsQ\n2ABgCQIbACxBYAOAJQhsALAEgQ0AliCwAcASrMMGYi5FjhPd9br3Eu2TOhB9BDYQczeViCd1IPqY\nEgEASxDYAGAJAhsALEFgA4AlCGwAsASBDQCWILABwBKswwaSRuxP2OFkncgisIGkEfsTdjhZJ7KY\nEgEASxDYAGAJAhsALEFgA4AlCGwAsASBDQCWILABwBKswwYQRfG5uk6iIrABRFF8rq4jJeYvCaZE\nAMASBDYAWILABgBLhBzY1dXVGj9+vMaOHastW7ZEsqcI88W7gV7EF+8GehFfvBvoRXzxbqAX8cW7\ngaBCCuxbt27p1VdfVXV1tX7++Wft3r1bv/zyS6R7ixBfvBvoRXzxbqAX8cW7gV7EF+8GehFfvBsI\nKqTAPnLkiMaMGaOsrCylpqZq0aJF2r9/f6R7AwD8Q0jL+s6ePauRI0cG/u12u1VTUxOxpv6tT58+\n+s9/PlX//j888HOvXTuh/v1/DKnutWt1IT0PAKIhpMDu6UL4SC+Yv379txCfdyLMyvFa0xmNuhvi\nULMnqBvfut29LqJRM9pCrRveWETzRKGQAnvEiBFqbGwM/LuxsVFut7vLY4yJx2J5AEhcIc1hT548\nWSdPnlRDQ4M6Ojr0xRdfqKSkJNK9AQD+IaR32CkpKfroo480a9Ys3bp1S+Xl5crJyYl0bwCAfwh5\nHfbs2bP122+/6ffff9ebb74ZyZ4eSGtrqxYsWKCcnBzl5uaqpqZGLS0t8nq9GjdunIqLi9Xa2hp4\n/ObNmzV27FiNHz9eBw8ejFvf0bB582ZNmDBB+fn5WrJkia5fv540Y1FWViaXy6X8/PzAtlCO/ccf\nf1R+fr7Gjh2r119/PabHECn3Gos1a9YoJydHBQUFmj9/vi5duhT4WbKNxd/ef/999enTRy0td67q\n3uvHwlhu6dKlZseOHcYYY27cuGFaW1vNmjVrzJYtW4wxxlRWVpq1a9caY4z56aefTEFBgeno6DD1\n9fVm9OjR5tatW3HrPZLq6+tNdna2uXbtmjHGmIULF5qdO3cmzVh8/fXXpra21uTl5QW2Pcix3759\n2xhjzJQpU0xNTY0xxpjZs2ebr776KsZHEr57jcXBgwcD/33Xrl2b1GNhjDGnT582s2bNMllZWebC\nhQvGGDvGwupT0y9duqRvvvlGZWVlkjqnagYOHKgDBw6otLRUklRaWqp9+/ZJkvbv36/FixcrNTVV\nWVlZGjNmjI4cORK3/iMpPT1dqampam9v182bN9Xe3q7hw4cnzVhMnz5dGRkZXbY9yLHX1NSoqalJ\nbW1tmjp1qiRp6dKlgefY5F5j4fV61adP5//u06ZN05kzZyQl51hI0qpVq/Tuu+922WbDWFgd2PX1\n9RoyZIiWLVumxx57TMuXL9fVq1fl9/vlcrkkSS6XS36/X5J07ty5LqtZ3G63zp49G5feI23w4MFa\nvXq1Ro0apeHDh2vQoEHyer1JORZ/e9Bj//f2ESNGJNyYSNLHH3+s5557TlJyjsX+/fvldrs1ceLE\nLtttGAurA/vmzZuqra3VK6+8otraWj300EOqrKzs8hjHcYKui0yUL1c/deqUPvzwQzU0NOjcuXO6\ncuWKdu3a1eUxyTIW99LdsSeLjRs3ql+/flqyZEm8W4mL9vZ2bdq0SRs23FlrbSxagmx1YLvdbrnd\nbk2ZMkWStGDBAtXW1iozM1PNzc2SpKamJg0dOlTS3evHz5w5oxEjRsS+8Sg4evSonnjiCT388MNK\nSUnR/Pnz9f333yflWPzN5XL1+NjdbrdGjBgRmCr4e3sijcnOnTv15Zdf6rPPPgtsS7axOHXqlBoa\nGlRQUKDs7GydOXNGkyZNkt/vt2IsrA7szMxMjRw5UidOdJ7JeOjQIU2YMEFz5sxRVVWVJKmqqkpz\n586VJJWUlOjzzz9XR0eH6uvrdfLkycC8lO3Gjx+vH374QX/++aeMMTp06JByc3OTciz+VlJS8kDH\nnpmZqfT0dNXU1MgYo08//TTwHNtVV1frvffe0/79+9W/f//A9mQbi/z8fPn9ftXX16u+vl5ut1u1\ntbVyuVx2jEVcPuqMoGPHjpnJkyebiRMnmnnz5pnW1lZz4cIFM3PmTDN27Fjj9XrNxYsXA4/fuHGj\nGT16tHn00UdNdXV1HDuPvC1btpjc3FyTl5dnli5dajo6OpJmLBYtWmSGDRtmUlNTjdvtNh9//HFI\nx3706FGTl5dnRo8ebV577bV4HErY/j0WO3bsMGPGjDGjRo0yhYWFprCw0KxYsSLw+GQYi379+gVe\nF/+UnZ0dWCViTO8fC8cYiyZwACCJWT0lAgDJhMAGAEsQ2ABgCQIbACxBYAOAJQhsALDE/wPNRBb9\nU8vqkwAAAABJRU5ErkJggg==\n"
}
],
"prompt_number": 57
},
{
"cell_type": "code",
"collapsed": false,
"input": "df = pd.DataFrame(data)\ndf.describe()",
"language": "python",
"metadata": {},
"outputs": [
{
"html": "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>0</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>count</th>\n <td> 100.000000</td>\n </tr>\n <tr>\n <th>mean</th>\n <td> 860.896357</td>\n </tr>\n <tr>\n <th>std</th>\n <td> 188.416866</td>\n </tr>\n <tr>\n <th>min</th>\n <td> 558.339959</td>\n </tr>\n <tr>\n <th>25%</th>\n <td> 724.077344</td>\n </tr>\n <tr>\n <th>50%</th>\n <td> 861.077929</td>\n </tr>\n <tr>\n <th>75%</th>\n <td> 939.012140</td>\n </tr>\n <tr>\n <th>max</th>\n <td> 1448.154688</td>\n </tr>\n </tbody>\n</table>\n</div>",
"output_type": "pyout",
"prompt_number": 58,
"text": " 0\ncount 100.000000\nmean 860.896357\nstd 188.416866\nmin 558.339959\n25% 724.077344\n50% 861.077929\n75% 939.012140\nmax 1448.154688"
}
],
"prompt_number": 58
},
{
"cell_type": "code",
"collapsed": false,
"input": "",
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment