Created
October 24, 2013 20:15
-
-
Save mehitabel/7144197 to your computer and use it in GitHub Desktop.
A quick run through weighted reservoir sampling.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| { | |
| "metadata": { | |
| "name": "Weighted Reservoir Sampling" | |
| }, | |
| "nbformat": 3, | |
| "nbformat_minor": 0, | |
| "worksheets": [ | |
| { | |
| "cells": [ | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": "Let's suppose I would like to collect a sample of $1000$ individuals uniformly at random from an unknown population. How might we proceed? The obvious method is to assign a number to each member of the population and then generate some random numbers. But generating the random numbers requires me to know the size of the population. In fact we can do better, using a simple observation." | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": "If we randomly *order* the population, we can take the first $1000$ individuals under the random ordering. So let's try the following method. For each member of the population, generate an independent random number, uniform on $[0,1).$ We walk through the population, keeping track of the $1000$ largest numbers we have seen so far, this is what we'll call the reservoir. When I see a new number I either evict an element of the reservoir or do nothing. When we come to the end, we have the desired sample." | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": "Here's what it looks like in code:" | |
| }, | |
| { | |
| "cell_type": "code", | |
| "collapsed": false, | |
| "input": "import numpy as np\nimport heapq as hq\n\ndef with_random(iterable):\n \"\"\" generates your sequence accompanied by random numbers.\n For example:\n a, b, c, ... ~> (0.4, a), (0.7, b), (0.3, c), ...\n \"\"\"\n for x in iterable:\n yield np.random.random(), x\n \nclass BoundedHeap(object):\n \"\"\" Represents a bounded min heap \"\"\"\n def __init__(self, max_size=100):\n self.heap = []\n self.max_size = max_size\n \n def __len__(self):\n return len(self.heap)\n \n def __iter__(self):\n for x in self.heap:\n yield x\n \n def min(self):\n return self.heap[0][0]\n \n def add(self, x):\n if len(self) < self.max_size:\n hq.heappush(self.heap, x)\n elif x > self.heap[0]: \n hq.heapreplace(self.heap, x)\n \n def consume(self, iterable):\n for x in iterable:\n self.add(x)\n\n \ndef simple_reservoir_sample(iterable, sample_size=100):\n \"\"\" collects a sample without replacement from our iterable in a single pass. \"\"\"\n reservoir = BoundedHeap(sample_size)\n reservoir.consume(with_random(iterable))\n\n \n for rnd, val in reservoir:\n yield val", | |
| "language": "python", | |
| "metadata": {}, | |
| "outputs": [], | |
| "prompt_number": 67 | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": "That's a lot of words and code, let's pull in an example. Here's ten random words:" | |
| }, | |
| { | |
| "cell_type": "code", | |
| "collapsed": false, | |
| "input": "with open('/usr/share/dict/words', 'r') as f:\n for word in simple_reservoir_sample(f, 10):\n print word", | |
| "language": "python", | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "output_type": "stream", | |
| "stream": "stdout", | |
| "text": "Scythize\n\nchapper\n\ncheckstone\n\nundishonored\n\nfarthest\n\nunretrievable\n\nparacusia\n\nplump\n\npodarthral\n\nunforeseeable\n\n" | |
| } | |
| ], | |
| "prompt_number": 68 | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": "Aside: if you really want to use this for passwords, you should restrict to a smaller set of words that are easier to remember. This doesn't cost too much entropy but may save you some headaches." | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": "Now a few words on how to extend this to the more interesting weighted case. Let's suppose I would like to sample each word with probability inversely proportional to its length. This is a little bit harder to imagine. Of course as before, I can sum the weights, but this requires knowing the population in advance." | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": "The insight is as follows. Let us suppose that I give you $100$ Poisson processes with rates $a_0, \\ldots, a_{99}.$ This is equivalent in law to giving a single Poisson process of rate $\\sum a_i$ and randomly labelling each point proportional to the corresponding rate. What does it mean, practically? To sample elements with some fixed weight, I can simply race Poisson processes. To race Poisson processes, I can simply sample exponential random variables." | |
| }, | |
| { | |
| "cell_type": "code", | |
| "collapsed": false, | |
| "input": "def weighted_reservoir_sample(iterable, sample_size):\n \"\"\" \n eats an iterable of terms that look like (item, weight)\n returns a weighted sample without replacement.\n \"\"\"\n \n def clean(iterable):\n \"\"\" flipped the signs because our heap keeps track of max instead of min\n \"\"\"\n for item, weight in iterable:\n yield -np.random.exponential(scale=1.0/weight), item\n \n reservoir = BoundedHeap(sample_size)\n reservoir.consume(clean(iterable))\n \n for rnd, val in reservoir:\n yield val\n \ndef generate_weights(iterable, weight_fn):\n \"\"\" eats an iterable and a weight function, produces\n a, b, c, ... ~> (a, weight_fn(a)), (b, weight_fn(b)), (c, weight_fn(c)), ...\n \"\"\"\n for x in iterable:\n yield x, weight_fn(x)\n ", | |
| "language": "python", | |
| "metadata": {}, | |
| "outputs": [], | |
| "prompt_number": 77 | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": "Now let's generate long words:" | |
| }, | |
| { | |
| "cell_type": "code", | |
| "collapsed": false, | |
| "input": "with open('/usr/share/dict/words', 'r') as f:\n for word in weighted_reservoir_sample(generate_weights(f, lambda w: len(w) * len(w)), 10):\n print word", | |
| "language": "python", | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "output_type": "stream", | |
| "stream": "stdout", | |
| "text": "misthrift\n\nscribble\n\nclamorist\n\nPinkertonism\n\nnotommatid\n\npoetastress\n\nhighman\n\nstereobatic\n\nrulemonger\n\ncarpocephala\n\n" | |
| } | |
| ], | |
| "prompt_number": 86 | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": "And short words:" | |
| }, | |
| { | |
| "cell_type": "code", | |
| "collapsed": false, | |
| "input": "with open('/usr/share/dict/words', 'r') as f:\n for word in weighted_reservoir_sample(generate_weights(f, lambda w: 1.0 / (len(w) * len(w))), 10):\n print word", | |
| "language": "python", | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "output_type": "stream", | |
| "stream": "stdout", | |
| "text": "resaca\n\nfollis\n\ngeologize\n\nNepali\n\nhogframe\n\nbitonality\n\nnithing\n\nignitability\n\nintercadence\n\nbaya\n\n" | |
| } | |
| ], | |
| "prompt_number": 88 | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": "And extraordinarily long words:" | |
| }, | |
| { | |
| "cell_type": "code", | |
| "collapsed": false, | |
| "input": "with open('/usr/share/dict/words', 'r') as f:\n for word in weighted_reservoir_sample(generate_weights(f, lambda w: np.exp(len(w))), 10):\n print word", | |
| "language": "python", | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "output_type": "stream", | |
| "stream": "stdout", | |
| "text": "physicogeographical\n\nparadichlorobenzene\n\ncounternecromancy\n\nantitintinnabularian\n\ninterdestructiveness\n\npaleodendrologically\n\ntheologicometaphysical\n\nbronchoaspergillosis\n\nelectrotelethermometer\n\nscientificophilosophical\n\n" | |
| } | |
| ], | |
| "prompt_number": 91 | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": "" | |
| } | |
| ], | |
| "metadata": {} | |
| } | |
| ] | |
| } |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment