Skip to content

Instantly share code, notes, and snippets.

@ewencp
Created October 16, 2013 16:13
Show Gist options
  • Select an option

  • Save ewencp/7010531 to your computer and use it in GitHub Desktop.

Select an option

Save ewencp/7010531 to your computer and use it in GitHub Desktop.

Revisions

  1. ewencp created this gist Oct 16, 2013.
    24 changes: 24 additions & 0 deletions mrjob_join.py
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,24 @@
    from mrjob.job import MRJob

    class JoinExample(MRJob):
    def mapper(self, id, record):
    # Use both large files as input. If you have orders and
    # customers, you'll have as input either
    # order_id, order_data
    # or
    # customer_id, customer_data
    # In this case, I assume both have a customerID field to join
    # on and that you'll be able to differentiate them in the
    # reducer
    yield record['customerId'], record

    def reducer(self, customerID, records):
    for record in records:
    if is_customer_record(record):
    # do something with the customer info
    else:
    # do something with the order info
    yield customerID, new_data

    if __name__ == '__main__':
    JoinExample.run()