Skip to content

Instantly share code, notes, and snippets.

@worthyl
Forked from postlmc/Notes.md
Created April 11, 2018 17:06
Show Gist options
  • Select an option

  • Save worthyl/8f60cd661d7a2f6da92b8c2792cd261e to your computer and use it in GitHub Desktop.

Select an option

Save worthyl/8f60cd661d7a2f6da92b8c2792cd261e to your computer and use it in GitHub Desktop.
AnacondaCon 18

AnacondaCon 18

My brain-dump from attending April 8th - 12th, 2018.

Many (all?) sessions will be available in the Anaconda channel on YouTube.

Sunday Tutorials

Tutorial 1: Introduction to Machine Learning with scikit-learn

Presenter: David Mertz

The notebooks will be interesting to play with later when I want flashbacks to my Regression Analysis courses.

Tutorial 2: Practical Data Science with ML and GPUs

Presenter: Stan Seibert Slides and Notebooks: https://github.com/ContinuumIO/ac2018-dl-gpu

Critical Questions for ML:

Part I

  • What is the specific purpose of the model I am trying to create?
  • What specific quantitative model output will help me achieve that purpose?
  • Are there publications, blog posts, conference talks describing a similiar use case?
  • What relevant data is available for the model? Can more be gathered?
  • Is there additional domain knowledge that would be helpful for selecting training data?
  • Are there expertss in the data I can talk to?

Part II

  • Are there security or ethical concerns assoc with the trianing data and the furture uses of the model?
  • What accuracy has already been achieved in our organization, and how much improvement is required for success?
  • Can the business benefits of the improved accuracy be quantified?
  • How will I know the modeol is working once deployed?

Today's (rented from GCE) GPUs: K80:4992 CUDA Cores:4.29 TFlops (Single Precision):480GB/sec
Today's exercise: classfiying the CIFAR10 image set

Most important question in ML: How do I know when I'm done? (Psuedo-answer: "When accuracy stops improving.")

"Overfitting does not mean your model is bad. Most models will overfit after emough training epochs."

  • Keras has a (new-ish) multi-GPU model that will automagically split the batches, recombine, then update the weigths
  • Since keras runs on top of TF, one can use TF functions to pin work to specific GPUs
  • conda can be used to package a trained model!
  • TensorRT: "Compiles" a trained model for production deployment

Day Two

11:00 Deep Learning with Just a Little Bit of Data

Presenter: Mike Bernico, @mikebernico

  • All about transfer learning
  • Universal truths: 1) the data is never good adn 2) never trust a wine pairing
  • His new book

13:00 Deploying Python and R to Spark and Hadoop

  • Conda runtime distribution is available for both Cloudera and Hortonworks environments! (AE only?)
  • Integrates via Livy (Livy support Kerberos with impersonation -- are we using that?)
  • For Cloudera AE gets wired up as a parcel repo

14:00 Accelerating Scientific Workloads with Numba

Presenter: Siu Kwan Lam, @sklam (primary developer of Numba!)

  • Basic API: Just use the @numba.jit decorator!
import numba
import numpy as np

@numba.jit
def mvmul(A, x):
    return A * x

A = np.random.rand(1000, 10000)
x = np.random.random(10000)
mvmul(A, x)

Turned into a list of projects using or existing numba:

15:10 Production-Grade Packaging with Anaconda

Presenter: Mahmoud Heshemi

  • Packaging is: turning code into a deployable archive
  • The 90/90 rule: "The first 90 percent of the code accounts for the first 90 percent of the development time. The remaining 10 percent of the code accounts for the other 90 percent of the development time." -- Tom Cargill, Bell Labs
  • Python: .py (module) < dir w/ _init_.py (package) < 0+ packages (distribution)
  • sdist == source distribution, bidst == binary distribution
  • python setup.py ... bdist_upload (missed it)
  • "Dont pip install to production!"
  • Holy crap: pipsi
  • His other talk
  • conda-pack
  • Research: Flatpack vs. Snappy (.deb vs. .rpm, round 2)
  • OpenSky
  • Useful conda flags to get save space, time, and get what you want: nomkl, clean, --no-channel-priority
  • Freezer to native packaging: constructor

16:10 Data Engineering for Data Science

  • Slides can be found here
  • Checkout Hydrogen for Atom (sigh)
  • from sklearn.proprocessing import CategoricalImputer, LabelBinarizer
  • pip install sklearn-pandas for categorical predictors using DataFrameMapper
  • pip install cerebrus for data validation
  • For CLIs: conda install -c conda_forge fire, then Fire() after import
  • import mummify, then mummify.log(...) and use the mummify command line tool to manage history (This is sweet, but I need to see what it actually does to git under the covers. Hint: tt's using git --git-dir=.mummify vodoo to do the magical history work.)

Day Three

10:00 Building Better Bad-ass Cars

Presenter: Peter Bushbacher

Uh-oh. He said "Tableau", and I tuned out...

10:00 conda: Tips and Tricks

Presenter: Kale Franz

11:00 GPU-Accelerating UDFs in PySpark with Numba and PyGDF

Presenters: Joshhua Patterson, @datametrician & Keith Kraus, @keithjkraus

Joshua:

Keith:

13:00 Real-Time Processing with Dask

Presenter: Matt Rocklin
Slides: http://matthewrocklin.com/slides/anacondacon-2018.html#/

14:00 Model Management with Anaconda Enterprise

Presenter: Michael Grant

15:10 Jumpstart Writing Continuous Application with Structured Streaming using Python APIS in Apache Spark

Presenter: Jules Damji, @2twitme
Slides: https://dbricks.co/anacondaConf2018

  • Structured Streaing can openate in continuos mode mode instead of micro-batching
  • .trigger() is checkpoint interval, not batch
  • Sample notebook: https://dbricks.co/ss_mllib_py

16:10 Apache Arrow: A Cross-Language Development Platform for In-Memory Data

Presenter: Wes Mckinney

A good, if a bit dry, talk. (Though, he did warn us.) You had to be there.

Day Four

10:00 Accelerating Deep Learning with GPUs

Presenter: Stan Seibert

  • One good note: if installing Keras via conda install keras-gpu, CUDA toolkit is not needed to take advantage of the GPU as long as a true Nvidia driver is being used. (The OSS clone included with Ubuntu will not work.) conda will install "enough of CUDA" with Keras.

Otherwise, this was mostly a showcase of ML/GPU projects...

(Well, that, and I'm running down, so note-taking will slow accordingly from here.)

11:00 conda: Deep Dive

Presenter: Kale Franz

  • conda Gross Anatomy: package/channel/prefix
    • package == .tar.bz2 w/ info/ directory containing:
      • recipe for building the package
      • index.json - pacakage-global metadata and used for search and to build the channel's repodata.json (source of truth)
      • paths.json - per-path metadata for invidual files in package
      • link.json - (optional) package install-time metadata
      • repodata_record.json - written after extracting the tarball, contains pacakge hash, source URL, etc.
    • channel - collection of packages at a URL (ex. https://repo.anaconda.com/pkgs/main)
      • contains subdirs by arch, or noarch, each contains a repodata.json file
      • channeldata.json - channel-level channel metadata, update info, package descriptions, icons, etc.
      • URL format: scheme, auth*, location, token*, name, subdir*, package_filename
        • * optional
        • File URL work as well using conda index to create local channel
    • environment
      • conda-meta dir - complete history of environment

(Wait, what happened to prefix?)

"If I have Docker, why do I need conda?"

Answer: conda is a universal package management adstraction layer, so quit writing:

if [distro]
    yum install ...
else if [other distro]
    apt-get install ...
else ...

conda accomodates cattle and pets. You can have your immutable infrastructure, but still capture state and re-use/re-create later.

defaults metadata is your transaction log (so, rollback!), and is actively managed by conda.

Tricks:

  • You can import conda:
import conda.cli.python_api
import conda.api
import contaexports

see: conda/conda#7059

  • conda search --envs ...

    • Works per-user or searches all envs if sudo/root
  • Research: conda 4.4's path stacking

13:00 conda, Docker & Kubernetes: The Cloud-Native Future of Data Science

Presenter: Matthew Lodge

14:00 ?

Presenter:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment