worthyl/Notes.md

AnacondaCon 18

My brain-dump from attending April 8th - 12th, 2018.

Many (all?) sessions will be available in the Anaconda channel on YouTube.

Sunday Tutorials

Tutorial 1: Introduction to Machine Learning with scikit-learn

Presenter: David Mertz

The notebooks will be interesting to play with later when I want flashbacks to my Regression Analysis courses.

Tutorial 2: Practical Data Science with ML and GPUs

Presenter: Stan Seibert Slides and Notebooks: https://github.com/ContinuumIO/ac2018-dl-gpu

GitHub Repo
Notebook (Sessions will evaporate after the talk!)
Style Transfer

Critical Questions for ML:

Part I

What is the specific purpose of the model I am trying to create?
What specific quantitative model output will help me achieve that purpose?
Are there publications, blog posts, conference talks describing a similiar use case?
What relevant data is available for the model? Can more be gathered?
Is there additional domain knowledge that would be helpful for selecting training data?
Are there expertss in the data I can talk to?

Part II

Are there security or ethical concerns assoc with the trianing data and the furture uses of the model?
What accuracy has already been achieved in our organization, and how much improvement is required for success?
Can the business benefits of the improved accuracy be quantified?
How will I know the modeol is working once deployed?

Today's (rented from GCE) GPUs: K80:4992 CUDA Cores:4.29 TFlops (Single Precision):480GB/sec
Today's exercise: classfiying the CIFAR10 image set

"One Hot Encoding" - encoding of categorical predictors (Keras and Scikit-Learn have functions to automate this. Ex: keras.utils.to_categorical)
holoviews
VGG99 CNN for Image Classification

Most important question in ML: How do I know when I'm done? (Psuedo-answer: "When accuracy stops improving.")

"Overfitting does not mean your model is bad. Most models will overfit after emough training epochs."

Keras has a (new-ish) multi-GPU model that will automagically split the batches, recombine, then update the weigths
Since keras runs on top of TF, one can use TF functions to pin work to specific GPUs
conda can be used to package a trained model!
TensorRT: "Compiles" a trained model for production deployment

Day Two

11:00 Deep Learning with Just a Little Bit of Data

Presenter: Mike Bernico, @mikebernico

All about transfer learning
Universal truths: 1) the data is never good adn 2) never trust a wine pairing
His new book

13:00 Deploying Python and R to Spark and Hadoop

Conda runtime distribution is available for both Cloudera and Hortonworks environments! (AE only?)
Integrates via Livy (Livy support Kerberos with impersonation -- are we using that?)
For Cloudera AE gets wired up as a parcel repo

14:00 Accelerating Scientific Workloads with Numba

Presenter: Siu Kwan Lam, @sklam (primary developer of Numba!)

Basic API: Just use the @numba.jit decorator!

import numba
import numpy as np

@numba.jit
def mvmul(A, x):
    return A * x

A = np.random.rand(1000, 10000)
x = np.random.random(10000)
mvmul(A, x)

jit API Docs

Turned into a list of projects using or existing numba:

LibROSA
Datashader
Trippy nb.jit(..., parallel=true) with for i in prange(...) to easily parallelize
No try catch in numba code! (but you can catch in a normal python wrapper)
FPBIC
FastParquet
OAMap
PyGDF

15:10 Production-Grade Packaging with Anaconda

Presenter: Mahmoud Heshemi

Packaging is: turning code into a deployable archive
The 90/90 rule: "The first 90 percent of the code accounts for the first 90 percent of the development time. The remaining 10 percent of the code accounts for the other 90 percent of the development time." -- Tom Cargill, Bell Labs
Python: .py (module) < dir w/ _init_.py (package) < 0+ packages (distribution)
sdist == source distribution, bidst == binary distribution
python setup.py ... bdist_upload (missed it)
"Dont pip install to production!"
Holy crap: pipsi
His other talk
conda-pack
Research: Flatpack vs. Snappy (.deb vs. .rpm, round 2)
OpenSky
Useful conda flags to get save space, time, and get what you want: nomkl, clean, --no-channel-priority
Freezer to native packaging: constructor

16:10 Data Engineering for Data Science

Slides can be found here
Checkout Hydrogen for Atom (sigh)
from sklearn.proprocessing import CategoricalImputer, LabelBinarizer
pip install sklearn-pandas for categorical predictors using DataFrameMapper
pip install cerebrus for data validation
For CLIs: conda install -c conda_forge fire, then Fire() after import
import mummify, then mummify.log(...) and use the mummify command line tool to manage history (This is sweet, but I need to see what it actually does to git under the covers. Hint: tt's using git --git-dir=.mummify vodoo to do the magical history work.)

Day Three

10:00 Building Better Bad-ass Cars

Presenter: Peter Bushbacher

Uh-oh. He said "Tableau", and I tuned out...

10:00 conda: Tips and Tricks

Presenter: Kale Franz

Find packages with conda search ... vs. http://www.anaconda.org/
Show/use other channels with conda config --show-channels, --add channels ..., --append channels ..., or direct with conda_forge::package-M.m.b
Learn: conda skeleton, --noarch-python, the local:: meta-channel, non_admin_enabled (in .condarc),
Thought: maybe I shouldn't have base in my PATH, and instead only activate to use it?
Read: https://www.anaconda.com/blog/developer-blog/conda-configuration-engine-power-users/?lang=en-us
Configuration cascades up: ENV > User .condarc > System .condarc, unless !final flag is set.

11:00 GPU-Accelerating UDFs in PySpark with Numba and PyGDF

Presenters: Joshhua Patterson, @datametrician & Keith Kraus, @keithjkraus

Joshua:

Search: "Mark Litwinschik's Blog" and "Billion Taxi Ride Benchmark"
Apache Arrow as a common data layer FTW!
This talk is all related to the GOAI: https://devblogs.nvidia.com/goai-open-gpu-accelerated-data-analytics/
Repos: https://github.com/gpuopenanalytics

Keith:

Watch: https://databricks.com/session/improving-python-and-spark-performance-and-interoperability-with-apache-arrow
Spark 2.3 adds Arrow support for PySpark (and K8s?)

13:00 Real-Time Processing with Dask

Presenter: Matt Rocklin
Slides: http://matthewrocklin.com/slides/anacondacon-2018.html#/

Live Examples
He strongly recommends we go back and view the other Dask talk -- "the main one" -- from earlier today when it hits YouTube!
https://github.com/eriknw/dask-patternsearch

14:00 Model Management with Anaconda Enterprise

Presenter: Michael Grant

From a "blank" state, a modeler should be able to: retrieve, install, trust or verify, and use a model
Python API for bulk data: https://github.com/ContinuumIO?q=intake

15:10 Jumpstart Writing Continuous Application with Structured Streaming using Python APIS in Apache Spark

Presenter: Jules Damji, @2twitme
Slides: https://dbricks.co/anacondaConf2018

Structured Streaing can openate in continuos mode mode instead of micro-batching
.trigger() is checkpoint interval, not batch
Sample notebook: https://dbricks.co/ss_mllib_py

16:10 Apache Arrow: A Cross-Language Development Platform for In-Memory Data

Presenter: Wes Mckinney

A good, if a bit dry, talk. (Though, he did warn us.) You had to be there.

Projects powered by Arrow: https://arrow.apache.org/powered_by/

Day Four

10:00 Accelerating Deep Learning with GPUs

Presenter: Stan Seibert

One good note: if installing Keras via conda install keras-gpu, CUDA toolkit is not needed to take advantage of the GPU as long as a true Nvidia driver is being used. (The OSS clone included with Ubuntu will not work.) conda will install "enough of CUDA" with Keras.

Otherwise, this was mostly a showcase of ML/GPU projects...

(Well, that, and I'm running down, so note-taking will slow accordingly from here.)

11:00 conda: Deep Dive

Presenter: Kale Franz

conda Gross Anatomy: package/channel/prefix
- package == .tar.bz2 w/ info/ directory containing:
  - recipe for building the package
  - index.json - pacakage-global metadata and used for search and to build the channel's repodata.json (source of truth)
  - paths.json - per-path metadata for invidual files in package
  - link.json - (optional) package install-time metadata
  - repodata_record.json - written after extracting the tarball, contains pacakge hash, source URL, etc.
- channel - collection of packages at a URL (ex. https://repo.anaconda.com/pkgs/main)
  - contains subdirs by arch, or noarch, each contains a repodata.json file
  - channeldata.json - channel-level channel metadata, update info, package descriptions, icons, etc.
  - URL format: scheme, auth*, location, token*, name, subdir*, package_filename
    - * optional
    - File URL work as well using conda index to create local channel
- environment
  - conda-meta dir - complete history of environment

(Wait, what happened to prefix?)

"If I have Docker, why do I need conda?"

Answer: conda is a universal package management adstraction layer, so quit writing:

if [distro]
    yum install ...
else if [other distro]
    apt-get install ...
else ...

conda accomodates cattle and pets. You can have your immutable infrastructure, but still capture state and re-use/re-create later.

defaults metadata is your transaction log (so, rollback!), and is actively managed by conda.

Tricks:

You can import conda:

import conda.cli.python_api
import conda.api
import contaexports

see: conda/conda#7059

conda search --envs ...
- Works per-user or searches all envs if sudo/root
Research: conda 4.4's path stacking

13:00 conda, Docker & Kubernetes: The Cloud-Native Future of Data Science

Presenter: Matthew Lodge

14:00 ?

Presenter: