My brain-dump from attending April 8th - 12th, 2018.
Many (all?) sessions will be available in the Anaconda channel on YouTube.
Presenter: David Mertz
The notebooks will be interesting to play with later when I want flashbacks to my Regression Analysis courses.
Presenter: Stan Seibert Slides and Notebooks: https://github.com/ContinuumIO/ac2018-dl-gpu
- GitHub Repo
- Notebook (Sessions will evaporate after the talk!)
- Style Transfer
Critical Questions for ML:
Part I
- What is the specific purpose of the model I am trying to create?
- What specific quantitative model output will help me achieve that purpose?
- Are there publications, blog posts, conference talks describing a similiar use case?
- What relevant data is available for the model? Can more be gathered?
- Is there additional domain knowledge that would be helpful for selecting training data?
- Are there expertss in the data I can talk to?
Part II
- Are there security or ethical concerns assoc with the trianing data and the furture uses of the model?
- What accuracy has already been achieved in our organization, and how much improvement is required for success?
- Can the business benefits of the improved accuracy be quantified?
- How will I know the modeol is working once deployed?
Today's (rented from GCE) GPUs: K80:4992 CUDA Cores:4.29 TFlops (Single Precision):480GB/sec
Today's exercise: classfiying the CIFAR10 image set
-
"One Hot Encoding" - encoding of categorical predictors (Keras and Scikit-Learn have functions to automate this. Ex:
keras.utils.to_categorical)
Most important question in ML: How do I know when I'm done? (Psuedo-answer: "When accuracy stops improving.")
"Overfitting does not mean your model is bad. Most models will overfit after emough training epochs."
- Keras has a (new-ish) multi-GPU model that will automagically split the batches, recombine, then update the weigths
- Since keras runs on top of TF, one can use TF functions to pin work to specific GPUs
condacan be used to package a trained model!- TensorRT: "Compiles" a trained model for production deployment
Presenter: Mike Bernico, @mikebernico
- All about transfer learning
- Universal truths: 1) the data is never good adn 2) never trust a wine pairing
- His new book
- Conda runtime distribution is available for both Cloudera and Hortonworks environments! (AE only?)
- Integrates via Livy (Livy support Kerberos with impersonation -- are we using that?)
- For Cloudera AE gets wired up as a parcel repo
Presenter: Siu Kwan Lam, @sklam (primary developer of Numba!)
- Basic API: Just use the
@numba.jitdecorator!
import numba
import numpy as np
@numba.jit
def mvmul(A, x):
return A * x
A = np.random.rand(1000, 10000)
x = np.random.random(10000)
mvmul(A, x)Turned into a list of projects using or existing numba:
- LibROSA
- Datashader
- Trippy
nb.jit(..., parallel=true)withfor i in prange(...)to easily parallelize - No try catch in numba code! (but you can catch in a normal python wrapper)
- FPBIC
- FastParquet
- OAMap
- PyGDF
Presenter: Mahmoud Heshemi
- Packaging is: turning code into a deployable archive
- The 90/90 rule: "The first 90 percent of the code accounts for the first 90 percent of the development time. The remaining 10 percent of the code accounts for the other 90 percent of the development time." -- Tom Cargill, Bell Labs
- Python: .py (module) < dir w/ _init_.py (package) < 0+ packages (distribution)
- sdist == source distribution, bidst == binary distribution
python setup.py ... bdist_upload(missed it)- "Dont
pip installto production!" - Holy crap:
pipsi - His other talk
- conda-pack
- Research: Flatpack vs. Snappy (.deb vs. .rpm, round 2)
- OpenSky
- Useful
condaflags to get save space, time, and get what you want:nomkl,clean,--no-channel-priority - Freezer to native packaging: constructor
- Slides can be found here
- Checkout Hydrogen for Atom (sigh)
from sklearn.proprocessing import CategoricalImputer, LabelBinarizerpip install sklearn-pandasfor categorical predictors usingDataFrameMapperpip install cerebrusfor data validation- For CLIs:
conda install -c conda_forge fire, thenFire()after import import mummify, thenmummify.log(...)and use the mummify command line tool to manage history (This is sweet, but I need to see what it actually does to git under the covers. Hint: tt's usinggit --git-dir=.mummifyvodoo to do the magical history work.)
Presenter: Peter Bushbacher
Uh-oh. He said "Tableau", and I tuned out...
Presenter: Kale Franz
- Find packages with
conda search ...vs. http://www.anaconda.org/ - Show/use other channels with
conda config --show-channels,--add channels ...,--append channels ..., or direct withconda_forge::package-M.m.b - Learn:
conda skeleton,--noarch-python, thelocal::meta-channel,non_admin_enabled(in.condarc), - Thought: maybe I shouldn't have
basein myPATH, and instead only activate to use it? - Read: https://www.anaconda.com/blog/developer-blog/conda-configuration-engine-power-users/?lang=en-us
- Configuration cascades up: ENV > User
.condarc> System.condarc, unless!finalflag is set.
Presenters: Joshhua Patterson, @datametrician & Keith Kraus, @keithjkraus
Joshua:
- Search: "Mark Litwinschik's Blog" and "Billion Taxi Ride Benchmark"
- Apache Arrow as a common data layer FTW!
- This talk is all related to the GOAI: https://devblogs.nvidia.com/goai-open-gpu-accelerated-data-analytics/
- Repos: https://github.com/gpuopenanalytics
Keith:
- Watch: https://databricks.com/session/improving-python-and-spark-performance-and-interoperability-with-apache-arrow
- Spark 2.3 adds Arrow support for PySpark (and K8s?)
Presenter: Matt Rocklin
Slides: http://matthewrocklin.com/slides/anacondacon-2018.html#/
- Live Examples
- He strongly recommends we go back and view the other Dask talk -- "the main one" -- from earlier today when it hits YouTube!
- https://github.com/eriknw/dask-patternsearch
Presenter: Michael Grant
- From a "blank" state, a modeler should be able to: retrieve, install, trust or verify, and use a model
- Python API for bulk data: https://github.com/ContinuumIO?q=intake
15:10 Jumpstart Writing Continuous Application with Structured Streaming using Python APIS in Apache Spark
Presenter: Jules Damji, @2twitme
Slides: https://dbricks.co/anacondaConf2018
- Structured Streaing can openate in continuos mode mode instead of micro-batching
.trigger()is checkpoint interval, not batch- Sample notebook: https://dbricks.co/ss_mllib_py
Presenter: Wes Mckinney
A good, if a bit dry, talk. (Though, he did warn us.) You had to be there.
- Projects powered by Arrow: https://arrow.apache.org/powered_by/
Presenter: Stan Seibert
- One good note: if installing Keras via
conda install keras-gpu, CUDA toolkit is not needed to take advantage of the GPU as long as a true Nvidia driver is being used. (The OSS clone included with Ubuntu will not work.)condawill install "enough of CUDA" with Keras.
Otherwise, this was mostly a showcase of ML/GPU projects...
(Well, that, and I'm running down, so note-taking will slow accordingly from here.)
Presenter: Kale Franz
- conda Gross Anatomy: package/channel/prefix
- package ==
.tar.bz2w/info/directory containing:- recipe for building the package
index.json- pacakage-global metadata and used for search and to build the channel's repodata.json (source of truth)paths.json- per-path metadata for invidual files in packagelink.json- (optional) package install-time metadatarepodata_record.json- written after extracting the tarball, contains pacakge hash, source URL, etc.
- channel - collection of packages at a URL (ex. https://repo.anaconda.com/pkgs/main)
- contains subdirs by arch, or
noarch, each contains arepodata.jsonfile channeldata.json- channel-level channel metadata, update info, package descriptions, icons, etc.- URL format: scheme, auth*, location, token*, name, subdir*, package_filename
- * optional
- File URL work as well using conda index to create local channel
- contains subdirs by arch, or
- environment
conda-metadir - complete history of environment
- package ==
(Wait, what happened to prefix?)
"If I have Docker, why do I need conda?"
Answer: conda is a universal package management adstraction layer, so quit writing:
if [distro]
yum install ...
else if [other distro]
apt-get install ...
else ...
conda accomodates cattle and pets. You can have your immutable infrastructure, but still capture state and re-use/re-create later.
defaults metadata is your transaction log (so, rollback!), and is actively managed by conda.
Tricks:
- You can import conda:
import conda.cli.python_api
import conda.api
import contaexportssee: conda/conda#7059
-
conda search --envs ...- Works per-user or searches all envs if sudo/root
-
Research: conda 4.4's path stacking
Presenter: Matthew Lodge
Presenter: