≡ Menu

Misusing Machine Learning for Fun

You can use a Decision Tree to compress an image. There are libraries to do this kind of thing properly, but I wanted to see what a Decision Tree “sees” when it looks at an image.

The training input is each pixel’s x, y position, and the channel number. The target is the value for that pixel and channel. I am not doing a test/train split because I’m not looking to build a model to predict anything, I’m looking to compress something.

And it worked. Here’s what it looks like for the black and white image:

animated gif showing the output of different decision tree depths. It goes from completely blank to a detailed representation of the Van Gogh self portrait.

And if you want to see a video where I go through a bit more of an explanation check it out here:

{ 0 comments }

So you wanna Data Science…

There are a number of blog posts that will describe what a person thinks is useful to know about getting into Data Science. And this is mine.

This is aimed at people who are outside of the tech industry, so tailor my advice to your own situation. For my own history, I went through the Insight Data Science Fellowship, which I highly recommend, but it isn’t feasible for many people for different reasons (including the limited number of slots available).

I tried to limit the number of links to resources to keep my suggestions manageable. If you feel that I have missed an incredible resource that is worthy of dislodging a link on this post, please let me know (both your link and what you think it should replace). Almost everything that I link below has a free option (for example, you can audit the online courses). Finally, this blog post is organized into two overarching sections: things to focus on, and a section of things to avoid.

Things to focus on

Each of the following subsections in this post are worth spending enough time to get good enough at each one. Think of your overall Data Scientist score as being roughly a multiplication of each of these skills — a zero anywhere will kill you, and it doesn’t take years of effort to get to some minimal level in each of these skills.

A Data Science Blog

I don’t have a Data Science blog, so don’t look at this website as an example. However, if you are not currently working in Data Science, I recommend starting a Data Science blog that chronicles your journey as you learn about the many aspects of this field. I found an article that argues this point very nicely, and I agree with most everything David writes in that post.

My short version: grab a data set, do some analysis, and post it on Github pages in a transparent and reproducible manner. Here are two links that look like they have good advice about how to technically set up just such a thing: JupyterSite and DataQuest’s how to set up a data science blog.

Aim for posting a new analysis once a month. After 6 months, you will look up and have 6 posts to show for your work. Ask for critical feedback on your analysis (feel free to ping me). This will tighten the feedback loop and help you get better, faster. You also practice many aspects of the skills of Data Science by doing this exercise.

Coding/Computer Science

Many interviewers revert to what they know to interview you. For example, a statistician or a computer scientist might ask you questions that they find trivial or foundational to their work while being almost impossible for you to know. It’s likely, in a tech environment, you’ll get someone at some point in your interview process who is (or was) a software engineer.

The long and the short of it, you need to know how to program and code. To write and read programs that other people use and consume.

What programming language should I learn? Python or R or something else?

The short answer: if you already know either Python or R pretty well, it’s probably better to strengthen the one you already know. Don’t pick another language outside of Python and R. If you’re a C/C++/Java/Matlab you’ll probably have a better time of it with Python. If you still haven’t decided between them, I’d suggest Python. For computer science/programming interviews, I definitely recommend Python over R. That being said, there are times when R is the correct choice for an analysis. Again, it’s more important to pick one and get strong at it, than to pick the “right one”. Finally, don’t try to level up on both at the same time, it’s not a good use of your time at this stage (or the next stage).

The following courses are what I recommend going through:

  1. Jake VanderPlas – Python Data Science Handbook — Has excellent explanatory Jupyter Notebooks. Work through the whole thing.
  2. Peter Norvig’s course on Design of Computer Programs — what to say except: reading Norvig’s treatment of almost any problem is a worthwhile way to spend your time. He has many Notebooks on many different kinds of problems that he solves.
  3. Jupyter Notebooks in general. Might not be super-useful on its own, but check out the notebooks in my free Github repo here: Jupyter Notebook Tips and Tricks. Or my older and somewhat dated O’Reilly course (not free) here: Jupyter Notebook for Data Science Teams.
  4. Algorithms: Khan Academy’s course, and Coursera’s course.
  5. Learn to semi-competently use version control git.

Machine Learning

A field of study that has many useful techniques for solving Data Science problems. It’s very useful to know the breadth of what’s available so you know what to reach for when confronted with a problem.

  1. Machine Learning course – Andrew Ng at Coursera. He does the class in Octave, feel free to do that, or challenge yourself to do as much of it in python as you can.
  2. Introduction to Machine Learning with Python by Andreas Mueller and Sarah Guido.
  3. I’ve heard good things about Machine Learning – Paisley but haven’t taken it myself.

Statistics

A field of study that is foundational to many aspects of Data Science. Unfortunately, I don’t feel that I have an extensive list of statistics resources to choose from, but these two are good and useful:

SQL

SQL is simply required. If you have spent a fair bit of time programming in python, the first time you encounter SQL it feels like a bit of a mind-bending language; it certainly was for me. I recommend the following tutorials:

Interview Preparation

Practice on a physical whiteboard. Seriously practice answering questions, on a whiteboard out loud to someone else. Have a friend quiz you with the SQL Lagunita exercises and have them type what you’ve written into the answer box to check your answers. Explaining aloud while writing code on a whiteboard is a weird but quickly learnable skill.

What to look for in your first Data Science job

Look for an experienced data science team to make you a much better data scientist. That is, make sure that they will provide feedback and mentorship as a key part of how they work. You want to optimize your learning rate in the first couple of years.

A final recommendation, the book Lean Analytics to help you get your head around the kinds of questions you should be asking in business.

Things to avoid working on

Here are a few things that you should either avoid entirely or limit your time severely. Each of the things that I list here are very valuable subjects and that’s what makes them dangerous, tempting, and hard to know to avoid. The long and the short of it: these things are very tempting ways to spend your way too much of your time that can be better spent on the above. Once you have your first Data Science job, feel free to indulge.

Don’t spend time on Deep Learning

There are awesome tutorials for deep learning. I’m not going to put links here because this should wait in my opinion until you have the breadth of things above covered. It’s a tool to add to your toolbelt once you start working at your first Data Science job.

Don’t spend much time on Kaggle Competitions

Spend at most 20% of your preparation time on them. You can learn a lot about one aspect of Data Science through Kaggle competitions, but you need much more than Kaggle experience.

Learning hard (not python/R based) visualization libraries (D3, etc)

I made this mistake myself when I was a postdoc preparing for my future Data Science career. I had heard, and saw with my own eyes, the amazing visualizations that D3 lets you create. I spent a couple of months of time trying to get up to speed with this, and I think it was pretty much entirely a waste. Only exception to this rule is if you are more interested in becoming a Data Visualization Engineer and not Data Science.

Hope that helps! If you find your experience is different or you find other resources that you like better please let me know! Best of luck.

{ 0 comments }

2016 Year in Review

Some of my major life highlights:

As I look back on 2016, I’m amazed at both how quickly the year went and yet how much happened.

Personal

  • Julija and I celebrated our marriage with a reception in beautiful Vilnius, Lithuania.
  • Thanksgiving with my parents + Jax/Derek in Monterey, CA.
  • Spent Christmas holidays with Julija’s family in Lithuania.
  • Spent New Year’s in New York City.

Professional

Creative

Recommended Books Read This Year

Fiction

Non-fiction

Places that I spent time this year

flight animation

  • Dallas, TX, USA — (family/friends)
  • Mountain View, CA, USA — where I live and work
  • San Diego, CA, USA — SVDS client that took me there throughout the year
  • Berkeley, CA, USA — AstroHackWeek
  • Maui, HI, USA & Kauai, HI, USA — Honeymoon!
  • Salt Lake City, UT, USA — First time to visit! Met up with Lauri and Vaiva and their brand new kid WHATSHISNAME.
  • Copenhagen, DNK — Day trip on the way to Vilnius.
  • Vilnius, LTU — Our wedding reception.
  • Riga, LVA — Xmas eve and day.
  • NYC, USA for New Year’s Eve, Rockettes Christmas Spectacular, The Present
  • Jacksonville, FL, USA — Katie/Alex Mendez wedding.
  • Monterey, CA, USA — Thanksgiving w/ my parents, and Jax & Derek.

Looking forward to in the new year

concentric circles

{ 0 comments }

I gave a talk this month at PyData San Francisco 2016. The title of my talk was Mental Models to Use and Avoid as a Data Scientist

A highlight of my week was Travis Oliphant (creator of scipy/author of numpy/CEO of Continuum) tweeting about it:

@teoliphant

My talk

Extra Materials

The source material and some code links can be found here, and the slides can be found here.

{ 1 comment }

I am now an O’Reilly Author!

My O’Reilly screencast course is now available: Jupyter Notebook for Data Science Teams! I’ve been working on this project for many months, and I’m so happy that it has finally completed.

Jupyter notebook data science usage tips

You might have seen me give an interactive talk on some tips and tricks in how to use the Jupyter notebook (either at OSCON 2015, at UC Berkeley’s Master of Information and Data Science program, Insight Data Science or elsewhere). These talks focused on the exploratory data science and general Jupyter notebook usage patterns. This screencasting course has some of how to use the Jupyter notebook efficiently and tips for using a bunch of useful extensions. However, there’s also a lot about using a Jupyter notebook with other data scientists collaboratively/version control. There’s also some focus on sharing notebooks w/ project managers or others who might not have python installed.

Please let me know if you end up getting Jupyter Notebook for Data Science Teams or recommending it to your companies for internal training! I’d love to hear what you think of it!

{ 3 comments }

2015 Year in Review

Another post to reflect on all that’s happened in the past year! I’m beginning writing this post while on a flight to Maui for my honeymoon with my lovely wife sleeping next to me. What a difference a year makes!

Some of my major life events:

Books read that I’d recommend — with Amazon affiliate links

Fiction

  • The Martian by Andy Weir
    • A well-paced and excellent story + lots of science. The audiobook edition is great.
  • The City & The City by China Mieville
    • Recommended to me by Eric White. A very different but imaginative detective story.

Non-fiction

  • Diplomacy by Henry Kissenger
    • A serious read. Such an interesting take on the history of diplomacy in European history right up through the early 90s. The twists and turns of Kissenger’s read on Americans from before WWI through today is very thought provoking.
  • Islam and the Future of Tolerance by Sam Harris and Majid Nawaz
    • An important discussion that needs to be had more frequently and more publicly.
  • Superforecasting by Philip Tetlock and Dan Gardner
    • Fascinating read on how everyday people can become much better at predicting future events (think questions like: will Assad give up power by April 3, 2016). I wish it was a bit more how-to in the end.
  • The $12 Million Stuffed Shark by Don Thompson
    • A good read with the next book about the high-end art world. It’s almost too incredible to believe.
  • Seven Days in the Art World Kindle Edition by Sarah Thornton
    • Same as above. If you’re interested in how the art world works, in broad strokes, read these two books together.

Places that I visited

  • Dallas, TX, USA — (family/friends)
  • Mountain View, CA, USA — where I live and work
  • Portland, OR, USA — first consulting trip
  • Amsterdam, NLD — (week in Amsterdam to watch Julija get her PhD cum laude and meet her family!)
  • Austin, TX, USA — SciPy 2015 conference (poster)
  • Portland, OR, USA — OSCON 2015 conference (talk)
  • San Francisco, CA, USA — Got married!
  • Yosemite, CA, USA — Upper Yosemite falls is quite the hike.
  • Maui, HI, USA & Kauai, HI, USA — Honeymoon!

Side projects

Looking forward to what the next year will have in store! The initial plans call for my first trip to Lithuania!

{ 0 comments }

I just wanted to post the guts of a script that Colin Higgins (fellow Data Scientist at SVDS) wrote.

# step1
wget https://repo.continuum.io/miniconda/Miniconda-latest-MacOSX-x86_64.sh

# step2
chmod +x Miniconda-latest-MacOSX-x86_64.sh

# step3 -- have to type spacebar and "yes"
./Miniconda-latest-MacOSX-x86_64.sh

# step4
source ~/.bashrc

# step5
conda update conda -y

# step6
conda create -y -n anaconda_r -c r r-irkernel r-recommended r-essentials anaconda

Now, switch into the anaconda_r environment (which will prepend your PATH in that one terminal ONLY) with:

source activate anaconda_r

and install extra packages like so:

conda install -c r rpy2 -y

This made it so that both the R kernel, python kernel, and the rpy2 package were all working in the same environment (my previous blog post was a temporary stop-gap that couldn’t get there).

{ 1 comment }

Here is an excellent talk by Michael Manapat at the PyData Seattle 2015 conference. I wish that this style of talk — of really digging deep with specific examples — becomes more common!

Michael Manapat: Counterfactual evaluation of machine learning models

The slides can be found here, and the paper that it’s partially based on is here.

{ 0 comments }

Jupyter Notebook Best Practices for Data Science

I gave a talk on Friday (July 24) at the 2015 OSCON in Portland, OR. My topic was on the IPython (Jupyter) Notebook for Data Science, and it highlighted a number of challenges that come from needing to organize a data science workflow — especially in the context of working on a team of data scientists.

The video of my talk (not available just yet) is below:

I had a great time and I hope people find it useful. The github repository for my talk.

{ 3 comments }

2014 in Review

Berlin Thanks to everyone who helped make this past year great – I’ve been incredibly fortunate to have people who have helped support me in all of my adventures. Below are a few highlights from 2014!

Some of my major life events:

  • Finished my 3 year postdoc under Michael Murphy at Swinburne University of Technology.
  • Published a paper with Michael that was the culmination of years of work. The pdf is here if curious.
  • Moved from Australia to the San Francisco Bay Area.
  • Brought Julija home to meet the parents over Thanksgiving.
  • Completed the Insight Data Science program.
  • Started at SVDS as a Data Scientist!

Places that I visited (and spent at least two nights this year)

  • Dallas, TX, USA (family/friends)
  • Washington, D.C., USA (AAS)
  • Phoenix, AZ, USA (visit Stephanie/Kelsey)
  • San Diego, CA, USA (talk at UCSD)
  • Melbourne, AUS (postdoc life, Marc visited!)
  • Hobart, AUS (Dave!)
  • Paris, FRA (week in Paris)
  • Amsterdam, NLD (week in Amsterdam –Julija!)
  • Zurich, CHE (Kern!)
  • Glasgow, GBR (IMAX Glasgow)
  • Cambridge, GBR (talk at Cambridge)
  • Berlin, DEU (photos)
  • Potsdam, DEU (visit at Potsdam University)
  • Sydney, AUS (Harley Wood Winter School)
  • Palo Alto, CA, USA (Insight Data Science)
  • Dallas, TX, USA (family/friends)
  • Mountain View, CA, USA (started work at SVDS)
  • Dallas, TX, USA (family/friends)

Fun final list:

Seasons experienced this year (in order)

  • Winter
  • Summer
  • Autumn
  • Spring
  • Summer
  • Winter
  • Summer
  • Autumn
  • Winter
{ 0 comments }