Software design patterns for Machine Learning R&D

glimcat 13 years ago |

This is all very good advice which will save you much stress and many wasted hours.

I would add:

Document any reference material you use, including the source and why you're including it. Cache any digital content, either in the project path or using a management tool like Zotero.

Keep a research log. Minimally, annotate your trials. Coming back even a week later and trying to figure out which trial was done on what hunch with what results is extremely time consuming without this information.

ballooney 13 years ago |

The use of 'Patterns' in the title is getting some heat but I think it's because his title 'Patterns for Research in Machine Learning' is a little play on 'Pattern Recognition and Machine Learning' (usually known as 'PRML' or 'Bishop') which is a text book by Chris Bishop, arguably the bible in the field.

This is good advice, especially saving intermediate calculations to file which can make iteration much faster. I have witnessed a lot of research students set a job running which will take about an hour, look at the results, say 'd'oh!', change one line of code in one of their functions and set the whole monolith running again, needlessly repeating about 55 minutes worth of the hour's computations.

bravura 13 years ago |

I have a handful of these too. (Like ecesena, I think they're "best practices", not design patterns.)

It might be useful to put up a wiki so people can discuss. Even something simple and ugly like c2.

For example, handling hyperparameters is actually a topic in itself.

textminer 13 years ago |

Please, please, please spend a day abstracting out commonly written functions to one place. Your quickly-written prototype code should not be slowed down by the 100th slightly different implementation of a text tokenizer or kNN classifier.

smalieslami 13 years ago | |

This is very good advice.

It almost always makes sense to include Tom Minka's Lightspeed toolbox (http://research.microsoft.com/en-us/um/people/minka/software...) right from the beginning.

Also perhaps Netlab (http://www1.aston.ac.uk/eas/research/groups/ncrg/resources/n...) although it is beginning to get rather dated.

textminer 13 years ago | | |

Hardest part of implementing a high-functioning production machine learning stack for me isn't the idea-articulating, prototyping, iterating, then polished refactoring. It's knowing when to go from a quick-producing language like Python/MATLAB/Julia to something painfully-written but smooth like C++ or into a scalable Elastic MapReduce or Mahout process (the former of which, sure, is language-agnostic).

You can only spend so much time optimizing on memory/CPU-times with smart data chunkings or low-dimensional representations or approximation operations. EC2 time and space is relatively cheap, but Python on a single machine with the multiprocessing module can only speed up by a multiple of < [# of Cores]...

beambot 13 years ago | | |

You should checkout PiCloud [1] or MrJob [2]. Both of these seek to make MapReduce dead simple using Python. More importantly, you can do all your testing on a Desktop PC. Then when you need real horsepower, you just tell it to "spin up on EC2."

Disclaimer: I have yet to use either, but I've heard good things.

[1] http://www.picloud.com/ [2] http://musicmachinery.com/2011/09/04/how-to-process-a-millio...

ecesena 13 years ago |

Misleading title: best practices, not design patterns.

xaa 13 years ago |

Great post! I have also found useful:

- Record the (Git) revision number of my code for each run.

- Use GNU make to manage the pipeline of downloading, training, evaluating, etc.

karavelov 13 years ago |

This are very sound advices. I have arrived at very similar architecture in LSI environment.

From my experience, here are some advantages of this architecture:

- stages could be independently rewritten, so you could prototype in fast-writing language (perl in my case) and later rewrite whole stages or parts of them in fast-execution language if you need extra performance (C,C++ here);

- you could easily integrate third party software in your workflow - most of the existing tools in the field work with input and output files;

- you could reuse already written stages for different purposes - just pass them different options for input/output and parameters.

paulbunn 13 years ago |

Was a bit disappointed - this is generally just good advice/best practice for any sort of programming task not just machine learning R&D.

It is however very good advice taken in the correct context.