Sunday, October 14, 2012

Pickles in Python

In the last project for Software Engineering, Netflix, we were asked to guess the ratings different users would give to different movies based on information from a large dataset of users, movies, and ratings. Sadly, for this project I was forced to work alone, and I knew I needed to be as efficient as possible in order to get done in time. The project called for different caches from the data: average ratings for different users, average ratings for different movies, standard deviations of users' ratings from their mean rating, average ratings from a user per decade (which decade the movie was in). Needless to say, parsing all this data could have been a real chore if I was not careful, and not just parsing it, but outputting it (caching it) into a format my top-level application could read later.

Recently at my job, the application I am creating called for Object serialization, which is basically a way in which Object instances in a running application can be "serialized" and output through some kind of data stream, usually a file. I knew something had to exist in Python for object serialization, so I came to find out about Pickle, which is exactly that.

Now to the fun stuff...

Pickle basically allows us to write any data structure to a file like so :

import pickle
my_list = ['f', 'o', 'o', 'e', 'y']
pickle.dump(my_list, open('my_list.p', 'w'))

That's it! And it is an awfully nice way to store the caches for the Netflix project. We can simply read these serialized objects back in like so:

import pickle
my_list = pickle.load(open('my_list.p', 'r'))

Again, that's it! Immediately our list (or any other data structure we save) is loaded right back into memory, ready to be used by the application. There is no ugly parsing of my own hacked-together data format, it is just the beauty of serialized objects and me saving a ton of time.

Speaking of saving time, we can actually save even more time in the code by changing the import statement from :

import pickle

to:

import cPickle as pickle

This imports the cPickle module instead of the standard pickle module. The difference is that cPickle is written in C and is "up to 1000 times faster" than the Python version of pickle. You lose out on some of the subclassibility of the normal pickle, but hey, you can't argue with 1000 times faster.

Object serialization really is a beautiful thing!

Cheers

No comments:

Post a Comment