Don’t use pickle
. Don’t use pickle
. Don’t use pickle
.
The problems with Python’s pickle
module are extensively documented (and repeated). It’s unsafe by default: untrusted pickles can execute arbitrary Python code. Its automatic, magical behavior shackles you to the internals of your classes in non-obvious ways. You can’t even easily tell which classes are baked forever into your pickles. Once a pickle breaks, figuring out why and where and how to fix it is an utter nightmare.
Don’t use pickle
.
So we keep saying. But people keep using pickle
. Because we don’t offer any real alternatives. Oops.
You can fix pickle
, of course, by writing a bunch of __setstate__
and __reduce_ex__
methods, and maybe using the copyreg
module that you didn’t know existed, and oops that didn’t work, and it’s trial and error figuring out which types you actually need to write this code for, and all you have to do is overlook one type and all your rigor was for nothing.
What about PyYAML? Oops, same problems: it’s dangerous by default, it shackles you to your class internals, it’s possible to be rigorous but hard to enforce it.
Okay, how about that thing Alex Gaynor told me to do at PyCon, where I write custom load
and dump
methods on my classes that just spit out JSON? Sure, you can do that. But if you want to serialize a nested object, then you have to manually call dump
on it, and it has to not do the JSON dumping itself. There’s also the slight disadvantage that all the knowledge about what the data means is locked in your application, in code — if all you have to look at is the JSON itself, there’s no metadata besides “version”. You can’t even tell if your codebase can still load a document without, well, just trying to load it. We’re really talking about rolling ad-hoc data formats here, so I think that’s a shame.
But I have good news: I have solved all of your problems.
YAML
YAML has earned itself something of a bad rap, which is also a shame. YAML is actually a pretty great format, but it’s fighting an uphill battle. The YAML specification is clearly intended for implementors and is horrible as a reference, yet there is no reference guide for someone seeking to use YAML rather than implement it. The language bindings tend to be atrocious — PyYAML’s documentation is a single massive page on a Trac wiki, and both it and the Ruby implementation (and probably others) allow load
to do arbitrary bad things. And JSON came along at just the right time to eat YAML’s lunch, even though JSON is utterly hostile to human beings.
But YAML has one particularly appealing feature that few data formats have: metadata. Every value in a YAML document has a type, and you can explicitly indicate those types within YAML. That is, when you see this:
1- 1
2- 2
3- apple
It actually, canonically, means this:
1!!seq [
2 !!int "1",
3 !!int "2",
4 !!str "apple",
5]
An identifier beginning with !
is called a tag, and it declares the type of the following value. Tags that begin with !!
are used for YAML’s own native types, and occasionally co-opted by libraries like PyYAML, ahem.
Perhaps you already see where I’m going with this. Consider the Table
example from Alex Gaynor’s talk. It has some serialization logic baked in, and looks like this:
1# v1: tables are always square
2class Table(object):
3 def __init__(self, size):
4 self.size = size
5
6 def dump(self):
7 return json.dumps({
8 "version": 1,
9 "size": self.size,
10 })
11
12 @classmethod
13 def load(cls, data):
14 assert data["version"] == 1
15 return cls(data["size"])
Which produces this JSON:
1{
2 "version": 1,
3 "size": 25
4}
That’s not much to go on to tell a casual reader that this is intended to be a table. But what if you could use YAML’s tagging and serialize it like this, instead:
1!table
2size: 25
Well now you can!
Introducing Camel
I’ve spent long enough telling people to roll their own serialization rather than use pickle
, so I’ve rolled it for you. camel
is a tiny library that wraps PyYAML, hides all its bad design decisions from you, and lets you register your own types in a useful way. Let’s try that table class again.
1# v1: tables are always square
2class Table(object):
3 def __init__(self, size):
4 self.size = size
5
6
7from camel import CamelRegistry
8my_types = CamelRegistry()
9
10
11@my_types.dumper(Table, 'table', version=1)
12def _dump_table(table):
13 return dict(
14 size=table.size,
15 )
16
17@my_types.loader('table', version=1)
18def _load_table(data, version):
19 return Table(data["size"])
Rather than being global state that’s entagled with the library itself, your serialization code is registered in an object that you can scope however you want. Then when the time comes to use it:
1from camel import Camel
2table = Table(25)
3print(Camel([my_types]).dump(table))
1!table;1
2size: 25
Amazing. And you can, of course, nest custom objects arbitrarily-deeply in collections or even each other, and all the right things should happen. You can even return a list or dict containing other custom types, as long as the return value itself is something YAML understands natively.
Now if we change our class a bit:
1# v2: tables can be rectangles
2class Table(object):
3 def __init__(self, height, width):
4 self.height = height
5 self.width = width
All we need to do is change our functions:
1@my_types.dumper(Table, 'table', version=2)
2def _dump_table(table):
3 return dict(
4 height=table.height,
5 width=table.width,
6 )
7
8
9@my_types.loader('table', version=1)
10def _load_table_v1(data, version):
11 edge = data["size"] ** 0.5
12 return Table(edge, edge)
13
14
15@my_types.loader('table', version=2)
16def _load_table_v2(data, version):
17 return Table(data["height"], data["width"])
And use it the same way as before:
1from camel import Camel
2table = Table(5, 7)
3print(Camel([my_types]).dump(table))
1!table;2
2height: 5
3width: 7
But old data continues to work, too.
I wrote some more extensive documentation on this already, so I’ll direct you that way rather than copy/paste all the examples.
Look at all these amazing benefits
It’s not a gigantic fucking security hole, because it doesn’t call arbitrary functions with arbitrary arguments or execute arbitrary code. It will only call the functions you give it. The only types recognized by default are dead simple Python types that map to built-in YAML types.
Your serialization code is explicit and versioned. It includes the names of your types, so anyone doing a major refactoring (and, hopefully, grep
ing around for where types are used) will easily find them. You can always tell exactly which types are sitting in data somewhere, so there are no pickle time bombs. And if you ever truly want to throw a type away, all you have to do is write a trivial loader that “loads” into a dummy object.
You have to pick your own YAML tags, rather than having module paths baked into your data. That paves the way to sharing serialized objects with other languages, or even standardizing a small format.
This is all just functions working with Python objects, so you can write all the tests you want without having to care about YAML at all.
And best of all, you won’t have to pay someone like me to spend the better part of two weeks fixing pickles in five-year-old database tables when you try to upgrade SQLAlchemy and discover that third party libraries don’t go out of their way to preserve pickle compatibility across four major versions!
Caveats
Python 2 str
are serialized exactly the same way as Python 3 bytes
. It’s not pretty. Sorry. Actually I’m not sorry, it’s 2015, what are you doing, use Unicode strings already.
Camel is backed by PyYAML, which is kinda slow and kinda memory-hungry. On the other hand, manual control over serialization means you’re much less likely to accidentally pickle a hundred kilobytes of configuration that some lazy-loaded property happened to point to, so maybe it all evens out.
You have to write some code. The horror. Trust me, it’s way better than the code you have to write to fix pickles after-the-fact.
This hasn’t actually been used in production yet — I haven’t actually had a need for this myself since leaving Yelp. But the entire library is a single file, less than 400 lines long. What could possibly go wrong?
Go use it already
It’s on PyPI, GitHub, and ReadTheDocs.
Also, I wrote a condensed guide to all of YAML’s syntax that’s hopefully much easier to digest than the spec. I’ve often wished such a thing exists, and now it does.
Enjoy! Let me know how it works for you.