[pycrypto] the sad state of pycrypto

Sun Nov 9 18:57:16 CST 2008

On Sun, Nov 9, 2008 at 4:01 PM, Dwayne C. Litzenberger <dlitz at dlitz.net> wrote:
> MD5 was _never_ collision-resistant; We just thought it was.  It's possible
> that MD5 is not safe for any purpose, and that we just currently think it
> is.  Maybe it's safe, and maybe not, but it's not a conservative choice for
> new applications.

Indeed. Those platitudes are true for all hash functions.

> Also, I'm not sure what security proof you're referring to, but see "Forgery
> and Partial Key-Recovery Attacks on HMAC and NMAC Using Hash Collisions":
> http://eprint.iacr.org/2006/319

I'm referring to <http://eprint.iacr.org/2006/043>, which I consider
to be more definitive. YMMV.

>> A hashed signature algorithm can use MD5 with no problems.
>
> I'm sure you don't mean that.

I'm sure I do.

> Any time you someone signs a message provided
> by a third party (such as when certifying a computer program or when adding
> a digital timestamping to a document), the hash function they use needs to
> be collision-resistant.

...unless the signer can add as much randomness to the signature as
they want, which is exactly what a hashed signature algorithm does.
The advantage of a hashed signature algorithm is that it can not only
compensate for less-than-expected collision resistance, it can make
the strength of the signature as strong as the preimage resistance of
the hash, not the collision resistance.

> No, RandomPool was safe if you used it correctly, which meant you had to
> feed it entropy from somewhere, and you had to monitor the entropy estimate.

So, in order to use it safely, you had to know more about randomness
than nearly any programmer might. I would not call that safe, but I
hear that you might. Compare that to MD5, which can be used safely if
you understand that 64 bits of collision resistance is not enough for
most applications, and that you will easily get less now that there is
a proven trivial attack.

> I still think you're being overly optimistic.  Smart developers still make
> fatal mistakes with crypto, and I have empirical evidence to back that up:

Of course. Removing a hash function will not prevent that. Someone who
misuses MD5 is just as likely to misuse SHA-1, such as using it where
collision resistance needs to be more than 2^64. Or were you
intendeing to remove SHA-1 as well?

>    1. Zooko said:
>
>        "I happen to know a somewhat famous developer who once looked
>  through the Crypto++ API and chose DES-XEX without (I think)
>  realizing that it was DES-X and not Triple-DES."

If the Crypto++ API allows a user to give 196 bits of key and then
only use 64 bits, then the problem is mostly with the API. Also, note
the "I think" there.

>    2. RandomPool was misused---twice---in Paramiko.  See
> http://lists.dlitz.net/pipermail/pycrypto/2008q3/000000.html

You have already documented why this function was difficult to use.
Why blame the users for that?

>    3. A Google Code Search for RandomPool turned up a bunch of uses, none
>     of which were correct.

Ditto.

> Developers of crypto libraries are in a position to reduce the number of
> mistakes their downstream users accidentally make.  I intend to make full
> use of this ability. (But see below.)

That is a good thing, and I fully support you in that. Removing hash
algorithms that are in wide safe use (as well as wide unsafe use) may
not be the best way to do that, but it's your library.

>> If you really want the library to be in nanny mode, simply rename the
>> function from "MD5" to something like "idontwantyoutouseMD5". This is
>> a serious suggestion. Self-documenting function names are surprisingly
>> useful.
>
> Aside from the maintainability benefits, I don't want to drop algorithms
> that people need for legacy reasons, even if they would be well-advised not
> to use them in new applications.  That's why I like the policy idea instead
> of dropping or renaming modules.  That way, developers can make less
> conservative choices if they need to, but they'll be less likely to do so
> accidentally, and reviewers will have an easier time checking for these
> mistakes.

Sure. However, now you become the enforcer of policy. For hashes,
that's a very tricky position. Do you call SHA-1 not conservative,
even though every conservative CA in the world uses it for all their
certificates? Even if you know that there is an attack that reduces
its collision resistance to where MD5 was a few years ago? Even though
there is a preimage attack on it? (The last one is a bit of a red
herring, but it shows the difficulty of being the arbiter of
conservativeness.) Again, it's your library, so you get to make the
judgement calls, but what you think is obvious can be far from it, as
you have already discovered.

> On the other hand, I don't mind dropping algorithms that nobody actually
> uses.  It's not just about "nanny mode": Code no longer present is code I
> don't have to spend my limited time maintaining.

Fully agree here.

>  That's why I asked about
> MD2.  Do you know of anyone who uses PyCrypto who needs MD2 support?

I'm the wrong person to ask about that. In fact, probably everyone on
this list is the wrong person to ask about that. You have to ask every
user "do you use this", which is of course impossible. As library
maintainer, you can rip it out and see who screams.

> My policy is that if I think an algorithm is patent-encumbered, then it's
> not getting included into PyCrypto; If it's already included, then it gets
> dropped.

And now you get to define "encumbered"! :-)

> I agree that it's goofy.  I really don't see why a person couldn't just
> truncate an ordinary SHA-256/512 hash if they want "matched impedance",
> rather than also mucking about with the initial values.

...answered by your next sentence...

> If we want to avoid
> allowing someone to truncate an SHA-256 hash to make a valid 224-bit hash,
> then we can define separate hash functions like so:
>
>   H_256(m) := SHA-256("SHA-256" || m)
>   H_224(m) := SHA-256("SHA-224" || m)[:224]

...at a cost of one extra round of hashing for all messages that are 7
octets or less short of a block size boundary. Using a different IV
prevents that performance hit in those cases. It's a design tradeoff.