My self is steam

Insights into computer security, programming and math

December 11, 2018
Finding Duplicate RSA Moduli in the Wild

Or, programming in the way of Diogenes

Let's imagine one is assigned the problem of finding duplicate RSA moduli in all publicly available SSL certificates on the Internet, the reason being a census of public hosts sharing the same modulus.

Most of the time, if not always, having the same modulus equates to saying that the certificates share the same public key, the latter being the tuple composed of the public exponent and the modulus. Indeed, the public exponent is often chosen between the values 3, 17 and 2^16 + 1, as this leads to fast exponentiation operations. Therefore, under this conditions, it is the modulus to be responsible for the uniqueness of the public key.

In this setting, the certificates fall into three distinct sets that, from a security standpoint, delineate a simple yet interesting threat model.

The first one is the set of all the certificates for which no duplicates occur; the second is the set of all the certificates with shared moduli/public keys belonging to the same organization. The third and most interesting one is the set of all those duplicates that appear to belong to different, unrelated actors.

The assumptions under which the latter two sets are meant, which hence form the basis of the model, are those where an entry in the third set would represent a concrete threat to those parties whose modulus is non-unique; while an entry in the second set, although not representative of best-practices, could be more easily justified as the result of key or certificate reusage, a common custom on the Internet.

Said concrete threat is grounded essentially on one single theorem by Dan Boneh, from his famous paper Twenty Years of Attacks on the RSA Cryptosystem:

Theorem 1: Given the private key d, one can efficiently factor the modulus N=p*q

A complete proof can be found here.

A good counter argument to the efficacy of the attack arises when one realizes that, generally, the compromission of a private key in a standard SSL setting implies the compromission of the modulus as well, provided that the two lie on the same machine intruded by the attacker. Indeed, a private key as generated by OpenSSL, for example, does not refer to the sole private exponent d but contains as well the two factors of the modulus, P and Q.

Theorem 1 is, in fact, more useful when seen in relation to the following corollary:

Attack 1.1: To avoid generating a different modulus N for each user, a trusted authority may wish to fix N and provide user i with a unique pair ei, di from which user i forms public key (N, ei) and private key (N, di). [...] By Theorem 1 Bob can use his own exponents eb, db to factor the modulus N. Once N is factored, Bob can recover Alice's private key from her public key ea. This observation shows that an RSA modulus should never be used by more than one entity.

The last passage is essential to our threat model: we will define one single entity to be the organization to which a certificate has been issued. If evidence shows that two duplicate moduli belong to the same organization, although on different hosts/domains, the assumption still applies. In other words, we would consider as relevant only those duplicates where evidence suggests that the moduli belong to two or more distinct organizations, since one of them could be interested in obtaining the private key of the others.

Of course, the chance of such an event occurring is very small and relies only on the hope in some implementation or deployment malpractice, as exemplified in Corollary 1.1, since the probability that two distinct organizations might end up with the same modulus when generated indipendently and correctly is practically zero, assuming a key length of 2048 bits. As it will turn out indeed, the aforementioned third set is empty. The result, although originating from a smaller sample, matches one of those from Lenstra et al., where a strategy for discerning the entities behind the certificates is not explored.

This notwithstanding, the exercise remains worth pursuing, if only for the technique adopted and some curious tangential results.

It may be useful to recall at which stage the RSA algorithm enters the TLS playground. As per RFC 5246, the algorithm is involved in the handshaking sequence, more specifically in the Client Key Exchange Message in order to encrypt the premaster secret directly with the server's public key, in case the selected key exchange method is RSA, or in the Server Key Exchange Message to sign the Diffie-Hellman public key parameters, in case the method is a non-anonymous ephemeral Diffie-Hellman (DHE_RSA), [1], [2]. The former, pure RSA method (i.e. without any DH or DHE prefix), although elegant, does not guarantee forward secrecy and faces concerns related to Bleichenbacher's original attack, as warned against in the RFC 5246 itself, and to its most recent variant ROBOT.

In these two settings then, the server's certificate contains an RSA public key, whose modulus is the subject of the present discussion. Rather than scanning the whole Internet directly, we download the set of certificates presented by all the hosts responding on port 443, as harvested by Project Sonar, which amounts to almost one million certificates.

We are then interested in finding duplicate moduli, which equates to finding duplicates in a given set, for which the two most common strategies are sorting, and searching by hash table. The former's time complexity is assumed to be O(n*logn), while the latter's, in the best case scenario with no collisions, is n*O(1), where O(1) is the table access time complexity.

We want to solve the problem by relying on the Unix toolset only, without resorting to any full-fledged programming language, as adherents to Doug McIlroy's school of thought. The Diogenic programmer is more interested in the "how", rather than the "why".

One may think of extracting the moduli, printing them on the stdout, piping them through the powerful `sort' command, and finally ask `uniq' to report duplicates. This may solve the problem elegantly and succinctly. There are minor concerns though. OpenSSL would output the moduli as hexadecimal strings, and Unix's `sort' does not interpret such values numerically, meaning that hex digits would be compared alphabetically. A naive solution would be to convert the values to decimal notation by means of `bc', for infinite precision. At this point the dilemma would be whether `sort' is able to handle big numbers at all, keeping in mind that we are dealing with 2048-bit long numbers.

There are better workarounds, however. According to the Posix C character collation sequence, the alphabetical order of strings representing hexadecimal values is the same as when the strings are interpreted numerically, provided they are of the same case and of the same length. The latter property can be easily ensured by padding the arguments to a certain boundary, like 2048 bits (256 bytes, 512 hex digits) for example, assuming the length is homogeneous across the set of candidates:

ftp -M -V -o - | \
gunzip -c | cut -d, -f2- | \
while read cert; do echo $cert | base64 -d | \
openssl x509 -inform der -noout -modulus | \
cut -d"=" -f 2 | awk '{ printf("%0512s\n", $1); }'; done | \
LC_COLLATE=C sort -k1 | uniq -d

The one-liner would represent a satisfying solution, except for the fact that, as per its man page, `sort' will resort to using temporary files on disk once the specified memory buffer is full (-S option), or eat up to the 90% of available memory, in case no buffer size is specified, and hence swapping, which would involve the disk anyway. Although irrelevant as it might sound to the modernist, this condition is reached very quickly on a very old x61s with only 1GB of RAM, my humble setup.

The disk being the ultimate commodity on which to rely, one may want to exploit it to its full potential, rather than taking its presence in the background for granted. What if there were a way to take control of that process that inevitably resorts to disk as a fallback and make a brand new paradigm out of it, with its own dedicated interface? This is where we start exploring the hash-table solution.

Hash tables have an intrinsic limitation that we will instead see as an interesting property. Hash functions are not injective, meaning that collisions are a very likely, and expected, possibility. To have a tangible view of how considerable the phenomenon is, one might want to refer to comparative tests. In our case, we want to find duplicates in a given set by means of a hash table, and use the moduli as keys; having two or more of them colliding would imply either a matching pair of duplicates or a spurious collision.

Since we cannot avoid collisions, we could think of exploiting them in order to reduce the set of samples to a smaller one: keys hashed to the same buckets would contain either real duplicates or just colliding entries. In either cases, we would still need to perform a linear search in the bucket to verify the "duplicate moduli" claim. The size of the buckets really depends on the choice of the hash function, but at least in theory one should be able to search candidates in smaller sets than the one containing all the samples. In other words, the hash table maps duplicates to collisions, but not all colliding entries are duplicate moduli.

One might wonder where the improvement over the `sort' solution is, and where the link with the disk interface lie. The fact that we are writing a shell script should not lead one to think about Bash's built-in associative arrays (we are using OpenBSD's pdksh, indeed). So, how would one emulate a hash table in such an environment? It is important to remember that one of the pillars of the Unix design is the transparent and homogeneous interface to the disk by means of the filesystem; by creating as many files as the number of items in the set, assigning each a filename representing the key, and whose file content is the value of the item, we can easily map a set to the filesystem.

We can think of a directory on the filesystem as a lookup table, after all. The interface would be represented by the general shell facilities like `test -e', for querying an entry, and so on. And indeed, BSDs' UFS filesystem performs directory lookups with the help of an in-memory hash-table, as compared to linear searches required in other filesystems. This feature is called dirhash and is enacted anytime certain conditions hold, one of them being the number of entries in the directory. An equivalent exists for Linux as well. Dirhash basically hashes filenames to inode numbers via the FNV hash function, whence our strategy of mapping the keys to filenames. By default the in-memory dirhash is 2 MB, although it can be increased via `sysctl'.

Stated differently, this means that whenever the circumstances require a full fledged hash table, one might simply want to resort to the filesystem interface. This fits the needs of those who, like me, prefer to see and use Unix more like a programming language. Clearly, the advantage resides in the fact that the dirhash table values are just inode numbers, rather than the full content of the entries to be hashed, allowing one to save memory and keep the content on disk, rather than copying it to memory. This scheme captures a common practical programming pattern where big files need to be indexed: in this case a common choice for the table values is the file path, rather than its whole content.

There is only one problem with this approach as it applies to our particular case though: the maximum filename lengths on UFS, which cannot be increased. Indeed:

$ touch $(python -c 'print "a"*256')
touch: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa: 
File name too long

We can only hash 255-byte long keys, one byte short the number of bytes required to represent 2048-bit long moduli. That's more a limitation of UFS than of the general technique but it still obstructs the way. However, we can still shrink the filenames with the help of another hash function like SHA-2, whose collision rate is practically close to zero.

To recap our tour de force: we have a set of one million x509 SSL certificates of the majority of internet hosts serving on port 443; we extract all the moduli for those certificates whose key pair is of type RSA; we apply SHA-2 to each modulus, and use the result as the filename of a file in a directory. We preserve additional information for each certificate, like subject, common name, subject alternative name, as the content of each file, in order to keep track of the findings. Then, we simply use the filesystem interface of the shell to check for duplicates: a duplicate is simply a file that already exists. The file look-up is implicitly carried through a hash table, rather than through a linear search.

The following script puts it all together:


[ $(id -u) != "0" ] && { echo "You must be root"; exit 1; }

TABNAME=$([ "$1" ] && echo "$1" || echo "hashtable")
MAXMEM=$([ "$2" ] && echo "$2" || echo "67108864")
OLDMAXMEM=$(sysctl vfs.ffs.dirhash_maxmem | cut -d= -f2)

function clean {
  sysctl vfs.ffs.dirhash_maxmem="$OLDMAXMEM" || \
  { echo "Error setting hashtable maxmem to its previous value. Exiting."; exit 1; }
  exit 0
trap clean INT TERM EXIT

sysctl vfs.ffs.dirhash_maxmem="$MAXMEM" || \
{ echo "Error setting hashtable maxmem. Exiting."; exit 1; }
! [ -d $TABNAME ] && mkdir $TABNAME

while read cert; do
  T=$(echo $cert | base64 -d | openssl x509 -inform der -noout -modulus -subject |\
      sed 'N;s/\n/#/;s/Modulus=\([^#]*\)#subject=\(.*\)$/\1#\2/');
  echo $MODULUS | grep -q 'Wrong Algorithm type' && continue  # non-RSA key
  SAN=$(echo $cert | base64 -d | openssl x509 -inform der -noout -text |\
        sed -n '/Subject Alternative Name/{n;s/^[[:space:]]*//p;}')
  NAME=$(echo -n $MODULUS | sha256)
  [ -e $NAME ] && echo "Collision found: $NAME#$SUBJECT#$SAN#$I"

  echo "$T#$SAN#$I" >> $NAME

cd ..

We keep modulus, subject, subject alternative name and the file index in each file. Log messages are reported to the stdout. Let's see the script in action:

sbudella@misery$ wc -l LOG                                                                
    1265 LOG

We captured the log messages in a file for further reference, we found 1265 collision messages in total. With `sed' we extracts from the log the sha256 hashes, that is, the filenames of the duplicate files. A duplicate entry can occur twice or more. Let's see how many unique duplicates we found:

sbudella@misery$ sed -e 's/^Collision found: \([^#]*\).*/\1/' LOG | sort -u | wc -l

According to our threat model, we look for those duplicates whose subject mismatches, assuming that a different subject would be indicative of a different entity and hence, of a different actor:

for file in $(sed -e 's/^Collision found: \([^#]*\).*/\1/' LOG | sort -u); do
  while IFS=# read -r m sub san i; do
    echo $file#$sub; done<hashtable/$file; done | sort | uniq -u

Note indeed the -u flag given to the command `uniq', which reports only lines which are unique. This filters out all those duplicates sharing the same subject. The rationale behind this filter is that, although interesting from a security point of view, the practice of using the same certificate (wildcard certificates, and multiple domains in the SAN), or the same public key in multiple certificate signing requests for multiple domains, seems to be quite common and accepted on the internet, regardless of the advices against it.

The filter gives 54 results, but inspecting them further shows that the mismatch in the subject does not originate from different organizations, but rather from different pieces of information like serial numbers encoded in the CN, like for example (the first field is the sha256 hash of the modulus):

.../O=Xirrus, Inc. WiFi Array/CN=X2187498B747F
.../O=Xirrus, Inc. WiFi Array/CN=X2188038BB568
.../O=Xirrus, Inc. WiFi Array/CN=X2188038BB58F
.../O=Xirrus, Inc. WiFi Array/CN=X2188058BC03F

Even for those duplicates where the domains encoded in the CN differ, evidence still points to the same owner or organization (namely, the Organization field in the subject):

 /C=US/ST=Texas/L=Dallas/O=ZixCorp Systems, Inc./
 /C=US/ST=Texas/L=Dallas/O=ZixCorp Systems, Inc./
 /C=US/ST=Texas/L=Dallas/O=ZixCorp Systems, Inc./
 /C=US/ST=Texas/L=Dallas/O=ZixCorp Systems, Inc./
 /C=US/ST=Texas/L=Dallas/O=ZixCorp Systems, Inc./
 /C=US/ST=Texas/L=Dallas/O=ZixCorp Systems, Inc./
 /C=US/ST=Texas/L=Dallas/O=ZixCorp Systems, Inc./
 / Organization/C=JP/ST=Tokyo/L=Chiyoda-ku/O=Hatsumei-Tsushin Co., Ltd./OU=Operation/
 / Organization/C=JP/ST=Tokyo/L=Chiyoda-ku/O=Hatsumei-Tsushin Co., Ltd./OU=Operation/

In other words, at least in the given dataset, no more than one single organization shares the same modulus among different hosts. This means that, under the assumptions of our threat model, third parties cannot factor the modulus of another party. Multiple DNS, wildcard certificates or, simply, deployments sharing the same certificate accounts for all the duplicated moduli/pubkeys in the dataset, the security consequences of which have been widely discussed elsewhere (basically, the risk of key compromise increases when the same key is replicated on different hosts; if multiple domains are hosted on the same machine, the risk is the same whether or not the key pair is the same). It might be worth widening the set of samples by adding all the publicly available SSL hosts on the Internet, not only those answering on port 443.

There are some nice pearls though worth immortalizing, which emerge if we remove the filter and hence inspects all the 414 unique duplicates. Can you find them?