Or, programming in the way of Diogenes
Let's imagine one is assigned the problem of finding duplicate
RSA moduli in all publicly available SSL certificates on the Internet,
the reason being a census of public hosts sharing the
same modulus.
Most of the time, if not always, having the same modulus equates
to saying that the certificates share the same public key, the
latter being the tuple composed of the public exponent and the
modulus. Indeed, the public exponent is often chosen between the
values 3, 17 and 2^16 + 1, as this leads to fast exponentiation
operations. Therefore, under this conditions, it is the modulus to
be responsible for the uniqueness of the public key.
In this setting, the certificates fall into three distinct sets
that, from a security standpoint, delineate a simple yet interesting
threat model.
The first one is the set of all the certificates for which no
duplicates occur; the second is the set of all the certificates
with shared moduli/public keys belonging to the same organization.
The third and most interesting one is the set of all those duplicates
that appear to belong to different, unrelated actors.
The assumptions under which the latter two sets are meant, which
hence form the basis of the model, are those where an entry in the
third set would represent a concrete threat to
those parties whose modulus is non-unique; while an entry
in the second set, although not representative of best-practices,
could be more easily justified as the result of key or certificate
reusage, a common custom on the Internet.
Said concrete threat is grounded essentially on one single
theorem by Dan Boneh, from his famous paper
Twenty Years of Attacks on the RSA Cryptosystem:
Theorem 1:
Given the private key d, one can efficiently
factor the modulus N=p*q
A complete proof can be found here.
A good counter argument to the efficacy of the attack arises when one
realizes that, generally, the compromission of a private key in a standard
SSL setting implies the compromission of the modulus as well, provided
that the two lie on the same machine intruded by the attacker. Indeed,
a private key as generated by OpenSSL, for example, does not refer to
the sole private exponent d but contains as well the two
factors of the modulus, P and Q.
Theorem 1 is, in fact, more useful when seen in relation
to the following corollary:
Attack 1.1:
To avoid generating a different modulus N for each user,
a trusted authority may wish to fix N and provide user i
with a unique pair ei, di from which user i forms
public key (N, ei) and private key (N, di).
[...]
By Theorem 1 Bob can use his own exponents eb, db to
factor the modulus N. Once N is factored, Bob can recover Alice's
private key from her public key ea. This observation
shows that an RSA modulus should never be used by more than one
entity.
The last passage is essential to our threat model: we will
define one single entity to be the organization to which a certificate
has been issued. If evidence shows that two duplicate moduli belong
to the same organization, although on different hosts/domains, the
assumption still applies. In other words, we would consider as
relevant only those duplicates where evidence suggests that the
moduli belong to two or more distinct organizations, since one of
them could be interested in obtaining the private key of the
others.
Of course, the chance of such an event occurring is very small
and relies only on the hope in some implementation or deployment
malpractice, as exemplified in Corollary 1.1, since the
probability that two distinct organizations might end up with the
same modulus when generated indipendently and correctly
is practically zero, assuming a key length of 2048 bits. As it will
turn out indeed, the aforementioned third set is empty.
The result, although originating from a smaller sample, matches
one of those from
Lenstra et al., where a strategy for discerning the entities
behind the certificates is not explored.
This notwithstanding, the exercise remains worth pursuing, if only
for the technique adopted and some curious tangential results.
It may be useful to recall at which stage the RSA algorithm
enters the TLS playground. As per RFC 5246, the algorithm
is involved in the handshaking sequence, more specifically in the
Client Key Exchange Message in order to encrypt the premaster secret
directly with the server's public key, in case the selected key
exchange method is RSA, or in the Server Key Exchange Message to
sign the Diffie-Hellman public key parameters, in case the method
is a non-anonymous ephemeral Diffie-Hellman (DHE_RSA), [1],
[2].
The former, pure RSA method (i.e. without any DH or DHE prefix),
although elegant, does not guarantee forward secrecy and faces
concerns related to Bleichenbacher's original attack,
as warned against in the RFC 5246 itself, and to its
most recent variant ROBOT.
In these two settings then, the server's certificate contains an
RSA public key, whose modulus is the subject of the present discussion.
Rather than scanning the whole Internet directly, we download the
set of certificates presented by all the hosts responding on port 443,
as harvested by Project Sonar,
which amounts to almost one million certificates.
We are then interested in finding duplicate moduli, which equates to
finding duplicates in a given set, for which the two most common
strategies are sorting, and searching by hash table. The former's time
complexity is assumed to be O(n*logn), while the latter's, in the best
case scenario with no collisions, is n*O(1), where O(1) is the table
access time complexity.
We want to solve the problem by relying on the Unix toolset
only, without resorting to any full-fledged programming language,
as adherents to Doug
McIlroy's school of thought. The Diogenic programmer is more
interested in the "how",
rather than the "why".
One may think of extracting the moduli, printing them on the stdout,
piping them through the powerful `sort' command, and finally ask
`uniq' to report duplicates. This may solve the problem elegantly
and succinctly. There are minor concerns though. OpenSSL would
output the moduli as hexadecimal strings, and Unix's `sort' does
not interpret such values numerically, meaning that hex digits would
be compared alphabetically. A naive solution would be to convert the
values to decimal notation by means of `bc', for infinite precision.
At this point the dilemma would be whether `sort' is able to
handle big numbers at all, keeping in mind that we are dealing with
2048-bit long numbers.
There are better workarounds, however. According to the
Posix C character collation sequence, the alphabetical order
of strings representing hexadecimal values is the same
as when the strings are interpreted numerically, provided they are
of the same case and of the same length. The latter property can
be easily ensured by padding the arguments to a certain boundary,
like 2048 bits (256 bytes, 512 hex digits) for example, assuming
the length is homogeneous across the set of candidates:
ftp -M -V -o - https://opendata.rapid7.com/sonar.ssl/20181023/2018-10-23-1540260352-https_get_443_certs.gz | \
gunzip -c | cut -d, -f2- | \
while read cert; do echo $cert | base64 -d | \
openssl x509 -inform der -noout -modulus | \
cut -d"=" -f 2 | awk '{ printf("%0512s\n", $1); }'; done | \
LC_COLLATE=C sort -k1 | uniq -d
The one-liner would represent a satisfying solution, except for the
fact that, as per its man page, `sort' will resort to using temporary
files on disk once the specified memory buffer is full (-S
option), or eat up to the 90% of available memory, in case no buffer
size is specified, and hence swapping, which would involve the disk
anyway. Although irrelevant as it might sound to the modernist,
this condition is reached very quickly on a very old x61s with only
1GB of RAM, my humble setup.
The disk being the ultimate commodity on which to rely, one may
want to exploit it to its full potential, rather than taking its
presence in the background for granted. What if there were a way
to take control of that process that inevitably resorts to disk as a
fallback and make a brand new paradigm out of it, with its own dedicated
interface? This is where we start exploring the hash-table solution.
Hash tables have an intrinsic limitation that we will instead see
as an interesting property. Hash functions are not injective, meaning
that collisions are a very likely, and expected, possibility. To
have a tangible view of how considerable the phenomenon is, one might
want to refer to comparative
tests. In our case, we want to find duplicates in a given set
by means of a hash table, and use the moduli as keys; having two
or more of them colliding would imply either a matching pair of
duplicates or a spurious collision.
Since
we cannot avoid collisions, we could think of exploiting them
in order to reduce the set of samples to a smaller one: keys hashed
to the same buckets would contain either real duplicates or just
colliding entries. In either cases, we would still need to perform
a linear search in the bucket to verify the "duplicate moduli"
claim. The size of the buckets really depends on the choice of the
hash function, but at least in theory one should be able to search
candidates in smaller sets than the one containing all the samples.
In other words, the hash table maps duplicates to collisions, but
not all colliding entries are duplicate moduli.
One might wonder where the improvement over the `sort' solution
is, and where the link with the disk interface lie. The fact that
we are writing a shell script should not lead one to think about
Bash's built-in associative arrays (we are using OpenBSD's pdksh,
indeed). So, how would one emulate a hash table in such an
environment? It is important to remember that one of the pillars
of the Unix design is the transparent and homogeneous interface to
the disk by means of the filesystem; by creating as many files as
the number of items in the set, assigning each a filename representing
the key, and whose file content is the value of the item, we can
easily map a set to the filesystem.
We can think of a directory on the filesystem as a lookup table,
after all. The interface would be represented by the general shell
facilities like `test -e', for querying an entry, and so on. And
indeed, BSDs' UFS filesystem performs directory lookups with the
help of an in-memory hash-table, as compared to linear searches
required in other filesystems. This feature is called dirhash and is
enacted anytime certain conditions hold, one of them being the
number of entries in the directory. An equivalent exists for Linux
as well. Dirhash basically hashes filenames to inode numbers via
the FNV
hash function, whence our strategy of mapping the keys to filenames.
By default the in-memory dirhash is 2 MB, although it can be increased
via `sysctl'.
Stated differently, this means that whenever the circumstances
require a full fledged hash table, one might simply want to resort
to the filesystem interface. This fits the needs of those who,
like me, prefer to see and use Unix more like a programming language.
Clearly, the advantage resides in the fact that the dirhash table
values are just inode numbers, rather than the full content of the
entries to be hashed, allowing one to save memory and keep the
content on disk, rather than copying it to memory. This scheme
captures a common practical programming pattern where big files
need to be indexed: in this case a common choice for the table
values is the file path, rather than its whole content.
There is only one problem with this approach as it applies to
our particular case though: the maximum filename lengths on UFS,
which cannot be increased. Indeed:
$ touch $(python -c 'print "a"*256')
touch: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa:
File name too long
We can only hash 255-byte long keys, one byte short the number
of bytes required to represent 2048-bit long moduli. That's more
a limitation of UFS than of the general technique but it still
obstructs the way. However, we can still shrink the filenames with
the help of another hash function like SHA-2, whose collision rate
is practically close to zero.
To recap our tour de force: we have a set of one million x509
SSL certificates of the majority of internet hosts serving on port
443; we extract all the moduli for those certificates whose key
pair is of type RSA; we apply SHA-2 to each modulus, and use the
result as the filename of a file in a directory. We preserve
additional information for each certificate, like subject, common
name, subject alternative name, as the content of each file, in
order to keep track of the findings. Then, we simply use the
filesystem interface of the shell to check for duplicates: a duplicate
is simply a file that already exists. The file look-up is implicitly
carried through a hash table, rather than through a linear search.
The following script puts it all together:
#!/bin/sh
[ $(id -u) != "0" ] && { echo "You must be root"; exit 1; }
TABNAME=$([ "$1" ] && echo "$1" || echo "hashtable")
MAXMEM=$([ "$2" ] && echo "$2" || echo "67108864")
OLDMAXMEM=$(sysctl vfs.ffs.dirhash_maxmem | cut -d= -f2)
function clean {
sysctl vfs.ffs.dirhash_maxmem="$OLDMAXMEM" || \
{ echo "Error setting hashtable maxmem to its previous value. Exiting."; exit 1; }
exit 0
}
trap clean INT TERM EXIT
sysctl vfs.ffs.dirhash_maxmem="$MAXMEM" || \
{ echo "Error setting hashtable maxmem. Exiting."; exit 1; }
! [ -d $TABNAME ] && mkdir $TABNAME
cd $TABNAME
I=0
while read cert; do
I=$((I+1))
T=$(echo $cert | base64 -d | openssl x509 -inform der -noout -modulus -subject |\
sed 'N;s/\n/#/;s/Modulus=\([^#]*\)#subject=\(.*\)$/\1#\2/');
MODULUS=${T%,*}
SUBJECT=${T#*,}
echo $MODULUS | grep -q 'Wrong Algorithm type' && continue # non-RSA key
SAN=$(echo $cert | base64 -d | openssl x509 -inform der -noout -text |\
sed -n '/Subject Alternative Name/{n;s/^[[:space:]]*//p;}')
NAME=$(echo -n $MODULUS | sha256)
[ -e $NAME ] && echo "Collision found: $NAME#$SUBJECT#$SAN#$I"
echo "$T#$SAN#$I" >> $NAME
done<../certs.dat
cd ..
We keep modulus, subject, subject alternative name and the file
index in each file. Log messages are reported to the stdout. Let's
see the script in action:
sbudella@misery$ wc -l LOG
1265 LOG
We captured the log messages in a file for further reference, we found
1265 collision messages in total. With `sed' we extracts from the log
the sha256 hashes, that is, the filenames of the duplicate files. A
duplicate entry can occur twice or more. Let's see how many unique
duplicates we found:
sbudella@misery$ sed -e 's/^Collision found: \([^#]*\).*/\1/' LOG | sort -u | wc -l
414
According to our threat model, we look for those duplicates
whose subject mismatches, assuming that a different subject would
be indicative of a different entity and hence, of a different actor:
for file in $(sed -e 's/^Collision found: \([^#]*\).*/\1/' LOG | sort -u); do
while IFS=# read -r m sub san i; do
echo $file#$sub; done<hashtable/$file; done | sort | uniq -u
Note indeed the -u flag given to the command `uniq', which
reports only lines which are unique. This filters out all those
duplicates sharing the same subject. The rationale behind
this filter is that, although interesting from a security point of
view, the practice of using the same certificate (wildcard certificates,
and multiple domains in the SAN), or the same public key in multiple
certificate signing requests for multiple domains, seems to be quite
common and accepted on the internet, regardless of the advices
against it.
The filter gives 54 results, but inspecting them further shows
that the mismatch in the subject does not originate from different
organizations, but rather from different pieces of information like
serial numbers encoded in the CN, like for example (the first field
is the sha256 hash of the modulus):
0764d24e02fbf2f197eda4bda2555cc4f425547409db55b96ed4e33693da4bd8
.../O=Xirrus, Inc. WiFi Array/CN=X2187498B747F
0764d24e02fbf2f197eda4bda2555cc4f425547409db55b96ed4e33693da4bd8
.../O=Xirrus, Inc. WiFi Array/CN=X2188038BB568
0764d24e02fbf2f197eda4bda2555cc4f425547409db55b96ed4e33693da4bd8
.../O=Xirrus, Inc. WiFi Array/CN=X2188038BB58F
0764d24e02fbf2f197eda4bda2555cc4f425547409db55b96ed4e33693da4bd8
.../O=Xirrus, Inc. WiFi Array/CN=X2188058BC03F
[...]
Even for those duplicates where the domains encoded in the CN
differ, evidence still points to the same owner or organization (namely,
the Organization field in the subject):
[...]
15e404a4d003f26e4ebba031844d118ef040060e8592e32044fe9c20fc84193b
/C=US/ST=Texas/L=Dallas/O=ZixCorp Systems, Inc./CN=secure-bremer.com
15e404a4d003f26e4ebba031844d118ef040060e8592e32044fe9c20fc84193b
/C=US/ST=Texas/L=Dallas/O=ZixCorp Systems, Inc./CN=secure-thechristhospital.com
15e404a4d003f26e4ebba031844d118ef040060e8592e32044fe9c20fc84193b
/C=US/ST=Texas/L=Dallas/O=ZixCorp Systems, Inc./CN=securemail-arrowbank.com
15e404a4d003f26e4ebba031844d118ef040060e8592e32044fe9c20fc84193b
/C=US/ST=Texas/L=Dallas/O=ZixCorp Systems, Inc./CN=securemail-avmed.org
15e404a4d003f26e4ebba031844d118ef040060e8592e32044fe9c20fc84193b
/C=US/ST=Texas/L=Dallas/O=ZixCorp Systems, Inc./CN=securemail-dbindt.com
15e404a4d003f26e4ebba031844d118ef040060e8592e32044fe9c20fc84193b
/C=US/ST=Texas/L=Dallas/O=ZixCorp Systems, Inc./CN=securemail-ironcore-inc.com
15e404a4d003f26e4ebba031844d118ef040060e8592e32044fe9c20fc84193b
/C=US/ST=Texas/L=Dallas/O=ZixCorp Systems, Inc./CN=securemail-jvhl.org
[...]
64f5a4acb35d62edebeb2f032ca2116f3d6fe897e83bafc947352863260c515c
/1.3.6.1.4.1.311.60.2.1.3=JP/serialNumber=0100-01-026834/businessCategory=Private Organization/C=JP/ST=Tokyo/L=Chiyoda-ku/O=Hatsumei-Tsushin Co., Ltd./OU=Operation/CN=app01.hypatweb.jp
64f5a4acb35d62edebeb2f032ca2116f3d6fe897e83bafc947352863260c515c
/1.3.6.1.4.1.311.60.2.1.3=JP/serialNumber=0100-01-026834/businessCategory=Private Organization/C=JP/ST=Tokyo/L=Chiyoda-ku/O=Hatsumei-Tsushin Co., Ltd./OU=Operation/CN=base01.hypatweb.jp
[...]
In other words, at least in the given dataset, no more than one
single organization shares the same modulus among different hosts.
This means that, under the assumptions of our threat model, third
parties cannot factor the modulus of another party. Multiple DNS,
wildcard certificates or, simply, deployments sharing the same
certificate accounts for all the duplicated moduli/pubkeys in the
dataset, the security consequences of which have been widely discussed
elsewhere (basically, the risk of key compromise increases when the
same key is replicated on different hosts; if multiple domains are
hosted on the same machine, the risk is the same whether or not the
key pair is the same). It might be worth widening the set of samples
by adding all the publicly available SSL hosts on the Internet,
not only those answering on port 443.
There are some nice pearls though worth immortalizing, which emerge
if we remove the filter and hence inspects all the 414 unique
duplicates. Can you find them?