My self is steam

Insights into computer security, programming and math


May 03, 2020
Mindreading your Ps and Qs

Consider the following:

Problem 1: Given a set of n randomly selected RSA numbers, estimate the portion k of numbers whose composing primes both belong to the residue class 3 (mod 4).

Attempting to solve this apparently innocent problem will allow us to range over a few interesting ideas from different disciplines.

It is important to first identify the constraints of the problem, since it clearly inquires about mathematical objects whose scope intersects with the domain of security: we shall define an RSA number to be any odd semiprime of which we observe neither its composing primes nor any kind of property about them. This definition easily applies to the public key moduli of RSA certificates, for instance.

Under this setting, the problem basically asks whether it is possible to infer some kind of information about objects that are kept secret; in other words, although it is generally not possible to reconstruct the underlying factors, could we at least classify them into well-defined categories by interacting with the observables only? That is, given a set of N=pq, could one state anything at all about both p and q by observing only the Ns?

The motivation for inquiring specifically about the residue class of the primes is a way of restricting the domain of this hidden information that we want to reconstruct. Also, as it will be shown, it allows us to exploit some results from number theory to our advantage.

Let P2 be the set of all odd semiprimes below a given x:

P2=pqx: p,qP, and p,q>2.

We define the following function on P2:

pq(pq mod 4, p mod 4, q mod 4)

This function can be also seen as a binary relation that partitions P2 into the following 23 residue classes: (1,1,1), (1,1,3), (1,3,1), (1,3,3), (3,1,1), (3,1,3), (3,3,1), (3,3,3). Indeed, since we ruled out all semiprimes multiple of 2, the only possible values of the remainders mod 4 are 1 and 3.

This means that each semiprime and its respective composing primes are congruent to any of the eight classes above, allowing us to deal with a much smaller set than the whole P2. But we can do even better by ruling out the residues (1,1,3), (1,3,1), (3,1,1), (3,3,3) due to the following observation:

(pq mod n)(p mod n · q mod n) mod n

The proof is self-evident if we remember that:

 rqq mod nrpp mod nrp·rqp·q mod n

So basically the residue classes where the modulo of the semiprime is not congruent to the product of the modulo of its composing primes are necessarily empty.

Our residue system of the semiprimes is then: R2={(1,1,1), (1,3,3), (3,1,3), (3,3,1)}. In the limit, one would expect roughly equal portions of integers in the two residue classes 1 mod 4 and 3 mod 4, that is:

#{s1 mod 4, sP2}=#{s3 mod 4, sP2}

The two quantities differ according to a phenomenon known as Chebyshev's bias (Kevin Ford, Jason Sneed, Chebyshev's bias for products of two primes). The bias generally arises when the two residue classes in the comparison are both neither quadratic residues nor quadratic nonresidues; indeed, 1 is a quadratic residue mod 4 but 3 is not. However, the bias is much more pronounced for the residue system of the primes, while for the semiprimes the bias is so small that the previous identity can be replaced in practice with:

#{s1 mod 4, sP2}#{s3 mod 4, sP2}

The latter espression essentially says that the semiprimes are equally partitioned between the two residue classes, meaning that if we pick a semiprime less than a given value, it will be congruent to either 1 or 3 mod 4 with the same probability. This would preclude any possibility of estimating which residue classes its generating primes belong to. Indeed, if we picked for example an s1 mod 4, it would either mean p1 mod 4  q1 mod 4, or p3 mod 4  q3 mod 4 but none of the two hypothesis would have even the smallest "advantage", as the information about the residue classes of the generating primes is lost forever due to multiplication, which could be seen as an entropy black hole where even this one-bit of information about p and q is destroyed.

There is a subtle mistake in this line of reasoning, though. The previous result states something about a residue system of only two classes, while the system of our investigation has four. We are implicitly assuming that, since the semiprimes are equally partitioned between the two classes, the composing primes are further as equally partitioned among the four residue classes. This assumption is indeed wrong, and by computing the actual distribution among the residue classes we convice ourselves by experiencing at the same time a far larger phenomenon.

Let's first compute the number of odd semiprimes less than 10t for each tN; this sequence is known as A066265 when each of its elements is π2(x), the count of all semiprimes (both even and odd) less than x. Although an exact formula for the calculation of π2(x) exists, it still requires iterating through all the primes up to the square root of x: this can require substantial time and computational resources, and it's the main reason why A066265 stops at 1021 and will probably not grow much during this lifetime.

Our needs are different though: not only do we need the exact count of the odd semiprimes at regular intervals, but we do also need to progressively count the congruence of each (pq mod 4, p mod 4, q mod 4) to any of the four residue classes, up to x rather than its square root. This makes the computation even more demanding than the one involving the closed formula.

My modest means indeed allows me to obtain only the counts up to 1012, but this is enough to notice the major pattern underpinning the discussion:

103;(1, 3, 3): 35.294%;(3, 1, 3): 29.412%;(3, 3, 1): 18.137%;(1, 1, 1): 17.157%;204
104;(1, 3, 3): 31.748%;(3, 1, 3): 30.010%;(3, 3, 1): 19.632%;(1, 1, 1): 18.609%;1956
105;(1, 3, 3): 30.375%;(3, 1, 3): 29.668%;(3, 3, 1): 20.186%;(1, 1, 1): 19.770%;18245
106;(1, 3, 3): 29.612%;(3, 1, 3): 29.220%;(3, 3, 1): 20.686%;(1, 1, 1): 20.482%;168497
107;(1, 3, 3): 29.081%;(3, 1, 3): 28.917%;(3, 3, 1): 21.067%;(1, 1, 1): 20.935%;1555811
108;(1, 3, 3): 28.732%;(3, 1, 3): 28.656%;(3, 3, 1): 21.340%;(1, 1, 1): 21.272%;14426124
109;(1, 3, 3): 28.480%;(3, 1, 3): 28.446%;(3, 3, 1): 21.554%;(1, 1, 1): 21.520%;134432669
1010;(1, 3, 3): 28.289%;(3, 1, 3): 28.272%;(3, 3, 1): 21.727%;(1, 1, 1): 21.712%;1258822220
1011;(1, 3, 3): 28.136%;(3, 1, 3): 28.128%;(3, 3, 1): 21.872%;(1, 1, 1): 21.864%;11840335764
1012;(1, 3, 3): 28.011%;(3, 1, 3): 28.008%;(3, 3, 1): 21.992%;(1, 1, 1): 21.989%;111817881036

The last column of each row shows the count of all odd semiprimes up to the power of 10 in the first column. The other columns are the percentage of each residue class. As we can see, the classes are not at all equally apportioned! In particular, each of the residue classes of pq (the first number of each triple) is always split into one dominant class and a minor one. Therefore, if we pick an odd semiprime less than, say, 1012, whose remainder mod 4 is 1, its generating primes will have a higher probability to be both congruent to 3 mod 4 than to 1 mod 4.

These calculations seems to confirm another notable result in number theory (David Dummit, Andrew Granville, and Hershy Kisilevsky, Big biases amongst products of two primes); therein, an asymptotic formula for the ratio between the odd semiprimes congruent to (1,3,3) and one quarter of the count of all odd semiprimes up to a given x guarantees the ratio to be always slightly greater than and slowly converging to 1:

r(x)=#{pqx : pq3 mod 4}14{pqx}1+1+o(1)3(loglogx-1)

The attentive reader may notice that the latter observation would very well apply to the two residue classes (3,1,3) and (3,3,1). However, these two classes are, in a sense, just one single opaque instance: if one picked a semiprime that is congruent to 3 there would be no way to decide which among p and q is congruent to 1 and which is congruent to 3, despite the counts of the two classes differ considerably; this is due to the commutative property of multiplication. Therefore, although the bias affects all the four residue classes, under the setting of the RSA numbers it gives an advantage whenever the semiprime is congruent to 1 mod 4.

An interesting application worth exploring would be the "probabilistic" optimization of naive factorization algorithms like Fermat's, by testing for candidate factors both congruent to 3 mod 4 first.

But let's go back to our former Problem 1. We notice that it possesses all the features of a binomial experiment: herein the random variable Y models the number of semiprimes congruent to (1,3,3) among a larger set of n semiprimes randomly extracted from the set of all semiprimes below a given x. There are indeed only two possible outcomes for each extraction: either the semiprime is congruent to that class or it is congruent to any of the remaining three. We will denote the probability of the former outcome with θ, its converse is therefore (1-θ). (Technically, since we do not reintroduce each of the extracted semiprime back in the set, further selections are not independent since there would be fewer semiprimes to select from; a more "accurate" distribution than the binomial would then be the hypergeometric. However, for large populations like that of semiprimes, we can consider the events as independent).

There a few caveats, though. One might argue that the number of the occurrences of the congruences of interest changes as the value of x increases; this might suggest that the Poisson-Binomial distribution is a better choice to model the problem, since we know both empirically and from the expression of r(x) that θ slowly decreases. However, we will see that θ does not change much even for large values of x, therefore we keep the calculations simple by adopting the binomial distribution, as they could get quite involved with the Poisson-Binomial distribution.

The probability that k out of n odd semiprimes are congruent to (1,3,3) is therefore:

P(Y=k)=nkθk(1-θ)n-k

Under the assumption of the binomial model, the probability θ that each semiprime is congruent to (1,3,3) never changes. We might be tempted to use directly one of the values of θ that we obtained empirically, however, since the size of the n semiprimes can span several different order of magnitudes, this approach represents too much of an approximation. Indeed, if the assumption holds that the semiprimes are randomly selected from a larger population, they could very well lie below, say, 1012 as well as below 1061722048. As already mentioned, empirical calculations above 1024 are out of question.

The solution is to estimate the parameter θ of the distribution. Among the several methods to estimate a distribution parameter, we opt for the Maximum Likelihood Estimation: a function of θ called the likelihood is generally differentiated in order to obtain its maxima. For the binomial distribution this method yields θ^=kn as the most likely estimator, that is, the proportion of positive trials. By taking into account the expression of Dammit et al. for r(x) we obtain:

θ^=kn14+112(loglogx-1)

This represents a good (under)estimation of θ for large values of x. Let's test its performance with a concrete example. Let's collect all those numbers from the RSA factoring challenge that have been factored so far; they amount to n=23, and the biggest of them is in the order of 10250. How many of them could we expect to be congruent to (1,3,3)? We must interpret the problem in the sense that the n=23 numbers have been randomly picked from the set of all semiprimes below 10250 where each of them have the same probability θ of being congruent to (1,3,3). By using our estimator θ^ we get a value of θ equal to 0.2656. We plug it into the equation of the binomial distribution for different values of k; we compare the results with the same distribution under the assumption that the residue class has the same probability than the others:

For θ=0.2656 the probability is maximized at k=6, while the unbiased θ "predicts" k=5, 6 with equal probability. Of course, it is more revealing to calculate the probability that the set contains at least 6 numbers that are congruent to the residue class: we obtain Pθ^(Y6)=0.41 and Pθ=0.25(Y6)=0.34 under the two hypothesis, respectively. Since these 23 RSA numbers have been factored, we can verify both claims. There are indeed exactly 7 RSA numbers in the set that are congruent to (1,3,3) and only 4 are congruent to (1,3,3). Not bad.

Let's finally consider the whole set of the numbers of the RSA challenge. We have a total of n=54 semiprimes, the majority of which have not been factored. For such large values of n, we calculate the binomial coefficient with Stirling's approximation. The estimator obtained for values of x10617 is θ^=0.2633, and the probability is maximed as per Fig 2:

We therefore estimate that the probability of having at least k=14 semiprimes whose composing primes are both congruent to 3 mod 4 is Pθ^(Y14)=0.46. As the factorization of the biggest RSA numbers will probably never happen in this lifetime of ours, knowing this tiny piece of information about their secret primes is already a victory, although it might not help us become eligible for any cash prize.