ChatGPT解决这个技术问题 Extra ChatGPT

How to avoid overflow in expr. A * B - C * D

I need to compute an expression which looks like: A*B - C*D, where their types are: signed long long int A, B, C, D; Each number can be really big (not overflowing its type). While A*B could cause overflow, at same time expression A*B - C*D can be really small. How can I compute it correctly?

For example: MAX * MAX - (MAX - 1) * (MAX + 1) == 1, where MAX = LLONG_MAX - n and n - some natural number.

How important is accuracy?
@Cthulhu, great question. He could try to make an equivalent function using smaller number by dividing all of them by 10 or something, then multiplying the result.
Vars A,B,C,D are signed. This implies A - C could overflow. It that an issue to consider or do you know that this is not going to happen with your data?
@MooingDuck but you can check beforehand if the operation will overflow stackoverflow.com/a/3224630/158285
@Chris: No, I'm saying that there's no portable way to check if a signed overflow has happened. (Brad is correct that you can portably detect that it will happen). Using inline assembly is one of many non-portable ways to check.

A
Anirudh Ramanathan

This seems too trivial I guess. But A*B is the one that could overflow.

You could do the following, without losing precision

A*B - C*D = A(D+E) - (A+F)D
          = AD + AE - AD - DF
          = AE - DF
             ^smaller quantities E & F

E = B - D (hence, far smaller than B)
F = C - A (hence, far smaller than C)

This decomposition can be done further. As @Gian pointed out, care might need to be taken during the subtraction operation if the type is unsigned long long.

For example, with the case you have in the question, it takes just one iteration,

 MAX * MAX - (MAX - 1) * (MAX + 1)
  A     B       C           D

E = B - D = -1
F = C - A = -1

AE - DF = {MAX * -1} - {(MAX + 1) * -1} = -MAX + MAX + 1 = 1

@Caleb, just apply the same algorithm to C*D
I think you should explain what E represents.
Both long long and double are 64 bits. Since double has to allocate some bits for the exponent, it has a smaller range of possible values without loss of precision.
@Cthulhu - seems to me like this would only work if all the number are very large...for example, you would still have overflow with {A,B,C,D} = {MAX,MAX,MAX,2}. The OP says "Each number can be really big", but it isn't clear from the problem statement that each number must be really big.
What if any of A,B,C,D are negative though? Won't E or F be even bigger then?
O
Ofir

The simplest and most general solution is to use a representation that can't overflow, either by using a long integer library (e.g. http://gmplib.org/) or representing using a struct or array and implementing a kind of long multiplication (i.e. separating each number to two 32bit halves and performing the multiplication as below:

(R1 + R2 * 2^32 + R3 * 2^64 + R4 * 2^96) = R = A*B = (A1 + A2 * 2^32) * (B1 + B2 * 2^32) 
R1 = (A1*B1) % 2^32
R2 = ((A1*B1) / 2^32 + (A1*B2) % 2^32 + (A2*B1) % 2^32) % 2^32
R3 = (((A1*B1) / 2^32 + (A1*B2) % 2^32 + (A2*B1) % 2^32) / 2^32 + (A1*B2) / 2^32 + (A2*B1) / 2^32 + (A2*B2) % 2^32) %2^32
R4 = ((((A1*B1) / 2^32 + (A1*B2) % 2^32 + (A2*B1) % 2^32) / 2^32 + (A1*B2) / 2^32 + (A2*B1) / 2^32 + (A2*B2) % 2^32) / 2^32) + (A2*B2) / 2^32

Assuming the end result fits in 64 bits you actually don't really need most bits of R3 and none of R4


The calculation above is really not as complicated as it looks, it really is simple long multiplication in base 2^32, and the code in C should look simpler. Also, it will be a good idea to create generic functions to do this work in your program.
M
Mysticial

Note that this is not standard since it relies on wrap-around signed-overflow. (GCC has compiler flags which enable this.)

But if you just do all the calculations in long long, the result of applying the formula directly:
(A * B - C * D) will be accurate as long as the correct result fits into a long long.

Here's a work-around that only relies on implementation-defined behavior of casting unsigned integer to signed integer. But this can be expected to work on almost every system today.

(long long)((unsigned long long)A * B - (unsigned long long)C * D)

This casts the inputs to unsigned long long where the overflow behavior is guaranteed to be wrap-around by the standard. Casting back to a signed integer at the end is the implementation-defined part, but will work on nearly all environments today.

If you need more pedantic solution, I think you have to use "long arithmetic"


+1 You're the only one to notice this. The only tricky part is setting the compiler to do wrap-around overflow and checking if the correct result actually does fit into a long long.
Even the naive version without any tricks at all will do the right thing on most implementations; it's not guaranteed to by the standard, but you would have to find a 1's-complement machine or some other quite weird device to make it fail.
I think this is an important answer. I agree it may not be correct programming to assume implementation specific behavior, but every engineer should understand modulo arithmetic and how to get the right compiler flags to ensure consistent behavior if performance is essential. DSP engineers rely on this behavior for fixed point filter implementations, for which the accepted answer will have unacceptable performance.
p
paquetp

This should work ( I think ):

signed long long int a = 0x7ffffffffffffffd;
signed long long int b = 0x7ffffffffffffffd;
signed long long int c = 0x7ffffffffffffffc;
signed long long int d = 0x7ffffffffffffffe;
signed long long int bd = b / d;
signed long long int bdmod = b % d;
signed long long int ca = c / a;
signed long long int camod = c % a;
signed long long int x = (bd - ca) * a * d - (camod * d - bdmod * a);

Here's my derivation:

x = a * b - c * d
x / (a * d) = (a * b - c * d) / (a * d)
x / (a * d) = b / d - c / a

now, the integer/mod stuff:
x / (a * d) = (b / d + ( b % d ) / d) - (c / a + ( c % a ) / a )
x / (a * d) = (b / d - c / a) - ( ( c % a ) / a - ( b % d ) / d)
x = (b / d - c / a) * a * d - ( ( c % a ) * d - ( b % d ) * a)

Thanks @bradgonesurfing - could you provide such an input? I've updated my answer, executed it and it works (bd and ca are 0)...
Hmmm. Now I think about it maybe not. Degenerate case with d = 1 and a = 1 and b = maxint and c = maxint it still works. Cool :)
@paquetp: a=1, b=0x7fffffffffffffff, c=-0x7fffffffffffffff, d=1 (note c is negative). Clever though, I'm certain your code handles all positive numbers correctly.
@MooingDuck but the final answer for your set is overflowed as well so it is not a valid setup. It only works if each side is of the same sign so the resulting subtraction is within range.
There is something strange with StackOverflow when this answer which is the simplest and the best has got such a low score compared to the top scored answer.
A
Anirudh Ramanathan
E = max(A,B,C,D)
A1 = A -E;
B1 = B -E;
C1 = C -E;
D1 = D -E;

then

A*B - C*D = (A1+E)*(B1+E)-(C1+E)(D1+E) = (A1+B1-C1-D1)*E + A1*B1 -C1*D1

G
Gian

You could consider computing a greatest common factor for all your values, and then dividing them by that factor before doing your arithmetic operations, then multiplying again. This assumes that such a factor exists, however (for example, if A, B, C and D happen to be relatively prime, they won't have a common factor).

Similarly, you could consider working on log-scales, but this is going to be a little scary, subject to numerical precision.


Logarithming seems good if long double is available. In that case, an acceptable level of precision can be achieved (and the result cen be rounded).
E
Esteban Crespi

If the result fits in a long long int then the expression A*B-C*D is okay as it performs the arithmetic mod 2^64, and will give the correct result. The problem is to know if the result fits in a long long int. To detect this, you can use the following trick using doubles:

if( abs( (double)A*B - (double)C*D ) > MAX_LLONG ) 
    Overflow
else 
    return A*B-C*D;

The problem with this approach is that you are limited by the precision of the mantissa of the doubles (54bits?) so you need to limit the products A*B and C*D to 63+54 bits (or probably a little less).


This is the most practical example. Clear and gives the right answer (or throws an Exception when the inputs are bad).
Nice and elegant! You did not fall for the trap the others fell for. Just one more thing: I would bet there are some examples where the double calculation is below MAX_LLONG just due to rounding errors. My mathematical instinct tells me you should calculate the difference of the double and the long result instead, and compare that to MAX_LLONG/2 or something. This difference are the rounding errors of the double calculation and plus the overflow and should normally be relatively low, but in the case I mentioned it will be large. But right now I am too lazy to find out for sure. :-)
P
Pierre Arnaud

You could write each number in an array, each element being a digit and do the calculations as polynomials. Take the resulting polynomial, which is an array, and compute the result by multiplying each element of the array with 10 to the power of the position in the array (the first position being the largest and the last being zero).

The number 123 can be expressed as:

123 = 100 * 1 + 10 * 2 + 3

for which you just create an array [1 2 3].

You do this for all numbers A, B, C and D, and then you multiply them as polynomials. Once you have the resulting polynomial, you just reconstruct the number from it.


don't know what that is but I'll have to find. put :) . this is a solution of the top of my head while I'm shopping with my girlfriend :)
you're implementing bignums in a base10 array. GMP is a quality bignum library that uses base 4294967296. MUCH faster. No downvote though, because the answer is correct, and useful.
thanks :) . it`s usefully to know that this one way of doing it but there are better ways so don't do it like this. at least not in this situation :)
anyway ... using this solution you could computer much bigger number than any primitive type could bold (like 100 digit number s) and keep the result as an array. this deserves an up votez :p
I'm not certain it gets an upvote, since this method (though effective and relatively easy to understand) is memory hungry and slow.
d
dronus

While a signed long long int will not hold A*B, two of them will. So A*B could be decomposed to tree terms of different exponent, any of them fitting one signed long long int.

A1=A>>32;
A0=A & 0xffffffff;
B1=B>>32;
B0=B & 0xffffffff;

AB_0=A0*B0;
AB_1=A0*B1+A1*B0;
AB_2=A1*B1;

Same for C*D.

Folowing the straight way, the subraction could be done to every pair of AB_i and CD_i likewise, using an additional carry bit (accurately a 1-bit integer) for each. So if we say E=A*B-C*D you get something like:

E_00=AB_0-CD_0 
E_01=(AB_0 > CD_0) == (AB_0 - CD_0 < 0) ? 0 : 1  // carry bit if overflow
E_10=AB_1-CD_1 
...

We continue by transferring the upper-half of E_10 to E_20 (shift by 32 and add, then erase upper half of E_10).

Now you can get rid of the carry bit E_11 by adding it with the right sign (obtained from the non-carry part) to E_20. If this triggers an overflow, the result wouldn't fit either.

E_10 now has enough 'space' to take the upper half from E_00 (shift, add, erase) and the carry bit E_01.

E_10 may be larger now again, so we repeat the transfer to E_20.

At this point, E_20 must become zero, otherwise the result won't fit. The upper half of E_10 is empty as result of the transfer too.

The final step is to transfer the lower half of E_20 into E_10 again.

If the expectation that E=A*B+C*D would fit the signed long long int holds, we now have

E_20=0
E_10=0
E_00=E

This is actually the simplified formula that one would get if using Ofir's multiplication formula and removing every useless temporary result.
E
Eric Postpischil

If you know the final result is representable in your integer type, you can perform this calculation quickly using the code below. Because the C standard specifies that unsigned arithmetic is modulo arithmetic and does not overflow, you can use an unsigned type to perform the calculation.

The following code assumes there is an unsigned type of the same width and that the signed type uses all bit patterns to represent values (no trap representations, the minimum of the signed type is the negative of half the modulus of the unsigned type). If this does not hold in a C implementation, simple adjustments can be made to the ConvertToSigned routine for that.

The following uses signed char and unsigned char to demonstrate the code. For your implementation, change the definition of Signed to typedef signed long long int Signed; and the definition of Unsigned to typedef unsigned long long int Unsigned;.

#include <limits.h>
#include <stdio.h>
#include <stdlib.h>


//  Define the signed and unsigned types we wish to use.
typedef signed char   Signed;
typedef unsigned char Unsigned;

//  uHalfModulus is half the modulus of the unsigned type.
static const Unsigned uHalfModulus = UCHAR_MAX/2+1;

//  sHalfModulus is the negation of half the modulus of the unsigned type.
static const Signed   sHalfModulus = -1 - (Signed) (UCHAR_MAX/2);


/*  Map the unsigned value to the signed value that is the same modulo the
    modulus of the unsigned type.  If the input x maps to a positive value, we
    simply return x.  If it maps to a negative value, we return x minus the
    modulus of the unsigned type.

    In most C implementations, this routine could simply be "return x;".
    However, this version uses several steps to convert x to a negative value
    so that overflow is avoided.
*/
static Signed ConvertToSigned(Unsigned x)
{
    /*  If x is representable in the signed type, return it.  (In some
        implementations, 
    */
    if (x < uHalfModulus)
        return x;

    /*  Otherwise, return x minus the modulus of the unsigned type, taking
        care not to overflow the signed type.
    */
    return (Signed) (x - uHalfModulus) - sHalfModulus;
}


/*  Calculate A*B - C*D given that the result is representable as a Signed
    value.
*/
static signed char Calculate(Signed A, Signed B, Signed C, Signed D)
{
    /*  Map signed values to unsigned values.  Positive values are unaltered.
        Negative values have the modulus of the unsigned type added.  Because
        we do modulo arithmetic below, adding the modulus does not change the
        final result.
    */
    Unsigned a = A;
    Unsigned b = B;
    Unsigned c = C;
    Unsigned d = D;

    //  Calculate with modulo arithmetic.
    Unsigned t = a*b - c*d;

    //  Map the unsigned value to the corresponding signed value.
    return ConvertToSigned(t);
}


int main()
{
    //  Test every combination of inputs for signed char.
    for (int A = SCHAR_MIN; A <= SCHAR_MAX; ++A)
    for (int B = SCHAR_MIN; B <= SCHAR_MAX; ++B)
    for (int C = SCHAR_MIN; C <= SCHAR_MAX; ++C)
    for (int D = SCHAR_MIN; D <= SCHAR_MAX; ++D)
    {
        //  Use int to calculate the expected result.
        int t0 = A*B - C*D;

        //  If the result is not representable in signed char, skip this case.
        if (t0 < SCHAR_MIN || SCHAR_MAX < t0)
            continue;

        //  Calculate the result with the sample code.
        int t1 = Calculate(A, B, C, D);

        //  Test the result for errors.
        if (t0 != t1)
        {
            printf("%d*%d - %d*%d = %d, but %d was returned.\n",
                A, B, C, D, t0, t1);
            exit(EXIT_FAILURE);
        }
    }
    return 0;
}

b
bradgonesurfing

You could try breaking the equation into smaller components which don't overflow.

AB - CD
= [ A(B - N) - C( D - M )] + [AN - CM]

= ( AK - CJ ) + ( AN - CM)

    where K = B - N
          J = D - M

If the components still overflow you could break them into smaller components recursively and then recombine.


This may or may not be correct, but is definitely confusing. You define K and J, why not N and M. Also, I think you're breaking the equation into bigger pieces. Since your step 3 is the same as the OP's question, except more complicated (AK-CJ) -> (AB-CD)
N is not simplified from anything. It's just a number subtracted from A to make it smaller. Actually it is a similar but inferior solution to paquetp. Here I'm using subtraction instead of integer division to make it smaller.
O
OldCurmudgeon

I may not have covered all of the edge cases, nor have I rigorously tested this but this implements a technique I remember using in the 80s when trying to do 32-bit integer maths on a 16-bit cpu. Essentially you split the 32 bits into two 16-bit units and work with them separately.

public class DoubleMaths {
  private static class SplitLong {
    // High half (or integral part).
    private final long h;
    // Low half.
    private final long l;
    // Split.
    private static final int SPLIT = (Long.SIZE / 2);

    // Make from an existing pair.
    private SplitLong(long h, long l) {
      // Let l overflow into h.
      this.h = h + (l >> SPLIT);
      this.l = l % (1l << SPLIT);
    }

    public SplitLong(long v) {
      h = v >> SPLIT;
      l = v % (1l << SPLIT);
    }

    public long longValue() {
      return (h << SPLIT) + l;
    }

    public SplitLong add ( SplitLong b ) {
      // TODO: Check for overflow.
      return new SplitLong ( longValue() + b.longValue() );
    }

    public SplitLong sub ( SplitLong b ) {
      // TODO: Check for overflow.
      return new SplitLong ( longValue() - b.longValue() );
    }

    public SplitLong mul ( SplitLong b ) {
      /*
       * e.g. 10 * 15 = 150
       * 
       * Divide 10 and 15 by 5
       * 
       * 2 * 3 = 5
       * 
       * Must therefore multiply up by 5 * 5 = 25
       * 
       * 5 * 25 = 150
       */
      long lbl = l * b.l;
      long hbh = h * b.h;
      long lbh = l * b.h;
      long hbl = h * b.l;
      return new SplitLong ( lbh + hbl, lbl + hbh );
    }

    @Override
    public String toString () {
      return Long.toHexString(h)+"|"+Long.toHexString(l);
    }
  }

  // I'll use long and int but this can apply just as easily to long-long and long.
  // The aim is to calculate A*B - C*D without overflow.
  static final long A = Long.MAX_VALUE;
  static final long B = Long.MAX_VALUE - 1;
  static final long C = Long.MAX_VALUE;
  static final long D = Long.MAX_VALUE - 2;

  public static void main(String[] args) throws InterruptedException {
    // First do it with BigIntegers to get what the result should be.
    BigInteger a = BigInteger.valueOf(A);
    BigInteger b = BigInteger.valueOf(B);
    BigInteger c = BigInteger.valueOf(C);
    BigInteger d = BigInteger.valueOf(D);
    BigInteger answer = a.multiply(b).subtract(c.multiply(d));
    System.out.println("A*B - C*D = "+answer+" = "+answer.toString(16));

    // Make one and test its integrity.
    SplitLong sla = new SplitLong(A);
    System.out.println("A="+Long.toHexString(A)+" ("+sla.toString()+") = "+Long.toHexString(sla.longValue()));

    // Start small.
    SplitLong sl10 = new SplitLong(10);
    SplitLong sl15 = new SplitLong(15);
    SplitLong sl150 = sl10.mul(sl15);
    System.out.println("10="+sl10.longValue()+"("+sl10.toString()+") * 15="+sl15.longValue()+"("+sl15.toString()+") = "+sl150.longValue() + " ("+sl150.toString()+")");

    // The real thing.
    SplitLong slb = new SplitLong(B);
    SplitLong slc = new SplitLong(C);
    SplitLong sld = new SplitLong(D);
    System.out.println("B="+Long.toHexString(B)+" ("+slb.toString()+") = "+Long.toHexString(slb.longValue()));
    System.out.println("C="+Long.toHexString(C)+" ("+slc.toString()+") = "+Long.toHexString(slc.longValue()));
    System.out.println("D="+Long.toHexString(D)+" ("+sld.toString()+") = "+Long.toHexString(sld.longValue()));
    SplitLong sanswer = sla.mul(slb).sub(slc.mul(sld));
    System.out.println("A*B - C*D = "+sanswer+" = "+sanswer.longValue());

  }

}

Prints:

A*B - C*D = 9223372036854775807 = 7fffffffffffffff
A=7fffffffffffffff (7fffffff|ffffffff) = 7fffffffffffffff
10=10(0|a) * 15=15(0|f) = 150 (0|96)
B=7ffffffffffffffe (7fffffff|fffffffe) = 7ffffffffffffffe
C=7fffffffffffffff (7fffffff|ffffffff) = 7fffffffffffffff
D=7ffffffffffffffd (7fffffff|fffffffd) = 7ffffffffffffffd
A*B - C*D = 7fffffff|ffffffff = 9223372036854775807

which looks to me like it's working.

I bet I've missed some of the subtleties such as watching for sign overflow etc. but I think the essence is there.


I think this is an implementation of what @Ofir suggested.
i
i Code 4 Food

For the sake of completeness, since no one mentioned it, some compilers (e.g. GCC) actually provide you with a 128 bit integer nowadays.

Thus an easy solution could be:

(long long)((__int128)A * B - (__int128)C * D)

S
SomeWittyUsername

AB-CD = (AB-CD) * AC / AC = (B/C-D/A)*A*C. Neither B/C nor D/A can overflow, so calculate (B/C-D/A) first. Since the final result won't overflow according to your definition, you can safely perform the remaining multiplications and calculate (B/C-D/A)*A*C which is the required result.

Note, if your input can be extremely small as well, the B/C or D/A can overflow. If it's possible, more complex manipulations might be required according to input inspection.


That won't work as integer division loses information (the fraction of the result)
@Ofir that's correct, however you can't eat the cake and leave it untouched. You have to pay either by precision or by using additional resources (like you suggested in your answer). My answer is of mathematical nature while yours' is computer-oriented. Each can be correct depending on the circumstances.
You are correct - I should have phrased it as - will not give an exact result rather than won't work, as the math is correct. However, note in the cases that likely interest the question submitter (e.g. in the example in the question), the error will probably be surprisingly large - much larger than can be acceptable for any practical application. In any case - it was an insightful answer and I shouldn't have used that language.
@Ofir I don't think your language was inappropriate. The OP clearly requested a "correct" calculation, not one that would lose precision for the sake of being performed under extreme resource constraints.
A
Amir Saniyan

Choose K = a big number (eg. K = A - sqrt(A))

A*B - C*D = (A-K)*(B-K) - (C-K)*(D-K) + K*(A-C+B-D); // Avoid overflow.

Why?

(A-K)*(B-K) = A*B - K*(A+B) + K^2
(C-K)*(D-K) = C*D - K*(C+D) + K^2

=>
(A-K)*(B-K) - (C-K)*(D-K) = A*B - K*(A+B) + K^2 - {C*D - K*(C+D) + K^2}
(A-K)*(B-K) - (C-K)*(D-K) = A*B - C*D - K*(A+B) + K*(C+D) + K^2 - K^2
(A-K)*(B-K) - (C-K)*(D-K) = A*B - C*D - K*(A+B-C-D)

=>
A*B - C*D = (A-K)*(B-K) - (C-K)*(D-K) + K*(A+B-C-D)

=>
A*B - C*D = (A-K)*(B-K) - (C-K)*(D-K) + K*(A-C+B-D)

Note that Because A, B, C and D are big numbers, thus A-C and B-D are small numbers.


How do you choose K in practical? Besides, K*(A-C+B-D) might still overflow.
@ylc: Choose K = sqrt(A), Not that A-C+B-D is a small number. Because A, B, C and D are big numbers, thus A-C is small number.
If you choose K = sqrt(A), then (A-K)*(B-K) might overflow again.
@ylc: OK! I change it to A - sqrt(A) :)
Then K*(A-C+B-D) may overflow.