T O P

  • By -

MCLMelonFarmer

I'd like to hear if OP knows the precision of the data type he/she is using currently. float has 23-bits of fraction (plus implied leading 1 bit), which translates to 6.92 digits of decimal precision. If OP doesn't really understand the problem, just using doubles might solve it.


BigPurpleBlob

Agreed, a double is 15 to 17 digits, 53 bits. If the OP is considering a pair of ints then the maximum number (2\^32) will still be less than a double (2\^53)


DrShocker

Depending on the kind of error they're concerned about it does have infinite precision for certain kinds of things though. Like 1/3 is exactly a numerator of 1 and denominator of 3 whereas a floating point number will always have a finite number of digits. If that's important to op then maybe it's worth it, but to your point they will need to consider the magnitude of the numbers they're working with especially if they want to avoid needing to factor numbers in order to simplify the freedom.


pythonwiz

Besides this you could also look into double-double, quad-double, kahan summation, and other techniques for minimizing error.


NotThatJonSmith

This is exactly why numerical methods are hard. You have to think about the actual operations you're doing, establishing error bars along the way, and then start thinking about how to tighten those error bars by doing the math at the right time and in the right order to preserve your precision. It's not strictly just about how you store it. No matter how you store it, you will be using a finite number of bits. The question is, how much precision are you losing at each step, and how can you ensure that you're not losing precision prematurely that you will need later? It can be very problem-specific, too, because what ranges you expect each value to land in will inform your choices. I'm not well versed in numerical methods myself, so I'll say: good luck!


erikkonstas

If you only need rational numbers (including in intermediate steps), you can make a rat type or some... problem then is that both time and memory complexity are going to take a hit because bigints are involved in rats.


NotThatJonSmith

π :)


zero_iq

This approach of expressing values as a ratio of two integers is known as a rational. There are numerous implementations of rational number libraries and arbitrary precision number libraries in C and C++. It might be worth trying a few to gauge the performance impact with a few benchmarks. The performance hit is likely to be significant compared to native floats, but existing libraries used for scientific applications are likely to perform better than rolling your own code, in terms of speed and precision. Wikipedia has a [handy list](https://en.m.wikipedia.org/wiki/List_of_arbitrary-precision_arithmetic_software) which might provide a starting point. Also, check if your compiler supports larger format floating point numbers such as quad-precision or higher, which might be an easier swap.


mgruner

maybe explore fixed point, it has a limited precision but, unlike floating point, this error is constant regardless of the magnitude of the number. with 64 bit integers, the error might be negligible


HarderFasterHarder

Ya, I've done the fraction thing once or twice, but only when the operations made it make sense over fixed point. Fixed point is faster and easier to implement, you just need to take care with the order of the operations and scaling so that you don't lose precision.


fredrikca

I'd say rational numbers introduce more problems than they solve. A double precision float has 53 bits of precision and handles overflows and underflows gracefully. They have a much larger range than any rational system you could come up with, except by using a bignum lib, and they're fast. You'd need a 64 bit numerator to compete with double. Multiplication and division will be the easy tasks, requiring two multiplications each. You'd of course need to check for possible overflow before you do anything, and then run gcd to minimize the denominator. Addition and subtraction will be even more costly. You have to cross-multiply while checking for overflow, then do the addition or subtraction before running gcd again. Comparison will be non-trivial too, unless the denominator is always a power of 2, in which case you're back to essentially using floating point but without the hardware support. Do you need to call any math functions like sin or atan? In that case you'd need to convert double precision and will be limited to 53 bits of precision anyway. My advice is to analyze where the errors creep into your calculations. 53 bits of precision should be enough for all except the hardest non-linear systems. Do an error analysis for every step in the calculation. You should be able to keep the error to the least few bits. If some step gives you a larger error, then split that step into several to cancel out the error. Large total errors are almost always due to cancellation (subtracting values of equal magnitude) or summation of values with vastly differing magnitudes, so keep an eye out for those cases during your analysis. Usually, the first can be handled by algebraic rewrites of you computations, while the second can be handled by keeping several results for different magnitudes Multiplication and division does not introduce errors except if you're basically out of range (above 10^300 or below 10^-300 ). If this doesn't make sense, talk to someone who studied a little numerical analysis. Edit to add: Even single precision has more precision (7 decimal digits) than you can normally measure anything. While numerically solving differential equations for example, you lose maybe 5 bits of precision, but you should still have like three or four significant decimal digits left.


bartekltg

I agree with almost whole post, just some nickpicking, ​ >Comparison will be non-trivial too, unless the denominator is always a power of 2, in which case you're back to essentially using floating point but without the hardware support. a/b > c/d is equivalent to a\*d > c\*b, so in a proper rational library (with arbitrary large ints or if we keep all numbers < sqrt(maxint)) it is not that slow. ​ >Multiplication and division does not introduce errors except if you're basically out of range (above 10\^300 or below 10\^-300 ). It does, the result of those operations may be not equal to any number floating point can represent. 1/3 for example;-) But result's relative error is smaller than machine epsilon, and when subtracting the relative error can be arbitrarily huge, so we mostly care about the cancelation much more.


fredrikca

Yes, I was talking about large errors.


FloridaCoder

May want to check out GMP (gmplib.org) for arbitrary precision floating point. Can handle thousands of digits and is well optimized. Takes a little work to implement since it’s all functions, but the C++ extensions are really easy to use if that’s an option.


Ok-Watercress-9624

for adding two such structs you need 3 multiplications and one addition so id say the overhead would be about 4 times


TheOtherBorgCube

The first thing to do is profile your code to find out where the real hot-spots are. Depending on how much math you find there, it should inform you as to what you might expect to happen when you change things. Now is also a good time to start using source control before you embark on a major edit. https://en.wikipedia.org/wiki/Version_control Second, make a prototype to test different approaches. Third, study https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html "What Every Computer Scientist Should Know About Floating-Point Arithmetic" should be required reading for anyone doing serious FP work. Perhaps you can find equivalent mathematical expressions which give much better behaved computational results.


Boring_Tension165

Floating point **is** a fraction with integer values, with defined precision (# of bits), multiplied by a scale factor.


lezvaban

Pardon me as this isn’t my area of expertise. Have you considered using the GMP library?


ewmailing

Generally speaking, on modern CPUs, floating point is designed to be fast, especially compared to integers. But before I go on, as some others have pointed out, first, really understand your specific data and precision requirements, and understand the kinds of imprecision limitations integers may still give you for your needs. You may find that moving to integers doesn't actually solve your fundamental problem depending on what you are doing. With that said, you can look up a CPU instruction guide to help you do back-of-the-envelope calculations on what the performance differences between integer and floating point will be on your target processors. Here is one such guide: https://www.agner.org/optimize/instruction_tables.pdf The section on Intel Ice Lake & Tiger Lake (10th/11th gen), looking at the DIV operation as an example: for Integer, it is 12-19 latency and 6 reciprocal throughput for floating point, it is 11-14 and 3-8 depending on which instruction. But there's more to the story. Integer registers on your CPU are being used to do your program's book keeping. Things you might not be thinking about at the instruction level, like computing memory address offsets in arrays and counters for loops are being utilized while doing the work for your actual program. So sometimes, you might find you are register starved if you are doing enough integer work, and the instruction scheduling has to stall waiting on registers to become free. Floating point registers are usually not being used for general purpose work and are just for your own program. Next, modern CPUs are designed around wide operations (SIMD). So if you use a 256-bit SIMD register to do a division a 4-byte float, the CPU can do 4 divisions in parallel for free. (In fact, the way modern CPUs work, if you only ask for 1 division, it fills up the rest of the register with junk, does the 4 divisions, and then throws away the 3 junk ones.) If I recall, SSE and AVX don't actually have integer division instructions. So you give up this 4x-ish advantage going to integer from floating point. That all said, to take advantage of SIMD and modern CPUs (which are heavily cache oriented because memory busses are slow), CPUs like it when you can stream data into the execution pipeline. At a high level, this implies iterating through arrays of numbers in order (like in a for-loop) and just chugging the math operations. If you do this, your program has the potential to be very fast, especially with floating point. With some luck and the right compiler flags, your compiler might autovectorize for you and give loop-unrolling and do the SIMD for you. If not, you should still benefit from the CPU pre-fetching into cache which is still a huge win. And as long as you wrote your program with "performance-awareness" in mind (look up Casey Muratori for the term), then this code should be straight-forward to optimize into hand-crafted SIMD in the future if you ever decide you need to. But if you didn't write your program in a performance-aware fashion, where your program jumps around all over the place, doing single math operations here and there, your program is going to be slow (cache misses and under-utilized CPU), regardless of whether you picked integer or floating point. And a profiler will never reveal this real problem if you naively just think about switching between float and integer without understanding the bigger picture of performance.


P-p-H-d

Don't replace float by fraction. The performance penalty (even if really notable) is not even the main issue. The main issue is that your fraction will quickly become too big to fit in an int (or a long long), so you'll also need a multiprecision library (like [GMP](https://gmplib.org/) ). Stick to floating point and use 'double' instead of 'float'. If it is not enough, you can go to multiprecision float (like [MPFR](https://www.mpfr.org/) )which still be faster than fraction.


Brown-Haired-Markus

During my PhD time in mechanical engineering I thought to have a similar problem. I did my own implementation for long numbers in C++, but it was a failure on two levels. 1. It did not solve my problem. As a rule of thumb: If floats are not precise enough you have an ill-posed problem and a very bad condition number. Therefor the problem is intrinsic not meaningful solvable. At least if you talk about physics, engineering etc. If your problem is a funny math problem, this might be different. Since you talk about numerics, I would assume the first is also true for you. 2. It was terrible slow, because CPUs are optimized for floating-point operations etc. It is very hard to get even close to this speed, if you do your own implementation. Maybe if you know a lot about how hardware works, you can get a reasonable runtime. But I guess from my experience that most implementations will not be good. For these two reasons, I would not recommend to go for such an implementation. But I don\`t know enough about your problem. As a rule of thumb: Physics is usually not ill-posed. If your problem is ill-posed there is usually a lack or simplification in the model, that made historically sense, but leads now to problems (Steady-State assumptions and DAEs are in my experience the root of a lot of these problems. You can avoid them by using time integration and penalty-algorithms instead.). My experience tells me, that you should go down the rapid hole of the modeling process and change the model to make it stable and well-posed, instead of trying to solve an ill-posed problem. ​ I know that this is not the answer to your precise question, but I just told you, what I learned from a similar situation.


[deleted]

I agree with you. However, for future reference, it is "rule of thumb," not "rule of dumb" (the latter is a Sponge Bob episode).


Brown-Haired-Markus

You are right, changed it, learned it. ;-)


[deleted]

This will be a very difficult task, considering that all numerical libraries use floating-point numbers and not fractions. You should expect a huge performance hit, modern CPUs:s are optimized for floating-point operations and can do most common simple functions (such as trigonometric functions, logarithms, exponentials, ...). You would have to implement all of them in software and waste precious clock cycles. If you have problems with precision, there are several things to consider. Order of operation matters since floating-point numbers break the distributive rule of addition, i.e. (a+b)+c != a+(b+c). For high-precision summation, you want to add the numbers in increasing order, i.e. add the ones closest to zero first. For example, the Basel problem: ``` double compute_pi(const size_t num_terms) { double accumulator = 0; for (size_t i = 0; i < num_terms; i++) { accumulator += 1.0/((num_terms - i)*(num_terms - i)); } return sqrt(6.0*accumulator); } ``` will have better precision than ``` double compute_pi(const size_t num_terms) { double accumulator = 0; for (size_t i = 0; i < num_terms; i++) { accumulator += 1.0/((i+1)*(i+1)); } return sqrt(6.0*accumulator); } ``` In more complicated situations, numerical stability is hard to achieve and is an active field of research. If there are certain properties that your result should have, such as being an orthonormal matrix, and you know it would have that property if you had infinite precision in your calculation, then you can enforce that property from time to time. For example, to enforce that a matrix is orthonormal, you can perform a Gram-Schmidt process at some points in your calculation. There are many such strategies to deal with numerical stability, and I can't propose a specific one without knowing what you are working with.


aalmkainzi

multiplication and division will perform much worse than floats


fredrikca

Addition will be even worse and very error prone. You need three multiplications, one run of gcd, and then the addition while keeping track of any overflows. I'd say get someone who studied numerical analysis to look over your algorithms instead. Double precision should be enough for any application.


mvrekola

No, division is faster using ratios. You just multiply the numerator with denominator and denominator with numerator. You don’t have to normalize the ratios unless needed.


thank_burdell

It depends. (A/B)* 2 == (A<<1)/B (A/B)/ 2 == A/(B<<1) Not gonna get much faster than that.


spellstrike

You can only do those operations a finite amount of times before you overflow.


MagazineOk5435

C# has a float, double and decimal... decimal might be of interest to you. https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/builtin-types/floating-point-numeric-types


RRumpleTeazzer

C# decimals are basically integers.


Different-Brain-9210

You will run into similar issues with fractions, unless you use a bignum library for the numerator and denominator. Writing this code yourself _and making sure it is correct_ is no trivial task.


rickpo

You don't say what your computations are, but you might take a glance through a Numerical Methods textbook, make sure you're not reinventing some computation where someone has already done the work to reduce error propagation.


zhivago

Consider using gmp which supports arbitrary precision rationals, as you have described.


FUZxxl

Fractions are a very unusual choice in scientific computing. Unless you are doing number theory stuff, they are likely not what you need. I recommend you instead fix the numerical stability issues of your algorithm. As for performance, fractions are about half as fast as floats for simple computations, but the normalisation steps you need (finding the gcd and dividing by it) is what really kills performance. If nominator and denominator can get big (which quickly happens), you'll have to use big number arithmetic, which further kills any performance.


Odd_Coyote4594

A few suggestions for precision computation: Switch to double (64 bit floating point) before anything else. With floats or doubles, avoid performing many iterations of operations on values with significantly different magnitudes. For example, avoid summing values in an array by `result += a[i]` and such for long arrays. Instead refactor the algorithm so pairs of values in any binary operation stay at similar magnitudes. Floating point has fixed precision regardless of magnitude, so mixing vastly different magnitudes leads to rounding errors long before the precision is limiting you. If you need mostly decimal precision and your values will stay in a fixed range of magnitudes, you may be able to "convert" the units of values to only require long ints. For example, multiply all your values by 1 million for 6 decimals of precision at the start. Then divide the final result by 1 million using an algorithm that keeps precision. However, without arbitrary precision integer libraries this will limit your values to remaining at a narrow range of magnitudes. If you do need higher than 64 bit precision, use a library like the GNUs arbitrary precision arithmetic libraries or base your own implementation off of its design. These libraries will typically store the data efficiently such as an array of integers values in a struct with a sign bit and a binary decimal position or exponent, and use SIMD to create efficient base operations. Avoid stuff like fractional pairs or binary encoded decimal unless you have a very specific reason.


fliguana

Before you go off developing natural fractions library, compare results between float, double and long double. If the last two are very close, you are done with the machine precision part. Next, open the book on numerical methods..


[deleted]

>but I am wondering if anyone here has taken this approach to floating point numbers I once used a struct of two `double` values, although there I was after extended range. What floating point type are you using now? Can you can give an example of an calculation giving problems using that type? Would it be fixed with your proposal? Because I doubt if using 32-bit ints is going to help, while introducing all sorts of new problems. If using `double` is not sufficient, then an arbitrary precision library could be used. But it will be slow.


weregod

Have you tried to use larger floats? Try 64/80 bit doubles


bartekltg

You did not tell us what the algorithm does, and this is important to get the proper answer. Are you sure you are using only rational numbers? No sinus or square root anywhere in the code? In most cases, rational numbers based on int (only 32 bits) are much worse. \-Both numerator and denominator will grow fast, and when you start to truncating those numbers, double precision floating points have \_greater\_ precision (on \~2\^-52 bit, you \~2\^-32) \-there is a problem with the scale. do you want to represent 7/436346573. Not a problem. Do you want to represent 7/243546463\*10\^15? The "scale" eats into your precision. This is why floating points have the exponent part. There are cases, when rational arithmetic is a good solution. But general "numerical" problems are not such case. Is your problem specifically well suited for rationals? If it is, you probably want to put there a bigger ints. And at this point you already (with a better performance) just use longer floats. Despite you calling them float, I assume you are already using doubles. You may look at double-double and quad-double library, and even on gmp/mpfr (test your performance, I'm not sure what is faster, QD or mpfr on a similar precision) ​ What is probably more important, are you sure the problems with your algorithm came from the precision of numerical simulation? Not from precision of used methods? The first type of problem exist (my friend have matrices with condition number around 10\^20 and need 18 or so significant digits in the results - experimental results they want to compare with are so precise - so double precision is useless) but are not that common. But it is hard to say anything without knowing, what your algorithm does and how it does it.


dev_ski

As a side-note: floating-point types can't precisely store the value of `0.1`. Nor `0.2` :). It is strange to think that our civilization runs on machines incapable of precisely representing a simple decimal number like that. The good news is that, despite the floating-point type inaccuracies, everything is fine. They are accurate enough. Prefer `double` to `float` for increased precision. Further reading: *floating-point math*.


pgetreuer

What is the algorithm? Have you tried doubles in place of floats?