Number Representation
/ 2 min read
Table of Contents
Signed Numbers
To evaluate how we store negative numbers, we measure against four key requirements:
- Sign bit: Clear indication of polarity ().
- Consistency: Incrementing the bit pattern corresponds to a logical increase in value.
- Single Zero: Avoids logical ambiguity (prevents and logic errors).
- Simple Arithmetic: Subtraction can use the same hardware as addition.
Comparison Table
| Method | Sign Bit? | Consistent? | Single Zero? | Simple Math? |
|---|---|---|---|---|
| Sign-Magnitude | Yes | No | No | No |
| One’s Complement | Yes | Yes | No | No |
| Two’s Complement | Yes | Yes | Yes | Yes |
Two’s Complement
- The Rule: To negate a number, invert all bits (NOT) and add 1.
- Why it wins: The CPU uses the same adder circuit for signed and unsigned integers. Subtraction is simply .
- Example (4-bit):
Bias (Offset) Encoding
Store value as: .
- Purpose: Shifts the range so all stored bit patterns are non-negative.
- Benefit: Allows for unsigned comparison of signed values. This is why it is used for exponents in IEEE 754—it makes sorting floating-point numbers faster.
Floating Point (IEEE 754)
Scientific Notation
Standard base-2:
- The leading
1is implicit (not stored) to maximize precision.
Single Precision (32-bit) Format
- Sign (1 bit):
- Exponent (8 bits): Biased by .
- Significand (23 bits): The fractional part (mantissa).
Normalized Formula:
Special Cases
| Category | Exponent | Significand | Value/Purpose |
|---|---|---|---|
| Zero | 0000 0000 | ||
| Denormal | 0000 0000 | Non-zero | Underflow protection; No implicit |
| Infinity | 1111 1111 | ||
| NaN | 1111 1111 | Non-zero | Not a Number (e.g., ) |
Denormalized Formula: Used for values too small for the standard format. The exponent is fixed at .
Precision and Step Size
Step Size: The gap between consecutive floating-point numbers (ULP - Unit in the Last Place).
- Normalized Step:
- Denormalized Step: (Constant gap)
Key Implications
- Relative Precision: Accuracy is high near zero and decreases as magnitude increases.
- Inexact Representation: Most decimal numbers (like ) cannot be represented exactly in binary floating point.
- Absorption: If a number is large enough, adding to it does nothing because the “step” is larger than .