Inline-Assembler-in-Delphi-VII-128-bit-integer-arithmetic


            Title: Inline Assembler in Delphi (VII) - 128-bit integer arithmetic
Question: This article shows the use of inline assembler to wotk with 128-bit integers.
Answer:
Inline Assembler in Delphi (VII)
128-bit integer arithmetic
By Ernesto De Spirito edspirito@latiumsoftware.com
Part 1: Introduction
With 32 bits we can represent 2^32 different numbers, i.e. 4294967296 (~4 billion) different numbers, like signed integers from -2147483648 to +2147483647 or unsigned integers from 0 to 4294967295 (types Longint and Longword respectively).
That's enough for many purposes, like for example holding a position of a byte within a 4GB file, but sometimes we need more than that, and there we have TLargeInteger (Windows unit) and Int64 (since Delphi 4) to represent 64-bit integers that can have 2^64 different values, i.e. 18446744073709551616 (~18 sixtillons) values, from -9223372036854775808 to +9223372036854775807 (~9 sixtillons, 17-18 decimal digits).
That number of digits is really more than enough for me, and right now I really can't figure any practical use for more than that. Hey, not even Bill Gates counts his money in sixtillons! ;) But from time to time I see someone in a forum asking for more digits than what the Int64 offers...
Anyway, whether useful or completely useless for a practical purpose, we'll see the implementation of many procedures and functions designed to work with 128-bit integers, that will serve for the purpose of showing examples of the basic assembler instructions. These "large integers", "big integers" or "huge integers" can hold 2^128 different values (38-39 decimal digits).
Representation of the huge integer
I called the new type Hugeint, but for example Bigint (big integer) or Int128 could have been good names. Largeint (large integer) could be confused with the type in the Windows.pas unit which refers to a 64-bit integer.
When it comes to the representation of the new type, there also many ways to do it. I decided the most simple is representing it as an array of four 32-bit integers:
 type
 Hugeint: packed array [0..3] of longword;
I also decided to use the little-endian format since it's the standard in the Intel architecture, and this means that the first element of the array (lowest address) will hold the low-order (least-significant) 32 bits of the large integer, and the last element of the array (highest address) will hold the high-order (most-significant) 32 bits of the large integer.
This is how the numbers 5 and 5000000000 ($12A05F200) would be represented:
 +---- Low-order 32 bits
 |
 v
 +-------------+-------------+-------------+-------------+
 | $00000005 | $00000000 | $00000000 | $00000000 | = 5
 +-------------+-------------+-------------+-------------+
 0 1 2 3
 +-------------+-------------+-------------+-------------+
 | $2A05F200 | $00000001 | $00000000 | $00000000 | = 5000000000
 +-------------+-------------+-------------+-------------+ $12A05F200
 ^
 |
 High-order 32 bits ----+
Integers themselves are also stored in little-endian format (low-order byte first). If we see the byte representation of the numbers in a memory dump, it would look like this (byte values are represented in hexadecimal notation):
 $00000005
 +-------------+-------------+-------------+-------------+
 | 05 00 00 00 | 00 00 00 00 | 00 00 00 00 | 00 00 00 00 | = 5
 +-------------+-------------+-------------+-------------+
 0 1 2 3
 +-------------+-------------+-------------+-------------+
 | 00 F2 05 2A | 01 00 00 00 | 00 00 00 00 | 00 00 00 00 | = 5000000000
 +-------------+-------------+-------------+-------------+ $12A05F200
 $2A05F200 $00000001
However, for almost all operations we can make abstraction of the byte order and consider the 32-bit integers as atomic units, since the byte order is handled transparently.
A few useful instructions
Before we begin, let's see some useful instructions that we might use in this article (mainly in the continuation of this part), but first allow me to say that it isn't the purpose of this article to actually teach you assembler. All I can do in this limited space is just showing you examples of some instructions. For reference material, I recommend you these links:
Intel 80386 Reference Programmer's Manual
An HTML version of this Intel manual. The pseudo-code helps explain the instructions and their effects on the flags. Excellent.
http://people.freebsd.org/~jhb/386htm/toc.htm
There are some broken links, but the pages are there. Try finding them in the directory index (http://people.freebsd.org/~jhb/386htm/).
iAPx86 - Norton Guide
Not as much explicative as the above document, but contains all the instructions from 8086 to Pentium and Pentium Pro, with size and timing information not included in the above document.
http://www.clipx.net/ng/iapx86/index.php
The IA-32 Intel Architecture Software Developer's Manual, Volume 2:
Instruction Set Reference
PDF Manual describing the instructions for the IA-32 processors (Pentium, Pentium Pro, Pentium II, Pentium III, Pentium 4, and Xeon). Includes pseudo-code to explain the instructions and how they affect the flags in the flags register.
http://www.intel.com/design/pentium4/manuals/245471.htm
Optimization
How to optimize for the Pentium family of microprocessors
Excellent optimization guide written by Agner Fog
http://fatphil.org/x86/pentopt/index.html
Optimizations for Intel's 32-Bit Processors
Another excellent optimization guide.
http://x86.ddj.com/ftp/manuals/686/optimgd.pdf
OK, now let's get to the instructions.
Reference:
 Z/ZF: Zero Flag
 S/SF: Sign Flag
 C/CF: Carry Flag
 P/PF: Parity Flag
 A/AF: Auxiliary Flag
 s: sign bit (high-order bit)
 o: odd bit (low-order bit)
 x: bit value
 0: namely, the value 0
 1: namely, the value 1
 r: bit is reversed from previous value
 u: bit is unchanged from previous value
 XX: unknown value (register, immediate, or memory reference)
In the examples it should be assumed the value of AL previous to each operation is sxxxxxxo (sign bit, 6 unknown bits, and odd bit).
Here are some instructions to begin:
 SHL al,1 AL := xxxxxxo0 CF := s Shift left
 SAL al,1 AL := xxxxxxo0 CF := s Synonym for SHL
 SHR al,1 AL := 0sxxxxxx CF := o Shift right
 SAR al,1 AL := ssxxxxxx CF := o Shift arithmetic right
 SAR al,7 AL := ssssssss CF := x This extends the sign bit
 ROL al,1 AL := xxxxxxos CF := s Rotate left
 ROR al,1 AL := osxxxxxx CF := o Rotate right
 RCL al,1 AL := xxxxxxoC CF := s Rotate thru carry left
 RCR al,1 AL := Csxxxxxx CF := o Rotate thru carry right
 AND al,al AL := uuuuuuuu CF := 0 Sets flags (see below)
 AND al,-1 AL := uuuuuuuu CF := 0 -1 = $FF = 1111111
 Sets flags (see below)
 AND al,$01 AL := 0000000u CF := 0 $01 = 00000001
 AND al,$80 AL := u0000000 CF := 0 $80 = 10000000
 AND al,$5A AL := 0u0uu0u0 CF := 0 $5A = 01011010
 AND al,0 AL := 00000000 CF := 0 XOR AL,AL or MOV AL,0 are better
 TEST AL,XX AL := uuuuuuuu
 TEST is like AND, but the result doesn't get stored in the
 destination. The result is used to set flags (see below).
 TEST AL,-1 It's usually better than AND AL,-1 and OR AL,AL because it
 doesn't write to AL, which allows certain optimizations in
 some cases.
 OR al,al AL := uuuuuuuu CF := 0 Sets flags (see below)
 OR al,$01 AL := uuuuuuu1 CF := 0 $01 = 00000001
 OR al,$80 AL := 1uuuuuuu CF := 0 $80 = 10000000
 OR al,$5A AL := u1u11u1u CF := 0 $5A = 01011010
 OR al,-1 AL := 11111111 CF := 0 Same as MOV AL,1
 XOR al,al AL := 0 CF := 0 Use MOV AL,0 to preserve flags
 XOR al,$5A AL := ururruru CF := 0 $5A = 01011010
 XOR al,-1 AL := rrrrrrrr CF := 0 Same as NOT AL
Except for the rotation instructions (ROL, RCL, ROR, and RCR) all of the above set SF, ZF and PF based on the result of the operation:
 SF = value of the high-order bit of the result
 ZF = 1 ("set") if the result is zero, 0 ("cleared") otherwise
 PF = 1 ("set") if the low-order byte of the result contains an even
 number of 1 bits, 0 ("cleared") otherwise
Let's see more instructions:
 STC CF := 1 Set Carry Flag
 CLC CF := 0 Clear Carry Flag
 CMC CF := r Complement Carry Flag
 LAHF AH := SZxAxPxC
 SAHF Assuming AH is SZxAxPxC:
 ZF := S; ZF := Z; AF := A; PF := P; CF := C
 SETc AL AL := CF Set if carry
 SETs AL AL := SF Set if sign
 SETz AL AL := ZF Set if zero
 SETe AL AL := ZF Set if equal (synonym of SETZ)
 SETp AL AL := PF Set if parity
 SETpe AL AL := PF Set if parity even (synonym of SETP)
 SETo AL AL := OF Set if overflow
 SETnc AL AL := NOT CF Set if not carry
 SETns AL AL := NOT SF Set if not sign
 SETnz AL AL := NOT ZF Set if not zero
 SETne AL AL := NOT ZF Set if not equal (synonym of SETNZ)
 SETnp AL AL := NOT PF Set if not parity
 SETpo AL AL := NOT PF Set if parity odd (synonym of SETNP)
 SETno AL AL := NOT OF Set if not overflow
 SETa (or SETNbe), SETae (or SETnb), SETb (or SETnae), SETbe (SETna),
 SETg (or SETNle), SETge (or SETnl), SETl (or SETnge), and SETle
 (SETng) set the destination byte to 1 or 0 depending on whether the
 specified condition is met or not.
 ADD AL,XX AL := AL+XX CF := 1 if operation generated a carry
 0 otherwise
 SUB AL,XX AL := AL-XX CF := 1 if operation needed a borrow
 0 otherwise
 SUB AL,0 AL := uuuuuuuu Set flags based on AL
 SUB AL,AL AL := 0 Same as XOR AL,AL or MOV AL,0
 CMP AL,XX CMP is like SUB, but the result doesn't get stored in the
 destination. The operation simply set the flags
 ADC AL,XX AL := AL+XX+C CF := 1 if operation generated a carry
 0 otherwise
 SBB AL,XX AL := AL-C-XX CF := 1 if operation needed a borrow
 0 otherwise
 NEG AL AL := -AL CF := 1 if previous AL 0
 NOT AL; INC AL is the same
 NOT AL AL := rrrrrrrr CF := u XOR AL,-1 is the same
Conversion functions
These functions will help us understand the representation of these huge integers.
Longword to Hugeint
Let's start by converting a Longword into a huge integer. The lowest 32 bits of the result will be the 32 bits of the parameter and the higher 96 bits will be zero.
 function UToHugeint(const x: Longword): Hugeint; overload;
 // Result := Hugeint(x);
 // Parameters: EAX = x; EDX = @Result;
 asm
 xor ecx, ecx // ECX := 0;
 mov [edx+_0_], eax // Result[0] := x;
 mov [edx+_1_], ecx // Result[1] := 0;
 mov [edx+_2_], ecx // Result[2] := 0;
 mov [edx+_3_], ecx // Result[3] := 0;
 end;
Comments:
* "_0_", "_1_", "_2_", and "_3_"? What are these?
 They are constants that represent the offsets of the four elements of
 the array, allowing us to write cleaner code.
 const
 _0_ = 0;
 _1_ = 4;
 _2_ = 8;
 _3_ = 12;
Longint to Hugeint
The lowest 32 bits of the result will be the 32 bits of the parameter. If the number is positive or zero, then the higher 96 bits will be 0, but if the number is negative, the higher 96 bits will be 1.
It might seem like we need to make a comparison or test the sign and then to perform a conditional jump based on the result:
 function ToHugeint(const x: Longint): Hugeint; overload;
 // Result := Hugeint(x);
 // Parameters: EAX = x; EDX = @Result;
 asm
 or eax, eax // EAX := EAX or EAX; // EAX remains unchanged
 // Side effect: SF (Sign Flag) := EAX mov ecx, 0 // ECX := 0;
 jns @@not_negative // if not SF then goto @@not_negative;
 dec ecx // ECX := ECX - 1; // 0 - 1 = -1 = $FFFFFFFF
 @@not_negative:
 mov [edx+_0_], eax // Result[0] := x;
 mov [edx+_1_], ecx // Result[1] := ECX; // 0 or $FFFFFFFF
 mov [edx+_2_], ecx // Result[2] := ECX; // 0 or $FFFFFFFF
 mov [edx+_3_], ecx // Result[3] := ECX; // 0 or $FFFFFFFF
 end;
Comments:
Notice the use of "MOV ECX, 0" instead of "XOR ECX, ECX" to avoid changing the state of the Sign Flag (SF) set in the preceding instruction (OR) and then used in the conditional jump that appears in the following instruction (JNS). Of course we could have changed the order of the operations for this to be unnecessary...
Instead of:
 or eax, eax
 jns @@not_negative
the following pairs of instructions would achieve the same:
 * and eax, eax // EAX keeps the value, but SF gets the sign
 jns @@not_negative // if SF = 0 then goto @@not_negative
 * test eax, $80000000 // result will be zero only if sign bit is 0
 jz @@not_negative // if ZF then goto @@not_negative
 * test eax, $87654321 // any value with bit 31 (sign bit) set
 jns @@not_negative // if SF = 0 then goto @@not_negative
 * cmp eax, 0 // compares eax with 0
 jge @@not_negative // if greater or equal then goto @@not_negative
Notice the use of "DEC ECX" to turn the value of ECX from $00000000 to $FFFFFFFF (by decrementing the value of the register). "NOT ECX" would have accomplished the same thing (by inverting the bits), at the same speed, and taking the same number of bytes to code the instruction, but NOT isn't a pairable instruction while DEC is. For this reason NOT is usually avoided and substituted as follows:
If you know beforehand that the previous value is 0, use DEC Dest
If you know beforehand that the previous value is 1, use INC Dest
If you don't know what the previous value is, use XOR Dest, -1
Also notice in the order of the instructions that we never used a register that was set in the immediately previous instruction. This is one of the conditions for pairing to occur. You'll find more information about instruction pairing in the documents about optimization that we recommended above.
We can simplify the function thanks to the CDQ instruction which extends the sign of EAX into EDX. This is basically how CDQ works:
 if EAX = 0 then
 EDX := $0
 else
 EDX := $FFFFFFFF;
Here's a smaller and simpler implementation using CDQ:
 function ToHugeint(const x: Longint): Hugeint; overload;
 // Result := Hugeint(x);
 // Parameters: EAX = x; EDX = @Result;
 asm
 mov ecx, edx // ECX := @Result;
 cdq // EDX := IIF(x=0, 0, $FFFFFFFF);
 mov [ecx+_0_], eax // Result[0] := x;
 mov [ecx+_1_], edx // Result[1] := EDX; // 0 or $FFFFFFFF
 mov [ecx+_2_], edx // Result[2] := EDX; // 0 or $FFFFFFFF
 mov [ecx+_3_], edx // Result[3] := EDX; // 0 or $FFFFFFFF
 end;
CDQ is usually replaced using MOV and SAR, which offer the advantage that the source doesn't have to be EAX and the destination doesn't have to be in EDX (plus they are pairable instructions). Let's see an example:
 function ToHugeint(const x: Longint): Hugeint; overload;
 // Result := Hugeint(x);
 // Parameters: EAX = x; EDX = @Result;
 asm
 mov ecx, eax // ECX := x;
 sar ecx, 31 // ECX := IIF(x=0, 0, $FFFFFFFF);
 mov [edx+_0_], eax // Result[0] := x;
 mov [edx+_1_], ecx // Result[1] := EDX; // 0 or $FFFFFFFF
 mov [edx+_2_], ecx // Result[2] := EDX; // 0 or $FFFFFFFF
 mov [edx+_3_], ecx // Result[3] := EDX; // 0 or $FFFFFFFF
 end;
Hugeint to Longint
A Hugeint can be converted to a Longint by simply taking the low-order 32 bits. The high-order 96 digits of the Hugeint should be all 0 or all 1 matching the sign bit of would be the result (bit 31) for the Hugeint value to be in the range of a Longint, but the function doesn't check for that and performs the conversion blindly (in the same way that a Longint is converted to a Shortint, for example).
 function ToLongint(const x: Hugeint): Longint; overload;
 // Result := Longint(x);
 // No exception is raised if the value is not within
 // range (high-order 96 bits are discarded).
 // Parameters: EAX = @x;
 asm
 mov eax, [eax+_0_] // Result := x[0];
 end;
Int64 to Hugeint
Int64 parameters are passed on the stack, so functions with an Int64 parameter will automatically create a stack frame. The lowest 64 bits of the result will be the 64 bits of the parameter, and the higher 64 bits of the result will extend the sign bit of the high-order integer that makes up the int64 value.
 {$IFDEF DELPHI4}
 function ToHugeint(const x: Int64): Hugeint; overload;
 // Result := Hugeint(x);
 // Parameters: x on the stack; EAX = @Result;
 asm
 mov edx, dword[x+_0_] // EDX := x[0];
 mov ecx, dword[x+_1_] // ECX := x[1];
 mov [eax+_0_], edx // Result[0] := x[0];
 mov [eax+_1_], ecx // Result[1] := x[1];
 sar ecx, 31 // ECX := IIF(x[1]=0, 0, $FFFFFFFF);
 mov [eax+_2_], ecx // Result[2] := ECX; // 0 or $FFFFFFFF
 mov [eax+_3_], ecx // Result[3] := ECX; // 0 or $FFFFFFFF
 end;
 {$ENDIF}
Int64 values are stored in little-endian format, so the low-order integer is the first, at offset 0 from the base address of the variable, and the high-order integer is the second, at offset 4 from the base address of the variable. In this case the base address of the variable is EBP+8 (see the first chapter of this series of articles), so the first element is at EBP+8 (EBP+8+0), and the second element is at EBP+12 (EBP+8+4). I could have used EBP+8 and EBP+12 to address the elements, but "x+_0_" and "x+_1_" refer to these addresses more transparently. The "DWORD" size specifier is mandatory since the assembler takes "x+_0_" and "x+_1_" as pointers to 64-bit data (because "x" is considered a pointer to 64-bit data) and doesn't allow to move the referenced value to a 32-bit register.
Hugeint to Int64
A Hugeint can be converted to an Int64 by simply taking the low-order 64 bits. The high-order 64 digits of the Hugeint should be all 0 or all 1 matching the sign bit of would be the result (bit 31) for the Hugeint value to be in the range of an Int64, but the function doesn't check for that and performs the conversion blindly:
 {$IFDEF DELPHI4}
 function ToInt64(const x: Hugeint): Int64; overload;
 // Result := Int64(x)
 // No exception is raised if the value is not within
 // range (high-order 64 bits are discarded).
 // Parameters: EAX = @x;
 asm
 mov edx, [eax+_1_] // EDX := x[1];
 mov eax, [eax+_0_] // EAX := x[0];
 // Result = EDX:EAX = x[1]:x[0]
 end;
 {$ENDIF}
Comment:
Int64 return values should be placed in the EDX (high-order 32 bits) and EAX (low-order 32 bits).
More assembler instructions
In the source code example (attached) you'll find the implementation of some functions to operate with the Hugeint data type. The purpose is to exemplify the instructions we've seen so far along with some new ones:
 BT (Bit Test):
 BT dword ptr [eax], edx -- CF = value of the EDXth bit in the
 memory pointed by EAX
 BTS (Bit Test and Set):
 BTS dword ptr [eax], edx -- sets to 1 the EDXth bit in the memory
 pointed by EAX
 CF = previous value of that bit
 BTR (Bit Test and Reset):
 BTR dword ptr [eax], edx -- sets to 0 the EDXth bit in the memory
 pointed by EAX
 CF = previous value of that bit
 BTC (Bit Test and Complement):
 BTC dword ptr [eax], edx -- toggles the value of the EDXth bit in
 the memory pointed by EAX
 CF = previous value of that bit
We won't reproduce the functions here since you can find them in the source code attached, but we'll show different possible implementations of the function _IsNeg, simply to provide more examples of the instructions we've seen so far:
 function _IsNeg(x: Hugeint): boolean;
 // Result := x // Parameters: EAX = @x
 asm
 mov eax, [eax+_3_] // EAX := High order 32 bits of x
 shr eax, 31 // AL := High order bit of EAX (sign bit)
 end;
 function _IsNeg(x: Hugeint): boolean;
 asm
 cmp dword ptr [eax+_3_], 0 // if x[3] jl @@negative // goto @@negative
 mov al, 0 // Result := False;
 ret // exit;
 @@negative: // @@negative:
 mov al, 1 // Result := True;
 end;
 function _IsNeg(x: Hugeint): boolean;
 asm
 // set the Sign Flag and then put it in AL
 mov eax, [eax+_3_] // EAX := High order 32 bits of x
 or eax, eax // SF := Sign bit of EAX
 // alt.: add eax, 0
 // also: sub eax, 0
 // also: and eax, eax
 // also: and eax, -1 // or any negative value
 // also: test eax, eax
 // also: test eax, -1 // or any negative value
 sets al // AL := SF; // Sign Flag
 // alt.: lahf; shr ax, 31
 // also: lahf; rol ax, 1; and al, $1
 end;
 function _IsNeg(x: Hugeint): boolean;
 asm
 // set the Carry Flag with the Sign Bit to put it in AL
 mov eax, [eax+_3_] // EAX := High order 32 bits of x
 bt eax, 31 // CF := Sign bit of EAX
 // alt.: shl/rol/rcl eax, 1
 setc al // AL := CF; // Carry Flag
 // alt.: mov al, 0; rcl, 1
 // also: mov al, 0; adc al, al
 // also: lahf; mov al, ah; and al, $1
 // also: lahf; ror/rcr/shr/sar ax, 1; shr al, 7
 // also: lahf; ror/shr/sar ax, 8; and al, $1
 // also: lahf; rol ax, 8; and al, $1
 // also: lahf; rcl ax, 9; and al, $1
 end;
 function _IsNeg(x: Hugeint): boolean;
 asm
 // set the Parity Flag and then negate it in AL
 mov al, [eax+_3_+3] // EAX := High order 8 bits of x
 or al, $7F // PF := Not Sign bit
 // alt.: and eax, $80000000
 setnp al // AL := Not PF; // Not Parity Flag
 // alt.: lahf; rol/shl ax, 6 / rcl ax, 7;
 xor al,-1 / not al; and al, $1;
 // also: lahf; ror/shr/sar ax, 10 / rcr ax, 11;
 xor al,-1 / not al; and al, $1;
 end;
In the next part we'll see functions to add, subtract, multiply and divide huge integers.
Part 2: The four fundamental operations
In this second and last part we'll finally get to see the actual arithmetics, with the four fundamental operations (addition, subtraction, multiplication and division).
Before getting into them I'd like to say that the procedures and functions introduced in the preceeding two parts have been corrected and also further optimized. I still haven't been able to test them as much as I'd have liked to. If you find any bugs or have any comments about the source code, please drop me an email.
Addition
How do we add two numbers, each made up of four 32-bit integers? Well, it's actually pretty easy. We simply add them in the same way that we would add two numbers of four decimal digits (like 3597 and 0015 for instance), except that here each "digit" can have about 4 billion different (2^32) values instead of just ten. The algorithm would be like this:
 function AddWithCarry(x: Longint; y: Longint;
 var Carry: Boolean): Longint; forward;
 function HugeAdd(x: Hugeint; y: Hugeint): Hugeint;
 // Result := x + y;
 var
 Carry: Boolean;
 begin
 Carry := False;
 Result[0] := AddWithCarry(x[0], y[0], Carry);
 Result[1] := AddWithCarry(x[1], y[1], Carry);
 Result[2] := AddWithCarry(x[2], y[2], Carry);
 Result[3] := AddWithCarry(x[3], y[3], Carry);
 end;
AddWithCarry is a fictitious function which returns an integer with the low order 32 bits of the result of the addition of the two arguments, plus 1 if Carry (the third argument) is True. It also stores True or False to the Carry (passed by reference) depending on whether the addition generated a carry or not (or whether the carry is 1 or 0, if you want to see it that way). Actually, this function doesn't have to be fictitious:
 function AddWithCarry(x: Longint; y: Longint;
 var Carry: Boolean): integer;
 asm
 // if Carry then CF := 1 else CF := 0;
 test byte ptr [ecx], -1 // Side-effect: CF := 0;
 jz @@NoCarry
 stc // CF := 1;
 @@NoCarry:
 // Result := x + y + CF; CF := GeneratedCarry;
 adc eax, edx
 // Carry := CF;
 setc byte ptr [ecx]
 end;
It would be more efficient to have HugeAdd coded entirely in assembler:
 function HugeAdd(x: Hugeint; y: Hugeint): Hugeint;
 // Result := x + y;
 // Parameters: EAX = @x; EDX = @y; ECX = @Result
 asm
 push esi
 mov esi, [eax+_0_] // ESI := x[0];
 add esi, [edx+_0_] // ESI := ESI + y[0];
 mov [ecx+_0_], esi // Result[0] := ESI;
 mov esi, [eax+_1_] // ESI := x[1];
 adc esi, [edx+_1_] // ESI := ESI + y[1] + Carry;
 mov [ecx+_1_], esi // Result[1] := ESI;
 mov esi, [eax+_2_] // ESI := x[2];
 adc esi, [edx+_2_] // ESI := ESI + y[2] + Carry;
 mov [ecx+_2_], esi // Result[2] := ESI;
 mov esi, [eax+_3_] // ESI := x[3];
 adc esi, [edx+_3_] // ESI := ESI + y[3] + Carry;
 mov [ecx+_3_], esi // Result[3] := ESI;
 pop esi
 end;
Subtraction
Subtraction works very much like addition, but instead of generating a carry, the operation generates a borrow (also represented by the Carry Flag) if the minuend (first operand) is less than the subtrahend (second operand):
 function SubtractWithBorrow(x: Longint; y: Longint;
 var Borrow: Boolean): Longint; forward;
 function HugeSub(x: Hugeint; y: Hugeint): Hugeint;
 // Result := x - y;
 var
 Borrow: Boolean;
 begin
 Borrow := False;
 Result[0] := SubtractWithBorrow(x[0], y[0], Borrow);
 Result[1] := SubtractWithBorrow(x[1], y[1], Borrow);
 Result[2] := SubtractWithBorrow(x[2], y[2], Borrow);
 Result[3] := SubtractWithBorrow(x[3], y[3], Borrow);
 end;
 function SubtractWithBorrow(x: Longint; y: Longint;
 var Borrow: Boolean): Longint;
 asm
 // if Borrow then CF := 1 else CF := 0;
 test byte ptr [ecx], -1 // Side-effect: CF := 0;
 jz @@NoBorrow
 stc // CF := 1;
 @@NoBorrow:
 // Result := x - y - CF; CF := NeededBorrow;
 sbb eax, edx
 // Borrow := CF;
 setc byte ptr [ecx]
 end;
You should be ready to write a pure assembler version of HugeSub, since it's the same as HugeAdd, but all you have to do is replace ADD and ADC with SUB and SBB respectively.
Opposite number
Given a number, these implementations of HugeNeg return it's opposite number (two's complement):
 function HugeNeg(x: Hugeint): Hugeint;
 begin
 // Result := (Not x) + 1;
 Result := HugeAdd(HugeNot(x), IntToHuge(1));
 end;
 function HugeNeg(x: Hugeint): Hugeint;
 begin
 // Result := 0 - x;
 Result := HugeSub(IntToHuge(0), x);
 end;
The second one is the simplest and fastest because it involves a single operation, and now that we know how to subtract, we can implement it in assembler:
 function HugeNeg(x: Hugeint): Hugeint;
 // Result := -x;
 // Parameters: EAX = @x; EDX = @Result
 asm
 // Result := 0 - x;
 push esi
 xor esi, esi
 mov ecx, [eax+_0_] // x[0]
 sub esi, ecx // 0 - x[0]
 mov ecx, 0
 mov [edx+_0_], esi // Result[0]
 mov esi, [eax+_1_] // x[1]
 sbb ecx, esi // 0 - x[1] - Borrow
 mov esi, 0
 mov [edx+_1_], ecx // Result[1]
 mov ecx, [eax+_2_] // x[2]
 sbb esi, ecx // 0 - x[2] - Borrow
 mov ecx, 0
 mov [edx+_2_], esi // Result[2]
 mov esi, [eax+_3_] // x[3]
 sbb ecx, esi // 0 - x[3] - Borrow
 mov [edx+_3_], ecx // Result[3]
 pop esi
 end;
Multiplication
A way of multiplying numbers is by means of an addition loop:
 function HugeMul(x: Hugeint; y: Hugeint): Hugeint;
 begin
 SetZero(Result);
 while not HugeIsZero(y) do begin
 Result := HugeAdd(Result, x);
 HugeSub(y, 1)
 end;
 end;
Computationally speaking, this algorithm is quite poor. For example, if the value of "y" was 4 million, the loop would repeat 4 million times! Anyway, the idea would still good if we could somehow accelerate the process. Let's play a little bit with algebra:
 x * y = x * (y[3]*2^96 + y[2]*2^64 + y[1]*2^32 + y[0]*2^0)
 = (x*y[3])*2^96 + (x*y[2])*2^64 + (x*y[1])*2^32 + (x*y[0])*2^0
Now we have reduced the problem of multiplying two Hugeint numbers to multiplying a Hugeint number by a 32-bit integer. We multiply the first operand by the four integers that make up the second operand and then we shift the partial results by 0, 32, 64, and 96 bits (to multiply them by 2^0, 2^32, 2^64 and 2^96), and finally we add these values to get the final result.
 function HugeMulInt(x: Hugeint; y: Longint): Hugeint; forward;
 function HugeMul(x: Hugeint; y: Hugeint): Hugeint;
 begin
 Result := HugeShl(HugeMulInt(x, y[3]), 96)
 + HugeShl(HugeMulInt(x, y[2]), 64)
 + HugeShl(HugeMulInt(x, y[1]), 32)
 + HugeMulInt(x, y[0]);
 end;
This is exactly the way we multiply decimal numbers when performing caculations on a paper, except that here the base is 2^32 instead of ten. Let's see now how we can a multiply a Hugeint by an integer:
 function MultiplyWithCarry(x: Longint; y: Longint;
 var Carry: Longint): Longint; forward;
 function HugeMulInt(x: Hugeint; y: Longint): Hugeint;
 // Result := x * y;
 var
 Carry: Longint;
 begin
 Carry := 0;
 Result[0] := MultiplyWithCarry(x[0], y, Carry);
 Result[1] := MultiplyWithCarry(x[1], y, Carry);
 Result[2] := MultiplyWithCarry(x[2], y, Carry);
 Result[3] := MultiplyWithCarry(x[3], y, Carry);
 end;
 function MultiplyWithCarry(x: Longint; y: Longint;
 var Carry: Longint): integer;
 // Result := LoDWord(x * y + Carry);
 // Carry := HiDWord(x * y + Carry);
 asm
 // EDX:EAX := EAX * EDX; // x * y
 mul edx
 // Inc(EDX:EAX, Carry);
 add eax, [ecx]
 adc edx, 0
 // Carry := EDX; // High order 32 bits of the result
 mov [ecx], edx;
 end;
MultiplyWithCarry is very much like AddWithCarry, but it performs a multiplication instead of an addition, and it generates a carry of 32 bits instead of just one bit (the multiplication of two 32-bit values generates a 64-bit result, while the addition of two 32-bit values can generate a 33-bit result).
MultiplyWithCarry first performs an unsigned multiplication of "x" (EAX) by "y" (EDX), using the MUL opcode. The result is a 64-bit unsigned integer in EDX:EAX, to which the function adds the Carry passed by parameter. The function returns the lower 32 bits of this final result (located EAX), and the higher 32 bits (EDX) constitute the carry for the next multiplication, which are stored in the Carry parameter (passed by reference).
An assembler implementation of HugeMul and HugeMulInt can be found in the source code attached. For reasons of simplicity, in the examples above the functions consider the numbers are unsigned, but the assembler implementations consider signed numbers. Also, the attached version of HugeMul doesn't call HugeMulInt or HugeShl, and is highly optimized.
Instead of considering a Huge integer as four 32-bit integers multiplied by four powers of 2^32, we consider them as 128 1-bit integers multiplied by 128 powers of 2:
 bit127 * 2^127 + bit126 * 2^126 + ... + bit1 * 2^1 + bit0 * 2^0
Since each bit can only be 0 or 1, the algorithm shown above can be greatly simplified:
 function HugeMul(x: Hugeint; y: Hugeint): Hugeint;
 // Result := x * y;
 var
 i: Longint;
 begin
 SetZero(Result);
 for i := 0 to 127 do
 if BitTest(y, i) then
 Result := HugeAdd(Result, HugeShl(x, i));
 end;
The idea is to add different powers of 2 of "x", depending those powers on the bits set on "y". For example, if "y" was 20, bits 5 and 3 would be on (20 in decimal is 10100 in binary), so only two additions would be performed, and the result would be HugeShl(x, 3) plus HugeShl(x, 5).
This algorithm can be coded quite efficiently in assembler, but still the first algorithm will work faster. The reason why I've shown this is because it'll make it easier to understand the algorithm we'll use for divisions.
Division
Let's first see the case of a division of a Hugeint by a 32-bit integer, which should be easy to understand:
 function DivideWithRemainder(x: Longint; y: Longint;
 var Remainder: Longint): Longint; forward;
 function HugeDivInt(x: Hugeint; y: Longint): Hugeint;
 // Result := x div y;
 var
 Remainder: Longint;
 begin
 Remainder := 0;
 Result[0] := DivideWithRemainder(x[3], y, Remainder);
 Result[1] := DivideWithRemainder(x[2], y, Remainder);
 Result[2] := DivideWithRemainder(x[1], y, Remainder);
 Result[3] := DivideWithRemainder(x[0], y, Remainder);
 asm
 mov edx, Remainder
 end;
 end;
 function DivideWithRemainder(x: Longint; y: Longint;
 var Remainder: Longint): Longint;
 // Result := Remainder:x div y;
 // Remainder := Remainder:x mod y;
 asm
 push esi
 mov esi, edx // y
 mov edx, [ecx] // Remainder
 // EAX := EDX:EAX div ESI;
 // EDX := EDX:EAX mod ESI;
 div esi
 // Remainder := EDX;
 mov [ecx], edx;
 pop esi
 end;
HugeDivInt leaves the remainder of the division in EDX, so it can be used in a function returning the remainder of the division:
 function HugeModInt(dividend: Hugeint; divisor: Longint): Longint;
 // Result := dividend mod divisor;
 // Parameters: EAX = @dividend; EDX = @divisor;
 asm
 sub esp, TYPE(Hugeint) // Make place on the stack for a Hugeint
 mov ecx, esp // to hold the result of the division
 call HugeDivInt // Perform the division
 add esp, TYPE(Hugeint) // Restore the stack pointer
 mov eax, edx // Result := Remainder; // was left in EDX
 end;
For the case of two huge integers we can think of an algorithm like the one we would use to divide two numbers of four digits with paper and pencil, but it turns to be quite complex, plus it isn't actually very fast since it implies divisions, multiplications, and substractions, and sometimes you take one step forwards and two steps back. Is there another possible algorithm? Yes, there is:
 function HugeDiv(dividend: Hugeint; divisor: Hugeint): Hugeint;
 // Result := dividend div divisor;
 begin
 if HugeIsZero(divisor) then
 raise EDivByZero.CreateRes(@sDivByZero);
 Result := 0;
 while HugeCmp(dividend, divisor) = 0 do begin
 dividend := HugeSub(dividend, divisor);
 Result := HugeAdd(Result, IntToHuge(1));
 end;
 end;
Of course, this algorithm turns out to be awfully slow (if we divide 12 million by 3, the loop would execute 4 million times), but we can speed things up if we subtract from the dividend the divisor multiplied by different powers of 2, from higher to lower, setting the corresponding bit of the result every time we perform a subtraction (the bit in the position of the power of 2 that was used). It's the inverse of what we did in the case of a multiplication shown above. The division process would then be reduced to just 128 subtractions at most.
In the following example, the dividend is 20 (10100 in binary) and the divider is 3 (11 in binary):
 10100 - 11 * 2^2 = 10100 - 1100 = 1000 Result := 100
 1000 - 11 * 2^1 = 1000 - 110 = 10 Result := 110
Initially, 11 * 2^2 is the highest value that is less or equal to the dividend, so we subtract that value from the dividend and we set bit 2 of the result because we subtracted the divisor multiplied by two to the power of 2. So far, the remainder is 8 (1000 in binary), and 11 * 2^1 is the highest value that is less than or equal to this remainder, so we subtract that value from the remainder, and we set bit 1 of the result because we subtracted the divisor multiplied by two to the power of 1. The remainder is 2 (10 in binary), and since the divisor is greater than that value, division stops there. The remainder of the operation would then be 2 (10 in binary) and since bits 2 and 1 of the result were set, the result is 110 in binary, i.e. 6 in decimal.
 function HugeDiv(dividend: Hugeint; divisor: Hugeint): Hugeint;
 var
 _r_: Hugeint; // remainder
 _d_: Hugeint; // divisor
 _q_: Hugeint; // quotient
 BitPosR, BitPosD, count: integer;
 begin
 _r_ := dividend;
 _d_ := divisor;
 HugeSetZero(_q_);
 BitPosD := HugeBitScanReverse(_d_);
 if BitPosD = -1 then RaiseDivByZero;
 BitPosR := HugeBitScanReverse(_r_);
 count := BitPosD - BitPosR;
 if count 0 then
 _d_ := HugeShl(_d_, count);
 repeat
 if HugeCmp(_d_, _r_) _r_ := HugeSub(_r_, _d_);
 HugeBitSet(_q_, count);
 end;
 _d_ := HugeShr(_d_, 1);
 dec(count);
 until count Result := _q_;
 asm
 lea edx, _r_
 end;
 end;
HugeBitScanReverse is a function that returns the position of the first non-zero bit, performing the search from bit 127 to bit 0. If all bits are zero, the result is -1. We use HugeBitScanReverse to determine the first power of two we should multiply the divisor in order to begin the iteration.
The assembler implementation of HugeDiv that you can find attached supports signed numbers. It is just a first approximation, and it can be heavily optimized.
The function leaves in EDX the address of the remainder, so it can be used by a function returning the modulus of the division:
 function HugeMod(dividend: Hugeint; divisor: Hugeint): Hugeint;
 // Result := dividend Mod divisor;
 // Parameters: EAX = @dividend; EDX = @divisor; ECX = @Result
 asm
 push ecx // @Result
 call HugeDiv // EDX := @remainder;
 pop eax // EAX := @Result;
 call HugeMov // EAX^ := EDX^;
 end;
Previous: Inline Assembler in Delphi (VI) - Calling external procedures
Algorithm Math Delphi