Assembly Language Library by David Parker 
 Home | Subscribe | List of Site Subscribers | Free Viewer 
 Latest news: 5 Sept 2013 | Update to newest version: 5 Sept 2013 
 Math Art Gallery | Documentation | Links | Privacy | Contact 
 
Tornado movie
"Tornado" (movie) by David Parker.
 

NOTE: Until I get more time to work on this page, I'm just marking broken links instead of fixing them.

Free tools | Documentation | Tips and tricks

I always write my commercially available programs entirely in assembly language. This page contains bits of information that I've found to be useful.


 
 Free tools
 
 Introduction

Everything you need to program in assembly language for DOS or Windows (and probably for UNIX, Linux, etc.) is available for free on the internet. In this chapter I list how to obtain the newest versions (and sometimes older versions) of the tools that I actually use. The list is heavily slanted toward tools for writing true 32-bit programs for Windows 95/98/NT4 or later, although most of the tools can still create 16-bit programs for DOS or Windows 3.1.

Sometimes a program that works with an older version of a tool stops working with a newer version. To avoid that problem, I keep different versions of the tools I use in different subdirectories. These subdirectories all live in a top level directory called tools. Right now it looks like this:

 Volume in drive C has no label.
 Volume Serial Number is B099-2C6C

 Directory of C:\tools

[.]                  [..]                 [alink-1-6]
[cvtres-5-00-1736-1] [h2inc-6-12a]        [imagedit-3-10-031]
[link-2-60-5046]     [link-5-12-8078]     [link-5-12-8181]
[link-5-13]          [masm-6-00b]         [masm-6-11c]
[masm-6-11d]         [masm-6-13-7299]     [masm-6-14-8444]
[nasm-0-97]          [pedump-1993]        [rc-3-10]
[rc-4-00-1367]       [win32lib-1998-10-1] [windbg-4-1381]
              21 File(s)              0 bytes
                             48,501,248 bytes free

I use dashes (minus signs) to separate the version information in subdirectory names instead of periods or underscores. I don't use periods because they are also used to separate file types from file names (for example, example-6-c is less confusing than example.6.c, which some systems will interpret as meaning that example.6 is written in c). I don't use underscores because most browsers underline HTML links, which sometimes hides the underscores (for example, masm_6_14_8444).

In the batch files that I use to build my programs I specifically state which version of each tool to use. Instead of a batch file that looks like this:

ml /coff /c /w /Sn /Sl132 /Sp80 /nologo /X project.asm
link @link.tmp

I use something that looks like this:

\tools\masm-6-14-8444\ml /coff /c /w /Sn /Sl132 /Sp80 /nologo /X project.asm
\tools\link-5-12-8181\link @link.tmp

In parentheses after the name of each tool in this chapter is the name that I use for that tool's subdirectory.

 
 Microsoft Macro Assembler (masm-6-11d)

These instructions are nearly the same as those for downloading the Microsoft 32-bit Linker, so you may want to grab a copy of that while you're getting the assembler.

  1. Click on the following two links to download two files that you'll need. Make sure that the directory is set to Desktop when your browser prompts you for a location to save in.After you finish downloading there should be two new icons on your Desktop for BINS_DDK.EXE and 98SETUP.EXE.
  2. To extract the files, double click on each of the two new icons on your Desktop. When you are prompted for a location for the extracted files, type in:

    c:\98ddk

    Warning: If you already have a directory called
    c:\98ddk, then use another name because you might be mixing old and new versions of files in your current c:\98ddk directory. The order in which you extract the files doesn't matter. After you have extracted the files you can drag the icons to the Recycle Bin because you won't need them anymore.
  3. To finish installing the files, first click on Start, then click on Run, then type in:

    c:\98ddk\setup.exe

    then click on OK, and then follow the instructions.
  4. Make a directory for MASM 6.11d (for example c:\tools\masm-6-11d). Now copy the following files from c:\98ddk\bin\win98 to c:\tools\masm-6-11d:

    ml.exe
    ml.err


  5. There are other interesting files in the c:\98ddk directory, but if you don't need any of them, you can uninstall the whole directory. To uninstall, go to Control Panel, go to Add/Remove Programs, select Microsoft 98 DDK in the bottom window, then select Add/Remove.
 
 Microsoft Macro Assembler (masm-6-14-8444)
  1. First, follow the directions above to install MASM 6.11d.
  2. Create a new directory for MASM 6.14.8444 (for example, c:\tools\masm-6-14-8444) and copy all of the files from c:\tools\masm-6-11d.
  3. Click on the following link to download the file ML614.exe. When your browser prompts you for a directory to save it in, enter c:\tools\masm-6-14-8444.
  4. To extract the patch files, first click on Start, then click on Run, then type in:

    c:\tools\masm-6-14-8444\ml614.exe

    and then click on OK. If the program leaves a DOS window open, feel free to close it.
  5. To finish patching the Macro Assembler, first click on Start, then click on Run, then type in:

    c:\tools\masm-6-14-8444\patch.exe

    and then click on OK.
 
 Netwide Assembler (nasm-2-01)

NASM 2.01 is the current stable release of the Netwide Assembler. It is available from many sites, for many operating systems. Follow the links from the NASM home page at http://nasm.sourceforge.net/ for more details.

 
 Anthony's Linker (alink-1-6)

You can download the latest version of Anthony's Linker (alink-1-6) for Windows 95/98/NT4 or later from http://alink.sourceforge.net/download.html (42 KB).

Anthony's Programming Page at http://alink.sourceforge.net/ has information on other useful programs, including other versions of ALINK and ALINK's source code.

 
 Microsoft 16-bit Linker (link-5-60-339)

The latest version of the Microsoft 16-bit Linker (link-5-60-339) for creating DOS programs is available from ftp://ftp.microsoft.com/softlib/MSLFILES/LNK563.EXE (broken link) (281 KB). Create a directory for the linker (for example, c:\tools\link-5-60-339), download the file to the directory, and then run LNK563.EXE to install the linker.

Even though the version number is greater than the 32-bit linker, the 16-bit linker is older.

 
 Microsoft 32-bit Linker (link-5-12-8181)

These instructions are nearly the same as those for downloading the Microsoft Macro Assembler, so you may want to grab a copy of that while you're getting the linker.

  1. Click on the following two links to download two files that you'll need. Make sure that the directory is set to Desktop when your browser prompts you for a location to save in.After you finish downloading there should be two new icons on your Desktop for BINS_DDK.EXE and 98SETUP.EXE.
  2. To extract the files, double click on each of the two new icons on your Desktop. When you are prompted for a location for the extracted files, type in:

    c:\98ddk

    Warning: If you already have a directory called
    c:\98ddk, then use another name because you might be mixing old and new versions of files in your current c:\98ddk directory. The order in which you extract the files doesn't matter. After you have extracted the files you can drag the icons to the Recycle Bin because you won't need them anymore.
  3. To finish installing the files, first click on Start, then click on Run, then type in:

    c:\98ddk\setup.exe

    then click on OK, and then follow the instructions.
  4. Make a directory for Link 5.12.8181 (for example c:\tools\link-5-12-8181). Now copy the following files from c:\98ddk\bin to c:\tools\link-5-12-8181:

    link.exe
    mspdb50.dll


  5. There are other interesting files in the c:\98ddk directory, but if you don't need any of them, you can uninstall the whole directory. To uninstall, go to Control Panel, go to Add/Remove Programs, select Microsoft 98 DDK in the bottom window, then select Add/Remove.
 
 Documentation
 
 Tips and tricks
 
 Divide by a constant

There are many ways to avoid the expense of an integer multiply instruction when multiplying by small integer constants. For example, multiplying Eax by 2 is as easy as:

; Eax = integer.
  Add   Eax,Eax ; Multiply Eax by 2.
; Eax = 2*integer.

Or, multiplying Eax by 9 is as easy as:

; Eax = integer.
  Lea   Eax,[Eax+8*Eax] ; Multiply Eax by 9.
; Eax = 9*integer.

Unfortunately, there don't seem to be any similar tricks for avoiding the expense of an integer divide, unless the constant you are dividing by is a power of 2. For example, to divide an unsigned integer in Edx by 4, you could use:

; Edx = unsigned integer.
  Shr   Edx,2 ; Divide Edx by 4.
; Edx = Edx/4, rounded toward 0.

or, if Edx is a signed integer:

; Edx = signed integer.
  Sar   Edx,2 ; Divide Edx by 4.
; Edx = Edx/4, rounded toward minus infinity.

The closest trick that I have seen for constants that are not powers of two is to multiply by 2^32/(integer constant), making sure that 2^32/(integer constant) is rounded towards infinity. For example, to divide Edx by 10, you could use the following:

; Edx = unsigned integer.
  Mov   Eax,(1 Shl 31)/5+1 ; 2^32/10 = 2^31/5, plus 1 for rounding.
  Mul   Edx ; Multiply Edx by 1/10.
; Edx = Edx/10, rounded toward 0, Eax = destroyed.

On many machines the Mul instruction isn't all that speedy, so this trick is of limited usefulness. On the other hand, if you are using floating point numbers then multiplying by 1/constant is usually much more efficient than dividing by the constant.

 
 Pop with Esp as base register gotcha

Suppose you want to copy a DWord from somewhere on the stack to the bottom of the stack. If the DWord is at offset 16 (for example) from Esp, then the simplest way to copy it to the bottom of the stack is:

; DWord to copy is at [Esp+16].
  Push  [Esp+16] ; Push DWord at [Esp+16].
; New copy is at [Esp], original DWord is now at [Esp+20].

Now suppose our program modifies the copied DWord at [Esp], and we want to copy it back to the original DWord, which is now at [Esp+20]. The naive approach would be to try something like:

; Modified DWord is at [Esp], original DWord is at [Esp+20].
  Pop   [Esp+20] ; Pop DWord from [Esp] to [Esp+20]
; Modified DWord is now at [Esp+16].

Unfortunately, this would be wrong. Unlike any other instruction, Pop calculates the address of the operand as though the instruction has already executed. That means that Esp will be 4 more than expected. The correct code sequence is:

; Modified DWord is at [Esp], original DWord is at [Esp+20].
  Pop   [Esp+16] ; Pop DWord from [Esp] to [Esp+20]
; Modified DWord is now at [Esp+16].
 
 Test for unsigned integer power of two

You can use a classic assembly language trick to test if a unsigned integer is a power of two. In other words, this trick enables you to test an integer to see if one, and only one, bit is set to 1 (and thus the other bits are all set to 0's).

For example, the integers 00100b and 10000b are unsigned integer powers of two because only one bit in each is set to 1, but the integer 11100b is not an unsigned integer power of two because more than one bit is set to 1. The integer 00000b is not an unsigned integer power of two because no bits are set to 1.

The trick relies on a combination of arithmetic and logical operations, and the peculiarities of two's complement arithmetic. The basic idea is that if you bitwise-and an integer with its own negative, and if the result equals the original integer, then one, and only one, bit in the original integer was set to 1, or else the original integer was 0. For example, 00100b gives (-00100b) And 00100b = 11100b And 00100b = 00100b, so 00100b is an unsigned integer power of 2. For another example, 11100b gives (-11100b) And 11100b = 00100b And 11100b = 00100b, so 11100b is not an unsigned integer power of 2. The result of bitwise-anding an integer with its own negative gives the lowest order 1 bit in the integer, or 0 if the integer was 0.

The following code snippet uses the Eax and Ebx registers, but any two integer registers excluding Esp will work. To sanitize the source code for inclusion in this HTML file, I changed all "<" characters to "&lt;", and all ">" characters to "&gt;". If you use this code in your program, you might want to reverse those transformations.

; Eax contains integer to test, Ebx = scratch register.
  Xor   Ebx,Ebx ; Eax = integer, Ebx = 0.
  Sub   Ebx,Eax ; Eax = integer, Ebx = -integer.
  Jz    Zero ; If -integer = 0, integer = 0.
  And   Ebx,Eax ; Eax = integer, Ebx = (-integer) And integer.
  Cmp   Ebx,Eax ; Eax = integer, Ebx = (-integer) And integer.
  Jne   NotPowerOfTwo ; If integer<>(-integer) And integer, multiple 1 bits.
  ...   ; If integer=(-integer) And integer, only one 1 bit.
; Eax contains integer to test, Ebx = lowest order 1 bit in Eax, else 0.

If you don't need the lowest order 1 bit at the end of the sequence, then you can use a shorter alternative. It is based on the fact that if you bitwise-and an integer with one less than the integer, then the result is zero only if the integer was a power of two, or if the integer was zero.

; Eax contains integer to test, Ebx = scratch register.
  Lea   Ebx,[Eax-1] ; Eax = integer, Ebx = integer-1.
  And   Ebx,Eax ; Eax = integer, Ebx = (integer-1) And integer.
  Jnz   NotPowerOfTwo ; If result is non-zero, integer is not 0 or power of 2.
  Test  Eax,Eax ; Was integer 0?
  Jz    Zero ; Yep.
  ...   ; Integer was a power of 2 (i.e. one and only one 1 bit was set).
; Eax contains integer to test, Ebx = integer minus lowest order 1 bit, else 0.
 
 Double precision unsigned integer division

Here's a tricky algorithm I use when I need to find an exact double precision unsigned integer quotient and remainder. The place I seem to use it the most is in 3D edge clipping.

To sanitize the source code for inclusion in this HTML file, I changed all "<" characters to "&lt;", and all ">" characters to "&gt;". If you use this code in your program, you might want to reverse those transformations.

One simple way you might want to customize this code is to check at the beginning if both operands are single precision (right now it only checks to see if the divisor is single precision), and in that case performing only one single precision divide. I didn't do it when I wrote this code because the dividends I was dealing with at the time were almost never single precision.

; Divides the double precision quantity a1*b+a2 by c1*b+c2, where b is the base
; (in this case 2^32), and a1, a2, c1, and c2 are all in the range 0 to b-1,
; inclusive.  If both c1 and c2 are zero, then a divide exception occurs.  In
; all other cases, a double precision quotient and remainder are returned.
;
; By David Parker, from www.dpgraph.com/assembly.html
;
; Expects on entry: Eax=c2
;                   Ebx=a2
;                   Ecx=a1
;                   Edx=c1
;
; Returns on exit:  Eax=lo dword of quotient.
;                   Ebx=lo dword of remainder.
;                   Ecx=hi dword of remainder.
;                   Edx=hi dword of quotient.
;
; The algorithm handles the division using 3 special cases.
;
; CASE 1.
;
; The first case is if c1=0.
; We want to find numbers q1 and r1 such that
;
; (a1*b+a2) = q1*c2 + r1, where 0 <= r1 < c2.
;
; We note that 0 <= q1 <= (b-1)*b+(b-1).  The minimum occurs, for example,
; when a1=a2=0, and c2 = 1.  The maximum occurs when a1=a2=(b-1) and c2=1.
; We first divide a1 by c2 to get q2 and r2 such that:
;
; a1 = q2*c2 + r2, where 0 <= r2 < c2.
;
; The bounds on q2 are 0 <= q2 <= (b-1), the minimum occurring at a2=0,
; the maximum at a1=(b-1), c2=1.
; We then divide r2*b+a2 by c2 to get q3 and r3 such that:
;
; r2*b + a2 = q3*c2 + r3, where 0<=r3<c2.
;
; The bounds on q3 are 0 <= q3 <= (b-1), the minimum occurring at r2=a2=0,
; the maximum at r2=(c2-1), a2=(b-1).
; Starting from (a1*b+a2), we then start substituting:
;
; (a1*b+a2)
; = (q2*c2+r2)*b + a2
; = q2*c2*b + r2*b + a2
; = q2*c2*b + q3*c2 + r3
; = (q2*b+q3)*c2 + r3.
;
; Comparing this to the equation for q1 and r1, we see that since 0 <= r3 < c2,
; that q1=(q2*b+q3) and r1=r3.
;
; CASE 2.
;
; The second special case is if 1 <= c1 < (b-1).  In this case, we want to find
; q1 and r1 such that:
;
; a1*b+a2 = q1*(c1*b+c2) + r1, where 0 <= r1 < (c1*b+c2).
;
; We note that 0 <= q1 <= (b-1), the minimum occurring if a1=a2=0, c1=c2=(b-1),
; and the maximum occurring when c1=1,c2=0,a1=a2=(b-1).
; First, we divide (a1*b+a2) by (c1+1) to get q2 and r2 such that:
;
; (a1*b+a2) = q2*(c1+1) + r2, where 0 <= r2 < (c1+1).
;
; Q2 may be a double precision quantity.
; Next, we divide (c1*b+c2) by (c1+1) to get q3 and r3 such that:
;
; (c1*b+c2) = q3*(c1+1) - r3, where 0 <= r3 < (c1+1).
;
; We note that 1 <= q3 <= b.
; The minus sign in front of r3 is important, so that later things come out
; positive.
;
; Next, we divide q2 by q3 (which, if q3 = b, means we don't have to divide),
; to get q4 and r4 such that:
;
; q2 = q4*q3 + r4, where 0 <= r4 < q3.
;
; Substituting in this last equation for q2 and q3, and rearranging a little
; gives us:
;
; (a1*b+a2) = q4*(c1*b+c2) + q4*r3 + r2 + r4*(c1+1).
;
; We will call this last remainder r5: r5 = q4*r3 + r2 + r4*(c1+1).
; We will now show that either q4=q1, or q4=q1-1.  We note that since all the
; terms in r5 are positive, we have that q4 <= q1.  To see how large r5 can
; get, we first note that since r4<q3, then:
;
; r5< q4*r3 + r2 + (c1*b+c2+r3)/(c1+1)*(c1*1), or
; r5< q4*r3 + r2 + (c1*b+c2+r3), or
; r5< (q4+1)*r3 + r2 + (c1*b+c2).
;
; Since r3 and r2 are both < (c1+1), we can substitute c1 for them to get
;
; r5< (q4+1)*c1 + c1 + (c1*b +c2), or
; r5< (q4+2)*c1 + (c1*b+c2). *******************************
;
; Now, since q4 <= q1 and q1 <= (b-1), we have that q4 <= (b-1), so that
;
; r5 < (b+1)*c1 + (c1*b+c2), or
; r5 < c1*b + c1 + c1*b + c2, or
; r5 < 2*c1*b + c1 + c2.
;
; Finally, since c1*(b-1)+2*c2 > 0, we have
;
; r5 < 2*c1*b + c1 + c2 + c1*(b-1) + 2*c2, or
; r5 < 3*c1*b + 3*c2, or
; r5 < 3*(c1*b+c2).
;
; Thus, we must have that q1-2 <= q4 <= q1.  However, we can examine more
; closely the case that q1-2=q4 to show that it never occurs.  If we assume
; that q4=q1-2, then we must have that r5 >= 2*(c1*b+c2).  Repeating the
; equation with the long line of asterisks, we know that
;
; r5 < (q4+2)*c1 + (c1*b+c2).
;
; If we assume that q4=q1-2, then we have
;
; r5 < q1*c1 + (c1*b+c2).
;
; We know that q1 <= (b-1), so we have
;
; r5 < (b-1)*c1 + (c1*b+c2), or
; r5 < 2*c1*b - c1 + c2.
;
; Since c1+c2 >= 0, we have that
;
; r5 < 2*c1*b - c1 + c2 + c1 + c2, or
; r5 < 2*c1*b + 2*c2, or
; r5 < 2*(c1*b+c2).
;
; This contradicts our assumption that q4=q1-2, so that case can never occur.
; It is fairly easy to construct cases where q4=q1-1, so our final result for
; the case 1 <= c1 < (b-1) is that
;
;  q1-1 <= q4 <= q1, and 0 <= r5 < 2*(c1*b+c2).
;
; Thus, once we have calculated q4 and r5, we can calculate q1 and r1 by
; comparing r5 to (c1*b+c2), and if r5 >= (c1*b+c2), we increment q4 by 1 and
; decrement r5 by (c1*b+c2).
;
; There is some trickiness involved in calculating the remainder r5.  First, we
; need to show that r5 never exceeds the value that can be stored as a
; double precision quantity.  We can start by determining a bound on q1 in
; terms of c1.  We have:
;
; q1=(a1*b+a2-r1)/(c1*b+c2)
;
; q1 will be at a maximum when the numerator is as large as possible (so that
; a1=a2=(b-1) and r1=0), and the denominator is as small as possible (so that
; c2=0).  Thus the maximal value of q1 for a given value of c1 is
;
; q1=((b-1)*b+(b-1))/(c1*b)
;
; Knowing that the maximal values for r2, r3, and r4 are c1, c1, and (b-1),
; respectively, and that q4<=q1, we can plug these into the expression for r5
; to get:
;
; r5<=((b-1)*b+(b-1))/(c1*b)*c1 + c1 + (b-1)*(c1+1), or
; r5<=(c1+2)*b-1-1/b
;
; Since the maximum possible value for c1 is (b-2), we then have
;
; r5<=b^2-1-1/b
;
; Since 1/b is positive, we can add it to the right so that
;
; r5<b^2-1
;
; Noting that the maximum double precision value is (b-1)*b+(b-1)=b^2-1, we
; see that calculating r5 will never overflow double precision.
;
; We can make the calculation of r5 slightly more efficient by noting that the
; term q4*r3 never overflows single precision.  We have to start:
;
; q4*r3<=q1*r3
;
; Using our above limit on q1 in terms of c1, we have
;
; q4*r3<=((b-1)*b+(b-1))/(c1*b)*r3
;
; Using the fact that 0<=r3<(c1+1), we can substitute c1 for r3 to get:
;
; q4*r3<=((b-1)*b+(b-1))/(c1*b)*c1, or
; q4*r3<=((b-1)*b+(b-1))/b, or
; q4*r3<=b-1/b.
;
; Since 1/b is positive, we have
;
; q4*r3<b, or
; q4*r3<=(b-1)
;
; Since b-1 is the maximum single precision integer, we see that q4*r3 never
; overflows single precision.
;
; CASE 3.
;
; The third case is if c1=(b-1).
; Then, since 0<=a1<=(b-1), we have one of the two possibilities:
;
; If a1=(b-1) and a2>=c2, then q1=1 and r1=a2-c2, otherwise
;
; q1=0, and r1=a1*b+a2.

DivDDDD_EcxEbxDDEdxEaxDR_EdxEaxQUEcxEbxRM:
  Test  Edx,Edx ; Is c1 zero?
  Jnz   DivDDDD_Case2 ; Nope, go try something else.
  Xchg  Eax,Ecx ; Yep, Eax=a1,Ebx=a2,Ecx=c2,Edx=0.
  Div   Ecx ; Eax=q2,Ebx=a2,Ecx=c2,Edx=r2.
  Xchg  Eax,Ebx ; Eax=a2,Ebx=q2,Ecx=c2,Edx=r2.
  Div   Ecx ; Eax=q3,Ebx=q2,Ecx=c2,Edx=r3.
  Xchg  Ebx,Edx ; Eax=q3,Ebx=r3,Ecx=c2,Edx=q2.
  Xor   Ecx,Ecx ; Eax=q3,Ebx=r3,Ecx=0,Edx=q2.
  Retn  ; Return to caller.

DivDDDD_Case2:
  Inc   Edx ; Eax=c2,Ebx=a2,Ecx=a1,Edx=c1+1.
  Jz    DivDDDD_Case3 ; If c1=65535 (c1+1=65536=0), go to case 3.
  Push  Esi ; Stack Esi
  Mov   Esi,Eax ; Eax=c2,Ebx=a2,Ecx=a1,Edx=c1+1,Esi=c2.
  Push  Edi ; Stack Edi.
  Mov   Edi,Edx ; Eax=c2,Ebx=a2,Ecx=a1,Edx=c1+1,Esi=c2,Edi=c1+1.
  Push  Ebp ; Stack Ebp.
  Xor   Edx,Edx ; Eax=a1,Ebx=a2,Ecx=c2,Edx=0,Esi=c2,Edi=c1+1.
  Xchg  Eax,Ecx ; Eax=a1,Ebx=a2,Ecx=c2,Edx=c1+1,Esi=c2,Edi=c1+1.
  Div   Edi ; Eax=Q2,Ebx=a2,Ecx=c2,Edx=Ecx,Esi=c2,Edi=c1+1.
  Xchg  Eax,Ebx ; Eax=a2,Ebx=Q2,Ecx=c2,Edx=Ecx,Esi=c2,Edi=c1+1.
  Div   Edi ; Eax=q2,Ebx=Q2,Ecx=c2,Edx=r2,Esi=c2,Edi=c1+1.
  Xchg  Eax,Ecx ; Eax=c2,Ebx=Q2,Ecx=q2,Edx=r2,Esi=c2,Edi=c1+1.
  Mov   Ebp,Edx ; Eax=c2,Ebx=Q2,Ecx=q2,Edx=r2,Esi=c2,Edi=c1+1,Ebp=r2.
  Mov   Edx,Edi ; Eax=c2,Ebx=Q2,Ecx=q2,Edx=c1+1,Esi=c2,Edi=c1+1,Ebp=r2.
  Dec   Edx ; Eax=c2,Ebx=Q2,Ecx=q2,Edx=c1,Esi=c2,Edi=c1+1,Ebp=r2.
  Div   Edi ; Eax=tq3,Ebx=Q2,Ecx=q2,Edx=tr3,Esi=c2,Edi=c1+1,Ebp=r2.
  Xchg  Ebx,Edx ; Eax=tq3,Ebx=tr3,Ecx=q2,Edx=Q2,Esi=c2,Edi=c1+1,Ebp=r2.
  Xchg  Ecx,Eax ; Eax=q2,Ebx=tr3,Ecx=tq3,Edx=Q2,Esi=c2,Edi=c1+1,Ebp=r2.
  Test  Ebx,Ebx ; Is r3 zero?
  Jz    DivDDDD_DoDiv ; Yep, no need to invert r3 or increment q3.
  Sub   Ebx,Edi ; Eax=q2,Ebx=-r3,Ecx=tq3,Edx=Q2,Esi=c2,Edi=c1+1,Ebp=r2.
  Neg   Ebx ; Eax=q2,Ebx=r3,Ecx=tq3,Edx=Q2,Esi=c2,Edi=c1+1,Ebp=r2.
  Inc   Ecx ; Eax=q2,Ebx=r3,Ecx=q3,Edx=Q2,Esi=c2,Edi=c1+1,Ebp=r2.
  Jz    DivDDDD_Rem ; If q3=65536, no need to do next division.
DivDDDD_DoDiv:
  Div   Ecx ; Eax=q4,Ebx=r3,Ecx=q3,Edx=r4,Esi=c2,Edi=c1+1,Ebp=r2.
  Xchg  Eax,Edx ; Eax=r4,Ebx=r3,Ecx=q3,Edx=q4,Esi=c2,Edi=c1+1,Ebp=r2.
DivDDDD_Rem:
  Xchg  Eax,Ebx ; Eax=r3,Ebx=r4,Ecx=q3,Edx=q4,Esi=c2,Edi=c1+1,Ebp=r2.
  Mov   Ecx,Edx ; Eax=r3,Ebx=r4,Ecx=q4,Edx=q4,Esi=c2,Edi=c1+1,Ebp=r2.
  Mul   Edx ; Eax=r3*q4,Ebx=r4,Ecx=q4,Edx=0,Esi=c2,Edi=c1+1,Ebp=r2.
  Add   Eax,Ebp ; Eax=r3*q4+r2,Ebx=r4,Ecx=q4,Edx=0,Esi=c2,Edi=c1+1.
  Adc   Edx,Edx ; Eax=r3*q4+r2,Ebx=r4,Ecx=q4,Edx=Edx*Q4+Ecx,Esi=c2,
                ; Edi=c1+1.
  Mov   Ebp,Ecx ; Eax=r3*q4+r2,Ebx=r4,Ecx=q4,Edx=Edx*Q4+Ecx,Esi=c2,
                ; Edi=c1+1,Ebp=q4.
  Mov   Ecx,Edx ; Eax=r3*q4+r2,Ebx=r4,Ecx=Edx*Q4+Ecx,Edx=Edx*Q4+Ecx,
                ; Esi=c2,Edi=c1+1,Ebp=q4.
  Xchg  Eax,Ebx ; Eax=r4,Ebx=r3*q4+r2,Ecx=Edx*Q4+Ecx,Edx=Edx*Q4+Ecx,
                ; Esi=c2,Edi=c1+1,Ebp=q4.
  Mul   Edi ; Eax=r4*(c1+1),Ebx=r3*q4+r2,Ecx=Edx*Q4+Ecx,
  Xchg  Eax,Ebp ; Eax=q4,Ebx=r5,Ecx=Edi,Edx=Esi*(C1+1),Esi=c2,Edi=c1+1.
                ; Edx=Esi*(C1+1),Esi=c2,Edi=c1+1,Ebp=q4.
  Dec   Edi ; Eax=q4,Ebx=r5,Ecx=Edi,Edx=0,Esi=c2,Edi=c1.
  Add   Ebx,Ebp ; Eax=r4*(c1+1),Ebx=r5,Ecx=Edx*Q4+Ecx,Edx=Esi*(C1+1),
                ; Esi=c2,Edi=c1+1,Ebp=q4.
  Adc   Ecx,Edx ; Eax=r4*(c1+1),Ebx=r5,Ecx=Edi,Edx=Esi*(C1+1),Esi=c2,
                ; Edi=c1+1,Ebp=q4.
  Xor   Edx,Edx ; Eax=q4,Ebx=r5,Ecx=Edi,Edx=0,Esi=c2,Edi=c1+1.
  Pop   Ebp ; Unstack Ebp.
  Cmp   Ecx,Edi ; Compare Edi to c1.
  Ja    DivDDDD_DecRe ; If Edi>c1, then we need to decrease remainder.
  Jb    DivDDDD_RemOk ; If Edi<c1, then the remainder is okay.
  Cmp   Ebx,Esi ; Compare r5 to c2.
  Jb    DivDDDD_RemOk ; If r5<c2, then the remainder is okay.
DivDDDD_DecRe:
  Inc   Eax ; Eax=q4+1,Ebx=r5,Ecx=Edi,Edx=0,Esi=c2,Edi=c1.
  Sub   Ebx,Esi ; Eax=q4+1,Ebx=r5-c2,Ecx=Edi,Edx=0,Edi=c1.
  Sbb   Ecx,Edi ; Eax=q4+1,Ebx=r5-c2,Ecx=Edi-c1,Edx=0.
DivDDDD_RemOk:
  Pop   Edi ; Unstack Edi.
  Pop   Esi ; Unstack Esi.
  Retn  ; Return to caller.

DivDDDD_Case3:
  Inc   Ecx ; Eax=c2,Ebx=a2,Ecx=a1+1,Edx=0.
  Jnz   DivDDDD_Q02 ; If a1<65535, then quotient is 0.
  Sub   Ebx,Eax ; Eax=c2,Ebx=a2-c2,Ecx=0,Edx=0.
  Jb    DivDDDD_Q01 ; If a2<c2, then quotient is 0.
  Mov   Eax,1 ; Eax=1,Ebx=a2-c2,Ecx=0,Edx=0.
  Retn  ; Return to caller.

DivDDDD_Q01:
  Add   Ebx,Eax ; Eax=c2,Ebx=a2,Ecx=a1+1,Edx=0.
DivDDDD_Q02:
  Dec   Ecx ; Eax=c2,Ebx=a2,Ecx=a1,Edx=0.
  Mov   Eax,Edx ; Eax=0,Ebx=a2,Ecx=a1,Edx=0.
  Retn  ; Return to caller.
 
 Saturated unsigned integer addition

When adding two unsigned integers, if the answer overflows we sometimes want to reset the result back to a maximum value. This is called saturated addition, and it is often used in graphics operations. For example, if we are adding two color intensities, each of which ranges from 0 to 255, and the answer exceeds 255, then we usually want to reset the sum back to 255.

Many of the newer Intel-compatible CPU's support saturated addition in hardware via MMX instructions. However, detecting and invoking the MMX instructions is often more of a hassle than it's worth, and the MMX instructions aren't even supported on many older CPU's, such as Intel's original Pentium.

Fortunately, saturated addition is fairly efficient to implement using standard instructions if the saturation value is 255 in the case of byte operations, 65535 in the case of word operations, or 2^32-1 in the case of dword operations. Here is the code for the byte case; the code is analogous for the word and dword cases:

; Al contains sum, Bl contains value to add.
  Add   Al,Bl ; Increment sum.
  Sbb   Bl,Bl ; Bl = 0 if no overflow (no carry), or 255 if overflow.
  Or    Al,Bl ; Al = sum if no overflow (no carry), or 255 if overflow.
; Al contains sum or 255, Bl contains 0 or 255.
 
 Absolute value of an integer

One of the classic assembly language tricks is taking the absolute value of a 2's complement integer. The naive method would be:

; Eax contains integer.
  Test  Eax,Eax ; Is Eax negative?
  Jns   GotAbsoluteValue ; If sign bit is 0, integer is already non-negative.
  Neg   Eax ; Negate a negative integer to get a positive integer.
GotAbsoluteValue:
; Eax contains abs(integer).

This method is good for clarity and for using the minimum number of registers, but is bad for speed because of the test and jump. The classic trick is:

; Eax contains integer.
  Cdq   ; Edx = -1 if Eax negative, Edx = 0 otherwise.
  Xor   Eax,Edx ; Eax = -integer-1 if Eax negative, Eax = integer otherwise.
  Sub   Eax,Edx ; Eax = -integer if Eax negative, Eax = integer otherwise.
; Eax contains abs(integer), Edx = -1 if Eax was negative, Edx = 0 otherwise.

This method is bad for clarity (although most experienced assembly language programmers would recognize it as a common idiom) and for register usage, but is good for speed because it gets rid of the test and jump. Since the Cdq instruction works only with the Eax and Edx registers, a variation that will work with any pair of registers is:

; Eax contains integer.
  Mov   Edx,Eax ; Copy integer.
  Sar   Edx,31 ; Edx = -1 if Eax negative, Edx = 0 otherwise.
  Xor   Eax,Edx ; Eax = -integer-1 if Eax negative, Eax = integer otherwise.
  Sub   Eax,Edx ; Eax = -integer if Eax negative, Eax = integer otherwise.
; Eax contains abs(integer), Edx = -1 if Eax was negative, Edx = 0 otherwise.

And as a final thought, sometimes the full absolute value isn't needed. See, for example, Divide a signed dividend by an unsigned divisor. For that tip/trick, if the integer is positive or zero we want to leave the integer unchanged (just as for the absolute value). However, if the integer is negative what we want is the absolute value minus 1. So, we need only the first three instructions above:

; Eax contains integer.
  Mov   Edx,Eax ; Copy integer.
  Sar   Edx,31 ; Edx = -1 if Eax negative, Edx = 0 otherwise.
  Xor   Eax,Edx ; Eax = -integer-1 if Eax negative, Eax = integer otherwise.
; Eax contains result, Edx = -1 if Eax was negative, Edx = 0 otherwise.
 
 Convert from binary to ASCII hex

Here's a clever trick that I read about in the (http://asmjournal.freeservers.com/ broken link) Assembly Programming Journal, February/March 1999. Suppose you have a binary number from 0h to 0Fh in Al, and you want to convert it to an ASCII character from "0" to "9" or "A" to "F". The following sequence of four instructions (six bytes) will do it:

; Al is from 0h to 0Fh.
  Add   Al,90h ; Al is from 90h to 09Fh.
  Daa   ; Al is from 90h to 99h (carry clear) or 00h to 06h (carry set).
  Adc   Al,40h ; Al is from 0D0h to 0D9h or 41h to 46h.
  Daa   ; Al is from 30h to 39h ("0" to "9") or 41h to 46h ("A" to "F").
 
 Divide a signed dividend by an unsigned divisor

Suppose you want to divide a signed dividend (numerator) by a non-zero unsigned divisor (denominator), to get a signed quotient and a non-negative remainder. In other words, given Num and Den where

-2^31 <= Num <= 2^31-1 and 0 < Den <= 2^32-1

you would like to calculate Quo and Rem such that

Num = Quo*Den + Rem and 0 <= Rem < Den.

Unfortunately, the Idiv instruction can't do this kind of division, because if Num is negative, Idiv will return a negative or zero remainder. Also, the Idiv instruction would require narrowing the range of Den to 0 < Den <= 2^31-1.

Here is the algorithm I use. The basic idea is to convert Num to a positive value if it is negative, perform unsigned division to obtain an intermediate Quo' and Rem', and then correct Quo' and Rem' if Num was originally negative. The algorithm involves no jump instructions, and so won't break the pipeline. It requires two extra registers in addition to Eax and Edx. The two extra registers can be any two of Ebx, Ecx, Esi, Edi, and Ebp. In the following code snippet, I use Esi and Edi.

; Eax = Num; Esi = Den.
  Mov   Edi,Eax ; Copy Num.
  Sar   Edi,31 ; If Num < 0 then Edi = -1 else Edi = 0.
  Xor   Edx,Edx ; Zero hi dword of dividend.
  Xor   Eax,Edi ; If Num < 0 then Eax = -Num-1 else Eax = Num.
  Div   Esi ; If Num < 0 then Quo'*Den + Rem' = -Num-1
            ;            else Quo'*Den + Rem' = Num.
  Xor   Eax,Edi ; If Num < 0 then Eax=Quo=-Quo'-1 else Eax=Quo=Quo'.
  Xor   Edx,Edi ; If Num < 0 then Edx = -Rem'-1 else Edx = Rem'.
  And   Edi,Esi ; If Num < 0 then Edi = Den else Edi = 0.
  Add   Edx,Edi ; If Num < 0 then Edx=Rem=Den-Rem'-1 else Edx=Rem=Rem'.
; Eax = Quo; Edx = Rem; Esi = Den; Edi = Den if Num < 0 else Edi = 0.

You can save a few cycles if you know that -2^15 <= Num <= 2^15-1 and 0 < Den <= 2^16-1 (i.e. that Num is a signed 16-bit integer and Den is a non-zero unsigned 16-bit integer). If so, change Div Esi to Div Si to use faster 16-bit division, and leave the other instructions unchanged.

To show that the algorithm works as advertised, first note that in the case of a non-negative Num the algorithm reduces to simple unsigned division so there is nothing to prove. In the case of a negative Num, we first note that -Num-1 is non-negative if Num is negative. Hence, the result of the Div instruction is a pair of non-negative intermediate numbers Quo' and Rem' which satisfy

-Num-1 = Quo'*Den + Rem' and 0 <= Rem' < Den.

We then calculate the true Quo and Rem using

Quo = -Quo'-1 and Rem = Den-Rem'-1.

Solving for Quo' and Rem' in terms of Quo and Rem, and plugging into the equation for -Num-1 gives

-Num-1 = (-Quo-1)*Den + (Den-Rem-1)

which simplifies to

Num = Quo*Den + Rem

Since 0 <= Rem' < Den, then 0 <= Den-Rem'-1 < Den, and so 0 <= Rem < Den, which shows that the algorithm works as advertised.

 
 Use the parity flag to test three values with one instruction

One bit can store two values: 0b or 1b. Two bits can store four values: 00b, 01b, 10b, or 11b. Sometimes, however, you only need to store three values. If so, you can take advantage of the parity flag to check for the three values using a single instruction. If you let the three values be 00b, 11b, and 01b (or 10b), then you can use the Jpo (jump on parity odd) instruction to do something like:

  Test  Al,11b ; Test bottom two bits of Al.
  Jpo   Value01bOr10b ; 01b or 10b in bottom two bits.
  Jnz   Value11b ; 11b in bottom two bits.
  ...   ; Fall through to here if 00b in bottom two bits.

Note that the order of the jump instructions is important, and also note the Parity flag gotcha below - both of the bits you are testing must be in the same byte. Putting the value into the Al register saves a byte of code, because testing the Al register takes only two bytes of code, while testing any other eight-bit register takes three bytes of code. You could have used a Test Eax,03h instruction, but that takes five bytes of code, and testing any other 32-bit register takes six bytes of code.

 
 Parity flag gotcha

Instructions that alter the parity flag clear it to 0 if there are an odd number of bits in the lowest order byte, and set it to 1 if there are an even number of bits in the lowest order byte. That may seem backwards, but the flag is technically a "parity even" flag so it gets set if there are an even number of bits in the lowest order byte. The gotcha, however, is in the words "lowest order byte". Suppose the Eax register contains the value 0101h (i.e. 1 in the low order byte, and 1 in the second lowest order byte). If you test the whole Eax register using the Test Eax,Eax instruction, you might expect that the parity flag would get set to 1 because there are an even number of bits in the Eax register. However, the parity flag gets cleared to 0 because the parity flag depends only on the number of bits in the lowest order byte. This is always true for any instruction that alters the parity flag; only the bits in the lowest order byte affect the parity flag. So, when using the parity flag for a test, it is often best to use eight-bit registers or single byte memory operands.

 
Copyright © 1997-2013 by David Parker. All rights reserved.