From: pottier@clipper.ens.fr (Francois Pottier)
Subject: csmp-digest-v3-036
Date: Thu, 16 Jun 1994 13:35:08 +0200 (MET DST)

C.S.M.P. Digest             Thu, 16 Jun 94       Volume 3 : Issue 36
 
Today's Topics:
 
        Apple Event example
        Faster Square Root Algorithm
        How to save alpha in PICT files?
        Open Transport & ASLM
        Sending the Mac Screen Image
        What are Universal Headers?


The Comp.Sys.Mac.Programmer Digest is moderated by Francois Pottier
(pottier@clipper.ens.fr).
 
The digest is a collection of article threads from the internet newsgroup
comp.sys.mac.programmer.  It is designed for people who read c.s.m.p. semi-
regularly and want an archive of the discussions.  If you don't know what a
newsgroup is, you probably don't have access to it.  Ask your systems
administrator(s) for details.  If you don't have access to news, you may
still be able to post messages to the group by using a mail server like
anon.penet.fi (mail help@anon.penet.fi for more information).
 
Each issue of the digest contains one or more sets of articles (called
threads), with each set corresponding to a 'discussion' of a particular
subject.  The articles are not edited; all articles included in this digest
are in their original posted form (as received by our news server at
nef.ens.fr).  Article threads are not added to the digest until the last
article added to the thread is at least two weeks old (this is to ensure that
the thread is dead before adding it to the digest).  Article threads that
consist of only one message are generally not included in the digest.

The digest is officially distributed by two means, by email and ftp.

If you want to receive the digest by mail, send email to listserv@ens.fr
with no subject and one of the following commands as body:
    help		                Sends you a summary of commands
    subscribe csmp-digest Your Name	Adds you to the mailing list
    signoff csmp-digest			Removes you from the list
Once you have subscribed, you will automatically receive each new
issue as it is created.

The official ftp info is //ftp.dartmouth.edu/pub/csmp-digest.
Questions related to the ftp site should be directed to
scott.silver@dartmouth.edu. Currently no previous volumes of the CSMP
digest are available there.

Also, the digests are available to WAIS users.  To search back issues
with WAIS, use comp.sys.mac.programmer.src. With Mosaic, use
http://www.wais.com/wais-dbs/comp.sys.mac.programmer.html.


-------------------------------------------------------

>From farago@onyx.ill.fr (Bela Farago)
Subject: Apple Event example
Date: Thu, 2 Jun 94 13:06:18 GMT
Organization: ILL - Grenoble FRANCE

Hi,

please forgive me if it is trivial but I am programming just by hobby (well
not really but this is not my main job)
I need to write a rather trivial code resource which sends the equivalent of
the following AppleScript

tell application anApplication
    save document 1
end tell

An Application is an application :)

It is hard to digest the volume VI of IM. and I could not find simple
example on ftp.apple.com. (I did find some but were not very clear to me) I
am bit lost in descriptors and direct objects etc. Could anybody give me
pointers to simple examples?
                                    Bela Farago (farago@ill.fr)

ps: No flames please, you can always just ignore this message.

+++++++++++++++++++++++++++

>From greer@utdallas.edu (Dale M. Greer)
Date: 2 Jun 1994 16:54:39 GMT
Organization: The University of Texas at Dallas

Bela Farago (farago@onyx.ill.fr) wrote:
> Hi,

> please forgive me if it is trivial but I am programming just by hobby (well
> not really but this is not my main job)
> I need to write a rather trivial code resource which sends the equivalent of
> the following AppleScript

> tell application anApplication
>     save document 1
> end tell

> An Application is an application :)

> It is hard to digest the volume VI of IM. and I could not find simple
> example on ftp.apple.com. (I did find some but were not very clear to me) I
> am bit lost in descriptors and direct objects etc. Could anybody give me
> pointers to simple examples?
>                                     Bela Farago (farago@ill.fr)

> ps: No flames please, you can always just ignore this message.


Here's how I did it.

/**************************************************** MyCreateDocContainer */
// 07-APR-94 : Dale M. Greer

#pragma segment AEvents
void MyCreateDocContainer(AEDesc *myDocContainer, char *docName)
{
	AEDesc myDocDescRec, nullDescRec;
	OSErr err;

	// Prepare to make something from nothing
	err = AECreateDesc(typeNull, nil, 0, &nullDescRec);

	// This document container just contains the document name
	err = AECreateDesc(typeChar, docName, strlen(docName), &myDocDescRec);

	// Encapsulate the object
	err = CreateObjSpecifier((DescType)cDocument, &nullDescRec, 
		(DescType)formName, &myDocDescRec, true, myDocContainer);

	AEDisposeDesc(&nullDescRec);
}

/********************************************************** SaveDocumentAs */
// 07-APR-94 : Dale M. Greer

#pragma segment AEvents
void SaveDocumentAs(AEAddressDesc *theAddress, char *docName, char *saveName)
{
	OSErr err;
	AppleEvent appleEvent, reply;
	AEDesc myDocDescRec, theName;

	err = AECreateAppleEvent(kAECoreSuite, kAESave, theAddress,
			kAutoGenerateReturnID, 1L, &appleEvent);

	// Make the doc container
	MyCreateDocContainer(&myDocDescRec, docName);

	// Append it to the appleEvent
	AEPutParamDesc(&appleEvent, keyDirectObject, &myDocDescRec);

	// Append the new filename descriptor to the appleEvent
	AECreateDesc(typeChar, saveName, strlen(saveName), &theName);
	AEPutParamDesc(&appleEvent, keyAEFile, &theName);

	// Send it to the Application
	err = AESend(&appleEvent, &reply, kAEWaitReply + kAENeverInteract,
			kAENormalPriority, kAEDefaultTimeout,
			(IdleProcPtr) MyIdleFunction, nil);

	AEDisposeDesc(&myDocDescRec);
	AEDisposeDesc(&appleEvent);
	AEDisposeDesc(&reply);
}


You get the address of the target application when you launch it, or if it
is already launched, use the following routine, where "creator" is the
creator type of your application.  For example, this creator type is
'XCEL' for Excel, 'ttxt' for TeachText, etc.

// This function originated from James "im" Beninghaus of Apple DTS
/*********************************************************** AppLaunched */
// Find the process serial number of your application

#pragma segment Launch
Boolean AppLaunched(AEAddressDesc *theAddress, OSType creator,
					ProcessSerialNumber *PSN)
{
	OSErr				osErr;
	ProcessSerialNumber process;
	ProcessInfoRec		procInfo;
	Str255				procName;
	FSSpec procSpec;
	
	osErr = noErr;
	process.highLongOfPSN = 0;
	process.lowLongOfPSN  = 0;
	
	procInfo.processInfoLength	= sizeof(procInfo);
	procInfo.processName		= procName;
	procInfo.processAppSpec		= &procSpec;

	while (procNotFound != (osErr = GetNextProcess(&process))) {
		if (noErr == (osErr = GetProcessInformation(&process, &procInfo))) {
			if (procInfo.processSignature == creator) {
            	AECreateDesc(typeProcessSerialNumber, (Ptr)&process, 
					sizeof(ProcessSerialNumber), theAddress);
					*PSN = process;
				return(true);
			}
		}
	}
	return(false);
}

In summary, the program snippet accomplishing what you desire would
look like this.

if (AppLaunched(&theAddress, theCreator, &PSN) {
	SaveDocumentAs(&theAddress, &docName, &saveName);
}

If the application is not already launched, there are a number of ways
to launch it, but I will leave that as an exercise for the reader, or
for a later tutorial if you're still having problems.

--

Dale Greer, greer@utdallas.edu
"They know that it is human nature to take up causes whereby a man may
 oppress his neighbor, no matter how unjustly. ... Hence they have had
 no trouble in finding men who would preach the damnability and heresy
 of the new doctrine from the very pulpit..." - Galileo Galilei, 1615


---------------------------

>From busfield@hurricane.seas.ucla.edu (John D. Busfield)
Subject: Faster Square Root Algorithm
Date: Tue, 3 May 1994 17:23:52 GMT
Organization: School of Engineering & Applied Science, UCLA.

    Does anyone have an algorithm or code for computing square roots.
The emphasis is on speed rather than accuracy in that I only need the
result rounded to the nearest integer (ie sqrt(1000) = 31 or 32).
Thanks in advance.

John Busfield
busfield@hurricane.seas.ucla.edu


+++++++++++++++++++++++++++

>From rmah@panix.com (Robert S. Mah)
Date: Tue, 03 May 1994 17:41:55 -0500
Organization: One Step Beyond

busfield@hurricane.seas.ucla.edu (John D. Busfield) wrote:

>     Does anyone have an algorithm or code for computing square roots.
> The emphasis is on speed rather than accuracy in that I only need the
> result rounded to the nearest integer (ie sqrt(1000) = 31 or 32).

There is an article on generating very fast square root aproximations
in the book "Graphics Gems", Ed. Glassner. 

It generates a lookup table which is indexed by munging the exponent
of the argument and then uses the mantissa to do an (linear?) 
aproximation to the final result.  It's fairly accurate to within a few
decimal points.  I think the source code is also available somewhere 
on the net, but I'm afraid I don't know where.

Another option, since you only want integer aproximations, is to create a 
table of squares and do a binary search over it.  A 1000 element table 
will give you a range of 1->1,000,000 with an average of log2(1000) = 10 
lookups using a simple binary search.  A table with 32K or so elements will

have a range of 1->1,000,000,000 and will be searchable in log2(32K) = 30 
lookups.  If you can spare a few dozen K, then this may be fast enough.

Cheers,
Rob
___________________________________________________________________________
Robert S. Mah  -=-  One Step Beyond  -=-  212-947-6507  -=-  rmah@panix.com

+++++++++++++++++++++++++++

>From gewekean@studentg.msu.edu (Andrew Geweke)
Date: Tue,  3 May 1994 20:56:35 -0400
Organization: Michigan State University

In article <rmah-030594174155@rmah.dialup.access.net>, rmah@panix.com (Robert 
S. Mah) writes:
> Another option, since you only want integer aproximations, is to create 
> a table of squares and do a binary search over it.  A 1000 element 
> table will give you a range of 1->1,000,000 with an average of log2
> (1000) = 10 lookups using a simple binary search.  A table with 32K or 
> so elements will 
>  
> have a range of 1->1,000,000,000 and will be searchable in log2(32K) = 30 
> lookups.  If you can spare a few dozen K, then this may be fast enough. 

Actually, I once found this algorithm:

int intSqrt (int num) {
   for (i = j = 1, k = 3; num > j; j += (k+=2), ++i)
          ;

   return i;
}

Note that this algorithm rounds up, not down. I'm not sure exactly how 
correct this is (I just coded it off the top of my head) but you get the 
basic idea in any case. Adding the odd integers gives you the squares; i.e. 1 
= 1; 1 + 3 = 4; 1 + 3 + 5 = 9; 1 + 3 + 5 + 7 = 16; 1 + 3 + 5 + 7 + 9 = 25; 
and so on.

I don't know how fast this is in your specific instance, but it should be 
quite fast.


+++++++++++++++++++++++++++

>From rmah@panix.com (Robert S. Mah)
Date: Tue, 03 May 1994 22:04:22 -0500
Organization: One Step Beyond

gewekean@studentg.msu.edu (Andrew Geweke) wrote:

> rmah@panix.com (Robert  S. Mah) writes:
> > Another option, since you only want integer aproximations, is to create 
> > a table of squares and do a binary search over it.  A 1000 element 
> > [...]
> > have a range of 1->1,000,000,000 and will be searchable in log2(32K) = 30 
> > lookups.  If you can spare a few dozen K, then this may be fast enough. 
> 
> Actually, I once found this algorithm:
> 
> int intSqrt (int num) {
>    for (i = j = 1, k = 3; num > j; j += (k+=2), ++i)
>           ;
> 
>    return i;
> }
> 
> Note that this algorithm rounds up, not down. I'm not sure exactly how 
> correct this is (I just coded it off the top of my head) but you get the 
> basic idea in any case. Adding the odd integers gives you the squares; 
> i.e. 1  = 1; 1 + 3 = 4; 1 + 3 + 5 = 9; 1 + 3 + 5 + 7 = 16; 1 + 3 + 5 + 
> 7 + 9 = 25;  and so on.
> 
> I don't know how fast this is in your specific instance, but it should  
> be quite fast.

That's a great little algorithm!  That sequence looks _so_ familiar, but
for the life of me, I can't recall the name...

Anyway, it looks like it would be order O(sqrt(n)) where n is the argument 
to the function.  Only a problem if the domain range for the application 
is large, that is Sqrt(10000) would take 10 times as long as Sqrt(100).  
However, unlike the table method, the upper bounds isn't fixed at compile
time.

Cheers,
Rob
___________________________________________________________________________
Robert S. Mah  -=-  One Step Beyond  -=-  212-947-6507  -=-  rmah@panix.com

+++++++++++++++++++++++++++

>From pottier@corvette.ens.fr (Francois Pottier)
Date: 4 May 1994 09:42:29 GMT
Organization: Ecole Normale Superieure, PARIS, France

In article <busfield.767985832@hurricane>,
John D. Busfield <busfield@hurricane.seas.ucla.edu> wrote:

>    Does anyone have an algorithm or code for computing square roots.


There was a thread on this subject about a year ago. Here's what
I kept:

- -----------------------------------------------------------------------


>Does somebody have code or an algorithm for extracting a long integer
>square root and returning an integer?

There was a long series of letters in Dr.  Dobbs' Journal a few years
back that I was a part of.  Here are two 'competing' subroutines for the
68000 that I wrote.  One is a Newton-Raphson iterator, the other a
hybrid of three different subroutines using two different methods,
heavily optimized for speed.  The non-Newton one is faster in some cases
and slower in others.  Choosing one as 'best' depends on what kind of
input ranges you expect to see most often.  Newton's time doesn't vary
much on the average based on the input argument.  The non-Newton one
ranges from a lot better (small arguments) to a little worse (large
arguments), but is always better than Newton's worst case.  Don't
neglect the fact that on a FPU-equipped machine it may be faster to use
floating point.  I've done no research into this possibility, nor have I
timed this stuff on any 68020+ system.  The speed balance is no doubt
different then.  (For that matter, the 68000 testbed had no wait states
nor interrupt system overhead, which doesn't necessarily apply to a
68000 PowerBook, and certainly doesn't apply to a Mac+ or earlier.)

I'm personally fond of the non-Newton version, because the algorithm
only uses shifts and adds, so it could be implemented in microcode with
about same speed as a divide.

*************************************************************************
*                                                                       *
*       Integer Square Root (32 to 16 bit).                             *
*                                                                       *
*       (Newton-Raphson method).                                        *
*                                                                       *
*       Call with:                                                      *
*               D0.L = Unsigned number.                                 *
*                                                                       *
*       Returns:                                                        *
*               D0.L = SQRT(D0.L)                                       *
*                                                                       *
*       Notes:  Result fits in D0.W, but is valid in longword.          *
*               Takes from 338 cycles (1 shift, 1 division) to          *
*               1580 cycles (16 shifts, 4 divisions)  (including rts).  *
*               Averages 854 cycles measured over first 65535 roots.    *
*               Averages 992 cycles measured over first 500000 roots.   *
*                                                                       *
*************************************************************************

        .globl lsqrt
*                       Cycles
lsqrt   movem.l d1-d2,-(sp) (24)
        move.l d0,d1    (4)     ; Set up for guessing algorithm.
        beq.s return    (10/8)  ; Don't process zero.
        moveq #1,d2     (4)

guess   cmp.l d2,d1     (6)     ; Get a guess that is guaranteed to be
        bls.s newton    (10/8)  ; too high, but not by much, by dividing the
        add.l d2,d2     (8)     ; argument by two and multiplying a 1 by 2
        lsr.l #1,d1     (10)    ; until the power of two passes the modified
        bra.s guess     (10)    ; argument, then average these two numbers.

newton  add.l d1,d2     (8)     ; Average the two guesses.
        lsr.l #1,d2     (10)
        move.l d0,d1    (4)     ; Generate the next approximation(s)
        divu d2,d1      (140)   ; via the Newton-Raphson method.
        bvs.s done      (10/8)  ; Handle out-of-range input (cheats!)
        cmp.w d1,d2     (4)     ; Have we converged?
        bls.s done      (10/8)
        swap d1         (4)     ; No, kill the remainder so the
        clr.w d1        (4)     ; next average comes out right.
        swap d1         (4)
        bra.s newton    (10)

done    clr.w d0        (4)     ; Return a word answer in a longword.
        swap d0         (4)
        move.w d2,d0    (4)
return  movem.l (sp)+,d1-d2  (28)
        rts             (16)

        end
*************************************************************************
*                                                                       *
*       Integer Square Root (32 to 16 bit).                             *
*                                                                       *
*       (Exact method, not approximate).                                *
*                                                                       *
*       Call with:                                                      *
*               D0.L = Unsigned number.                                 *
*                                                                       *
*       Returns:                                                        *
*               D0.L = SQRT(D0.L)                                       *
*                                                                       *
*       Notes:  Result fits in D0.W, but is valid in longword.          *
*               Takes from 122 to 1272 cycles (including rts).          *
*               Averages 610 cycles measured over first 65535 roots.    *
*               Averages 1104 cycles measured over first 500000 roots.  *
*                                                                       *
*************************************************************************

        .globl lsqrt
*                       Cycles
lsqrt   tst.l d0        (4)     ; Skip doing zero.
        beq.s done      (10/8)
        cmp.l #$10000,d0 (14)   ; If is a longword, use the long routine.
        bhs.s glsqrt    (10/8)
        cmp.w #625,d0   (8)     ; Would the short word routine be quicker?
        bhi.s gsqrt     (10/8)  ; No, use general purpose word routine.
*                               ; Otherwise fall into special routine.
*
*  For speed, we use three exit points.
*  This is cheesy, but this is a speed-optimized subroutine!

        page
*************************************************************************
*                                                                       *
*       Faster Integer Square Root (16 to 8 bit).  For small arguments. *
*                                                                       *
*       (Exact method, not approximate).                                *
*                                                                       *
*       Call with:                                                      *
*               D0.W = Unsigned number.                                 *
*                                                                       *
*       Returns:                                                        *
*               D0.W = SQRT(D0.W)                                       *
*                                                                       *
*       Notes:  Result fits in D0.B, but is valid in word.              *
*               Takes from 72 (d0=1) to 504 (d0=625) cycles             *
*               (including rts).                                        *
*                                                                       *
*       Algorithm supplied by Motorola.                                 *
*                                                                       *
*************************************************************************

* Use the theorem that a perfect square is the sum of the first
* sqrt(arg) number of odd integers.

*                       Cycles
        move.w d1,-(sp) (8)
        move.w #-1,d1   (8)
qsqrt1  addq.w #2,d1    (4)
        sub.w d1,d0     (4)
        bpl qsqrt1      (10/8)
        asr.w #1,d1     (8)
        move.w d1,d0    (4)
        move.w (sp)+,d1 (12)
done    rts             (16)

        page
*************************************************************************
*                                                                       *
*       Integer Square Root (16 to 8 bit).                              *
*                                                                       *
*       (Exact method, not approximate).                                *
*                                                                       *
*       Call with:                                                      *
*               D0.W = Unsigned number.                                 *
*                                                                       *
*       Returns:                                                        *
*               D0.L = SQRT(D0.W)                                       *
*                                                                       *
*       Uses:   D1-D4 as temporaries --                                 *
*               D1 = Error term;                                        *
*               D2 = Running estimate;                                  *
*               D3 = High bracket;                                      *
*               D4 = Loop counter.                                      *
*                                                                       *
*       Notes:  Result fits in D0.B, but is valid in word.              *
*                                                                       *
*               Takes from 544 to 624 cycles (including rts).           *
*                                                                       *
*               Instruction times for branch-type instructions          *
*               listed as (X/Y) are for (taken/not taken).              *
*                                                                       *
*************************************************************************

*                       Cycles
gsqrt   movem.l d1-d4,-(sp) (40)
        move.w #7,d4    (8)     ; Loop count (bits-1 of result).
        clr.w d1        (4)     ; Error term in D1.
        clr.w d2        (4)
sqrt1   add.w d0,d0     (4)     ; Get 2 leading bits a time and add
        addx.w d1,d1    (4)     ; into Error term for interpolation.
        add.w d0,d0     (4)     ; (Classical method, easy in binary).
        addx.w d1,d1    (4)
        add.w d2,d2     (4)     ; Running estimate * 2.
        move.w d2,d3    (4)
        add.w d3,d3     (4)
        cmp.w d3,d1     (4)
        bls.s sqrt2     (10/8)  ; New Error term > 2* Running estimate?
        addq.w #1,d2    (4)     ; Yes, we want a '1' bit then.
        addq.w #1,d3    (4)     ; Fix up new Error term.
        sub.w d3,d1     (4)
sqrt2   dbra d4,sqrt1   (10/14) ; Do all 8 bit-pairs.
        move.w d2,d0    (4)
        movem.l (sp)+,d1-d4 (44)
        rts             (16)

        page
*************************************************************************
*                                                                       *
*       Integer Square Root (32 to 16 bit).                             *
*                                                                       *
*       (Exact method, not approximate).                                *
*                                                                       *
*       Call with:                                                      *
*               D0.L = Unsigned number.                                 *
*                                                                       *
*       Returns:                                                        *
*               D0.L = SQRT(D0.L)                                       *
*                                                                       *
*       Uses:   D1-D4 as temporaries --                                 *
*               D1 = Error term;                                        *
*               D2 = Running estimate;                                  *
*               D3 = High bracket;                                      *
*               D4 = Loop counter.                                      *
*                                                                       *
*       Notes:  Result fits in D0.W, but is valid in longword.          *
*                                                                       *
*               Takes from 1080 to 1236 cycles (including rts.)         *
*                                                                       *
*               Two of the 16 passes are unrolled from the loop so that *
*               quicker instructions may be used where there is no      *
*               danger of overflow (in the early passes).               *
*                                                                       *
*               Instruction times for branch-type instructions          *
*               listed as (X/Y) are for (taken/not taken).              *
*                                                                       *
*************************************************************************

*                       Cycles
glsqrt  movem.l d1-d4,-(sp) (40)
        moveq #13,d4    (4)     ; Loop count (bits-1 of result).
        moveq #0,d1     (4)     ; Error term in D1.
        moveq #0,d2     (4)
lsqrt1  add.l d0,d0     (8)     ; Get 2 leading bits a time and add
        addx.w d1,d1    (4)     ; into Error term for interpolation.
        add.l d0,d0     (8)     ; (Classical method, easy in binary).
        addx.w d1,d1    (4)
        add.w d2,d2     (4)     ; Running estimate * 2.
        move.w d2,d3    (4)
        add.w d3,d3     (4)
        cmp.w d3,d1     (4)
        bls.s lsqrt2    (10/8)  ; New Error term > 2* Running estimate?
        addq.w #1,d2    (4)     ; Yes, we want a '1' bit then.
        addq.w #1,d3    (4)     ; Fix up new Error term.
        sub.w d3,d1     (4)
lsqrt2  dbra d4,lsqrt1  (10/14) ; Do first 14 bit-pairs.

        add.l d0,d0     (8)     ; Do 15-th bit-pair.
        addx.w d1,d1    (4)
        add.l d0,d0     (8)
        addx.l d1,d1    (8)
        add.w d2,d2     (4)
        move.l d2,d3    (4)
        add.w d3,d3     (4)
        cmp.l d3,d1     (6)
        bls.s lsqrt3    (10/8)
        addq.w #1,d2    (4)
        addq.w #1,d3    (4)
        sub.l d3,d1     (8)

lsqrt3  add.l d0,d0     (8)     ; Do 16-th bit-pair.
        addx.l d1,d1    (8)
        add.l d0,d0     (8)
        addx.l d1,d1    (8)
        add.w d2,d2     (4)
        move.l d2,d3    (4)
        add.l d3,d3     (8)
        cmp.l d3,d1     (6)
        bls.s lsqrt4    (10/8)
        addq.w #1,d2    (4)
lsqrt4  move.w d2,d0    (4)
        movem.l (sp)+,d1-d4 (44)
        rts             (16)

        end
-- 
+----------------+
! II      CCCCCC !  Jim Cathey
! II  SSSSCC     !  ISC-Bunker Ramo
! II      CC     !  TAF-C8;  Spokane, WA  99220
! IISSSS  CC     !  UUCP: uunet!isc-br!jimc (jimc@isc-br.isc-br.com)
! II      CCCCCC !  (509) 927-5757
+----------------+

- -----------------------------------------------------------------------------

/*
 * ISqrt--
 *     Find square root of n.  This is a Newton's method approximation,
 * converging in 1 iteration per digit or so.  Finds floor(sqrt(n) + 0.5).
 */
int ISqrt(register int n)
{
    register int i;
    register long k0, k1, nn;

    for (nn = i = n, k0 = 2; i > 0; i >>= 2, k0 <<= 1)
        ;
    nn <<= 2;
    for (;;) {
        k1 = (nn / k0 + k0) >> 1;
        if (((k0 ^ k1) & ~1) == 0)
            break;
        k0 = k1;
    }
    return (int) ((k1 + 1) >> 1);	
}

-- 
Joseph Nathan Hall  |  Joseph's Rule of Thumb: Most folks aren't interested
Software Architect  |                          in your rules of thumb.
Gorca Systems Inc.  |  ----- joseph@joebloe.maple-shade.nj.us (home) -----
(on assignment)     |        (602) 732-2549 jnhall@sat.mot.com (work)

- -----------------------------------------------------------------------------

Here's an article I saved from c.s.m.p four months ago.  Strangely, it
was only distributed in North America.  I don't know how fast it works
in practice, but there are no multiplications or divisions in its inner
loop, which is promising.


#include <math.h>
 
/*
 * It requires more space to describe this implementation of the manual
 * square root algorithm than it did to code it.  The basic idea is that
 * the square root is computed one bit at a time from the high end.  Because
 * the original number is 32 bits (unsigned), the root cannot exceed 16 bits
 * in length, so we start with the 0x8000 bit.
 *
 * Let "x" be the value whose root we desire, "t" be the square root
 * that we desire, and "s" be a bitmask.  A simple way to compute
 * the root is to set "s" to 0x8000, and loop doing the following:
 *
 *	t = 0;
 *	s = 0x8000;
 *	do {
 *		if ((t + s) * (t + s) <= x)
 *			t += s;
 *		s >>= 1;
 *	while (s != 0);
 *
 * The primary disadvantage to this approach is the multiplication.  To
 * eliminate this, we begin simplying.  First, we observe that
 *
 *	(t + s) * (t + s) == (t * t) + (2 * t * s) + (s * s)
 *
 * Therefore, if we redefine "x" to be the original argument minus the
 * current value of (t * t), we can determine if we should add "s" to
 * the root if
 *
 *	(2 * t * s) + (s * s) <= x
 *
 * If we define a new temporary "nr", we can express this as
 *
 *	t = 0;
 *	s = 0x8000;
 *	do {
 *		nr = (2 * t * s) + (s * s);
 *		if (nr <= x) {
 *			x -= nr;
 *			t += s;
 *		}
 *		s >>= 1;
 *	while (s != 0);
 *
 * We can improve the performance of this by noting that "s" is always a
 * power of two, so multiplication by "s" is just a shift.  Also, because
 * "s" changes in a predictable manner (shifted right after each iteration)
 * we can precompute (0x8000 * t) and (0x8000 * 0x8000) and then adjust
 * them by shifting after each step.  First, we let "m" hold the value
 * (s * s) and adjust it after each step by shifting right twice.  We
 * also introduce "r" to hold (2 * t * s) and adjust it after each step
 * by shifting right once.  When we update "t" we must also update "r",
 * and we do so by noting that (2 * (old_t + s) * s) is the same as
 * (2 * old_t * s) + (2 * s * s).  Noting that (s * s) is "m" and that
 * (r + 2 * m) == ((r + m) + m) == (nr + m):
 *
 *	t = 0;
 *	s = 0x8000;
 *	m = 0x40000000;
 *	r = 0;
 *	do {
 *		nr = r + m;
 *		if (nr <= x) {
 *			x -= nr;
 *			t += s;
 *			r = nr + m;
 *		}
 *		s >>= 1;
 *		r >>= 1;		
 *		m >>= 2;
 *	} while (s != 0);
 *
 * Finally, we note that, if we were using fractional arithmetic, after
 * 16 iterations "s" would be a binary 0.5, so the value of "r" when
 * the loop terminates is (2 * t * 0.5) or "t".  Because the values in
 * "t" and "r" are identical after the loop terminates, and because we
 * do not otherwise use "t"  explicitly within the loop, we can omit it.
 * When we do so, there is no need for "s" except to terminate the loop,
 * but we observe that "m" will become zero at the same time as "s",
 * so we can use it instead.
 *
 * The result we have at this point is the floor of the square root.  If
 * we want to round to the nearest integer, we need to consider whether
 * the remainder in "x" is greater than or equal to the difference
 * between ((r + 0.5) * (r + 0.5)) and (r * r).  Noting that the former
 * quantity is (r * r + r + 0.25), we want to check if the remainder is
 * greater than or equal to (r + 0.25).  Because we are dealing with
 * integers, we can't have equality, so we round up if "x" is strictly
 * greater than "r":
 *
 *	if (x > r)
 *		r++;
 */
 
unsigned int isqrt(unsigned long x)
{
	unsigned long r, nr, m;
 
	r = 0;
	m = 0x40000000;
	do {
		nr = r + m;
		if (nr <= x) {
			x -= nr;
			r = nr + m;
		}
		r >>= 1;
		m >>= 2;
	} while (m != 0);
 
	if (x > r)
		r++;
	return(r);
}
--
(Dr.) John Bruner, Deputy Director			bruner@csrd.uiuc.edu
Center for Supercomputing Research & Development	(217) 244-4476 (voice)
University of Illinois at Urbana-Champaign		(217) 244-1351 (FAX)
465 CSRL, MC-264; 1308 West Main St.; Urbana, IL  61801-2307


-- 
 Jamie McCarthy		Internet: k044477@kzoo.edu	AppleLink: j.mccarthy

- -----------------------------------------------------------------------------

Then there is the easy way - use the toolbox:

#define LongSqrt(x) ((short)(FracSqrt((Fract)(x)) >> 15))

FractSqrt is in FixMath.h. This works because given two fixed point number 
of the form x.y, where x is the number of bits before the decimal point 
and y is the number of bits after the decimal point, twos complement 
multiplication will yield a result of: x1.y1 * x2.y2 = x1+x2.y1+y2. If the 
numbers are the same format, this becomes x.y * x.y = 2x.2y. This holds 
for squaring a number also: x.y^2 = 2x.2y. The sqrt is then sqrt(x.y) = 
x/2.y/2. FracSqrt is of the form FracSqrt(2.30) = 2.30. Since we know how 
sqrt works this can also be expressed as FracSqrt(x.y) = x/2.y/2 << 15 
(This is a "virtual" shift since the calculation is compleated with more 
precision the 15 bits shifted in are "real" and not 0). So if you want to 
use FracSqrt for a long you get FracSqrt(32.0) = 16.0 << 15. So you shift 
the result down be 15 to get a short (you can add 1 << 14 prior to 
shifting down by 15 to round instead of truncing your result). If you 
wanted a long sqrt with a 16.16 (Fixed) result you would use:

#define LongFixedSqrt(x) ((Fixed)(FracSqrt((Fract)(x)) << 1))

If you want a Fixed (16.16) sqrt with a Fixed result (16.16) you would use:

#define FixedSqrt(x) ((Fixed)FracSqrt((Fract)(x)) >> 7))

You can do this kind of work with all of the fixed point routines and 
standard operators.

Sean Parent


-- 
Francois Pottier                                            pottier@dmi.ens.fr
- ----------------------------------------------------------------------------
This area dedicated to the preservation of endangered species.         "Moof!"

+++++++++++++++++++++++++++

>From christer@cs.umu.se (Christer Ericson)
Date: Wed, 4 May 1994 12:16:40 GMT
Organization: Dept. of Computing Science, Umea Univ., 901 87 Umea, Sweden

In <busfield.767985832@hurricane> busfield@hurricane.seas.ucla.edu (John D. Busfield) writes:

>    Does anyone have an algorithm or code for computing square roots.
>The emphasis is on speed rather than accuracy in that I only need the
>result rounded to the nearest integer (ie sqrt(1000) = 31 or 32).
>Thanks in advance.

I posted this to comp.sys.m68k some time ago, here it is again:


unsigned short ce_quick_sqrt(n)
register unsigned long n;
{
	asm {
;-------------------------
;
; Routine       : ISQRT; integer square root
; I/O parameters: d0.w = sqrt(d0.l)
; Registers used: d0-d2/a0
;
; This is a highly optimized integer square root routine, based
; on the iterative Newton/Babylonian method
;
;	r(n + 1) = (r(n) + A / R(n)) / 2
;
; Verified over the full interval [0x0L,0xFFFFFFFFL]
;
; Some minor compromises have been made to make it perform well
; on the 68000 as well as 68020/030/040. This routine outperforms
; the best of all other algorithms I've seen (except for a table
; lookup).
;
; Copyright (c) Christer Ericson, 1993. All rights reserved.
;
; Christer Ericson, Dept. of Computer Science, University of Umea,
; S-90187 UMEA, SWEDEN. Internet: christer@cs.umu.se
;

;-------------------------
;
; Macintosh preamble since THINK C passes first register param
; in d7. Remove this section on any other machine
;
	move.l	d7,d0
;-------------------------
;
; Actual integer square root routine starts here
;
	move.l	d0,a0		; save original input value for the iteration
	beq.s	@exit		; unfortunately we must special case zero
	moveq	#2,d1		; init square root guess
;-------------------------
;
; Speed up the process of finding a power of two approximation
; to the actual square root by shifting an appropriate amount
; when the input value is large enough
;
; If input values are word only, this section can be removed
;
	move.l	d0,d2
	swap	d2		; do [and.l 0xFFFF0000,d2] this way to...
	tst.w	d2		; go faster on 68000 and to avoid having to...
	beq.s	@skip8		; reload d2 for the next test below
	move.w	#0x200,d1	; faster than lsl.w #8,d1 (68000)
	lsr.l	#8,d0
@skip8	and.w	#0xFE00,d2	; this value and shift by 5 are magic
	beq.s	@skip4
	lsl.w	#5,d1
	lsr.l	#5,d0
@skip4

;-------------------------
;
; This finds the power of two approximation to the actual square root
; by doing the integer equivalent to sqrt(x) = 2 ^ (log2(x) / 2). This
; is done by shifting the input value down while shifting the root
; approximation value up until they meet in the middle. Better precision
; (in the step described below) is gained by starting the approximation
; at 2 instead of 1 (this means that the approximation will be a power
; of two too large so remember to shift it down).
;
; Finally (and here's the clever thing) since, by previously shifting the
; input value down, it has actually been divided by the root approximation
; value already so the first iteration is "for free". Not bad, eh?
;
@loop	add.l	d1,d1
	lsr.l	#1,d0
	cmp.l	d0,d1
	bcs.s	@loop
@skip	lsr.l	#1,d1		; adjust the approximation
	add.l	d0,d1		; here we just add and shift to...
	lsr.l	#1,d1		; get the first iteration "for free"!
;-------------------------
;
; The iterative root value is guaranteed to be larger than (or equal to)
; the actual square root, except for the initial guess. But since the first
; iteration was done above, the loop test can be simplified below
; (Oh, without the bvs.s the routine will fail on most large numbers, like
; for instance, 0xFFFF0000. Could there be a better way of handling these,
; so the bvs.s can be removed? Nah...)
;
@loop2	move.l	a0,d2		; get original input value
	move.w	d1,d0		; save current guess
	divu.w	d1,d2		; do the Newton method thing
	bvs.s	@exit		; if div overflows, exit with current guess
	add.w	d2,d1
	roxr.w	#1,d1		; roxr ensures shifting back carry overflow
	cmp.w	d0,d1
	bcs.s	@loop2		; exit with result in d0.w
@exit
	}
}


Christer Ericson --- Internet: christer@cs.umu.se --- tel: +46-90-166794
Department of Computer Science, University of Umea, S-90187 UMEA, SWEDEN

+++++++++++++++++++++++++++

>From usenet@lowry.eche.ualberta.ca (Brian Lowry)
Date: 4 May 1994 16:38:23 GMT
Organization: Dept. of Chemical Eng., University of Alberta

Well, if people are going to start posting square root routines that use
division, here's about the simplest possible, coded in C:


#define precLim 1e-8   //for good single precision performance

float sqrt(float x);

float sqrt(float x)
{	register float d = 0.5*(1.0 - x);
		register float b = 1.0 - d;
	
	while ((d > precLim)||(d < precLim))
	{	d *= 0.5*d/b; b -= d;}

	return b;
}


  This takes about three to five iterations, but on the PowerPC that
amounts to a whopping 100 to 170 clock cycles.  Since that's enough cycles
for about 80 multiplies and/or flops, division is definitely not the best
route here (though it's certainly compact...)

Brian Lowry

+++++++++++++++++++++++++++

>From parkb@bigbang.Stanford.EDU (Brian Park)
Date: 4 May 1994 20:43:37 GMT
Organization: Stanford University

I am assuming that the original poster wanted integer in/ integer out
since he only wanted accurary within 1.

The following is a description of code which follows at the end of this
article.

lsqrt2() - traditional Newton's method
- ------
The traditional Newton's method of getting y = sqrt(n), coded as lsqrt2()
below)

	n = y^2
	y(0) = n
	y(i+1) = (y(i) + n/y(i))/2

is second-order convergent, meaning that the relative error in the i+1
iteration is proportional to the square of the previous error
(remember, at sufficiently large iteration, relative error << 1).  [Do
a Taylor series exansion to verify that the first order term
vanishes].  Hence, for a fixed tolerance, in this case, within unity
(ie. error of 1/n), the algorithm performs a bisection on the
error at every iteration.

My point is that Newton's method is O(log_2(n)) and you are not going
to do much better than that.  Binary lookup tables are O(log_2(n)) 
efficient.  The only thing you might save is an integer division
compared to the Newton's method (see n/y(i) above).  But it also globbles
up memory and has fixed size.  A hash table provides the fastest
lookup table, but it takes up even more memory than the binary lookup
table, and coding it can be more tricky.  Do you really want to use up
that much memory for a simple square root function?

I should add that Newton's method is guaranteed to converge for all n.

lsqrt1() - bit-shift method
- -----
There is yet another method (see lsqrt1() function below) that I dreamed
up last night, and coded it quickly this morning.  It uses bit-shifts
to perform an integer square root by taking log_2(n), dividing it
by 2 (hence doing a square root) and shifting it back.  This routine is
also O(log_2(n)) but has the peculiar property that it is _faster_
for larger numbers.  You can see why below.  It has the disadvantage that
it is quite inaccurate for small numbers, but surprisingly accurate for 
large numbers.

lsqrt3() - combination
- ------
If you really need accurary within unity, then you can plug the result
of the bit-shift method, lsqrt1(), use it as the initial guess for
the Newton's method and converge the answer down to its limiting
accurary.  This combination is really really fast, since Newton's
method converges within an iteration or two when the initial guess is
very close to the actual answer.  This has been coded as lsqrt3()
below.

For comparision, the straight Newton's method takes longer
because its initial guess is simply the original answer, n.
But let me empasize that even with straight Newton's method is O(log_2(n)),
Sqrt(10^9) takes only 19 iterations, as you can verify.  That's only 19
integer divisions.  How fast do you want your square root to be?

I have even more ideas on how to speed things up, but I think the code
below should get you going.  I'm sorry it's not coded in Macintosh,
I'm on the Unix box right now, and my mac's at home.  It hasn't
been tested extensively on the boundary points (ie. 0,1, 2^32-1), 
but you can do that.

My final comment concerns the O(sqrt(n)) algorithm that was posted
a short while ago.  Although it is simple and cute, I would not recommend
it when you have so many O(log_2(n)) routines to choose from.

Regards,
Brian Park

- - cut for code below ---

#include <stdio.h>

#define BITSIZE (sizeof(unsigned long)*8)

unsigned long lsqrtng();
unsigned long lsqrt1();
unsigned long lsqrt2();
unsigned long lsqrt3();

/***
Usage: 
	lsqrt n

Example:

lsqrt 100000
l = 1000000
bit shift: 976
iter = 13
newton: 1000
iter = 2
bit shift + newton: 1000
***/
main(int argc, char **argv)
{
	unsigned long l;

	l = atol(argv[1]);
	printf("l = %lu\n", l);

	printf("bit shift: %lu\n", lsqrt1(l));
	printf("newton: %lu\n", lsqrt2(l));
	printf("bit shift + newton: %lu\n", lsqrt3(l));
}

/***
Take square root by performing bit-shifts, getting the log_2(a),
dividing by 2, and shifting back.
***/
unsigned long lsqrt1(unsigned long a)
{
	register unsigned int i;
	register unsigned long l = a;

	if (a==0) return 0;

	for (i=0; l<(unsigned long)(1L<<(BITSIZE-1)) && i<BITSIZE; ++i)
		l<<=1;

	i = (32-i)>>1;
	a >>= i;
	return a;
}

/***
Take square root by using Newton's method with an initial guess of itself.
***/
unsigned long lsqrt2(unsigned long a)
{
	return lsqrtng(a, a);
}

/***
Newton's method with guess.
Newton's method is second order convergence, hence it should be very fast.
***/
unsigned long lsqrtng(register unsigned long a, unsigned long guess)
{
	register unsigned long b, b0;
	int i = 0;

	if (a == 0) return 0;

	b = guess;
	b0 = a/b;
	while (labs(b-b0)>1) {
		/* printf("guess = %lu\n", b);*/
		++i;
		b0 = b;
		b=(a/b+b)>>1;
	}
	printf("iter = %d\n", i);
	return b;
}

/***
Combination Newton's method and bit-shifts.
***/
unsigned long lsqrt3(unsigned long a)
{
	return lsqrtng(a, lsqrt1(a));
}

+++++++++++++++++++++++++++

>From sparent@mv.us.adobe.com (Sean Parent)
Date: Wed, 4 May 1994 20:46:30 GMT
Organization: Adobe Systems Incorporated

In article <9405032056.AA35937@geweke.ppp.msu.edu>,
gewekean@studentg.msu.edu (Andrew Geweke) wrote:

> Actually, I once found this algorithm:
> 
> int intSqrt (int num) {
>    for (i = j = 1, k = 3; num > j; j += (k+=2), ++i)
>           ;
> 
>    return i;
> }

This is a difference algorithim and similar algorithms can be used to solve
most any polynomial expression. This is the method used by Charles Babage
(sp?) in the late 1800's in his difference engine. For example: y = x^3 can
be calculated over a range with a given step in x of n by:

y = y` + a

Where a = (x + n)^3 - x^3. Or a = 3nx^2 + 3xn^2 + n^3.

The x^2 term (i) can be similary reduce:
i = i` + b
Where b = (x + n)^2 - x^2. Or b = 2xn + n^2.

Finaly the x term is reduced as:
x = x` + n

This can be coded up as a loop consisting of only additions. As to the
original question. To calculate a square root of an integer on a Mac, I
would code it as:

inline long LongSqrt(long x)
{
  return FracSqrt(x) >> 15;
}

FracSqrt is in FixMath.h. This works because squaring a fixed point number
doubles the number of fractional bits (so a number of type Fract, which is
2d30 multiplied by a Fract would yield a 4d60 number). Taking the square
root of a number then halves the number of bits (to the original
precision). So a strait square root of a fract would yield a 1d15 value.
Apple's FracSqrt return a Fract so in effect it is calculating sqrt(x) <<
15 [with the bits shifted in actually being calculated]. So if you want a
number of type Fixed as a result for the sqrt of an integer you could use.

inline Fixed FixedSqrtLong(long x)
{
	 return FracSqrt(x) << 1;  // 0/2 + 15 + 1 = 16
}

Or a the sqrt of a Fixed yielding a fixed is:

inline Fixed FixedSqrt(Fixed x)
{
  return FracSqrt(x) >> 7; // 16/2 + 15 - 7 = 16
}

This is a difficult thing to explain, I don't no of any good terminology
for talking about fixed point numbers. Play with it some and it will make
sense.

-- 
Sean Parent

+++++++++++++++++++++++++++

>From jimc@tau-ceti.isc-br.com (Jim Cathey)
Date: Wed, 4 May 1994 18:04:33 GMT
Organization: Olivetti North America, Spokane, WA

In article <busfield.767985832@hurricane> busfield@hurricane.seas.ucla.edu (John D. Busfield) writes:
>    Does anyone have an algorithm or code for computing square roots.
>The emphasis is on speed rather than accuracy in that I only need the
>result rounded to the nearest integer (ie sqrt(1000) = 31 or 32).

Here's two:  A Newton's Method, and a cool bit-banger that's a little
faster under certain circumstances.  (Timings were all on a 68000
as these are kinda old.  Re-test to see what works best now.)

*************************************************************************
*                                                                       *
*       Integer Square Root (32 to 16 bit).                             *
*                                                                       *
*       (Newton-Raphson method).                                        *
*                                                                       *
*       Call with:                                                      *
*               D0.L = Unsigned number.                                 *
*                                                                       *
*       Returns:                                                        *
*               D0.L = SQRT(D0.L)                                       *
*                                                                       *
*       Notes:  Result fits in D0.W, but is valid in longword.          *
*               Takes from 338 cycles (1 shift, 1 division) to          *
*               1580 cycles (16 shifts, 4 divisions)  (including rts).  *
*               Averages 854 cycles measured over first 65535 roots.    *
*               Averages 992 cycles measured over first 500000 roots.   *
*                                                                       *
*************************************************************************

        .globl lsqrt
*                       Cycles
lsqrt   movem.l d1-d2,-(sp) (24)
        move.l d0,d1    (4)     ; Set up for guessing algorithm.
        beq.s return    (10/8)  ; Don't process zero.
        moveq #1,d2     (4)

guess   cmp.l d2,d1     (6)     ; Get a guess that is guaranteed to be
        bls.s newton    (10/8)  ; too high, but not by much, by dividing the
        add.l d2,d2     (8)     ; argument by two and multiplying a 1 by 2
        lsr.l #1,d1     (10)    ; until the power of two passes the modified
        bra.s guess     (10)    ; argument, then average these two numbers.

newton  add.l d1,d2     (8)     ; Average the two guesses.
        lsr.l #1,d2     (10)
        move.l d0,d1    (4)     ; Generate the next approximation(s)
        divu d2,d1      (140)   ; via the Newton-Raphson method.
        bvs.s done      (10/8)  ; Handle out-of-range input (cheats!)
        cmp.w d1,d2     (4)     ; Have we converged?
        bls.s done      (10/8)
        swap d1         (4)     ; No, kill the remainder so the
        clr.w d1        (4)     ; next average comes out right.
        swap d1         (4)
        bra.s newton    (10)

done    clr.w d0        (4)     ; Return a word answer in a longword.
        swap d0         (4)
        move.w d2,d0    (4)
return  movem.l (sp)+,d1-d2  (28)
        rts             (16)

        end
*************************************************************************
*                                                                       *
*       Integer Square Root (32 to 16 bit).                             *
*                                                                       *
*       (Exact method, not approximate).                                *
*                                                                       *
*       Call with:                                                      *
*               D0.L = Unsigned number.                                 *
*                                                                       *
*       Returns:                                                        *
*               D0.L = SQRT(D0.L)                                       *
*                                                                       *
*       Notes:  Result fits in D0.W, but is valid in longword.          *
*               Takes from 122 to 1272 cycles (including rts).          *
*               Averages 610 cycles measured over first 65535 roots.    *
*               Averages 1104 cycles measured over first 500000 roots.  *
*                                                                       *
*************************************************************************

        .globl lsqrt
*                       Cycles
lsqrt   tst.l d0        (4)     ; Skip doing zero.
        beq.s done      (10/8)
        cmp.l #$10000,d0 (14)   ; If is a longword, use the long routine.
        bhs.s glsqrt    (10/8)
        cmp.w #625,d0   (8)     ; Would the short word routine be quicker?
        bhi.s gsqrt     (10/8)  ; No, use general purpose word routine.
*                               ; Otherwise fall into special routine.
*
*  For speed, we use three exit points.
*  This is cheesy, but this is a speed-optimized subroutine!

        page
*************************************************************************
*                                                                       *
*       Faster Integer Square Root (16 to 8 bit).  For small arguments. *
*                                                                       *
*       (Exact method, not approximate).                                *
*                                                                       *
*       Call with:                                                      *
*               D0.W = Unsigned number.                                 *
*                                                                       *
*       Returns:                                                        *
*               D0.W = SQRT(D0.W)                                       *
*                                                                       *
*       Notes:  Result fits in D0.B, but is valid in word.              *
*               Takes from 72 (d0=1) to 504 (d0=625) cycles             *
*               (including rts).                                        *
*                                                                       *
*       Algorithm supplied by Motorola.                                 *
*                                                                       *
*************************************************************************

* Use the theorem that a perfect square is the sum of the first
* sqrt(arg) number of odd integers.

*                       Cycles
        move.w d1,-(sp) (8)
        move.w #-1,d1   (8)
qsqrt1  addq.w #2,d1    (4)
        sub.w d1,d0     (4)
        bpl qsqrt1      (10/8)
        asr.w #1,d1     (8)
        move.w d1,d0    (4)
        move.w (sp)+,d1 (12)
done    rts             (16)

        page
*************************************************************************
*                                                                       *
*       Integer Square Root (16 to 8 bit).                              *
*                                                                       *
*       (Exact method, not approximate).                                *
*                                                                       *
*       Call with:                                                      *
*               D0.W = Unsigned number.                                 *
*                                                                       *
*       Returns:                                                        *
*               D0.L = SQRT(D0.W)                                       *
*                                                                       *
*       Uses:   D1-D4 as temporaries --                                 *
*               D1 = Error term;                                        *
*               D2 = Running estimate;                                  *
*               D3 = High bracket;                                      *
*               D4 = Loop counter.                                      *
*                                                                       *
*       Notes:  Result fits in D0.B, but is valid in word.              *
*                                                                       *
*               Takes from 544 to 624 cycles (including rts).           *
*                                                                       *
*               Instruction times for branch-type instructions          *
*               listed as (X/Y) are for (taken/not taken).              *
*                                                                       *
*************************************************************************

*                       Cycles
gsqrt   movem.l d1-d4,-(sp) (40)
        move.w #7,d4    (8)     ; Loop count (bits-1 of result).
        clr.w d1        (4)     ; Error term in D1.
        clr.w d2        (4)
sqrt1   add.w d0,d0     (4)     ; Get 2 leading bits a time and add
        addx.w d1,d1    (4)     ; into Error term for interpolation.
        add.w d0,d0     (4)     ; (Classical method, easy in binary).
        addx.w d1,d1    (4)
        add.w d2,d2     (4)     ; Running estimate * 2.
        move.w d2,d3    (4)
        add.w d3,d3     (4)
        cmp.w d3,d1     (4)
        bls.s sqrt2     (10/8)  ; New Error term > 2* Running estimate?
        addq.w #1,d2    (4)     ; Yes, we want a '1' bit then.
        addq.w #1,d3    (4)     ; Fix up new Error term.
        sub.w d3,d1     (4)
sqrt2   dbra d4,sqrt1   (10/14) ; Do all 8 bit-pairs.
        move.w d2,d0    (4)
        movem.l (sp)+,d1-d4 (44)
        rts             (16)

        page
*************************************************************************
*                                                                       *
*       Integer Square Root (32 to 16 bit).                             *
*                                                                       *
*       (Exact method, not approximate).                                *
*                                                                       *
*       Call with:                                                      *
*               D0.L = Unsigned number.                                 *
*                                                                       *
*       Returns:                                                        *
*               D0.L = SQRT(D0.L)                                       *
*                                                                       *
*       Uses:   D1-D4 as temporaries --                                 *
*               D1 = Error term;                                        *
*               D2 = Running estimate;                                  *
*               D3 = High bracket;                                      *
*               D4 = Loop counter.                                      *
*                                                                       *
*       Notes:  Result fits in D0.W, but is valid in longword.          *
*                                                                       *
*               Takes from 1080 to 1236 cycles (including rts.)         *
*                                                                       *
*               Two of the 16 passes are unrolled from the loop so that *
*               quicker instructions may be used where there is no      *
*               danger of overflow (in the early passes).               *
*                                                                       *
*               Instruction times for branch-type instructions          *
*               listed as (X/Y) are for (taken/not taken).              *
*                                                                       *
*************************************************************************

*                       Cycles
glsqrt  movem.l d1-d4,-(sp) (40)
        moveq #13,d4    (4)     ; Loop count (bits-1 of result).
        moveq #0,d1     (4)     ; Error term in D1.
        moveq #0,d2     (4)
lsqrt1  add.l d0,d0     (8)     ; Get 2 leading bits a time and add
        addx.w d1,d1    (4)     ; into Error term for interpolation.
        add.l d0,d0     (8)     ; (Classical method, easy in binary).
        addx.w d1,d1    (4)
        add.w d2,d2     (4)     ; Running estimate * 2.
        move.w d2,d3    (4)
        add.w d3,d3     (4)
        cmp.w d3,d1     (4)
        bls.s lsqrt2    (10/8)  ; New Error term > 2* Running estimate?
        addq.w #1,d2    (4)     ; Yes, we want a '1' bit then.
        addq.w #1,d3    (4)     ; Fix up new Error term.
        sub.w d3,d1     (4)
lsqrt2  dbra d4,lsqrt1  (10/14) ; Do first 14 bit-pairs.

        add.l d0,d0     (8)     ; Do 15-th bit-pair.
        addx.w d1,d1    (4)
        add.l d0,d0     (8)
        addx.l d1,d1    (8)
        add.w d2,d2     (4)
        move.l d2,d3    (4)
        add.w d3,d3     (4)
        cmp.l d3,d1     (6)
        bls.s lsqrt3    (10/8)
        addq.w #1,d2    (4)
        addq.w #1,d3    (4)
        sub.l d3,d1     (8)

lsqrt3  add.l d0,d0     (8)     ; Do 16-th bit-pair.
        addx.l d1,d1    (8)
        add.l d0,d0     (8)
        addx.l d1,d1    (8)
        add.w d2,d2     (4)
        move.l d2,d3    (4)
        add.l d3,d3     (8)
        cmp.l d3,d1     (6)
        bls.s lsqrt4    (10/8)
        addq.w #1,d2    (4)
lsqrt4  move.w d2,d0    (4)
        movem.l (sp)+,d1-d4 (44)
        rts             (16)

        end
-- 
+----------------+
! II      CCCCCC !  Jim Cathey
! II  SSSSCC     !  ISC-Bunker Ramo
! II      CC     !  TAF-C8;  Spokane, WA  99220
! IISSSS  CC     !  UUCP: uunet!isc-br!jimc (jimc@isc-br.isc-br.com)
! II      CCCCCC !  (509) 927-5757

+++++++++++++++++++++++++++

>From lasley@umdsp.umd.edu (Scott E. Lasley)
Date: Wed,  4 May 1994 20:25:02 -0400
Organization: Space Physics Group, University of Maryland

In article <rmah-030594174155@rmah.dialup.access.net>, rmah@panix.com (Robert 
S. Mah) writes:
> There is an article on generating very fast square root aproximations 
> in the book "Graphics Gems", Ed. Glassner. 
>  
> It generates a lookup table which is indexed by munging the exponent 
> of the argument and then uses the mantissa to do an (linear?) 
> aproximation to the final result.  It's fairly accurate to within a 
> few decimal points.  I think the source code is also available 
> somewhere on the net, but I'm afraid I don't know where. 

some (perhaps all) of the code in the Graphics Gems series of books can be
found at ftp.princeton.edu.  here is a URL

ftp://ftp.princeton.edu//pub/Graphics/GraphicsGems

there are directories for books I through IV. the ReadMe file in the 
GemsIII directory contains the following line:

    sqrt.c		II.1	Steve Hill, "IEEE Fast Square Root"

hope this helps,


lasley@umdsp.umd.edu
"... and there were bumper stickers saying "Free Love"
 like it was some kind of political prisoner..."  David Wilcox


+++++++++++++++++++++++++++

>From nagle@netcom.com (John Nagle)
Date: Thu, 5 May 1994 02:23:34 GMT
Organization: NETCOM On-line Communication Services (408 241-9760 guest)

busfield@hurricane.seas.ucla.edu (John D. Busfield) writes:
>    Does anyone have an algorithm or code for computing square roots.
>The emphasis is on speed rather than accuracy in that I only need the
>result rounded to the nearest integer (ie sqrt(1000) = 31 or 32).
>Thanks in advance.

     The classic for floating point is to check the exponent and if odd,
shift the mantissa right one bit.  Then halve the exponent.  Use the
high 4 bits of the mantissa to select a starting mantissa from a table of
16 values.  Then execute Newton's Method for two or three iterations,
written out in line.  On a machine with an FPU, this beats most iterative
methods.  Try it on a PPC.  This may actually be faster than an 
iterative integer method.

					John Nagle

+++++++++++++++++++++++++++

>From Ron_Hunsinger@bmug.org (Ron Hunsinger)
Date: Thu,  5 May 94 05:35:48 PST
Organization: Berkeley Macintosh Users Group

parkb@bigbang.Stanford.EDU (Brian Park) writes:

>lsqrt2() - traditional Newton's method
>--------
>The traditional Newton's method of getting y = sqrt(n), coded as lsqrt2()
>below)
>
>n = y^2
>y(0) = n
>y(i+1) = (y(i) + n/y(i))/2
>
>is second-order convergent, 

You can do square root much faster than this.  Square root is about as
complex, and takes about as much time, as a divide.  A square root is
like a division in which the quotient and divisor are equal.
The usual division algorithm knows the divisor from the beginning, and
discovers the quotient a bit at a time.  You just have to tweak the
algorithm a little so that it discovers the divisor and quotient in
parallel, a bit at a time.

Think about how you would calculate square root by hand.  Then try
to do it in binary.  Notice how easy it becomes.

>My point is that Newton's method is O(log_2(n)) and you are not going
>to do much better than that.  

Newton's method is O(log n) ITERATIONS, but you do a divide in each 
iteration.  If division takes O(log n) shift-subtracts, then you have
a run time of O((log n)^2) shift-subtracts, compared to only O(log n)
shift-subtracts using a modified divide algorithm.

[Not that it really matters.  You aren't writing an arbitrary-precision
square root routine.  You have a fixed upper bound on the size of the
input, and ANY operation with only a finite number of valid inputs
can be implemented in O(1) time.]

[And if you have a very fast divide, as fast as a shift-subtract, as
might happen if you use a hardware divide instruction on a RISC computer,
then both methods might have nearly the same speed.  But if shifting
is faster than division, Newton's method is going to lose.]

-Ron Hunsinger

+++++++++++++++++++++++++++

>From afcjlloyd@aol.com (AFC JLloyd)
Date: 5 May 1994 12:37:02 -0400
Organization: America Online, Inc. (1-800-827-6364)

In article <2q91dp$rf@nntp2.Stanford.EDU>, parkb@bigbang.Stanford.EDU (Brian
Park) writes:

>>>
But let me empasize that even with straight Newton's method is O(log_2(n)),
Sqrt(10^9) takes only 19 iterations, as you can verify.  That's only 19
integer divisions.  How fast do you want your square root to be?
<<<

If you simply provide a better starting estimate you can reduce your 19
iterations
down to a maximum of 5.  A very cheap way to provide a better estimate is to
find the most significant bit and use it to estimate log2(N) and thus sqrt(N).
I don't know about you, but when I want fast code I'll take a factor of four
anyway I can get. 

However, if you're really out for speed, it turns out that Newton's
method is not the best.  Yesterday,  
pottier@corvette.ens.fr (Francois Pottier) posted an algorithm
by (Dr.) John Bruner,   bruner@csrd.uiuc.edu:

>>>>
unsigned int isqrt(unsigned long x)
{
 unsigned long r, nr, m;
 
 r = 0;
 m = 0x40000000;
 do {
  nr = r + m;
  if (nr <= x) {
   x -= nr;
   r = nr + m;
  }
  r >>= 1;
  m >>= 2;
 } while (m != 0);
 
 return(r);
}
<<<<

(I've removed two lines of code that round the result up if
rounding to nearest integer is desired).
This is an incredibly great algorithm, which uses exactly
16 iterations but no integer multiply or divides.

However,  if you really want the fastest possible code, 
you should rewrite it like this:

unsigned short isqrt(register unsigned long x)
{
 register unsigned long r;
 register unsigned long nr;
 register unsigned long m;
 
 r = 0;
 m = 0x40000000;
  nr = m;       if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2;
  nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2;
  nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2;
  nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2;
  nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2;
  nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2;
  nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2;
  nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2;
  nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2;
  nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2;
  nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2;
  nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2;
  nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2;
  nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2;
  nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2;
  nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1;

  return(r);
}

This code is 40% faster (on my 840av) than code that I have been 
using that used 4 iterations of Newton-Raphson after seeding 
along the lines I mentioned above.

It may be possible to speed this up even further by using the
fast estimate of Log2(N) to reduce the number of iterations
from 16 to the absolute minimum (ceil(Log2(N)/2), though
I expect the improvement to be minimal since each iteration of
this algorithm is so cheap.

Jim Lloyd
afcjlloyd@aol.com


+++++++++++++++++++++++++++

>From trussler@panix.com (Tom Russler)
Date: Fri, 06 May 1994 00:38:31 -0400
Organization: PANIX Public Access Internet and UNIX (NYC)

 busfield@hurricane.seas.ucla.edu (John D. Busfield) writes:
 >    Does anyone have an algorithm or code for computing square roots.
 >The emphasis is on speed rather than accuracy in that I only need the
 >result rounded to the nearest integer (ie sqrt(1000) = 31 or 32).
 >Thanks in advance.

In 68000 assembly in THINK C:

  	   	move.l 	number, d0
  		   move.l	 #1, d1
  		   move.l 	d0, d2
  	@1	 cmp.l	  d1, d2
  		   ble	   	@2
  		   lsl.l	  #1, d1
  		   lsr.l	  #1, d2
  	   	bra		   @1
  	@2	 add.l  	d2, d1
  		   lsr.l   #1, d1
  		   divu	   d1, d0
  		   add.w	  d1, d0
  		   add.w	  #1, d0
  		   lsr.w	  #1, d0

-- 
Tom Russler
trussler@panix.com

+++++++++++++++++++++++++++

>From zstern@adobe.com (Zalman Stern)
Date: Fri, 6 May 1994 04:35:37 GMT
Organization: Adobe Systems Incorporated

Ron_Hunsinger@bmug.org (Ron Hunsinger) writes
> [And if you have a very fast divide, as fast as a shift-subtract, as
> might happen if you use a hardware divide instruction on a RISC computer,

This is not even close to true for any RISC implementation I know of. If the  
divisor is loop-invariant, one can multiply by the reciprocal which may be a  
single cycle operation, however this is not the case in the example being  
discussed. Most RISC architecture now include a floating-point square root  
instruction. This is partially due to this operations similarity to division  
(i.e. can use the same hardware) and partially due to its common appearance  
in electrical CAD codes :-)

I believe the integer square root routine I use is based on the algorithm  
Ron referred to. It has the following comment at the top (thanks to Mark  
Hamburg):

/*
 *      A square-root routine.  See Dijkstra, "A Discipline of Programming",
 *      page 65.
 */

I threw in an optimization to use the PowerPC count-leading-zeros  
instruction to efficiently reduce the number of iterations. (There is a  
similar instruction on the 68K.) The result is competitive with a lookup  
table based method for computing sqrt(x^2 + y^2) that uses a divide. (I'd  
post the code, but I didn't write it and I'm sure Adobe would frown on my  
doing so. Besides, reading Dijkstra is good for you :-))
--
Zalman Stern		   zalman@adobe.com		    (415) 962 3824
Adobe Systems, 1585 Charleston Rd., POB 7900, Mountain View, CA 94039-7900
		  There is no lust like the present.

+++++++++++++++++++++++++++

>From hall_j@sat.mot.com (Joseph Hall)
Date: Thu, 5 May 1994 21:55:49 GMT
Organization: Motorola Inc., Satellite Communications

Seems it was nagle@netcom.com (John Nagle) who said:
>busfield@hurricane.seas.ucla.edu (John D. Busfield) writes:
>>    Does anyone have an algorithm or code for computing square roots.
>>The emphasis is on speed rather than accuracy in that I only need the
>>result rounded to the nearest integer (ie sqrt(1000) = 31 or 32).
>>Thanks in advance.
>
>     The classic for floating point is to check the exponent and if odd,
>shift the mantissa right one bit.  Then halve the exponent. [...]

You can consult Plaugher's "The Standard C Library" for a portable
floating point sqrt.  For a positive real number, the algorithm is:

  1) separate exponent and mantissa -- basically bit-masking except
       in cases of gradual underflow
  2) compute a quadratic least-squares fit to sqrt for the mantissa
       (y = (-.1984742*x+.8804894)*x + .3176687)
  3) follow with 3 iterations of Newton's method (y = .5*(y+x/y))
  4) if exponent is odd, a) multiply mantissa by sqrt(2) and
                         b) subtract 1 from exponent
  5) divide exponent by 2
  6) reassemble exponent and mantissa

Right-shifting the mantissa for odd exponents before computing
the root, to eliminate the multiplication by sqrt(2), is harder to 
express portably, but is certainly possible.  Plaugher doesn't make
any particular efforts to replace things like ".5 * x" with shifts.

Handy little book.  ISBN 0-13-131509-9.

(begin digression)

Plaugher uses conventional floating-point techniques to compute
transcendental functions.  This is pretty efficient, and these days
may be as good as any technique.  However, there are also algorithms
that allow you to compute trigonometric functions, as well as exponent
and log, by using only shifts and adds/subtracts--as if you were doing a
divide.  In fact, the computational effort is just twice that of
a divide.  I have seen relatively few references to these techniques,
but they are easy to implement.

For example, to compute an exponent, you need a table of ln(101(2)),
ln(11(2)), ln(1(2)), ln(1.1(2)), ln(1.01(2)), ln(1.001(2)), etc.   
Then you can code a fixed-point version like this (please excuse typos 
since I can't paste this directly in here at the moment):

  unsigned long logTbl[] = 
      { /* binary ln((2^i + 65536)/65536), for i == 32..1), e.g. */
        0xa65b1, 0x9b441, 0x902d3 ... 
        /* ln 2147549184/65536, ln 1073807360/65536, ln 536936448/65536 */ };

  unsigned long fexp(unsigned long x)
  {
    long i = 1L, j = 0L, k = 0x10000L, l = 0L, *lt = logTbl - 1;
    while (i <= 32) {
      j = l + lt[i];
      if (j > x) {
        i++;
      } else {
        l = j;
        k += k >> i;
      }
    }
    return k;
  }

This computes the exponent iteratively in k.  The log of the approximation
is maintained in l.  If l + the current entry in the table is <=
than the value you are computing the exponent of, then you add the
log in the table to l, and multiply k by a number that is conveniently
in the form 1 + some power of 2.  Repeat until the table is exhausted.

These days this may not be any faster than using the coprocessor.  However,
this technique was occasionally used to write very high-performance libraries 
for 8- and early 16-bit processors.  I doubt it was ever used in any
commercial BASIC, though, because it is not well known, and BASIC would
have been a hell of a lot faster for numerical work if it had been.  I first
heard this technique described in the mid 70s, but the literature goes
back to at least the 60s.

I wonder if this technique has ever been used to implement transcendental 
functions in microcode.  At one time I was thinking of trying to get it 
working on a TI bit slice CPU, that is, build my own coprocessor.  :-)

-- 
Joseph Nathan Hall | Joseph's Law of Interface Design: Never give your users
Software Architect | a choice between the easy way and the right way.
Gorca Systems Inc. |                 joseph@joebloe.maple-shade.nj.us (home)
(on assignment)    | (602) 732-2549 (work)  Joseph_Hall-SC052C@email.mot.com

+++++++++++++++++++++++++++

>From Ron_Hunsinger@bmug.org (Ron Hunsinger)
Date: Sat,  7 May 94 13:41:39 PST
Organization: Berkeley Macintosh Users Group

parkb@bigbang.Stanford.EDU (Brian Park) writes:

>But let me empasize that even with straight Newton's method is O(log_2(n)),
>Sqrt(10^9) takes only 19 iterations, as you can verify.  That's only 19
>integer divisions.  How fast do you want your square root to be?

I want it to be as fast as a single integer division.  Why take 19 times
as long as I have to?


zstern@adobe.com (Zalman Stern) writes:

>Ron_Hunsinger@bmug.org (Ron Hunsinger) writes
>> [And if you have a very fast divide, as fast as a shift-subtract, as
>> might happen if you use a hardware divide instruction on a RISC computer,
>
>This is not even close to true for any RISC implementation I know of. 

Glad to hear it.  I only included the comment because I did not have
instruction timings handy, and I was afraid someone was going to come 
along and say I needn't worry about avoiding divides because on the 
PowerPC (or some other RISC computer) EVERYTHING, including divide,
took only a single cycle, and so divide was just as fast as shift.

I knew shift couldn't be slower than divide, and if it really is faster
(which is reasonable), then finding a square root algorithm that does
not do any divides is a worthwhile objective.

>I believe the integer square root routine I use is based on the algorithm  
>Ron referred to. It has the following comment at the top (thanks to Mark  
>Hamburg):
>
>/*
> *      A square-root routine.  See Dijkstra, "A Discipline of Programming",
> *      page 65.
> */

A distinct possibility, since I read all the Dijkstra books I could get
my hands on, way back when.  However, I knew that square root could be
done as quickly as division at least as far back as high school, before
I had ever heard of Dijkstra (or seen a computer for that matter).  As 
I said, the method suggests itself as soon as you assail to calculate 
square root using the traditional pencil and paper method in binary.

>I threw in an optimization to use the PowerPC count-leading-zeros  
>instruction to efficiently reduce the number of iterations. (There is a  
>similar instruction on the 68K.) 

There is?  Is this a 68040 instruction?  I can't find any such instruction
at least through the 68030.  I thought the 68040 was just a 68030 with
a bigger cache and a built-in FPU.  Sure would be nice, though.  The
Unisys A-series computers have a similar instruction, accessible through
the built-in FIRSTONE function in Burroughs Extended Algol, which I
found many interesting (and not always obvious) uses for.

>The result is competitive with a lookup  
>table based method for computing sqrt(x^2 + y^2) that uses a divide. (I'd  
>post the code, but I didn't write it and I'm sure Adobe would frown on my  
>doing so. Besides, reading Dijkstra is good for you :-))

No problem.  I'll post mine, which I wrote from scratch before checking
your Dijkstra reference. My routine and his, neither of which uses divides
(except by 2 or 4), bear only a vague similarity.  (But reading Dijkstra 
is still good for you.)  The whole thing is as fast as a single divide 
done by subroutine (as, for example, would be required to divide 32-bit 
integers on a 68000).

Normally, a division subroutine will keep the divisor fixed and shift
the dividend past it, doing subtractions (and incrementing the quotient)
as needed.  My first thought was to do the same thing, but that's messy
to do in C.  (In assembler, you would use LSL, ROXL to shift bits out
of one register into another; but that doesn't map well to C.)  So 
instead, I keep the dividend fixed, and shift the divisor right.  Just as
in the pencil and paper square root method, you take the digits (bits) of
the argument two at a time.  But since the quotient (a.k.a. the square
root) is accumulating bits one at a time in the opposite direction, it
ends up only shifting one bit at a time.  

Without further comment, my version (which is meant to be fast, not 
necessarily readable) follows:

#define BITS 32
// BITS is the smallest even integer such that 2^BITS > ULONG_MAX.
// If you port this routine to a computer where 2^(BITS-1) > ULONG_MAX
// (Unisys A-series, for example, where integers would generally be
// stored in the 39-bit mantissa of a 48-bit word), unwind the first 
// iteration of the loop, and special-case references to 'half' in that 
// first iteration so as to avoid overflow.

unsigned long lsqrt (unsigned long n) {
    // Compute the integer square root of the integer argument n
    // Method is to divide n by x computing the quotient x and remainder r
    // Notice that the divisor x is changing as the quotient x changes

    // Instead of shifting the dividend/remainder left, we shift the
    // quotient/divisor right.  The binary point starts at the extreme
    // left, and shifts two bits at a time to the extreme right.

    // The residue contains n-x^2.  (Within these comments, the ^ operator
    // signifies exponentiation rather than exclusive or.  Also, the /
    // operator returns fractions, rather than truncating, so 1/4 means
    // one fourth, not zero.)

    // Since (x + 1/2)^2 == x^2 + x + 1/4,
    //   n - (x + 1/2)^2 == (n - x^2) - (x + 1/4)
    // Thus, we can increase x by 1/2 if we decrease (n-x^2) by (x+1/4)

    unsigned long residue;   // n - x^2
    unsigned long root;      // x + 1/4
    unsigned long half;      // 1/2

    residue = n;             // n - (x = 0)^2, with suitable alignment
    root = 1 << (BITS - 2);  // x + 1/4, shifted left BITS places
    half = root + root;      // 1/2, shifted left BITS places
    
    do {
        if (root <= residue) {   // Whenever we can,
            residue -= root;         // decrease (n-x^2) by (x+1/4)
            root += half; }          // increase x by 1/2
        half >>= 2;          // Shift binary point 2 places right
        root -= half;        // x{+1/2}+1/4 - 1/8 == x{+1/2}+1/8
        root >>= 1;          // 2x{+1}+1/4, shifted right 2 places
        } while (half);      // When 1/2 == 0, binary point is at far right
    
    // Omit the following line to truncate instead of rounding
    if (root < residue) ++root;
    
    return root;   // Guaranteed to be correctly rounded (or truncated)
    }

Enjoy,
-Ron Hunsinger

+++++++++++++++++++++++++++

>From Ron_Hunsinger@bmug.org (Ron Hunsinger)
Date: Sun,  8 May 94 04:45:40 PST
Organization: Berkeley Macintosh Users Group

trussler@panix.com (Tom Russler) writes:
>
> busfield@hurricane.seas.ucla.edu (John D. Busfield) writes:
> >    Does anyone have an algorithm or code for computing square roots.
> >The emphasis is on speed rather than accuracy in that I only need the
> >result rounded to the nearest integer (ie sqrt(1000) = 31 or 32).
> >Thanks in advance.
>
>In 68000 assembly in THINK C:
>
>     move.l number, d0
>     move.l #1, d1
>     move.l d0, d2
>  @1 cmp.l  d1, d2
>     ble   @2
>     lsl.l  #1, d1
>     lsr.l  #1, d2
>     bra   @1
>  @2 add.l  d2, d1
>     lsr.l   #1, d1
>     divu   d1, d0
>     add.w  d1, d0
>     add.w  #1, d0
>     lsr.w  #1, d0

Sorry.  Valiant effort, but no cigar.  You don't handle boundary 
conditions correctly.  Examine what happens when 'number' is 0xFFFFaaaa, 
where aaaa is any arbitrary bit pattern.

>     move.l number, d0     ; d0 = 0xFFFFaaaa
>     move.l #1, d1         ; d1 = 0x00000001
>     move.l d0, d2         ; d2 = 0xFFFFaaaa

>  @1 cmp.l  d1, d2
>     ble   @2
>     lsl.l  #1, d1
>     lsr.l  #1, d2
>     bra   @1

       ; after 16 iterations, you break out of the above loop, with:
                               ; d1 = 0x00010000 = 65536
                               ; d2 = 0x0000FFFF = 65535

>  @2 add.l  d2, d1         ; d1 = 0x0001FFFF
>     lsr.l   #1, d1         ; d1 = 0x0000FFFF = 65535
>     divu   d1, d0   <-----OVERFLOW:  d0/d1 >= 65536, to big for a short

In fact, you should have seen that overflow was a problem.  DIVU will
always overflow, no matter what the divisor, if the dividend is greater
than 65535*65535 = 0xFFFE0001.

You may have been counting on the fact that, if overflow is detected,
the operands are unaffected, but that doesn't help.  The contents of 
d0.w after the divide is aaaa, which was a completely arbitrary value.
Continuing:

>     add.w  d1, d0   <-----OVERFLOW AGAIN:  Since d1=65535, the
>     add.w  #1, d0         ; effect of these two instruction is to
                               ; overflow at some point, but otherwise
                               ; do nothing.  d0 still contains aaaa

>     lsr.w  #1, d0         ; divide the garbage by 2.

The final result is in the range 0..32767.  The correct result should
be 65535 (if aaaa is 0000) or 65536 (for any other value of aaaa).
Again, you should have seen this coming.  Your final instruction can
never return a value >32767, meaning that you cannot possibly return
the correct result for any number > 32767*32768 = 0x3FFF8000.


You also do not achieve the stated accuracy goal, "I [only] need the
result rounded to the nearest integer", even when you don't overflow.

Suppose 'number' = 0x10004000 = 268451840.  Your code computes a value
of 16794.  The correctly rounded value is 16384.  Off by 410 is not "the
nearest integer".

-Ron "who gets very picky sometimes :)" Hunsinger

+++++++++++++++++++++++++++

>From fixer@faxcsl.dcrt.nih.gov (Chris Gonna' Find Ray Charles Tate)
Date: Mon, 9 May 1994 14:04:10 GMT
Organization: DCRT, NIH, Bethesda, MD

In article <0014A722.fc@bmug.org>, Ron_Hunsinger@bmug.org (Ron Hunsinger) writes:
>
>zstern@adobe.com (Zalman Stern) writes:
>>
>>I threw in an optimization to use the PowerPC count-leading-zeros  
>>instruction to efficiently reduce the number of iterations. (There is a  
>>similar instruction on the 68K.) 
>
>There is?  Is this a 68040 instruction?  I can't find any such instruction
>at least through the 68030.  I thought the 68040 was just a 68030 with
>a bigger cache and a built-in FPU.

I think you're looking for the BFFFO instruction, which my 680x0 family
reference describes as "Find First One in Bit Field," and says it's
a 60820 instruction (and later, of course).

The book I'm looking in is Motorola's "M68000 Family Programmer's
Reference Manual," which doesn't seem to have any Library of Congress
registration (!), but which is Motorola part number M68000PM/AD.

- -------------------------------------------------------------
Christopher Tate             |  "So he dropped the heart - 
MSD, Inc.                    |     the floor's clean."
fixer@faxcsl.dcrt.nih.gov    |                 - Sidney Harris

+++++++++++++++++++++++++++

>From christer@cs.umu.se (Christer Ericson)
Date: Tue, 10 May 1994 10:24:38 GMT
Organization: Dept. of Computing Science, Umea Univ., 901 87 Umea, Sweden

In <00148DC5.fc@bmug.org> Ron_Hunsinger@bmug.org (Ron Hunsinger) writes:
>
>parkb@bigbang.Stanford.EDU (Brian Park) writes:
>>[...suggests using Newton's method...]
>
>You can do square root much faster than this.

I beg to differ. In article <CpA23t.DwM@cs.umu.se> I posted an 68000-
based implementation of Newton's method (with quirks) which is something
like 20-30% faster than the equivalent hand-coded optimized assembly
version of the routine you posted in another article. It all depends on
how good your initial guess for Newton's method is.

So there! :-)


Christer Ericson --- Internet: christer@cs.umu.se --- tel: +46-90-166794
Department of Computer Science, University of Umea, S-90187 UMEA, SWEDEN

+++++++++++++++++++++++++++

>From afcjlloyd@aol.com (AFC JLloyd)
Date: 10 May 1994 18:01:03 -0400
Organization: America Online, Inc. (1-800-827-6364)

In article <CpL0x4.HKJ@cs.umu.se>, christer@cs.umu.se (Christer Ericson)
writes:

>In <00148DC5.fc@bmug.org> Ron_Hunsinger@bmug.org (Ron Hunsinger) writes:
>>
>>parkb@bigbang.Stanford.EDU (Brian Park) writes:
>>>[...suggests using Newton's method...]
>>
>>You can do square root much faster than this.
>
>I beg to differ. In article <CpA23t.DwM@cs.umu.se> I posted an 68000-
>based implementation of Newton's method (with quirks) which is something
>like 20-30% faster than the equivalent hand-coded optimized assembly
>version of the routine you posted in another article. It all depends on
>how good your initial guess for Newton's method is.

I agree with Christer.  Until this thread I had been using Newton's method
with 5 iterations for obtaining the square root of unsigned longs, where
the initial estimate came from finding the most signficant bit, using
that to estimate Log2(n), and using that to estimate sqrt(n).

When I saw this thread I tried the code attributed to Dr. John Bruner
(similar to Ron's code) and saw a 40% speedup over my Newton's method
code.  Convinced that this was a precious find,  I then tried to apply 
the same technique to taking the square root of Fixed point values and 
discovered that it couldn't beat my Newton's method code.  The reason 
was primarily that 24 (instead of 16) iterations are required for 
fixed point square root (the Fract case would require 31 iterations).

This caused me to do more thinking about Newton's method.  
The great thing about Newton's method is that it _doubles_ the 
precision with each iteration.  Of course, double of nothing
is nothing, so Newton's method is worthless if you give it a poor
estimate, but given a good estimate Newton's method is fantastic.

So, I decided to use a table lookup to provide the initial estimate for
Newton's method.  I found that I could reduce the number of iterations
down from five to _one_ for the integer case, and just _two_ iterations
for both the Fixed and Fract cases.  The resulting code is significantly
faster than any of my previous attempts.

Jim Lloyd
afcjlloyd@aol.com


+++++++++++++++++++++++++++

>From Ron_Hunsinger@bmug.org (Ron Hunsinger)
Date: Sat, 14 May 94 03:01:56 PST
Organization: Berkeley Macintosh Users Group

christer@cs.umu.se (Christer Ericson) writes:

>In <00148DC5.fc@bmug.org> Ron_Hunsinger@bmug.org (Ron Hunsinger) writes:
>>
>>parkb@bigbang.Stanford.EDU (Brian Park) writes:
>>>[...suggests using Newton's method...]
>>
>>You can do square root much faster than this.
>
>I beg to differ. In article <CpA23t.DwM@cs.umu.se> I posted an 68000-
>based implementation of Newton's method (with quirks) which is something
>like 20-30% faster than the equivalent hand-coded optimized assembly
>version of the routine you posted in another article. It all depends on
>how good your initial guess for Newton's method is.

You've got me mixed up with somebody else.  I did say you can do square
root much faster, and explained how, but I didn't post any assembly
version.  I posted the C version (which therefore has the advantage
that it can be compiled native for PowerPC, or ported easily to another
machine that doesn't even have to be binary, although of course it is
unlikely that shifting will be fast on a non-binary machine).

I saw two assembly versions.  One, which was quite long, very tightly
optimized, and attributed to Motorola, was indeed faster than my short 
simple C version, but not by all that much.  Also, it always truncated,
where mine would produce the correctly rounded result (or the truncated
result by commenting out one line if that's really what you wanted).

The other, very short assembler version was hopelessly flawed.  It
computed an initial guess that could be off by a factor of two, and
then did *ONE* iteration of Newton's method, and stopped there, producing
a value that was not even close (in absolute terms), even when it did
not lose all significance due to overflow.  If asked to take the square 
root of zero, it would divide by zero.  It probably was fast (when it
didn't crash), but who cares how quickly you get the wrong answer?

Which version was yours?

I'll be charitable and assume it was the Motorola version.  I no longer
have the original message, but I seem to recall the max running time
was something like 1262 cycles on a 68000, and (quite a bit) faster
on small values.  I modified my C version to special-case small values
too, and unwind the iterations of the main loop up to the iteration that
finds the high bit of the root.  The main benefit of special-casing small 
values is so I don't have to worry about that boundary in the remaining 
code -- I am then guaranteed that the (unrounded) root is at least two 
bits long.  And although I am not displeased by the improved efficiency, 
my main reason for unwinding the first iterations was to remove the 
restriction in the prior version that it would only work on machines in 
which the number of bits in an unsigned long was even.  This code will 
now work even on machines with odd integer sizes.  (Yes, Virginia, such 
machines do exist!  An example is the Unisys A-series computers, in which 
ALL arithmetic is done in floating point, and integers are just the special 
case where the exponent is zero and the integer value is stored in the 
39-bit mantissa of a 48-bit word.)  

My improved C version appears at the end of this message.

I compiled it with MPW C with no special optimizations, tested it
thoroughly, and then disassembled the code and calculated the running
time on a 68000.  (On later machines, the presence of the cache makes
accurate timings much more difficult to do by hand.)  I did not make
any attempt to "improve" the C-generated code, although there are some 
easy and obvious optimizations available.  The running time for my
version (the one that correctly rounds), counting the RTS and the
(quite unnecessary) LINK/MOVEM.L/.../MOVE.L <rslt>,D0/MOVEM.L/UNLK is:

   If N <= 12 then 204+4N, else 620+46u+32z+6r, where
        N = input argument
        u = Number of one-bits in the unrounded root
        z = Number of non-leading zero-bits in the unrounded root
        r = 1 if rounding increases the root, else 0

   The actual running time for various values of N is:

                 N   Time  Comments
         ---------  -----  -------------------------
                 0    204  Best time
                13    718  Best time not using lookup table
               625    822  This is a cut-point for the Motorola code
          0..65535    935  Average* time over inputs that fit in 16 bits
             65535    994  Worst time if N fits in 16 bits
     0..0xFFFFFFFF   1247  Average* time over all 32-bit inputs
        0xFFFFFFFF   1362  Worst case
       *average times computed assuming uniform distrubution of outputs

If my memory serves me right, this is only about 8% slower than the
hand-crafted Motorola code, which I think is a cheap price to pay for
the portability.  Especially considering that most of that difference 
is due to the fact that they split the general loop into a main loop 
that could get away with word-size operations for some variables, and 
a few unwound iterations that had to use long-sized operands.  Notice
that this is counterproductive on 68020 and up, where long operations 
are just as fast as word operations when the operands are in registers,
and the only effect of unwinding the last two iterations is to 
guarantee that they won't be in the cache.  

The compiled 68K code fits in 104 bytes (decimal). (96 bytes without
the debugger symbol. This really was a vanilla compile.)  How big was 
your code?  How big is the code cache on your machine?  How much of
your caller's code do you push out of the cache?


>So there! :-)

Same to you!!  :-)


#define lsqrt_max4pow (1UL << 30)
// lsqrt_max4pow is the (machine-specific) largest power of 4 that can
// be represented in an unsigned long.

unsigned long lsqrt (unsigned long n) {
    // Compute the integer square root of the integer argument n
    // Method is to divide n by x computing the quotient x and remainder r
    // Notice that the divisor x is changing as the quotient x changes

    // Instead of shifting the dividend/remainder left, we shift the
    // quotient/divisor right.  The binary point starts at the extreme
    // left, and shifts two bits at a time to the extreme right.

    // The residue contains n-x^2.  (Within these comments, the ^ operator
    // signifies exponentiation rather than exclusive or.  Also, the /
    // operator returns fractions, rather than truncating, so 1/4 means
    // one fourth, not zero.)

    // Since (x + 1/2)^2 == x^2 + x + 1/4,
    //   n - (x + 1/2)^2 == (n - x^2) - (x + 1/4)
    // Thus, we can increase x by 1/2 if we decrease (n-x^2) by (x+1/4)

    unsigned long residue;      // n - x^2
    unsigned long root;         // x + 1/4
    unsigned long half;         // 1/2

    residue = n;                // n - (x = 0)^2, with suitable alignment
    
    // if the correct answer fits in two bits, pull it out of a magic hat
#ifndef lsqrt_truncate
    if (residue <= 12)
        return (0x03FFEA94 >> (residue *= 2)) & 3;
#else
    if (residue <= 15)
        return (0xFFFEAA54 >> (residue *= 2)) & 3;
#endif
    root = lsqrt_max4pow;       // x + 1/4, shifted all the way left
//  half = root + root;         // 1/2, shifted likewise
    
    // Unwind iterations corresponding to leading zero bits 
    while (root > residue) root >>= 2;
    
    // Unwind the iteration corresponding to the first one bit
    // Operations have been rearranged and combined for efficiency
    // Initialization of half is folded into this iteration
    residue -= root;            // Decrease (n-x^2) by (0+1/4)
    half = root >> 2;           // 1/4, with binary point shifted right 2
    root += half;               // x=1.  (root is now (x=1)+1/4.)
    half += half;               // 1/2, properly aligned

    // Normal loop (there is at least one iteration remaining)
    do {
        if (root <= residue) {  // Whenever we can,
            residue -= root;        // decrease (n-x^2) by (x+1/4)
            root += half; }         // increase x by 1/2
        half >>= 2;             // Shift binary point 2 places right
        root -= half;           // x{+1/2}+1/4 - 1/8 == x{+1/2}+1/8
        root >>= 1;             // 2x{+1}+1/4, shifted right 2 places
        } while (half);         // When 1/2 == 0, bin. point is at far right

#ifndef lsqrt_truncate
    if (root < residue) ++root;  // round up if (x+1/2)^2 < n
#endif
    
    return root;        // Guaranteed to be correctly rounded (or truncated)
    }

-Ron Hunsinger

+++++++++++++++++++++++++++

>From christer@cs.umu.se (Christer Ericson)
Date: Mon, 16 May 1994 08:44:47 GMT
Organization: Dept. of Computing Science, Umea Univ., 901 87 Umea, Sweden

In <0014E819.fc@bmug.org> Ron_Hunsinger@bmug.org (Ron Hunsinger) writes:
>
>christer@cs.umu.se (Christer Ericson) writes:
>>[...]
>>I beg to differ. In article <CpA23t.DwM@cs.umu.se> I posted an 68000-
>>based implementation of Newton's method (with quirks) which is something
>>like 20-30% faster than the equivalent hand-coded optimized assembly
>>version of the routine you posted in another article. It all depends on
>>how good your initial guess for Newton's method is.
>
>You've got me mixed up with somebody else.  I did say you can do square
>root much faster, and explained how, but I didn't post any assembly
>version.  I posted the C version (which therefore has the advantage
>that it can be compiled native for PowerPC, or ported easily to another
>machine that doesn't even have to be binary, although of course it is
>unlikely that shifting will be fast on a non-binary machine).

Gosh! What if you read what I wrote above. I didn't claim your code was
written in assembly. I refer rather clearly to highly optimized assembly
code _equivalent_ to the C code you posted.


>I saw two assembly versions.  One, which was quite long, very tightly
>optimized, and attributed to Motorola, was indeed faster than my short 
>simple C version, but not by all that much.  Also, it always truncated,
>where mine would produce the correctly rounded result (or the truncated
>result by commenting out one line if that's really what you wanted).
>
>The other, very short assembler version was hopelessly flawed.  It
>computed an initial guess that could be off by a factor of two, and
>then did *ONE* iteration of Newton's method, and stopped there, producing
>a value that was not even close (in absolute terms), even when it did
>not lose all significance due to overflow.  If asked to take the square 
>root of zero, it would divide by zero.  It probably was fast (when it
>didn't crash), but who cares how quickly you get the wrong answer?
>
>Which version was yours?

Neither of these. Instead of arguing about some program you thought I
wrote, wouldn't it have been better (as well as making you look better)
if you had looked my article up, or in the case it had expired at your
site, requested a copy from me before making yourself silly in public?

Instead of commenting on your rather irrelevant comments on some code
that wasn't mine, I've appended below a copy of my code and if you
care to compare it to your code, I'll take the time to respond to
any comments you might have (size, speed, what have you). Fair?

Finally, the point I wanted to make was that an optimized version of
Newton's method very well beats an optimized version of the type of
algorithm your program is a version of. I'm not interested in any
"...b-b-b-but C is more portable..." arguments.


- - cut here ---
unsigned short ce_quick_sqrt(n)
register unsigned long n;
{
	asm {
;-------------------------
;
; Routine       : ISQRT; integer square root
; I/O parameters: d0.w = sqrt(d0.l)
; Registers used: d0-d2/a0
;
; This is a highly optimized integer square root routine, based
; on the iterative Newton/Babylonian method
;
;	r(n + 1) = (r(n) + A / R(n)) / 2
;
; Verified over the full interval [0x0L,0xFFFFFFFFL]
;
; Some minor compromises have been made to make it perform well
; on the 68000 as well as 68020/030/040. This routine outperforms
; the best of all other algorithms I've seen (except for a table
; lookup).
;
; Copyright (c) Christer Ericson, 1993. All rights reserved.
;
; Christer Ericson, Dept. of Computer Science, University of Umea,
; S-90187 UMEA, SWEDEN. Internet: christer@cs.umu.se
;

;-------------------------
;
; Macintosh preamble since THINK C passes first register param
; in d7. Remove this section on any other machine
;
	move.l	d7,d0
;-------------------------
;
; Actual integer square root routine starts here
;
	move.l	d0,a0		; save original input value for the iteration
	beq.s	@exit		; unfortunately we must special case zero
	moveq	#2,d1		; init square root guess
;-------------------------
;
; Speed up the process of finding a power of two approximation
; to the actual square root by shifting an appropriate amount
; when the input value is large enough
;
; If input values are word only, this section can be removed
;
	move.l	d0,d2
	swap	d2		; do [and.l 0xFFFF0000,d2] this way to...
	tst.w	d2		; go faster on 68000 and to avoid having to...
	beq.s	@skip8		; reload d2 for the next test below
	move.w	#0x200,d1	; faster than lsl.w #8,d1 (68000)
	lsr.l	#8,d0
@skip8	and.w	#0xFE00,d2	; this value and shift by 5 are magic
	beq.s	@skip4
	lsl.w	#5,d1
	lsr.l	#5,d0
@skip4

;-------------------------
;
; This finds the power of two approximation to the actual square root
; by doing the integer equivalent to sqrt(x) = 2 ^ (log2(x) / 2). This
; is done by shifting the input value down while shifting the root
; approximation value up until they meet in the middle. Better precision
; (in the step described below) is gained by starting the approximation
; at 2 instead of 1 (this means that the approximation will be a power
; of two too large so remember to shift it down).
;
; Finally (and here's the clever thing) since, by previously shifting the
; input value down, it has actually been divided by the root approximation
; value already so the first iteration is "for free". Not bad, eh?
;
@loop	add.l	d1,d1
	lsr.l	#1,d0
	cmp.l	d0,d1
	bcs.s	@loop
@skip	lsr.l	#1,d1		; adjust the approximation
	add.l	d0,d1		; here we just add and shift to...
	lsr.l	#1,d1		; get the first iteration "for free"!
;-------------------------
;
; The iterative root value is guaranteed to be larger than (or equal to)
; the actual square root, except for the initial guess. But since the first
; iteration was done above, the loop test can be simplified below
; (Oh, without the bvs.s the routine will fail on most large numbers, like
; for instance, 0xFFFF0000. Could there be a better way of handling these,
; so the bvs.s can be removed? Nah...)
;
@loop2	move.l	a0,d2		; get original input value
	move.w	d1,d0		; save current guess
	divu.w	d1,d2		; do the Newton method thing
	bvs.s	@exit		; if div overflows, exit with current guess
	add.w	d2,d1
	roxr.w	#1,d1		; roxr ensures shifting back carry overflow
	cmp.w	d0,d1
	bcs.s	@loop2		; exit with result in d0.w
@exit
	}
}

Christer Ericson --- Internet: christer@cs.umu.se --- tel: +46-90-166794
Department of Computer Science, University of Umea, S-90187 UMEA, SWEDEN

+++++++++++++++++++++++++++

>From Ron_Hunsinger@bmug.org (Ron Hunsinger)
Date: Tue, 24 May 94 23:23:06 PST
Organization: Berkeley Macintosh Users Group

christer@cs.umu.se (Christer Ericson) writes:

>In <0014E819.fc@bmug.org> Ron_Hunsinger@bmug.org (Ron Hunsinger) writes:
>>
>>christer@cs.umu.se (Christer Ericson) writes:
>>>[...]
>>>I beg to differ. In article <CpA23t.DwM@cs.umu.se> I posted an 68000-
>>>based implementation of Newton's method (with quirks) which is something
>>>like 20-30% faster than the equivalent hand-coded optimized assembly
>>>version of the routine you posted in another article. It all depends on
>>>how good your initial guess for Newton's method is.
>>
>>You've got me mixed up with somebody else.  I did say you can do square
>>root much faster, and explained how, but I didn't post any assembly
>>version.  I posted the C version (which therefore has the advantage
>>that it can be compiled native for PowerPC, or ported easily to another
>>machine that doesn't even have to be binary, although of course it is
>>unlikely that shifting will be fast on a non-binary machine).
>
>Gosh! What if you read what I wrote above. I didn't claim your code was
>written in assembly. I refer rather clearly to highly optimized assembly
>code _equivalent_ to the C code you posted.

I did read what you wrote.  You have to admit that it isn't all that
clear.  One could, and I did, read it that you thought I had posted an
"equivalent hand-coded optimized assembly".

But as long as we're on the subject of carefully reading each other's
notes, lets look again at what I said.  I said that Newton/Raphson was
not the fastest ALGORITHM for taking square roots, because it uses
several division steps, and there is another algorithm (which I described)
which is as fast as a single division.  I also made it clear that in
making that comparison, I was comparing only similar implementations
of the two operations: my sqare root method implemented entirely in
software is as fast and about the same size as division implemented
entirely in software; my square root method implemented entirely in
hardware is as fast and uses about the same number of logic gates as
division implemented entirely in hardware.

Comparing my wholly-software IMPLEMENTATION to your implementation using
hardware divide is unfair and irrelevant.  I stand by my claim.  Square
root is no harder or slower than division.  And there is an algorithm
for square root that will beat Newton/Raphson hands down whenever both
algorithms are given similar implementations.

To back up my claim, consider that ANY software-only implementation of 
Newton/Raphson would have to, as a minimum, test for a zero input (to 
avoid dividing by zero), and do at least one division.  That is, it would 
have to be at least as complex as the following routine to compute a 
sort-of reciprocal:

    unsigned long lrecip (unsigned long n) {
        return (n == 0) ? 0 : 0xFFFFFFFF / n;
        }

On a 68000, the division has to be done in software, because the 68000
does not have a DIVU.L instruction.  Examining the code generated by
this routine shows that its worst-case running time on a 68000 is 1312 
clocks.  My square-root routine takes at most 1362 clocks.  Close enough?


>Finally, the point I wanted to make was that an optimized version of
>Newton's method very well beats an optimized version of the type of
>algorithm your program is a version of. I'm not interested in any
>"...b-b-b-but C is more portable..." arguments.

But I DO care that C is more portable, and it is definitely relevant to
the comparison.  Our versions were not optimized equivalently, because
you are focusing only on optimization to a particular piece of silicon
that happens to implement division, but not square root, in hardware.
Take our implementations unchanged, and run them on a PowerPC, and mine
will leave yours in the dust.  ("Unchanged" means I get to compile mine
native - that's what I bought when I paid the price of being portable.
"Unchanged" means you have to have yours emulated - that's what you
bought when you decided to write processor-dependent code.)  Or take
our implementations unchanged and run them on an Intel processor.  Watch
mine still run at acceptable speed.  Watch yours not run at all.

I also care that mine meets the design specs.  The original requestor
said he wanted to compute square root "rounded to the nearest integer".
My method rounds, and rounds correctly in every case.  Yours, and all
the other methods I've seen, truncate.  I can truncate too, if that's
what you want, but fixing yours to round takes a little work.  Essentially,
rounding involves comparing the unrounded root with the remainder from
the last division.  If you implement Newton/Raphson in C or Pascal, that
remainder is not available to you, and you have to do at least another
multiply and subtract to get it.  If done in 68K Assembler, the remainder
is available, so rounding won't slow you down much, but it will complicate
your logic some, and of course there's that portability issue you don't
want to talk about.  Either way, remember that rounding may produce a 
result of 65536, so the routine has to return an unsigned long, just 
like the requestor asked for, instead of an unsigned short.

>I've appended below a copy of my code and if you
>care to compare it to your code, I'll take the time to respond to
>any comments you might have (size, speed, what have you). Fair?

Well OK, as long as you keep in mind we're comparing apples to
oranges.

Before looking at speed, there is the more important issue of 
correctness.  As nearly as I can tell, you always return the correct 
value (except for not rounding).  I won't quite go so far as to say the 
code is all correct, because you make some early mistakes that get 
compensated for later:

  o You test for a zero input by doing a BEQ.S right after moving the
    input to register A0 for safekeeping.  Are you aware that moving to
    an A register does not affect the condition codes?  Fortunately,
    the condition codes are still set correctly from the prior move
    of the same data from D7 to D0.  However, you have a comment that
    says that move can be eliminated.  If someone does, they might be 
    in trouble.

  o Your logic to do large initial shifts is faulty.  On some inputs (like
    0x02000000) you overshift.  On other inputs (in the range 512..65535) 
    I think you are not shifting as far as you think you are shifting.  
    (The logic around label @skip8 is muddled.)  Fortunately, the Newton 
    method is robust enough to compensate for a bad initial guess, and
    you still arrive at the correct answer, although not as efficiently
    as you could have.

Maybe you don't care about either of these points - after all, you already 
said you didn't care about portability - but has no one ever pointed out
to you that is much easier to write provably correct code in a high level
language?  Does the fact that your code does not work the way you think it 
does suggest anything?

It's hard to estimate your average running time, because there are
so many paths through the optimization and it's difficult to assign
proper weights to each path, so I'll settle for comparing worst cases.

  o Your worst case is with the input 0x02000000, for which you have to
    do 4 divisions.  You take a total of 994 clocks on a 68000, or
    459 clocks on a 68020.

  o My worst case is with the input 0xFFFFFFFF.  I take 1362 clocks on
    a 68000, 609 on a 68020.

1362 / 994 = 1.37.  You are 37% faster than I am on a 68000.
609 / 459 = 1.33.  You are 33% faster than I am on a 68020.

Truth be told, I'm not unhappy that my portable software solution fares
so well against your non-portable hardware solution, even on the hardware
of your choice.  Your speed advantage stems from the fact that you are
coding in Assembler, not from any advantage of the algorithm.  If you 
implemented your algorithm in C, then either your code would be calling a
division subroutine (and your performance would go to hell), or you
could specifically tell the compiler to generate 68020 code to get
inline divide instructions.  But then, besides no longer running on a 
68000, you would still come out behind because the compiler would be
compelled to generate a DIVU.L instruction instead of your DIVU.W, and 
that difference alone is enough to dissipate all your advantage.  Add 
the extra multiply you need to do correct rounding and you're behind.

-Ron Hunsinger

+++++++++++++++++++++++++++

>From christer@cs.umu.se (Christer Ericson)
Date: Thu, 2 Jun 1994 08:17:18 GMT
Organization: Dept. of Computing Science, Umea Univ., 901 87 Umea, Sweden

In <00155867.fc@bmug.org> Ron_Hunsinger@bmug.org (Ron Hunsinger) writes:
>[...]                                  I also made it clear that in
>making that comparison, I was comparing only similar implementations
>of the two operations: my sqare root method implemented entirely in
>software is as fast and about the same size as division implemented
>entirely in software; my square root method implemented entirely in
>hardware is as fast and uses about the same number of logic gates as
>division implemented entirely in hardware.

OK, I can agree with that. But apart from that... well, see below.


>Comparing my wholly-software IMPLEMENTATION to your implementation using
>hardware divide is unfair and irrelevant.  I stand by my claim.  Square
>root is no harder or slower than division.  And there is an algorithm
>for square root that will beat Newton/Raphson hands down whenever both
>algorithms are given similar implementations.

It's neither unfair nor irrelevant. You're living in a dream world.
In the real world (where the rest of us live) there's no such thing
as "similar implementations". The better (which equals to faster in
this case) implementation wins, regardless of whether it uses all
available machine instructions or not.

In the real world do you always cry, "Not fair! Not fair! You used
hardware division and I didn't <stomp of feet>."? Hey, good luck to you!


>But I DO care that C is more portable, and it is definitely relevant to
>the comparison.  Our versions were not optimized equivalently, because
>you are focusing only on optimization to a particular piece of silicon
>that happens to implement division, but not square root, in hardware.
>Take our implementations unchanged, and run them on a PowerPC, and mine
>will leave yours in the dust.  ("Unchanged" means I get to compile mine
>native - that's what I bought when I paid the price of being portable.
>"Unchanged" means you have to have yours emulated - that's what you
>bought when you decided to write processor-dependent code.)  Or take
>our implementations unchanged and run them on an Intel processor.  Watch
>mine still run at acceptable speed.  Watch yours not run at all.

You're being silly! I use my code as posted on a 68K machine, I use another
(general) piece of code on another machine. Ever heard the word #ifdef?


>I also care that mine meets the design specs.  The original requestor
>said he wanted to compute square root "rounded to the nearest integer".
>My method rounds, and rounds correctly in every case.  Yours, and all
>the other methods I've seen, truncate.  I can truncate too, if that's
>what you want, but fixing yours to round takes a little work.  Essentially,
>rounding involves comparing the unrounded root with the remainder from
>the last division.  If you implement Newton/Raphson in C or Pascal, that
>remainder is not available to you, and you have to do at least another
>multiply and subtract to get it.  If done in 68K Assembler, the remainder
>is available, so rounding won't slow you down much, but it will complicate
>your logic some, and of course there's that portability issue you don't
>want to talk about.  Either way, remember that rounding may produce a 
>result of 65536, so the routine has to return an unsigned long, just 
>like the requestor asked for, instead of an unsigned short.

I didn't post my article as a reply to the original post, but as a proof
that a carefully coded N/B-approach very well outperforms (in terms of
execution time) the algebraic algorithm your code is a version of.


>>[I wrote:]
>>I've appended below a copy of my code and if you
>>care to compare it to your code, I'll take the time to respond to
>>any comments you might have (size, speed, what have you). Fair?
>
>Well OK, as long as you keep in mind we're comparing apples to
>oranges.

I think it's interesting to note that when you compared your code to
Jim Cathey's code (which you erroneously thought was mine), then you
thought size was important. Was that only because your code was smaller
than Cathey's? Now that my code is smaller than yours, it seems that
you don't think size is important no longer. Nor did you bother to
compare average times.

And this from someone who cries "fair" all the time...


>Before looking at speed, there is the more important issue of 
>correctness.  As nearly as I can tell, you always return the correct 
>value (except for not rounding).  I won't quite go so far as to say the 
>code is all correct, because you make some early mistakes that get 
>compensated for later:
>
>  o You test for a zero input by doing a BEQ.S right after moving the
>    input to register A0 for safekeeping.  Are you aware that moving to
>    an A register does not affect the condition codes?  Fortunately,
>    the condition codes are still set correctly from the prior move
>    of the same data from D7 to D0.  However, you have a comment that
>    says that move can be eliminated.  If someone does, they might be 
>    in trouble.

Yes this is true. That comment is really a mistake. I added it in haste
when posting the code to comp.sys.m68k some while ago, when there were
a discussion of fast implementations of integer square roots. The comment
is faulty. The code is, and was, correct, however.


>  o Your logic to do large initial shifts is faulty.  On some inputs (like
>    0x02000000) you overshift.  On other inputs (in the range 512..65535) 
>    I think you are not shifting as far as you think you are shifting.  
>    (The logic around label @skip8 is muddled.)  Fortunately, the Newton 
>    method is robust enough to compensate for a bad initial guess, and
>    you still arrive at the correct answer, although not as efficiently
>    as you could have.

Harsh words from someone who doesn't bother to attribute code to the
correct person. There is nothing wrong with the section of code you're
refering to. That code is there to reduce the number of iterations in
the section just below, and to _reduce_it_as_much_as_possible_on_average_
_without_over-complicating_the_code_.

Now, say again, is it correct or is it faulty? If you consider it to
be faulty, I'd very much like to see you post what you consider to be
a correct version!


>Does the fact that your code does not work the way you think it 
>does suggest anything?

While we're being sarcastic, I'd like to ask you if the fact that you
find errors that aren't there suggest anything to you?


>It's hard to estimate your average running time, because there are
>so many paths through the optimization and it's difficult to assign
>proper weights to each path, so I'll settle for comparing worst cases.
>
>  o Your worst case is with the input 0x02000000, for which you have to
>    do 4 divisions.  You take a total of 994 clocks on a 68000, or
>    459 clocks on a 68020.
>
>  o My worst case is with the input 0xFFFFFFFF.  I take 1362 clocks on
>    a 68000, 609 on a 68020.
>
>1362 / 994 = 1.37.  You are 37% faster than I am on a 68000.
>609 / 459 = 1.33.  You are 33% faster than I am on a 68020.

Well, since you didn't bother running the code, I did. On an IIcx I got
the following results (all numbers in ticks):

R = Ron's code, as posted
A = Optimized assembly code equivalent to Ron's code
C = My code, as posted

	[0..65536)	[0..500000)	500000 "random" numbers
R	142		1088		1135
A	118		906		944
C	95		626		473

142 / 95 = 1.49
1088 / 626 = 1.74
1135 / 473 = 2.4

These figures should be considerably larger on a 68000, if we can believe
your figures. (Note that you could squeeze an extra 20% out of your code,
which for what is essentially a library routine is quite large. I would use
my code though.)


>Truth be told, I'm not unhappy that my portable software solution fares
>so well against your non-portable hardware solution, even on the hardware
>of your choice.

You were saying?


Christer Ericson --- Internet: christer@cs.umu.se --- tel: +46-90-166794
Department of Computer Science, University of Umea, S-90187 UMEA, SWEDEN

---------------------------

>From baer@qiclab.scn.rain.com (Ken Baer)
Subject: How to save alpha in PICT files?
Date: Sat, 28 May 1994 00:24:17 GMT
Organization: SCN Research/Qic Laboratories of Tigard, Oregon.

I need to save 32-bit images as PICT files that include the upper 8 bits
of data (the alpha buffer).  My code currently saves as 32-bit PICT files,
but the upper 8-bit of data are missing!  I am using the methods described
in Inside Mac and developer notes.  The sequence I use now is as follows:
 - allocate and write 32-bit data to an offscreen GWorld
 - OpenPicture()
 - CopyBits(offgworld, offgworld) (copy the pixmap to itself)
 - ClosePicture()

Is there an extra writing step to jam in the alpha buffer?  Is there any
sample code (in C) with a more modern method?  The sample code I'm working
from is pretty dated.

Thanks in advance.

	-Ken Baer.
	baer@qiclab.scn.rain.com
-- 
 \_       -Ken Baer.  Programmer/Animator, Hash Inc.
<[_]   Usenet: baer@qiclab.scn.rain.com/AppleLink:KENBAER/Office: (206)750-0042
 =# \, Animation Master: Finally, 3D animation software an artist can afford!

+++++++++++++++++++++++++++

>From zapper2@netcom.com (Chris Athanas)
Date: Sun, 29 May 1994 09:19:37 GMT
Organization: NETCOM On-line Communication Services (408 261-4700 guest)

Ken Baer (baer@qiclab.scn.rain.com) wrote:
: I need to save 32-bit images as PICT files that include the upper 8 bits
: of data (the alpha buffer).  My code currently saves as 32-bit PICT files,
: but the upper 8-bit of data are missing!  I am using the methods described
: in Inside Mac and developer notes.  The sequence I use now is as follows:
:  - allocate and write 32-bit data to an offscreen GWorld
:  - OpenPicture()
:  - CopyBits(offgworld, offgworld) (copy the pixmap to itself)
:  - ClosePicture()

: Is there an extra writing step to jam in the alpha buffer?  Is there any
: sample code (in C) with a more modern method?  The sample code I'm working
: from is pretty dated.

Under system 6.0, you could use the upper 8 bits as an alpha channel.
>From what I understand, that was a bug.

Current PICT files only support 24-bit images. The upper 8bits are 
truncated, and the picture is stored as 24-bits per pixel.

To associate an alpha channel with a PICT, you will need to do some 
custom stuff. I'm not sure what your application is, but if you are doing 
everything yuorself, you could store the normal picture as a 24-bit pict 
file (normal PICT file) and create a PICT resource in that file that 
contains the alpha channel. This way "normal" applications can use your 
data-fork PICT file normally, and you can use your alpha channel in your 
own apps.

Also, if you are planning on using photoshop, photoshop has a buil-in 
filter that allows you to take a PICT resource from a file. So your users 
could still have access to your "custom" alpha channel in photoshop.

Another alternative is to use the photoshop standard as your file format. 
The ps format allows for as many channels as you would like. And 
photoshop is a pretty standard application now, and is widely used.

Hope this helps.

Chris Athanas

+++++++++++++++++++++++++++

>From baer@qiclab.scn.rain.com (Ken Baer)
Date: Mon, 30 May 1994 17:51:58 GMT
Organization: SCN Research/Qic Laboratories of Tigard, Oregon.

In article <zapper2CqK4Kq.Fn2@netcom.com> zapper2@netcom.com (Chris Athanas) writes:
>Ken Baer (baer@qiclab.scn.rain.com) wrote:
>: I need to save 32-bit images as PICT files that include the upper 8 bits
>: of data (the alpha buffer).  My code currently saves as 32-bit PICT files,
>: but the upper 8-bit of data are missing!  I am using the methods described
>: in Inside Mac and developer notes.  The sequence I use now is as follows:
>:  - allocate and write 32-bit data to an offscreen GWorld
>:  - OpenPicture()
>:  - CopyBits(offgworld, offgworld) (copy the pixmap to itself)
>:  - ClosePicture()
>
>: Is there an extra writing step to jam in the alpha buffer?  Is there any
>: sample code (in C) with a more modern method?  The sample code I'm working
>: from is pretty dated.
>
>Under system 6.0, you could use the upper 8 bits as an alpha channel.
>From what I understand, that was a bug.

A damn useful bug.  I guess 32-bit QuickDraw was a typo too :-(.

>Current PICT files only support 24-bit images. The upper 8bits are 
>truncated, and the picture is stored as 24-bits per pixel.

What's the logic behind this?  Or as Seinfeld would say, "Who are the 
programming wizards that came up with this one?!?"

>To associate an alpha channel with a PICT, you will need to do some 
>custom stuff. I'm not sure what your application is, but if you are doing 
>everything yuorself, you could store the normal picture as a 24-bit pict 
>file (normal PICT file) and create a PICT resource in that file that 
>contains the alpha channel. This way "normal" applications can use your 
>data-fork PICT file normally, and you can use your alpha channel in your 
>own apps.

The application in question is a 3D renderer which normally outputs 
QuickTime (which has no problem dealing with real 32-bit data), but will
also put out Targa, PICS, and PICT.  I want PICT files with an alpha
channel that can be used primarily in Photoshop, and in other applications
that support 32-bit PICT files.

>Also, if you are planning on using photoshop, photoshop has a buil-in 
>filter that allows you to take a PICT resource from a file. So your users 
>could still have access to your "custom" alpha channel in photoshop.

Are you saying that Photoshop uses the alpha in a resource method that you
described above?  Is this documented anywhere?

>
>Another alternative is to use the photoshop standard as your file format. 
>The ps format allows for as many channels as you would like. And 
>photoshop is a pretty standard application now, and is widely used.

I'd rather stick with PICT, though it is easily my leasy favorite image
format.  It's really a shame that the Mac standard format is also the
least portable to other systems.  It's a big deal to us since our
application also runs on Windows and SGI.

>
>Hope this helps.

Yes, it does.  My big question is what is everyone else doing for saving
alpha with PICT files?

>
>Chris Athanas


-- 
 \_       -Ken Baer.  Programmer/Animator, Hash Inc.
<[_]   Usenet: baer@qiclab.scn.rain.com/AppleLink:KENBAER/Office: (206)750-0042
 =# \, Animation Master: Finally, 3D animation software an artist can afford!

+++++++++++++++++++++++++++

>From scottsquir@aol.com (ScottSquir)
Date: 30 May 1994 18:54:02 -0400
Organization: America Online, Inc. (1-800-827-6364)

In article <1994May30.175158.28837@qiclab.scn.rain.com>,
baer@qiclab.scn.rain.com (Ken Baer) writes:

> My big question is what is everyone else doing for saving
>alpha with PICT files?

Set the cmpCount in the GWorld pixmap to 4.  It defaults to 3.

If you resize or do other CopyBits manipulation it may get lost.
-scott


+++++++++++++++++++++++++++

>From Darrin.Cardani@AtlantaGA.NCR.COM (Darrin Cardani)
Date: Tue, 31 May 1994 18:17:38 GMT
Organization: UNITY, AT&T GIS

In article <1994May30.175158.28837@qiclab.scn.rain.com> baer@qiclab.scn.rain.com (Ken Baer) writes:

>Are you saying that Photoshop uses the alpha in a resource method that you
>described above?  Is this documented anywhere?

You can get all the docs on Photoshop's format from archive.umich.edu in the
mac/graphics/util directory. (Might be mac/development directory, actually, I'm
not sure).

>>
>>Another alternative is to use the photoshop standard as your file format. 
>>The ps format allows for as many channels as you would like. And 
>>photoshop is a pretty standard application now, and is widely used.

>Yes, it does.  My big question is what is everyone else doing for saving
>alpha with PICT files?

I'm currently saving them as 2 PICT files. I've debated using 1 file with 2
PICT resources in it. I think I'll end up using the Photoshop method, because
it adds usefulness to my program.

Also, someone mentioned you could have as many channels as needed with the
photoshop format, I believe you can only have up to 16, though.

>-- 
> \_       -Ken Baer.  Programmer/Animator, Hash Inc.
><[_]   Usenet: baer@qiclab.scn.rain.com/AppleLink:KENBAER/Office: (206)750-0042
> =# \, Animation Master: Finally, 3D animation software an artist can afford!

Really? What it is and can a demo be downloaded anywhere?

Darrin

Darrin Cardani
Opinions above are mine. Mine! Mine! Mine!
Darrin.Cardani@AtlantaGA.NCR.COM

+++++++++++++++++++++++++++

>From a ()
Date: 2 Jun 1994 01:11:24 GMT
Organization: Apple Computer, Inc., Cupertino, California

In article <1994May30.175158.28837@qiclab.scn.rain.com>,
baer@qiclab.scn.rain.com (Ken Baer) wrote:

> In article <zapper2CqK4Kq.Fn2@netcom.com> zapper2@netcom.com (Chris Athanas) writes:
> >Ken Baer (baer@qiclab.scn.rain.com) wrote:
> >: I need to save 32-bit images as PICT files that include the upper 8 bits
> >: of data (the alpha buffer).  My code currently saves as 32-bit PICT files,
> >: but the upper 8-bit of data are missing!  I am using the methods described
> >: in Inside Mac and developer notes.  The sequence I use now is as follows:
> >:  - allocate and write 32-bit data to an offscreen GWorld
> >:  - OpenPicture()
> >:  - CopyBits(offgworld, offgworld) (copy the pixmap to itself)
> >:  - ClosePicture()
> >
> >: Is there an extra writing step to jam in the alpha buffer?  Is there any
> >: sample code (in C) with a more modern method?  The sample code I'm working
> >: from is pretty dated.
> >
> >Under system 6.0, you could use the upper 8 bits as an alpha channel.
> >From what I understand, that was a bug.
> 
> A damn useful bug.  I guess 32-bit QuickDraw was a typo too :-(.
> 
> >Current PICT files only support 24-bit images. The upper 8bits are 
> >truncated, and the picture is stored as 24-bits per pixel.
> 
> What's the logic behind this?  Or as Seinfeld would say, "Who are the 
> programming wizards that came up with this one?!?"
> 
> >To associate an alpha channel with a PICT, you will need to do some 
> >custom stuff. I'm not sure what your application is, but if you are doing 
> >everything yuorself, you could store the normal picture as a 24-bit pict 
> >file (normal PICT file) and create a PICT resource in that file that 
> >contains the alpha channel. This way "normal" applications can use your 
> >data-fork PICT file normally, and you can use your alpha channel in your 
> >own apps.
> 
> The application in question is a 3D renderer which normally outputs 
> QuickTime (which has no problem dealing with real 32-bit data), but will
> also put out Targa, PICS, and PICT.  I want PICT files with an alpha
> channel that can be used primarily in Photoshop, and in other applications
> that support 32-bit PICT files.
> 
> >Also, if you are planning on using photoshop, photoshop has a buil-in 
> >filter that allows you to take a PICT resource from a file. So your users 
> >could still have access to your "custom" alpha channel in photoshop.
> 
> Are you saying that Photoshop uses the alpha in a resource method that you
> described above?  Is this documented anywhere?
> 
> >
> >Another alternative is to use the photoshop standard as your file format. 
> >The ps format allows for as many channels as you would like. And 
> >photoshop is a pretty standard application now, and is widely used.
> 
> I'd rather stick with PICT, though it is easily my leasy favorite image
> format.  It's really a shame that the Mac standard format is also the
> least portable to other systems.  It's a big deal to us since our
> application also runs on Windows and SGI.
> 
> >
> >Hope this helps.
> 
> Yes, it does.  My big question is what is everyone else doing for saving
> alpha with PICT files?
> 
> >
> >Chris Athanas
> 
> 
> -- 
>  \_       -Ken Baer.  Programmer/Animator, Hash Inc.
> <[_]   Usenet: baer@qiclab.scn.rain.com/AppleLink:KENBAER/Office: (206)750-0042
>  =# \, Animation Master: Finally, 3D animation software an artist can afford!

The technique is described in IM VI pp 17-22 through 17-23; basically the
idea is that before calling CopyBits you need to set the packType field in
the source pixmap to four and the cmpCount to 4 also.

Hope this really helps.

Guillermo

---------------------------

>From wingo@apple.com (Tony Wingo)
Subject: Open Transport & ASLM
Date: Thu, 2 Jun 1994 18:06:30 GMT
Organization: Apple Computer

The following is a response from the Open Transport Team to the recent
threads on OpenTransport, ASLM and DLL's in general.

Comments may be sent to opentpt@applelink.apple.com

=====================================================================

When Open Transport was developed, ASLM was the only general option
for dynamic linking - CFM and SOM were not anywhere in sight. In
fact, today we still can only use CFM for dynamic linking on PowerPC.
It's not available for 68K, and SOM is not available for either
platform.

Whether or not ASLM is used as the dynamic linking mechanism, the
fact that C++ objects generated by different compilers have different
formats makes it virtually impossible to export object-oriented
interfaces that everyone can use.  Yes - SOM will solve this problem,
but that solution does not come without a performance penalty -
parameters have to be marshalled, and solving the fragile base class
problem requires a much more indirect method of dispatching than a
jump through a vtable.  

We really wanted to be able to create an object-oriented framework
for writing STREAMS modules, and ASLM allowed us to do that without a
performance penalty.   Now that CFM and SOM are either here or on the
horizon, we've backed off of that goal.

>From a client perspective, Open Transport supports a complete
procedural interface using Pascal calling conventions for use in the
widest range of development environments possible.  It also has a C++
object interface, which is usable with MPW Cfront, and with the PPCC
Script/tools for PowerPC.  We believe that the C++ interface is also
usable from Symantec C++ for 68K, but this has not been tested. We
have also not tested with the Metrowerks products, but are assuming
that, like Symantec, they have a way to import MPW .o and XCOFF
files.

ASLM is currently used as the dynamically loading mechanism for both
client code and STREAMS modules, but neither version REQUIRES using
C++ interface.  

When SOM becomes available as part of the system, we will look at
converting our C++ client interfaces to become SOM objects.  The
procedural interfaces for STREAMS modules will remain ASLM-based for
68K and CFM-based for PowerPC.

The bottom line is:  If you can't get Open Transport client files to
link with your tools using our "C" interfaces, then either your tools
don't support MPW-generated libraries,  your tools can't deal with
Pascal calling conventions, or the Open Transport team has a bug.

We have not yet published any information on how to write a STREAM
module for Open Transport.  As soon as we do, it will tell you how to
create both ASLM and CFM-based STREAM modules.

I hope this clears up any misunderstandings on using Open Transport.

---------------------------

>From kennedy@fenton.cs.udcavis.edu (Kennedy)
Subject: Sending the Mac Screen Image
Date: Thu, 2 Jun 1994 01:56:33 GMT
Organization: University of California, Davis


    This might be too involved of a question for this news group
But I was wondering if anyone had any Ideas. What I need to do is 
send screen image of a Macintosh so that it can be viewed in all
its glory in an X window. The screen should refresh flicker free
and should look to the viewer of the X window as is he/she was 
looking at the Mac monitor. This has to be done using TCP/IP. 
     If you have any Ideas, example code, algorithms, or whether 
you think it can be done or not, let me know. Thanx.

Brian Kennedy (kennedy@fenton.cs.ucdavis.edu)


+++++++++++++++++++++++++++

>From rmah@panix.com (Robert S. Mah)
Date: Thu, 02 Jun 1994 01:22:22 -0500
Organization: One Step Beyond

kennedy@fenton.cs.udcavis.edu (Kennedy) wrote:

>     This might be too involved of a question for this news group
> But I was wondering if anyone had any Ideas. What I need to do is 
> send screen image of a Macintosh so that it can be viewed in all
> its glory in an X window. The screen should refresh flicker free
> and should look to the viewer of the X window as is he/she was 
> looking at the Mac monitor. This has to be done using TCP/IP. 
>      If you have any Ideas, example code, algorithms, or whether 
> you think it can be done or not, let me know. Thanx.

Since you said "in all its glory" I take it you mean color.

Let's see...640x480 minimum screen size means 300K per frame.  At 24 fps,
that's around 7MBytes per second.  Too much.

You'll have to send screen differences (i.e. just what changed on the 
screen) to get a decent frame rate.  Keep a copy of the screen image on
the Mac, note the changes, send it via TCP/IP.  Receive it on the UNIX 
machine.  Have the X Server send it to the X Client.

You could also patch all of QuickDraw and send the QD command opcodes to
the UNIX box.  There, you'd have to write a QD emulator and execute the
opcodes as they come in.  This would be very hard to write, but would 
work faster than the screen difference method.

Have you considered just buying it.  I think there are a few products 
that do this out there.

Cheers,
Rob
___________________________________________________________________________
Robert S. Mah  -=-  One Step Beyond  -=-  212-947-6507  -=-  rmah@panix.com

+++++++++++++++++++++++++++

>From David K}gedal <davidk@lysator.liu.se>
Date: 2 Jun 1994 11:44:58 GMT
Organization: Lysator Academic Computer Society

In article <rmah-020694012222@rmah.dialup.access.net> Robert S. Mah,
rmah@panix.com writes:
>You'll have to send screen differences (i.e. just what changed on the 
>screen) to get a decent frame rate.  Keep a copy of the screen image on
>the Mac, note the changes, send it via TCP/IP.  Receive it on the UNIX 
>machine.  Have the X Server send it to the X Client.

Oops, this is a very common confusion. It is of course the X client that 
will send it to the X server. The server is the one that controls the 
actual screen and does the drawing that the clients tells it to do.

 /David

+++++++++++++++++++++++++++

>From d88-jwa@mumrik.nada.kth.se (Jon W�tte)
Date: 2 Jun 1994 12:56:48 GMT
Organization: The Royal Institute of Technology

In <rmah-020694012222@rmah.dialup.access.net> rmah@panix.com (Robert S. Mah) writes:

>Let's see...640x480 minimum screen size means 300K per frame.  At 24 fps,
>that's around 7MBytes per second.  Too much.

You only need send what's drawn, and you can compress it while you
send.

There are commercial apps that do EXACTLY this. There are two
mechanisms you can use; patching the QuickDraw bottlenecks, or
patching ShieldCursor().

QuickDraw does all drawing through bottlenecks, so if you want
object graphics, send the bottleneck commands across the network.
You have to send port information as well, I guess.

QuickDraw also calls ShieldCursor() for the enclosing rectangle of
what it draws, and ShowCursor() when it's done, so if you only
want a bitmap, use that area, and send the (packed) bitmap once
drawing is done (ShowCursor() time) or accumulate it all into a
region which you send at WaitNextEvent time.
-- 
 -- Jon W{tte, h+@nada.kth.se, Mac Software Engineer Deluxe --
   This signature is kept shorter than 4 lines in the interests of UseNet
   S/N ratio.

+++++++++++++++++++++++++++

>From rmah@panix.com (Robert S. Mah)
Date: Thu, 02 Jun 1994 11:40:51 -0500
Organization: One Step Beyond

David K}gedal <davidk@lysator.liu.se> wrote:

> Robert S. Mah, rmah@panix.com writes:
> >You'll have to send screen differences (i.e. just what changed on the 
> >screen) to get a decent frame rate.  Keep a copy of the screen image on
> >the Mac, note the changes, send it via TCP/IP.  Receive it on the UNIX 
> >machine.  Have the X Server send it to the X Client.
> 
> Oops, this is a very common confusion. It is of course the X client that 
> will send it to the X server. The server is the one that controls the 
> actual screen and does the drawing that the clients tells it to do.

Right.  I don't do X, so I'm more than a bit hazy about this stuff, but
after you mentioned it, I did recall reading an article about this and
saying to myself, "That's supposed to make sense?"

So, to correct myself...the X-client would ask the X-server to draw the
stuff.  Hey...couldn't you simply put the X-client on the Mac?  Could
be an interesting academic project.

Cheers,
Rob
___________________________________________________________________________
Robert S. Mah  -=-  One Step Beyond  -=-  212-947-6507  -=-  rmah@panix.com

---------------------------

>From jaffe@rabbit.cita.utoronto.ca (Andrew Jaffe)
Subject: What are Universal Headers?
Date: Thu, 19 May 1994 13:49:41 GMT
Organization: CITA


That's about it:
	What are the Universal Headers?

AJ
--
- --------------------------------------------------------------------
Andrew Jaffe					jaffe@cita.utoronto.ca
CITA, U. Toronto				(416) 978-8497, 6879
60 Saint George St.				         -3921 (Fax)
Toronto, ON M5S 1A1 CANADA			

+++++++++++++++++++++++++++

>From dean@genmagic.com (Dean Yu)
Date: 19 May 1994 17:20:31 GMT
Organization: General Magic, Inc.

In article <JAFFE.94May19094941@rabbit.cita.utoronto.ca>,
jaffe@rabbit.cita.utoronto.ca (Andrew Jaffe) wrote:
> 
> That's about it:
> 	What are the Universal Headers?
> 

  Universal Headers are the twisted idea of a couple of system software
engineers who have since left the company to try to avoid the proliferation
of interface files as the Mac API was going cross platform. That is, rather
than having a set of header files you compile with if you were doing say a
PowerPC application, and different set you use if you were doing a 68K
application, you would have one set that can be used for any platform that
supplied the Mac API by flipping a couple of compile time switches.
  Like any grand unifying theory, it was very elusive, long in the coming,
and probably a little less all-encompassing than it should have been.

-- Dean Yu
   Negative Ethnic Role Model
   General Magic, Inc.

+++++++++++++++++++++++++++

>From Lars.Farm@nts.mh.se (Lars Farm)
Date: Fri, 20 May 1994 09:43:17 +0100
Organization: Mid Sweden University

In article <dean-190594101355@dean_yu.genmagic.com>, dean@genmagic.com
(Dean Yu) wrote:

> In article <JAFFE.94May19094941@rabbit.cita.utoronto.ca>,
> jaffe@rabbit.cita.utoronto.ca (Andrew Jaffe) wrote:
> > 
> > That's about it:
> > 	What are the Universal Headers?
> > 
> 
>   Universal Headers are the twisted idea of a couple of system software
[...]
>   Like any grand unifying theory, it was very elusive, long in the coming,
> and probably a little less all-encompassing than it should have been.

Still, it is very nice to have _one_ set of Mac headers, after all there is
only one API. Now it's a lot simpler to move code between compilers. This
can't be bad.

-- 
Lars.Farm@nts.mh.se

+++++++++++++++++++++++++++

>From slosser@apple.com (Eric Slosser)
Date: Sat, 28 May 1994 21:14:40 GMT
Organization: Apple Computer, Inc.

> In article <dean-190594101355@dean_yu.genmagic.com>, dean@genmagic.com
> (Dean Yu) wrote:
> 
> > 
> >   Universal Headers are the twisted idea of a couple of system software
> [...]
> >   Like any grand unifying theory, it was very elusive, long in the coming,
> > and probably a little less all-encompassing than it should have been.

Well, with the kind of people they had working on the project, what did you
expect, Dean?

PS.  Don't forget the really novel idea of having the C, Pascal, and Asm
headers auto-generated from a master file, rather than maintained by hand.

-- 
Eric Slosser / Apple Computer / slosser@apple.com

+++++++++++++++++++++++++++

>From jum@anubis.han.de (Jens-Uwe Mager)
Date: Thu, 2 Jun 94 01:05:11 MET
Organization: (none)


In article <slosser-280594131440@mac96.kip.apple.com> (comp.sys.mac.programmer), slosser@apple.com (Eric Slosser) writes:

> PS.  Don't forget the really novel idea of having the C, Pascal, and Asm
> headers auto-generated from a master file, rather than maintained by hand.


Hmmm, how does your specification language look like?

______________________________________________________________________________
Jens-Uwe Mager			jum@anubis.han.de
30177 Hannover			jum@helios.de
Brahmsstr. 3			Tel.: +49 511 660238

---------------------------

End of C.S.M.P. Digest
**********************