Hints

HINTS FOR EFFICIENT PROGRAMMING Copyright (c) 1992-95 by Omega Point, Inc.
Copyright (c) 1989-92 by Microsystems Software, Inc. ------------------------------------------------------------------------NOTE: The text below is taken from the CodeRunneR (R) support documentation with the permission from Omega Point, Inc. CodeRunneR is a TSR library for C and assembly programmers, thus some hints deal implicitly with size optimization. For more information about CodeRunneR or if you have comments, corrections or suggestions call or write to: Omega Point, Inc. TEL: (508) 877-1819 25 Birch Road FAX: (508) 877-0915 Framingham, MA 01701 BBS: (508) 875-8009 USA ------------------------------------------------------------------------CodeRunneR was designed with the greatest attention to efficiency (size and speed). All functions were carefully crafted in assembly language. Combined with some care in the C code you can produce programs rivaling in size to average assembly language programs, especially on medium and large projects. On very small programs (few hundreds bytes) the assembly language will still be ahead. In terms of speed, your C programs & TSRs, should be highly competitive with the best of assembly code. The C programming hints below have been found useful in reducing the efficiency gap between C and assembly code. 1. INT or CHAR - Current compilers are notoriously bad in handling expressions involving byte size data. The reason for this is an overly literal interpretation of C "abstract machine", which mandates that the result should be evaluated as if extended to integer. On the other hand, the xxx86 processors have a rich subset of instructions operating on byte size items. These instructions make great majority of "integral promotions" unnecessary. Unfortunately, only a small fraction of these are utilized by C compilers, resulting in much more expensive operations on char versus int variables. Thus a saving of one byte for the variable storage is practically always more than paid for in code size. The following rules have been found useful in reducing this type of inefficiency: a) If a char variable is used within expressions, or is being passed to functions, or returned by functions, it is better to use int instead. This is true even in case when variable is only used as TRUE/FALSE flag. If defined as an int, the variable will occupy 1 more byte, but for each use one or more bytes are saved in generated code. In addition the speed is often improved due to an even address alignment of variables, preferred by the 16/32 bit CPU-s. b) In some situation rule (a) cannot be applied, since an array of characters is the source for byte size variables. In that case the characters should be cast as SIGNED whenever possible since the instruction CBW used to extend signed char to int is 1 byte, while the instruction MOV AH,0 used for unsigned is 2 bytes. c) CodeRunner functions which receive char or byte as an argument
will ignore high byte passed. Hence you can safely pass integers without having to mask out high byte. This also allows application of the rule (b) independently of char being below or above 0x80. d) Auto variables of char type will occupy two bytes on stack anyways, hence one should always gain by using int instead. e) If an integer variable is to be compared to a char, or passed to function expecting char, casting integer to char is better than AND-ing it with 0xff. 2. INITIALIZED AUTO variables for strings, arrays, structures or any composite objects are very inefficient. Each such object will occupy exactly twice the size defined. Also, the compiler will on every function entry call an internal transfer function to copy the variable from original place into the auto variable. Besides wasting time, this transfer function is not usable by the C code as a memory copy function. And lastly, the space used as the destination is also internal space (stack space), which will often place much heavier demand on stack size. You can often create a "local copy" of such item in some temporarily free buffer (e.g. a disk i/o buffer when disk i/o is idle). a) If this type of variables is truly needed (rare case), one may use other general purpose memory copy function (like memcpy()) which is usable for other purposes as well. In case of Turbo C this is even more important, since Turbo C uses far call and 32 bit pointers to perform the copying, even in small model. b) In most situations, the intention was only to keep a variable local to the function and not to reinitialize variable each time function is activated. In that case the variable should be turned into static, accomplishing same result in half the size. Hence instead of: func() { char msg[]="This is an initialized local string."; } it is better to use: func() { static char msg[]="This is an initialized local string."; } This is one example where ANSI C standard has violated a useful classic C rule that more efficient constructs are shorter to type. 3. Another case of the same Classic C rule violation is the passing of structures by value to/from functions. Such deceivingly efficient constructs are wasteful on memory and time even when implemented well, and especially in the current C compilers where they were added as a marketing feature to boost the count of ANSI compliance points. Hence instead of:
struct x_type func(struct x_type x) { .... process(x.field); .... return(x); .... } it is more efficient to use: void funct(struct x_type *x) { ... process(x->field); ... return; } In rare circumstances, where a local copy of a structure is needed in order to be modified while preserving the callers copy, it is still better to use a general purpose memory copy, rather than have compiler link in an internal-use-only transfer function. 4. Uninitialized AUTO variables are more efficient than the equivalent static/global variables both in memory (typically 1 byte per access) and speed (2-3 clocks). Also, since such variables are visible only from the function containing it, optimizing compilers can take advantage of this and eliminate or some accesses completely. Additionally, the space used by auto variables is released when function exits. Therefore both simple and composite objects needed temporarily and locally should be defined as auto variables. Similar situation is with arguments a function receives - they are equivalent to auto variables in efficiency, thus they are often better than global variables. 5. Variables which are OFTEN being passed as arguments between functions should be redefined as globals, ar at least static for that module. This rule needs to be counterbalanced with the previous rule - the decision parameter is "often". Namely each access to a global/static costs on average 1 extra byte (compared to access of a function argument), but the cost of passing such variable is on average 5 bytes. 6. If a variable occurs in the code as say: n-1, two or more times, it is better to define another variable, n1=n-1 and use n1 instead. Just the code space saved on the single re-use pays for the extra space of variable n1. The program speed will benefit too. This rule is applicable to char, int, long variables as well as pointers to any objects. More complex expressions (than n-1) will benefit even more. Note also that the assignment (n1=n-1; above) should be combined with the first use of (n-1): For example instead of: y = f(x) + n - 1; z = v * (n-1);
use: y = f(x) + (n1=n-1); z = v * n1; 7. 2-D arrays are not efficiently handled by C compilers. Each access will produce multiply instruction to compute an offset into the array. If a 2-D array is used mostly in a sequential manner (like screens, edit buffers, matrices, etc.) one should define them as 1-D arrays and compute starting offset explicitly, and from then on use ++,-operators (on either pointers or indices). 8. Larger expressions are compiled more efficiently than the equivalent step-by-step sequence of expressions. This is also true for a function invocation within an argument of another function invocation. The reason for this is the freedom compilers have when evaluating expressions - compiler need not worry about side effects occurring during evaluation, their sequencing is in most cases undefined, thus compilers can ignore aliasing. On the other hand, the equivalent sequence of simple expressions has so called "sequence points" after each step, requiring often that the intermediate results be stored in memory. In other words the sequence of small expressions places additional, often useless, constraints on evaluation. The drawbacks to be weighed in a decision are higher susceptibility to errors, lower readability and lower flexibility of complex expressions. 9. Functions which accept string pointers and their lengths as arguments, should have lengths placed before pointers. This is advantageous when the following invocation method is used: char *s; .... str_func( str_len(s), s); .... The reason for this is the order in which arguments are pushed on the stack - in case of C parameter convention this is from right to left. Thus in the example above, the code will fetch the value s and push it twice than call str_len(). Had the argument order been opposite, the value s would be accessed (computed) twice. 10. When segments and offsets are defined within structures, offset should be placed first, followed by the segment. When passing far pointers as a separate words for offset and segment, offset should be first. The target function should receive this pair of words as the far pointer, and will access it using single instruction (LES/LDS). 11. If the total amount of data used by program exceeds 64K, the bulk of this space is normally occupied by arrays (programs with 30,000 simple variables are rare). The largest of these arrays should be moved into far heap (see section on (far) memory allocation). This is always more efficient than switching globally to large-data models. CodeRunner is compiled in small data model, allowing for 64K of data and 64K of code using near access, and up to memory size of far data. The library functions can move data to/from far memory. One can also use far pointers available in Turbo and MS C compilers (mixed models).
12. Accessing array elements via pointers (as opposed to index) is more efficient, especially if arrays have int elements (or larger). This only holds if the pointer is used without additional offset, i.e. as *p instead of *(p+i) or p[i]. 13. If an index is used to access array elements, global arrays will compile more efficiently than arrays defined via starting pointer. Hence if an alternatives are p[i] and a[i] (p is a pointer, a[] is an array), the array is preferable. This should not be confused with the "abstract C machine" which mandates that array references are converted to pointers - the actual code does make distinction. The reason is the same as for the constants being more efficient than variables - in this case a[] is a constant, while p is a variable, both representing some memory address. One can unify this and the preceding rules as: The pointers are more efficient than arrays if they follow the array elements, and less efficient if they are fixed (a variable playing role of a constant). 14. Loops which always execute at least once, are best implemented as do {...} while() loops, as opposed to for() {...} loops or while() {...} loops. The saving is 2-3 bytes. 15. Switch statement is often implemented more efficiently using two arrays - one with words (or preferably bytes) containing case values, and another array containing function pointers for the matching cases. There are three reasons for this: A. Switch statement will often generate code that does exactly same thing: it has the two arrays, both with 16-bit elements plus the code which scans the first one (case values) then dispatches from the second one. You can not only use common search function, but also often reduce the case values to 8 bits. B. Actions done for particular case can be delegated to functions which are reusable in other switch statements or other code. C. For each case there will be a jump code to join the common end point of the switch statement - this is 2-3 byte instruction. With the suggested replacement, the RET instruction is used which is always 1 byte. This method also has a side benefit of allowing variables for case labels. Within the context of CodeRunner, this method is useful for hot-key response. The hot-key service routine receives as an argument the hot-key which has triggered TSR activation. The CR.LIB function word_pos() when applied to hot-key and the array of all hot-keys will return 1-based index of the hot-key in the array. The matching array of function pointers should have first pointer set to an error function (no match case), and the rest in the same order as the hot-key array. The whole decision process could look as follows: word hot_key_list[] = { key1, key2, key3 }; /* Hot keys */ fp dispatch[]={err_func,func1,func2,func3 }; /* Function pointers */ hot_key_service(key) {
..... dispatch[word_pos(key,hot_key_list,3)](); ..... } When a switch statement is coded in this manner, the err_func() should replace the default case of the switch statement. As usually there is a tradeoff involved - the switch statement allows access of auto variables belonging to the same function. If these have to be recoded as globals, or passed to individual case functions the overhead for globals (or function entry frame setup) may offset the benefits. 16. When left to choose register variables, compilers will generally pick the first two integer variables. You may override the default choice by weighing space/time priorities. If the space is important select variables which occur (literally) most frequently within a function. If the speed is important (and you don't use assembler) select variables used most frequently in the inner-most loop within a function. 17. For pop-up applications, compiler should be set to optimize for size, since the bulk of time consuming (low level) screen handling is done by the CodeRunner functions. In some cases the size optimization will actually produce faster code than speed optimization due to compilers disregard of CPU prefetch queue effects. And in almost all cases the speed difference will be beyond human perception. 18. Compiler switch for stack-overflow checking should be disabled for TSR application. Namely, the normal action of the compiler will be to abort the program and exit to DOS. This obviously will not work once program goes resident, since officially the program has already "exited" once. Thus all the checking code is at best useless in this environment (CodeRunneR library substitutes checking and exit function with a neutral function). An alternate approach is offered by a pair of CodeRunner functions _watch_stack() and _unused_stack(). These functions are intended for the test stage of development, when they can help determine minimal, but still safe stack size. They should be removed from the final program. 19. When selecting functions from the library, if there is a tossup choice between the two functions, you should select one already used somewhere else in the code (or at least the one more likely usable elsewhere). When the choice is not a tossup, but requires extra C code to make it use the same function, one should use the best suited library function. In other words, one should avoid variety used for variety sake. 20. The CodeRunner video functions are fast enough to make redisplaying of a line or a window an acceptable alternative to trying to repaint only the changed portions of the screen. This can save a great deal of code space. 21. For the same reason as in previous rule, the nested windows may be implemented without buffers to save an overlayed portion of another window. Calling the same function which drew the overlayed window to start with, will save on the data space. In case of two levels of nesting, a swap_blk() functions may be used to swap the screen block and another buffer.
22. Goto statement is still legal in C language. Your C compiler will never generate more than 3 bytes per goto (most often just 2). In the same time, even a single variable used to control the early exit from nested loops will generate 4 bytes to get set and 4-5 bytes for each test, and then 2-3 bytes to jump anyways. 23. CodeRunneR's data compaction will not dispose strings specified as function arguments. These strings should be defined in the disposable data module, and accessed as an external variable from the disposable code module. For example, instead of: main() { dsp("Signon Screen"); /* This string is not disposable */ } it is more efficient to use: extern char msg[]; /* Defined in disposable data module */ main() { dsp(msg); } In assembler you can just define the string within the same module, but in _IDATA segment to make it disposable. 24. Educational C texts providing lists of DO-s and DON'T-s for portable, modular, readable, structured and object-oriented coding practices are a rich source of DON'T-s and DO-s for efficient programs. This advice should be taken with grain of salt. Namely, when sufficient effort has been spent polishing a program and when it looks that no more improvements are possible using "noble" methods, the point of tradeoffs has been reached. Only at this point, if further size/speed improvements are needed, one should look at the DON'T-s from such lists (or DO-s in your code) and choose the least of evils which will help achieve the desired performance. Violation of the good practices before the tradeoff point is sloppy rather than efficient programming. 25. ELSE statement generates a JMP instruction around the ELSE-block. This can be eliminated as shown in the following example: Instead of: if (x<y) z=10; else z=0; Use more compact form: z=0; if (x<y) z=10;
NOTE: This also applies to ?: operator, which is just a compactly written variation of IF-ELSE statement. 26. ELSE IF statement may often be skipped gaining in space at the expense of the execution speed. This is especially useful when the statements within each conditional block are simple assignment or single integer operator like +,-,^,|,&,++,--. In such case there is almost no speed penalty for the space gained. For example: if ((c>='a')&&(c<='z')) c-=0x20; else if (c<0x20) c=0x20; can be replaced with seemingly redundant, but more compact check: if ((c>='a')&&(c<='z')) c-=0x20; if (c<0x20) c=0x20; This type of savings is especially useful in keystroke processing code which, due to human typing speed, need not be very fast. In addition to greater compactness, the decoupling of conditional statements makes code simpler to shuffle around. 27. EXTRACTION of common expressions into functions can often eliminate lots of repetitive expressions. Simple call to a function with no arguments takes only 3 bytes. Any simple assignment will take at least that much (usually more). Therefore, any pair of assignments which repeats in the code 2 or more times can be made into a function. For example: ... crs_x=X0; crs_y=Y0; ... crs_x=X0; crs_y=Y0; ... /* Put cursor in the corner */ /* Reset cursor to the same corner */
can be done in less space using: ... corner0 (); ... corner0 (); ... corner0 () { crs_x=X0; crs_y=Y0; } This example will save space even if Y0 is not used, and X0 is some global variable. 28. The previous hint is an example of TIME/SPACE tradeoff. One can generalize the simple case above to any piece of code used more than
once within a program. Such space optimizations are often done after the program is already finished, and can be done almost mechanically. Pieces of code which are similar, can be often made identical and then replaced with a function. Variables originally placed arbitrarily can be reorganized into records, allowing for simpler argument passing to common functions. In interactive applications, the response speed faster than the screen refresh time (around 20 ms) is without any visible effect. Since the function replacement method described above adds under 20 microseconds even on a 4.77 Mhz IBM PC, then fifty such replacements will add only one millisecond to the execution time, which is humanly unperceptible difference. On the other hand, few hundred bytes saved are certainly perceptible with ever growing DOS with each new version. The function-blocking method has an additional benefit that the code is easier to read and modify, especially if function names are well chosen. The resulting code will look as if designed top-down, without having to think through entire program up-front. 29. EXAMINE GENERATED CODE by making compiler produce ASSEMBLER output. Most of todays C compilers have a switch to generate ASM file (-S for Turbo C and /Fa for Microsoft C). If you are familiar with assembly language, this exploration will show which constructs need improvements. Typically, you would look for expressions in which compiler extends char or byte to int or word (instructions like CBW or MOV AH,0 or XOR AH,AH). Such expressions can usually be typecast so that no extension instructions are generated. Other things to watch are Jumps-Around-Jumps necessitated by large conditional statement blocks. Reducing the size of these blocks by redistributing some work-load to functions will produce smaller and cleaner code (perhaps slightly slower). 30. Arithmetic decisions. is toggled pointer is expressions can often beneficially replace IF/ELSE Consider case of PING-PONG buffers: A pointer (char*)p between buffers buf1 and buf2. The switching of the usually done as:
if (p==buf1) p=buf2; else p=buf1; More efficient construct is: p = (char*) ((word)p ^ ((word)buf1 ^ (word)buf2)); NOTE: The cast to word is done to allow use of ^ operator which is generally not defined on pointers. In this case it is well defined. In practical use the buf1^buf2 would be a constant computed in disposable code. This XOR trick can be used whenever some (integer) variable x needs to be toggled between two specified values a and b. You simply create variable ab=a^b, and toggle x via x^=ab; . In many cases the control variables could be defined in such way that the plain arithmetic could be applied to produce desired result.
31. Along the lines of previous example, you should check the bit patterns of variables occurring in the program. There are often surprising properties of binary 2-complement arithmetic which can greatly optimize the code size and speed. a) A trivial example is replacement of say (x % 8) with (x & 7), or the same way for any power of two. b) Less obvious is, for example, the task of extracting the bit mask for the right-most bit set in a word (e.g. for 1101000 you want just 0001000). word x,mask; Instead of: mask=1; do if (mask & x) break; /* Test if non-zero bit found */ while ((mask<<=1)!=0); /* Else shift the mask to next bit */ Use: mask = -x & x; c) Similarly if you are checking if a non-zero integer is power of two (i.e. it has exactly 1 bit set) you can use: #define is_power_of_2(x) ( x & (x-1) ) d) Thus to count bits in a word x, instead of checking all bits via shifting a bit mask, you can use the following: for (cnt=0; x; ++cnt) x &= x-1; This will produce bit count (cnt) on average twice as fast as the simple shift and test method since the loop will execute once for each bit set (to 1) in x, instead of once for each bit in the word. Additional speedup comes from only one conditional test & jump (x==0?) instaed of two in the brute force approach. e) One often needs to cycle some subset of a bit pattern through all bit combinations, e.g. cycling integer x through all 16 combinations of the low 4 bits (0000, 0001, 0010, 0011,... , 1111). If bits which are affected make a contiguous section of the word, the job is easy. Say, m is a bitmask for the contiguous bits which need to be cycled, e.g. m=0xF, to cycle the low 4 bits of x. In this case, to advance to the next value of x (within the set all bit patterns in low 4 bits), you can use: #define next(x) (x+1) & m If you start with x=0, then when x=next(x) produces 0, you have completed all bit combinations (16 patterns).
If the contiguous pattern is not packed in the low bits, as above, you can apply the trick (b) above to extract the bitmask for the lowest bit set in x and add that value instead of 1, i.e. step = -m & m; #define next(x) (x+step) & m f) If the bit pattern above isn't contiguous (e.g. m=1101, where only 3 separated bits need to be cycled through all 8 combinations: 0000, 0001, 0100, 0101, 1000, 1001, 1100, 1101), the brute force method would go all through all 16 combinations and skip over those which have bit set in the wrong place, e.g. int next(x,m) { int cm=~m; /* cm is a bitmask for "wrong" bits */ do ++x; /* Advance x while "wrong" bits set */ while(x & cm); return(x); } One can slightly optimize this by using trick (e) above, but that still does lots of work since it eliminates only the lower "don't care" bits. Here is the magic method which has no loop and produces only the new patterns (and all of them, as in the example m=1101 above) within given subset of bits (given via m): #define next(x) (x-m) & m g) Sometimes one wants to cycle through all bit patterns of given length, but in way which changes only a single bit from one pattern to the next one. Such sequences (so called Gray codes) are useful when bit patterns represent some configurations/sets and one needs to evaluate some complex expressions/functions based on which bits are on/off. In many cases the complex/long computation associated with the bit patterns can be greatly speeded up if one can reuse results from the previous pattern and perform incremental update based on a change of a single bit. For example, you need a 3-bit sequence 000, 001, 011, 010, 110, 111. To get the n-th member of such k-bit sequence (where n=0,1,.., N=(1<<k)-1; i.e. 0..7 in the 3-bit example), use: Xn = (n^(n+n)) >> 1; As you increment n from 0 to N, the Xn's will also take on all N+1 values, and any two successive Xn bit patterns will always differ from each other in exactly 1 bit (the bit-mask of which you can obtain by xor-ing the two successive Xn's).
NOTE: The expressions above work for signed and unsigned integral values (char, int, long). 32. Integer arithmetic in C is modular (also in assembler). Although the ANSI C guarantees this only for unsigned types, in practice on most processors (including 8086,68000,...) this is also true for signed integer variables (consequence of the binary 2-complement representation). This can be most often used when a bit-pattern is shifted to the left (also to the right for unsigned) to eliminate additional count variable set to 8, 16 or 32. Thus instead of: for (i=0,mask=1; i<16; i++,mask<<=1) {...} you can often use (assuming you need a mask): mask=1; do { ... } while ((mask<<=1)!=0); This way the mask doubles as a bit mask and as a counter. NOTE: We have also applied above the rule for loops which always execute at least once, therefore do-while form is used. 33. Another consequence of modular arithmetic is that pointer/index overflow around 64K can be checked without resorting to long arithmetic. Thus instead of: char *p; word step; /* Assumed forward step */ if (((long)p+step)<0x10000L) p+=step; You can use: if (p+step>p) p+=step; In case of backward step (check for wrap around 0) you can use: if (p-step<p) p-=step; 35. You can avoid long arithmetic when subtracting from 64K. Thus instead: word used,free; ... free = 0x10000L-(long)used; ...
you can do: free = -used; 34. Loop counters are more efficient when counting down to 0 rather than from 0 up. Hence instead of: i=0; do {....} while (++i<N); it is better to use (whenever possible): i=N; do {....} while (--i); The latter often exploits CPU zero flag when decrementing i, combining the decrement and the boundary test into a single instruction. This also allows the C compilers to utilize LOOP instruction which combines all three steps (decrement, test for boundary and conditionally jump) into a single instruction. 35. Pre-increment/decrement operators are often more efficient than the post-increment/decrement version. This is valid mostly when these operators are combined in the expressions using compare (like ++i<N), and has no effect if the simple ++i; or --i; is done (without compare). The reason for this is that the processor flags are changed by ++ or --, thus the (i++ < N) must generate an extra code (usually a JMP) to prevent ++ from destroying the result of compare. Note also that while(--i) or if(--i) will also benefit from this rule since they contain an implicit compare with 0. 36. Function argument passing has dual overhead - an overhead for pushing each argument on the stack, plus an overhead for clearing arguments from the stack. The latter is 3 bytes for MSC for any number of arguments (2 bytes for 1 argument in recent versions), but for Turbo C it is 1,2 or 3 bytes for 1,2 or 3 and more arguments, respectively. Thus to reduce stack clearing overhead it is useful (especially for TC) to keep number of arguments to 1 or 2. Of course, any reduction helps also in the argument pushing phase. One way to reduce number of arguments is to replace them with globals. As mentioned earlier, this has a drawback of more expensive global (or static) variable access. Another way is to use structures (or arrays) which hold several variables, and pass only the pointer to the structure (array). A third method is useful when several arguments are passed via a chain of nested calls. For example instead of: func1(int a,int b,int c,int d) { .... func2(b,c,d); .... }
func2(int x,int y,int z) {...} you can use: func1(int a,int b,int c,int d) { ... func2(&b); ... } func2(int *p) { /*** x is p[0], y is p[1] and z is p[2] ***/ } Structures can be in the same manner for more complicated argument lists (or for code documentation purposes). NOTE: You should not use this for independent auto variables inside the func1(), such as int e,f,g; since the variables are not necessarily allocated in the order of declaration. On the other hand, the passing of arguments to functions is additionally constrained with the CPU stack discipline for C convention calls, thus it has well defined behavior across MS-DOS C compilers. 37. If there are 2 or more occurences of some function call, check the surrounding statements. There are aften common pieces of code just before (or after) the call. The common section should be moved into the function. If the surrounding code is merely similar but not identical, one can often modify it by replacing some auto variable with a global, thus making it identical so that it can be moved to the function. If the code surrounding the call appears in the same form in several, but not all occurences, consider what effect it would have to move it into the function anyways. Often the places which don't have the common piece are not affected by such code being executed, thus it can be safely moved into the function. The gain with this rule is even greater than the one from hint #27 since the call already exists. 38. A switch statement may have two or more cases which are similar except for some variables being assigned different value. The similar case-blocks can be made to reuse same code (i.e. fall thru) by initializing the variables to one of the cases and modifying it in the remaining cases. For example this code: switch (i) { ... case KEY0:
func(0,a,b,c);
case KEY1: case KEY2: ... }
break; func(1,a,b,c); break; func(2,a,b,c); break;
can be optimized (for size) as follows: x=0; switch { ... case case case ... } 39. When designing layout of structures place the most frequently accessed element first. This saves 1 byte and 2-3 clocks on access of this element via pointers to the structure. 40. In rule #4 it was stated that auto variables are more efficient than global variables (typically 1 byte or 2-3 CPU clocks). You can obtain some of the benefits for global variables by grouping them into structures and accessing them via (register variable!) pointer to the structure. Additional benefits of such global structures come from savings on argument passing (see rule #5), ability to overload (or shorten) variable names (compared to global variables) and easier context switching in multithreaded execution. Using global structures to hold global variables may also improve code readability since it gives structure to otherwise scattered collection of variables. This is similar to organizing power of functions or block constructs in structured languages except that instead of code it operates on data. 41. Structures which exceed 128 bytes will have 1 byte penalty in accessing fields at offsets beyond 128 bytes. This is due to 80x86 instructions which have more efficient form for range Index-128...Index+127 (other processors have similar short offset optimizations). Here are some ways to avoid the large structure penalty: a) Place more frequently used fields at the top of the structure. b) Move arrays out of the structure or toward the end of the structure so that most of fields will fall within the 128 byte optimum range. c) Use negative offsets into the structure, thus exploiting the full -128..+127 efficent range that 80x86 CPUs provide. The following example shows the "negative offset" technique: (i) KEY2: KEY1: KEY0: x++; x++; func(x,a,b,c); break;
Instead of: typedef struct Big { /* Structure spanning 256 bytes */ int a1,a2,..,a64; int b1,b2,..,b64; /* This half is penalized */ } BIG; BIG vars; func() { register BIG *p=&vars; ... p->b3 = p->b1 + p->b2; ... } Use: typedef struct Half1 { int b1,b2,..,b64; } HALF1; typedef struct Half2 { int a1,a2,..,a64; } HALF2; typedef struct TwoHalfs { HALF1 h1; HALF2 h2; } TWO_HALFS; The idea of "negative offsets" middle of TWO_HALFS (i.e. into of HALF1 with negative offsets one would use macros to access /* First half of BIG */ /* Global variables */
/* Some structure access */
/* Second halfs of BIG */
/* Combined halfs */
is to point a pointer P into the its h2 field) and access elements (-1,..-128). To simplify expression HALF1 as follows:
#define N(p) ((HALF1*)((char*)(p)-sizeof(HALF1))) TWO_HALFS vars; /* Global variables */
func() { register HALF2 *p=&vars.h2; /* Middle-structure pointer */ ... N(p)->b3 = N(p)->b1 + N(p)->b2; ... } Macro N(p) converts pointer type (HALF2*)p into type (HALF1*)p allowing simple and efficient access to HALF1. The expression shown inside func() will save 3 bytes and 6-9 clocks compared to the equivalent statement using the original structure BIG. There is no reason why an optimizing C compiler couldn't do similar optimization on original structure BIG and thus better utilize the native instruction set. This would affect only the compiler's internal rules on how the pointers to structures are mapped into CPU addresses: instead of mechanically mapping BIG_Ptr(vars)=CPU_ADDRESS(vars) it could use BIG_Ptr(vars)=
CPU_ADDRESS(vars)+128. C standards (ANSI/classic) don't prescribe the first form of mapping, it's the compiler writer's choice.

Hints

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Hints

Загружено:

Авторское право:

Доступные форматы

HINTS FOR EFFICIENT PROGRAMMING Copyright (c) 1992-95 by Omega Point, Inc.

case KEY1: case KEY2: ... }

break; func(1,a,b,c); break; func(2,a,b,c); break;

/* Some structure access */

/* Second halfs of BIG */

#define N(p) ((HALF1)((char)(p)-sizeof(HALF1))) TWO_HALFS vars; /* Global variables */

Вам также может понравиться