Final Monk

Increasing Performance of a High-Resolution Monitor Wall
Jason Monk Advisors: Prof. Bruce Segee

Department of Electrical and Computer Engineering University of Maine Orono, USA
jason.monk@maine.edu
Abstract This paper will examine several aspects of rendering OpenGL graphics on large displays using VirtualGL and VNC. It will look at increasing performance in a couple aspects of rendering on a monitor wall, in particular a CUDA enhanced version of VirtualGL as well as the advantages to having multiple VNC servers. It will discuss restrictions caused by read back and blitting rates and how they are affected by different sizes being rendered. A CUDA extension for VirtualGL was developed allowing for faster read back at high resolutions.
GPGPU (general purpose graphics processing unit) has been growing, this is the concept that the GPU is very good a algebra and running things in parallel so we should take use of that power for other applications.
I. I NTRO As we move into the digital age increase in display size is becoming more and more popular. The most commonly thought of way for this is to use a projector. A projector is a useful tool, however as you increase the size each pixel gets bigger, and the image looks worse and worse. A less common alternative to a projector is a monitor wall, several monitors set up in a grid. This solution allows large images while maintaining lots of detail, when adding more monitors the other monitors do not increase pixel size, as they would with a projector increase in size. There are several ways to achieve high-resolution data visualization on both a hardware a software layer. Although most high-resolution systems involve this monitor grid there are a few ways to distribute the image to each display. Some create the image on one computer and distribute the image to client machines afterwards, whereas others will have each image create its own piece of the image. The four by four monitor wall located at the UMaine innovation center is using a system where one computer creates the image and then it is distributed. To do this the computers are running two software packages, xtightvncserver to create a large display and distribute it, and VirtualGL to do rendering for applications run in the VNC server. Currently the setup can achieve frame rates of one to two fames per second while rendering close to full screen (about twenty mega pixels). When nVidia started producing the G8X series of cards they started implementing a architecture called CUDA, and most of their video cards since then have had CUDA support. With this new architecture they provided extensions for C/C++ that create an Application Programming Interface (API) allowing code to executed on the GPU. Since then the concept of
Fig. 1.
Display Wall at Innovation Center
II. P ROJECT G OAL The goal of this project is to increase performance of the display wall at the innovation center, preferably by harnessing unused processing power available in the GPUs, through CUDA, of the computers hosting the wall. III. R EAD BACK R ATE T ESTING At the start of the project it was believed that the factor limiting the frame rate of the display was read back of the video card (the process of getting each frame from the video card to the CPU for distributing). There are no reliable read rates available on the internet of these cards so the rates needed to be tested. A program was written (Appendix A) that would use CUDA to write and read a large random amount of data to and from the video card memory and then later process the times collected, the program was later expanded to test several types of memory accessible from the function cudaMemcpy. The data collected showed that the Write Speeds were usually between 1700 and 2000 MB/s and the Read Speeds were only slightly slower between 1500 and 1800 MB/s, results shown in gure 2. When writing to the card the cudaMemcpy function
appeared to have a overhead of about 1.2 ms, when reading from the card was only 37 us. Using page-locked memory can signicantly increase performance of both read and write speeds to over 3GB/s.
square resolutions. The plot clearly follows a very linear trend. Notice that there is a very large drop from the CUDA testing program being run by itself to any rendering at all happening, There is a several hundred MB/s drop in bandwidth the GLX rendering is occurring. Using this linear regression we can extrapolate that there would be no bandwidth left while rendering at sizes upwards of 12 MP, let alone 20 MP. This shows that read back during rendering is undesirable and better frame rates can only be achieved if rendering and reading are performed synchronously one after another. The question became what exactly is VirtualGL doing so the next step was looking at the VirtualGL source code to see exactly what it was doing. IV. V IRTUAL GL VirtualGL is software that will intercept OpenGL commands from any program, it then will run all of the OpenGL commands on the video card in a off screen buffer called a PBuffer. Once each frame is rendered it reads the frame from the PBuffer and brings it back to host memory (memory available to CPU), now available to other software such as a VNC server. VirtualGL is written mostly in C++ with a few pieces of it in C. Already written into the code for VirtualGL is a proler to track the frame rate of VGL. The proler will time the read back, the blit (writing to X server), and the total frame rate. VirtualGL as well as a VNC server were compiled and installed on a separate machine to do some testing of VirtualGL speeds and so any code changes could be run in a test environment instead of the working system. When running the proling on the testing machine (at low resolutions) it conrmed that the issue was with read back, therefore read back speed needed to be increased. VirtualGL calls an OpenGL function named glReadPixels. It is widely acknowledged that glReadPixels does not perform nearly as fast as it should. Performance testing showed that glReadPixels was not transferring at the full speed found earlier with CUDA. This means if the transfer could be performed using CUDA rather than the GL call it could run faster. nVidia has been working on a set of CUDA functions to allow interoperability between CUDA and OpenGL. Currently it allows a OpenGL Pixel Buffer Object to be mapped to a pointer available to CUDA calls. The problem with using CUDA to read the frame is that the frame rst has to be moved from the PBuffer it was rendered into to a PBO (Pixel Buffer Object). The only way to move the pixels from the PBuffer to a PBO is using the OpenGL function glReadPixels. This glReadPixels call may not be as slow as the original, since the transfer is between GPU memory, which has high bandwidth (greater than 10 GB/s). There was a similar example in the nVidia CUDA SDK, the example rendered then would call glReadPixels to move pixels to a PBO, then would map it to a pointer readable by CUDA calls. In the nVidia example glReadPixels would copy at a speed of 10-40 GB/s, as opposed to copying to host memory which runs at less than 1 GB/s.
Fig. 2.
Read and Write Test on GTX260
What had not been taken into account was that when reading back was occurring on the wall rendering may or may not also be taking place. So a new set of tests were done while running something requiring rendering on the video card, for this test glxgears was used because it was already set up on the testing machine and it was easy to run automated testing with.
Fig. 3.
Read Speeds while Rendering
Read and write speeds were tested while rendering square glxgears windows, from widths of 50 on up to 1150. The results showed that the read and write speeds while rendering were directly proportional to the number of pixels rendered. Figure 3 is a graph of the read speeds while rendering various
nVidias example provided in the CUDA SDK requested that it be rendered in a different pixels format than the other programs used for testing VirtualGL, because it requested it not only have RGB pixels but alpha also. The function controlling these pixel formats is glXChooseVisual, which takes a list of attributes for the window and produces an XVisual structure to be used. Since VirtualGL already intercepts this call only a simple change is required to replace any requested attributes with the ones to increase performance. The increase performance is seen because glReadPixels takes a parameter of what format to read to pixels into, when the format requested and the format of the buffer do not match a conversion must be done creating a lot of overhead. This process of mapping and un-mapping a PBO each read cycle creates a lot of overhead. When dealing with a small window, only a few mega-pixels, the overhead is so great that no read back increase can be found. However when larger resolutions are reached, specically 20 MP, an increase in read back rate can be found. When testing GoogleEarth at full screen of 20 MP an increase in read back of 30% was found, from the original 120MP/s in the testing environment to a speed around 160 MP/s. To achieve this speedup several things must be done to setup the process whenever a PBuffer is created. 1) Before any CUDA GL calls can be made cudaGLSetGLDevice must be called, it is also good practice to call cudaSetDevice to make sure which device is being used. 2) The PBO must be created and lled with data, this can be done using glGenBuffers, glBindBuffer, and glBufferData. 3) The PBO must be registered with CUDA using cudaGLRegisterBufferObject, upon exiting cudaGLUnregisterBufferObject should be called. Each time the frame is read several steps are also required to make it possible. 1) The PBO must be bound to the pack buffer and lled using glReadPixels. 2) The PBO must be unbound from the pack buffer and mapped using cudaGLMapBuffer. 3) The data must be read from the PBO using cudaMemcpy. 4) The PBO must be unmapped using cudaGLUnmapBuffer. The entire process from the users program to the X server is shown in gure 4. Although this process is more complex it is faster because CUDA is better is utilizing the bandwidth between the CPU and the GPU. This section of code for VirtualGL can be found in Appendix B. Once the frame was already being transferred through CUDA, an attempt at transferring and blitting only the changes required was attempted. A simple algorithm in which if there was a change in a set of 512 pixels was found those 512 pixels would be updated. The overhead associated with checking for changes and transferring small groups of pixels is too large to make this system worth using, but shows promise for solving
the problem associated with blitting large frames to X server.
Fig. 4.
Flow for CUDA Enhanced Read back
V. VNC When testing the performance of code changes in VirtualGL a signicant difference was found between the display wall at the innovation center and the test environment, the test environment was limited by the time it takes to read back, whereas the display wall, at much higher resolution, is limited by the blit speed. In the blitting process VirtualGL would make a series of X calls to draw to the window then call XSync which is what takes most of the time, so the current speed did not appear to be due to VirtualGL not being fast enough. At that time VNC server was also showing as taking 100% of time of one of the CPUs, since the VNC server was not multi-threaded it could not take any more time than it was. According to a proling of VNC server done in the past, compression was taking the most amount of time. Compression was enabled because with no encryption the client machines were often crashing. Attempting raw mode again was the next step to reduce CPU usage. By changing the read and write network buffers, raw mode was able to run for several minutes. In the several minutes it performed at about the same frame rate as before, however the VNC server was taking much less CPU and the network usage had gone up signicantly (probably saturated during image transfer). Since these two different transfer types had the same performance it meant between there was likely a point with balance on compression and network usage that would get better performance. After a few tests, better performance could be achieved by setting the JPEG quality setting on the VNC clients to 3, which increased the frame rate to 2-3 frames per second, with noticeable image quality degradation. Now the CPU is not spending all of its time but the network is not at full capacity either. Knowing that such a performance change can be found in the VNC server the next logical step is running it in parallel. The VNC server being used handled each request sequentially, in this environment 2 main clients were accessing the server, each of the nodes having eight monitors. Since they are handled sequentially the second client must always wait for the rst to nish for its request to be processed. Although the VNC server is not using all of the CPU anymore it is still using more than half the time, meaning it is quite busy.
The VNC server being used was a program called xtightvnc, a modied version of the original Xvnc. There is, to date, no version of Xvnc that handles client requests in parallel. When Xvnc is launched it handles all X Server setup required to create a second display (e.g. display :1). There is another common VNC server called x11vnc which connects to an already existing display and allowing clients to view whatever is on that display, this is most commonly used by users that want to have remote access to display :0. A simple test was performed in attempt to relieve the CPU restrictions of serving clients sequentially. Instead of having the client nodes connect to a single VNC server, the clients connected to individual x11vnc servers, one to handle the top half of the screen and the other to handle the bottom half. The performance increase was easily noticeable by human eyes. After this xtightvnc was no longer used and a X virtual frame buffer was used to create display :1 with x11vnc servers were setup for each client that wants to connect. The same 2-3 frames per second could be achieved without the quality degradation found by changing compression and losing quality. There are many advantages to having multiple x11vnc servers rather than a single Xvnc server, such as each client not affecting the speed of another and being able to display only sections of the screen. There are also disadvantages, one worth noting is that clients are no longer synchronized in updates and might have different frame rates. The other cause for concern is the x11vnc can update the same time as VirtualGL is updating the screen causing tearing or other strange affects to the image. VI. C ONCLUSION This project has found several factors affecting performance of rendering on a monitor wall. A balance between network and CPU usage for the VNC server needs to be found for optimal performance, this is easiest by having multiple VNC servers. Also found is that read back rate can be increased noticeably by using CUDA to read pixels rather than reading through OpenGL.
A PPENDIX A. CUDA Memory Bandwidth Testing Program

#include #include #include #include #include <stdlib.h> <stdio.h> <string.h> <time.h> <math.h>
#include <cutil_inline.h> #define #define #define #define #define #define #define PINNED 0 CPUGPU 1 GPUGPU 2 CPUCPU 3 PINTOUNPIN 4 PINTOPIN 5 GPINTOUNPIN 6
int memtest(int type, long long int bytes, double *to, double *from); void printtype(int type); //#define SAMPLES 4 #define MEMCPY 0 int main(int argc, char * argv[]) { long long int bytes; double **toarray,**fromarray,*to,*from; long long int **bytesarray,*bytesr; int curtype; int i; double averageto, averagefrom; int SAMPLES; if (argc == 2) sscanf(argv[1],"%d",&SAMPLES); else SAMPLES = 3; cutilSafeCall(cudaSetDevice(0)); toarray = (double**)malloc(6*sizeof(double*)); fromarray = (double **)malloc(6*sizeof(double*)); bytesarray = (long long int**)malloc(6*sizeof(int*)); for (i = 0; i < 6; i++) { toarray[i] = (double *)malloc(SAMPLES*sizeof(double)); fromarray[i] = (double *)malloc(SAMPLES*sizeof(double)); bytesarray[i] = (long long int *)malloc(SAMPLES*sizeof(long long int)); } for (curtype = 0; curtype < 6; curtype++) { i = 0; to = toarray[curtype]; from = fromarray[curtype]; bytesr = bytesarray[curtype]; while ((i < SAMPLES)) { bytes = rand() % 153600000; if (bytes < 1000000) continue; if (!memtest(curtype,bytes,to+i,from+i)) { bytesr[i++] = bytes; } } } printf("Type:\t\t\tTo (MB/s)\t\tFrom (MB/s)\n"); for (curtype = 0; curtype < 6; curtype++) { to = toarray[curtype];
from = fromarray[curtype]; bytesr = bytesarray[curtype]; averageto = 0; averagefrom = 0; for (i = 0; i < SAMPLES; i++) { averageto += ((*(bytesr+i)/1024.0/1024.0) / (*(to+i))*(1000000000.0)); averagefrom += ((*(bytesr+i)/1024.0/1024.0) / (*(from+i))*(1000000000.0)); } averageto /= SAMPLES; averagefrom /= SAMPLES; printtype(curtype); printf("\t%7.2lf\t\t\t%7.2lf\n",averageto,averagefrom); } return 0; } void printtype(int type) { switch(type) { case PINNED: printf("GPUTOPinned Memory: "); break; case CPUGPU: printf("CPU to GPU Memory: "); break; case GPUGPU: "); printf("GPU to GPU Memory: break; case CPUCPU: "); printf("CPU to CPU Memory: break; case PINTOUNPIN: printf("CPUPIN to CPU Memory: "); break; case PINTOPIN: printf("PIN to PIN Memory: "); break; case GPINTOUNPIN: printf("GPU to CPUPIN to CPU Memory: \n"); break; } return; } int memtest(int type, long long int bytes, double *to, double *from) { int pinned; void *h,*d, *tmp; int i = 0; int j; double rate1,rate2; struct timespec start, stop; if (type == GPUGPU) { tmp = malloc(bytes); if (!tmp) return 1; d = NULL; cudaMalloc(&d,bytes); if (!d) { return 1; } h = NULL; cudaMalloc(&h,bytes);
if (!h) { cudaFree(d); return 1; } for (i = 0; i < (bytes/sizeof(double)); i++) { *((double*)tmp+i) = rand(); } cutilSafeCall(cudaMemcpy(h,tmp,bytes,cudaMemcpyHostToDevice)); clock_gettime(CLOCK_MONOTONIC,&start); cutilSafeCall(cudaMemcpy(d,h,bytes,cudaMemcpyDeviceToDevice)); clock_gettime(CLOCK_MONOTONIC,&stop); rate1 = ((stop.tv_nsec-start.tv_nsec)); clock_gettime(CLOCK_MONOTONIC,&start); cutilSafeCall(cudaMemcpy(h,d,bytes,cudaMemcpyDeviceToDevice)); clock_gettime(CLOCK_MONOTONIC,&stop); rate2 = ((stop.tv_nsec-start.tv_nsec)); printf("%lld\t%.2lf\t%.2lf\n",bytes,rate1,rate2); free(tmp); cudaFree(d); cudaFree(h); } if ((type == CPUCPU)||(type == PINTOUNPIN)||(type == PINTOPIN)) { if ((type == PINTOUNPIN)||(type == PINTOPIN)) { cutilSafeCall( cudaMallocHost((void **)&h,bytes) ); } else { h = malloc(bytes); } if (!h) { return 1; } if ((type == PINTOPIN)) { cutilSafeCall( cudaMallocHost((void **)&d,bytes) ); } else { d = malloc(bytes); } if (!d) { if ((type == PINTOUNPIN)||(type == PINTOPIN)) cudaFreeHost(h); else free(h); return 1; } for (i = 0; i < (bytes/sizeof(double)); i++) { *((double*)h+i) = rand(); } clock_gettime(CLOCK_MONOTONIC,&start); #if MEMCPY memcpy(d,h,bytes); #else cutilSafeCall(cudaMemcpy(d,h,bytes,cudaMemcpyHostToHost)); #endif clock_gettime(CLOCK_MONOTONIC,&stop); rate1 = ((stop.tv_nsec-start.tv_nsec)); for (i = 0; i < (bytes/sizeof(double)); i++) { *((double*)d+i) = rand(); } clock_gettime(CLOCK_MONOTONIC,&start); #if MEMCPY memcpy(h,d,bytes); #else cudaMemcpy(h,d,bytes,cudaMemcpyHostToHost); #endif clock_gettime(CLOCK_MONOTONIC,&stop);
rate2 = ((stop.tv_nsec-start.tv_nsec)); printf("%lld\t%.2lf\t%.2lf\n",bytes,rate1,rate2); if ((type == PINTOUNPIN)||(type == PINTOPIN)) { cudaFreeHost(h); } else { free(h); } if ((type == PINTOPIN)) { cudaFreeHost(d); } else { free(d); } } if ((type == PINNED)||(type == CPUGPU)) { if (type == PINNED) { pinned = 1; } else { pinned = 0; } h = NULL; if (pinned == 0) { h = malloc(bytes); } else { cutilSafeCall( cudaMallocHost((void **)&h,bytes) ); } if (!h) { return 1; } d = NULL; cudaMalloc(&d,bytes); if (!d) { free(h); return 1; } for (j = 0; j < (bytes/sizeof(double)); j++) { *((double*)h+j) = rand(); } clock_gettime(CLOCK_MONOTONIC,&start); cutilSafeCall(cudaMemcpy(d,h,bytes,cudaMemcpyHostToDevice)); clock_gettime(CLOCK_MONOTONIC,&stop); rate1 = ((stop.tv_nsec-start.tv_nsec)); clock_gettime(CLOCK_MONOTONIC,&start); cudaMemcpy(h,d,bytes,cudaMemcpyDeviceToHost); clock_gettime(CLOCK_MONOTONIC,&stop); rate2 = ((stop.tv_nsec-start.tv_nsec)); if ((rate1 < 0) || (rate2 < 0)) return 1; printf("%lld\t%.2lf\t%.2lf\n",bytes,rate1,rate2); if (pinned == 0) { free(h); } else { cutilSafeCall(cudaFreeHost(h)); } cudaFree(d); } *to = rate1; *from = rate2; return 0; }
B. Changes to pbwin.cpp in VirtualGL

void pbwin::readpixels(GLint x, GLint y, GLint w, GLint pitch, GLint h, GLenum format, int ps, GLubyte *bits, GLint buf, bool stereo) { static int zfq = 0; struct timespec start,stop; GLint readbuf=GL_BACK; _glGetIntegerv(GL_READ_BUFFER, &readbuf); tempctx tc(_localdpy, EXISTING_DRAWABLE, GetCurrentDrawable());
glReadBuffer(buf); glPushClientAttrib(GL_CLIENT_PIXEL_STORE_BIT);
if(pitch%8==0) glPixelStorei(GL_PACK_ALIGNMENT, 8); else if(pitch%4==0) glPixelStorei(GL_PACK_ALIGNMENT, 4); else if(pitch%2==0) glPixelStorei(GL_PACK_ALIGNMENT, 2); else if(pitch%1==0) glPixelStorei(GL_PACK_ALIGNMENT, 1); int e=glGetError(); while(e!=GL_NO_ERROR) e=glGetError();
// Clear previous error
if ((!first)&&((cw!=w)||(ch!=h))) { if (!cudafl) { cudafl = 1; i = 1; rrout.PRINT("[VGL] Resolution Change Attempting CUDA Acceleration (%d,% d) to (%d,%d)\n",cw,ch,w,h); } cudachangesize(w,h); cw = w; ch = h; } if (first) { cudafl = 1; } if (cudafl) { if (first) { static int go = 1; int i = 0; cw = w; ch = h; glewInit(); cudastart(w,h); if (go) { go = 0; cutilSafeCall(cudaSetDevice(0)); cutilSafeCall(cudaGLSetGLDevice(0)); } while (cudamakebuffer(x,y/(1<<i),&buffer,&p)) i++; rrout.PRINT("[VGL] Successfully Created Buffer for i = %d\n",i); cutilSafeCall(cudaGLMapBufferObject(&p,buffer)); first = 0; } if (i) { rrout.PRINT("[VGL] Starting CUDA Accelerated Mode\n"); i = 0;
_prof_rb.startframe(); glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB,buffer); clock_gettime(CLOCK_MONOTONIC,&start); _glReadPixels(x, y, w, h, format, GL_UNSIGNED_BYTE, (GLvoid *)NULL); clock_gettime(CLOCK_MONOTONIC,&stop); rrout.PRINT("[VGL] glReadPixels took %lf ns\n",(double)stop.tv_nsec start.tv_nsec); if ((((double)stop.tv_nsec - start.tv_nsec) > 1000000)||(((double)stop. tv_nsec - start.tv_nsec)<0)) { cudafl = 0; rrout.PRINT("[VGL] Switching to Normal Mode\n"); } cutilSafeCall(cudaGLMapBufferObject(&p,buffer)); cudaMemcpy(bits,p,w*h*sizeof(GL_UNSIGNED_BYTE),cudaMemcpyDeviceToHost); glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB,0); cutilSafeCall(cudaGLUnmapBufferObject(buffer)); _prof_rb.endframe(w*h, w*h*4, stereo? 0.5 : 1); } else { _prof_rb.startframe(); glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB,buffer); _glReadPixels(x, y, w, h, format, GL_UNSIGNED_BYTE, (GLvoid *)NULL); cutilSafeCall(cudaGLMapBufferObject(&p,buffer));
cudaMemcpy(bits,p,w*h*sizeof(GL_UNSIGNED_BYTE),cudaMemcpyDeviceToHost); glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB,0); cutilSafeCall(cudaGLUnmapBufferObject(buffer)); _prof_rb.endframe(w*h, w*h*4, stereo? 0.5 : 1); } } else { _prof_rb.startframe(); _glReadPixels(x, y, w, h, format, GL_UNSIGNED_BYTE, (GLvoid *)bits); _prof_rb.endframe(w*h, w*h*4, stereo? 0.5 : 1); }
checkgl("Read Pixels");
// Gamma correction if(!_gammacorrectedvisual && fconfig.gamma!=0.0 && fconfig.gamma!=1.0 && fconfig.gamma!=-1.0) { _prof_gamma.startframe(); static bool first=true; #ifdef USEMEDIALIB if(first) { first=false; if(fconfig.verbose) rrout.println("[VGL] Using mediaLib gamma correction ( correction factor=%f)\n", (double)fconfig.gamma); } mlib_image *image=NULL; if((image=mlib_ImageCreateStruct(MLIB_BYTE, ps, w, h, pitch, bits))!=NULL) { unsigned char *luts[4]={fconfig.gamma._lut, fconfig.gamma._lut, fconfig.gamma._lut, fconfig.gamma._lut};
mlib_ImageLookUp_Inp(image, (const void **)luts); mlib_ImageDelete(image); } else { #endif if(first) { first=false; if(fconfig.verbose) rrout.println("[VGL] Using software gamma correction ( correction factor=%f)\n", (double)fconfig.gamma); } unsigned short *ptr1, *ptr2=(unsigned short *)(&bits[pitch*h]); for(ptr1=(unsigned short *)bits; ptr1<ptr2; ptr1++) *ptr1=fconfig.gamma._lut16[*ptr1]; if((pitch*h)%2!=0) bits[pitch*h-1]=fconfig.gamma._lut[bits[pitch*h-1]]; #ifdef USEMEDIALIB } #endif _prof_gamma.endframe(w*h, 0, stereo?0.5 : 1); } // If automatic faker testing is enabled, store the FB color in an // environment variable so the test program can verify it if(fconfig.autotest) { unsigned char *rowptr, *pixel; int match=1; int color=-1, i, j, k; color=-1; if(buf!=GL_FRONT_RIGHT && buf!=GL_BACK_RIGHT) _autotestframecount++; for(j=0, rowptr=bits; j<h && match; j++, rowptr+=pitch) for(i=1, pixel=&rowptr[ps]; i<w && match; i++, pixel+=ps) for(k=0; k<ps; k++) { if(pixel[k]!=rowptr[k]) {match=0; break;} } if(match) { if(format==GL_COLOR_INDEX) { unsigned char index; glReadPixels(0, 0, 1, 1, GL_COLOR_INDEX, GL_UNSIGNED_BYTE, & index); color=index; } else { unsigned char rgb[3]; glReadPixels(0, 0, 1, 1, GL_RGB, GL_UNSIGNED_BYTE, rgb); color=rgb[0]+(rgb[1]<<8)+(rgb[2]<<16); } } if(buf==GL_FRONT_RIGHT || buf==GL_BACK_RIGHT) { snprintf(_autotestrclr, 79, "__VGL_AUTOTESTRCLR%x=%d", (unsigned int) _win, color); putenv(_autotestrclr); } else { snprintf(_autotestclr, 79, "__VGL_AUTOTESTCLR%x=%d", (unsigned int)_win , color); putenv(_autotestclr); }
snprintf(_autotestframe, 79, "__VGL_AUTOTESTFRAME%x=%d", (unsigned int)_win, _autotestframecount); putenv(_autotestframe); } glPopClientAttrib(); tc.restore(); glReadBuffer(readbuf); }

Final Monk

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Final Monk

Загружено:

Авторское право:

Доступные форматы

Increasing Performance of a High-Resolution Monitor Wall

Jason Monk Advisors: Prof. Bruce Segee

Display Wall at Innovation Center

Read and Write Test on GTX260

Read Speeds while Rendering

the problem associated with blitting large frames to X server.

Flow for CUDA Enhanced Read back

A PPENDIX A. CUDA Memory Bandwidth Testing Program

B. Changes to pbwin.cpp in VirtualGL

// Clear previous error

Вам также может понравиться