9512.net
甜梦文库
当前位置:首页 >> >>

Page-Mapping Techniques for CC-NUMA Multiprocessors


Page-Mapping Techniques for CC-NUMA Multiprocessors
Department of Computer Science and Engineering University of Minnesota, Minneapolis, MN 55455, USA E-mail: huangj@cs.umn.edu Department of Computer Science Rice University, Houston, TX 77005 E-mail: jin@cs.rice.edu Department of Computer Sciences Purdue University, West Lafayette, IN 47907,USA E-mail:li@cs.purdue.edu
Careful page mapping has been shown in the past to be e ective for reducing cache con icts on both uniprocessor and Uniform Memory Access (UMA) multiprocessors. This paper extends previous page-mapping schemes to the more recent Cache-Coherent Non-Uniform Memory Access (CC-NUMA) multiprocessors. These extensions maintain the program's data-task a nity, which is important to CC-NUMA, while reducing cache set con icts by carefully selecting the page frames. Using an execution-driven simulator that simulates a CC-NUMA machine with a 4-MB secondary cache and a 16-KB primary cache on each of the 4-issue super-scalar processors, we nd that, when non-coherence cache misses are relatively heavy, it is quite important for page mapping to preserve the compilergenerated memory module ID (MID) which determines data distribution among the processors. We also nd that straight application of page-coloring performs worse than bin-hopping by 10-45%, while by hashing the page color with part of the MID, page-coloring can perform closely to bin-hopping.

Jian Huang

Guohua Jin Zhiyuan Li

1 Introduction

Cache-Coherent Non-Uniform Memory Access (CC-NUMA) multiprocessors become increasingly attractive as an architecture which provides a transparent access to local and remote memories and a good scalability. Systems based on this architecture include research prototypes such as the Stanford DASH and FLASH, MIT Alewife, as well as commercial products including the Sequent STiNG, Hewlett-Packard SPP and Silicon Graphics Origin 2000 (See our technical report 5 for references). A CC-NUMA machine has a number of nodes connected by an interconnection network. Each node consists of one or a few processors, a private cache hierarchy, and a local memory module. Each node has a Node ID (NID), each 1

processor has a Processor ID (PID), and each memory module a Module ID (MID). The local memory modules of all nodes compose of a shared, continuous memory space used by the operating system (OS). References to a local memory module avoid the network latency while the references to a non-local memory modules (remote references) may experience a two-hop or three-hop network delay. Although this is transparent to the programmer, the OS and the compiler need to minimize the number of remote references. The compiler can analyze the program and allocate the data to memory such that it aligns with the parallel tasks.1;2;7 The paging subsystem in OS keeps the allocation information and maps a virtual page to a physical page. This process includes the decision of MID and the so-called page-color 6. Previous works show that the page assignment by the OS can a ect the number of cache set con icts, and hence program performance, on uniprocessor machines as well as multiprocessor with uniform memory access.3;6 Various techniques are proposed, such as page-coloring, bin-hopping, best-bin, hierarchical method 6 , compiler-assisted page-coloring 3 and dynamic re-mapping 11. Page-coloring and bin-hopping are simple and hence the most popular ones. Silicon Graphics Inc., adopts page-coloring scheme in its products, while DEC ships OSF/1 with bin-hopping. In this paper, we extend these two techniques to CC-NUMA multiprocessors. A new page-mapping scheme is then proposed to reduce the cache-set con icts further in page-coloring. Four popular SPEC Floating Point benchmark programs and one program from Numerical Recipe were parallelized for experiments. The rest of the paper is organized as follows: Section 2 discuses the extensions of page coloring and bin-hopping in CC-NUMA environment. It also discusses the issue of unnecessary cache misses in page-coloring. Section 3 describes the experimental set-up. Section 4 analyzes the results and section 5 concludes the paper.

2 Careful Page Mapping in CC-NUMA Machines

On a paging-based virtual memory system, OS must assign a physical memory module for each virtual page which is brought in to the physical memory. Selection of the physical memory module is usually done by designating a set of n = log2 (p ) bits in the physical address to identify the MID. Here p stands for the number of nodes in the whole system. For convenience, our model assumes that each node has only one processor. Hence, MID and NID are the same for each individual node. After the MID is extracted, we need to nd a physical page from the free-page pool in the identi ed memory module, and allocate 2

Virtual Address Virtual Page Number Page Offset

Address Translation

Real Address Page Frame Bin index Page Offset

Cache Sets Set Index

Figure 1: Description of Cache-Bin
Virtual Page Number

Page Color

MID Bits

Cache Line

Set Index

Page Offset

Figure 2: Description of Cache-Bin

this page for the faulting virtual page. For a given physical page, there are a certain number of cache-sets which can cache the data of this page. All these cache-sets together are called a cache-bin.6 The page-mapping process is essentially the selection of a cache-bin for a particular virtual page (Figure 1). Since the number of cache-bins are limited, we may see frequent cache-set con icts, if we do not utilize all the bins intelligently. We interpret an address in the way described in Figure 2. The set index (SI) decides which cache set this address will be residing in. MID and SI bits are separated so that all physical addresses from di erent memory modules could be mapped to the whole cache. If n MID and SI bits are overlapped in hardware, we essentially divide the cache into 2n portions. This will impact the utilization of cache. A natural decision is to separate the MID bits from the SI bits in the physical address, which gives the OS the greatest exibility to in uence cache mapping. The MID bits are also separated from the page o set such that one page resides in one processor's memory in its entirety. Other than these constraints, the MID positions in the physical address and the virtual address is quite exible, although it must be xed on the given hardware and the OS. The compiler can generate virtual addresses according to the MID positions such that data locality is improved. Part of the SI is in3

A1(650) : Starting from 0x100001a8 00010000000000000000000110101000

Bin Index

MID Page Offset

I1(256): Starting from 0x10001728 00010000000000000001011100101000

Bin Index

MID Page Offset

Figure 3: Con icting Cache-bin ID for two arrays

side the page-o set and should be left untouched during the mapping process. The rest of the SI is considered as the color of a page, which is also called the cache-bin ID.6 In page-coloring, OS tries to allocate a physical page that has the same color as the virtual page, while in bin-hopping, a physical page with a cache-bin ID which is consecutive with that of the previously allocated cachebin is assigned. The page-coloring scheme picks a page randomly if the desired one is not available, while bin-hopping resorts to the next cache-bin in the row. A direct application of these schemes to CC-NUMA machines extracts the color from the virtual address (page-coloring) or calculates the next cache-bin ID (bin-hopping) and looks for a desired page. The MID for each physical page is obtained in a round-robin manner. This practice distributes all requests for pages evenly to all memory modules, leaving behind the compiler-designated MID. It destroys the data-task a nity cultivated by the compiler. We call this simple extension MID-insensitive extension. On the other hand, page mapping can be done by rst identifying the MID from the embedded MID bits, then carrying out the page-selection process inside the designated memory module. This extension keeps the compiler cultivated MID information for the address, hence it is preferred. The OS maintains an independent pool of free pages for each memory module. If page-coloring is used to select a bin on the target memory module, we have a MID-sensitive page-coloring. Similarly, if bin-hopping is used, we have a MID-sensitive bin-hopping. On a CC-NUMA machine, the distributed data in di erent memory mod4

Page Offset Bits 31 Page Color Bits MID Bits Cache Line 0

Part of page color that is replaced by the selected MID bits

Selected MID bits as part of page color

Figure 4: Part of Page Color Bits are replaced by selected MID bits

ules may compete for a limited portion of each node's private cache. This problem can be illustrated by Figure 3, assuming that a system has 16 processors, 32-byte cache line, 4KB page size, and that the bin index bits are separate from the MID bits. Array A1 has 650 double words (5KB), starting from 0x100001a8, and array I1 has 256 double words, starting from 0x10001728. Because the arrays are distributed by cache-line to 16 processors, and MID is considered as part of the virtual page number, there will be 16 page faults. All 16 segments of the same array share one page-color, although originally a 5KB array occupies two di erent pages. As a result all elements of A1 and I1 have zero as the page-color, and will be competing for one cache-bin if pagecoloring scheme is applied. Bin-hopping can solve this problem by using a globally managed cache-bin ID to allocate pages. That is, if processor x has a virtual page with MID mx mapped to a page with n as the cache-bin ID, processor y, which comes after x to ask for a physical page with MID my , will get n + 1 as the cache-bin ID. Note that mx and my may di er. This mapping technique guarantees that the cache-bin ID's of all the in-use physical pages spread evenly across available cache-bins. Hence all the 16 faulting pages will be mapped to 16 di erent cache-bins. With page-coloring case, we propose to hash the cache-bin ID with part of the MID bits, which is described below. Suppose we have n bits for page color, we can take k MID bits and treat them as part of the color of a page and then carry out page-coloring based on the modi ed color. We call these k bits the replacing bits. This is explained in Figure 4. When we select one MID bit as part of the color, we essentially divide the cache-sets into halves. Suppose the MID bit we pick is the highest bit and the number of MID bits is n, then the addresses with a MID of 2(n?1) ? 1 or smaller will map to the rst half of the cache, and the addresses with a MID of 2(n?1) or larger will occupy the second half of the cache. In the above example, A1's MID bits are di erent from I1's. If we select two replacing bits, we will get cache-bin 96 for A1, and cache-bin 64 for I1. The 5

Ora Mgrid Tomcatv

Program Adi Swm256

Speed-Up 5:5 ? 6:5 9:5 ? 12:5 14:5 ? 16:5 4:3 ? 6:5 2:5 ? 4:0

L2 Cache Miss 1:0 ? 1:2% 0:41 ? 1:86% 0:007 ? 6:440% 0:67 ? 1:92% 1:54 ? 8:62%

L1 Cache Miss 21:4 ? 22:0% 11:3 ? 11:5% 26:07 ? 53:09% 9:15 ? 18:34% 12:11 ? 25:60%

Memory Miss 41:16 ? 44:13% 16:08 ? 16:22% 45:07 ? 89:89% 41:06 ? 45:96% 43:37 ? 66:79%

Table 1: Speed Up and Miss Ratio Summary

references are therefore scattered. It may happen that we would fail to nd a wanted page, in which case a page is picked randomly from the free-page pool. Each program is parallelized and instrumented by Panorama 4, an inter-procedural parallelizing compiler developed at the University of Minnesota, and compiled by SGI's f77 compiler with -O2 ag on IRIX 5.3. The executable object code is then fed to our multi-processor simulator (NUMAsim). All simulations are done on an SGI Challenge cluster with MIPS R10000 microprocessors. We select Block Scheduling, also known as Simple Scheduling on SGI Cluster, to be the default scheduling technique, since our experiments 9 showed that other scheduling schemes are inferior to it for our benchmarks. NUMAsim is an execution-driven multi-processor simulator based on MINT 12 . However, modi cations are done to support multiple-issue, out-of-order execution, and weak-ordered memory consistency. The system has 16 nodes. Each node has two levels of non-blocking cache, the on-chip 16-KB level-one (L1) cache, and the o -chip 4-MB level-two (L2) cache. Level-two caches of di erent nodes are kept coherent (see our technical report 5 for the details of the simulator). The Panorama compiler uses a data-task co-allocation scheme 8 to align the data with tasks. It then instruments the FORTRAN source code by inserting directives to identify the starting and ending address of an array and to specify the data allocation decisions. The simulator uses the inserted information and re-maps addresses at run-time for simulation. We rst evaluate the e ect of MID-sensitive schemes without introducing context switch. However, the existence of context-switch in a multiprogramming environment does change the behavior of programs. In order to simulate the multiprogramming environment, we select a time slice of 4 million cycles between context switches, and ush the whole cache after the switch. This set-up is only used to compare MID-insensitive bin-hopping and MID-sensitive bin-hopping. Various statistics are collected. The execution time is in CPU cycles. Since 6

3 Experimental Setup

Tomcatv -- Execution Time 5
Execution Time (in 10M cycles)

4.5 4 3.5 3 2.5 2 1.5 2 4 Associativity

PC -- p0 PC -- p1 PC -- p2 PC -- p3 PC -- p4 BH -- p0

6

8

Figure 5: Execution Time for Tomcatv

not all references go through the level-two cache, we de ne the extended cache miss as the number of misses of the level-two cache over total number of references. We also de ne memory miss rate to be the number of remote memory accesses over the total number of cache misses and write-backs. Coherence misses refer to the misses caused by invalidation. Table 1 lists the speed-up and cache miss information for the benchmarks. The data summarizes the statistics we collected over a variety of simulation setup. It also shows the parallelism and memory reference characteristics for these programs.

4 Data and Discussions

4.1 Evaluation of Schemes Data was collected on MID-sensitive extensions of page-coloring (PC) and binhopping (BH). For MID-sensitive page-coloring (MSPC), all 4 MID bits could potentially be used as replacing bits. We chose to present 0, 1 and 2 bits here because previous experiments 9 on a di erent processor model revealed that 3 and 4 replacing bits have relatively poor performance. For bin-hopping (MSBH) case, we only show the zero-replacing-bit result with globally managed cache-bin ID. This scheme spreads out the cache-bin accesses in the best way, and we did not nd it useful to overlap the MID and the bin-index bits. Figure 5 to Figure 14 graph the execution time, and extended cache miss rate over di erent associativities of the level-two cache. In the gures, PC { x means page-coloring with x MID bit to replace the highest x page-color bits. BH refers to bin-hopping with globally managed cache-bin ID's.

From the data, we observe that Page-coloring with zero replacing bits has 7

Tomcatv -- Extended Miss Rate 9
Extended Miss Rate (percent)

8 7 6 5 4 3 2 1 2 4 Associativity

PC -- p0 PC -- p1 PC -- p2 PC -- p3 PC -- p4 BH -- p0

6

8

Figure 6: Extended Cache Miss Rate for Tomcatv
Swm256 -- Execution Time
Execution Time (in 10M cycles)

4

PC -- p0 PC -- p1 PC -- p2 PC -- p3 PC -- p4 BH -- p0

3.5

3

2

4 6 Associativity

8

Figure 7: Execution Time Curves for Swm256
Swm256 -- Extended Miss Rate 2
Extended Miss Rate (percent)

1.5

PC -- p0 PC -- p1 PC -- p2 PC -- p3 PC -- p4 BH -- p0

1

0.5 2 4 Associativity 6 8

Figure 8: Extended Cache Miss Rate for Swm256

8

Adi -- Execution Time 120
Execution Time (in 10M cycles)

110 100 90 80 70 60 2

PC -- p0 PC -- p1 PC -- p2 PC -- p3 PC -- p4 BH -- p0

4 6 Associativity

8

Figure 9: Execution Time Curves for Adi
Adi -- Extended Miss Rate 6
Extended Miss Rate (percent)

5 4 3 2 1 2

PC -- p0 PC -- p1 PC -- p2 PC -- p3 PC -- p4 BH -- p0

4 6 Associativity

8

Figure 10: Extended Cache Miss Rate for Adi
Ora -- Execution Time 3.5
Execution Time (in 10M cycles)

3 2.5 2 1.5 1 0.5 2 4 Associativity

PC -- p0 PC -- p1 PC -- p2 PC -- p3 PC -- p4 BH -- p0

6

8

Figure 11: Execution Time Curves for Ora

9

Ora -- Extended Miss Rate 7
Extended Miss Rate (percent)

6 5 4 3 2 1 0 2 4 Associativity

PC -- p0 PC -- p1 PC -- p2 PC -- p3 PC -- p4 BH -- p0

6

8

Figure 12: Extended Cache Miss Rate for Ora
Mgrid -- Execution Time 6.5
Execution Time (in 10M cycles)

6 5.5 5 4.5 4 3.5 2 4 Associativity

PC -- p0 PC -- p1 PC -- p2 PC -- p3 PC -- p4 BH -- p0

6

8

Figure 13: Execution Time Curves for Mgrid
Mgrid -- Extended Miss Rate 2
Extended Miss Rate (percent)

1.8 1.6 1.4 1.2 1 0.8 0.6 2 4 Associativity

PC -- p0 PC -- p1 PC -- p2 PC -- p3 PC -- p4 BH -- p0

6

8

Figure 14: Extended Cache Miss Rate for Mgrid

10

an inferior performance to all other schemes. Page-coloring with one replacing bit improves the performance by as much as 50% over the zero-replacing bit case for swm256, 45% for Tomcatv, 35% for Mgrid, and 40% for Ora, while Adi has only about 1% improvement. When two replacing bits were selected, we see up to 40% improvement for Tomcatv, 50% for Swm256, 10-30% for Ora, 10-35% for Mgrid, and 1% for Adi. Note that bin-hopping with globally managed cache-bin ID's has a better performance than MSPC with one or more replacing bits. Performance improves with higher associativity of the level-two cache, but the above ndings are mostly consistent. The advantage of tworeplacing bits over one-replacing bit diminishes quickly as the associativity of the level-two cache increases. In our simulation setup, sixteen bits are used for page-o set and cache-line o set. Seven to nine bits are used for cache-bin ID depending on the associativity of level-two cache, and the size of cache-bin varies from 8KB to 32KB. An array of 512KB (nineteen bits) in size will occupy eight di erent cache-bins. Ora has only two arrays, others are scalars. Total 6KB of data competing for one cache-bin. Some of the loops referencing these arrays are sequential. Hence a higher miss rate to access these arrays increased the critical path. In Tomcatv, seven of the nine major arrays are 512KB each in size. Adi has ve privatized arrays, totaling 4KB, mapped to the same cache-bin. Although the rest nine arrays are 512KB each, eight of them are copied to local arrays (2KB each) for computation. Only one 512-KB array is continuously used in parallel loops. Hence the application of replacing bits does not help much. Mgrid has 8 arrays that are trivial in size , and three other shared arrays that are 326KB each, competing for six cache-bins. Most arrays in Swm256 are 512KB in size. When the whole array is referenced at the same loop repeatedly, MSPC will see lots of con ict misses and its performance will be poor. Bin-hopping does not honor the color of the virtual page. It assigns a di erent cache-bin ID each time a page-fault occurs in round-robin style. This does spread out the cache references evenly. When some replacing bits are used for MSPC, we take advantage of the di erent MID bits among the references, and make better use of the cache. Since MID-sensitive bin-hopping has superior performance in all ve benchmarks, and it does not involve any complication to use non-pagecolor bits, this scheme is a good choice for CC-NUMA multiprocessors. 4.2 MID-sensitive and MID-insensitive Bin-hopping Data-task a nity directly a ects the service time of non-coherence misses. In the presence of context switch, more cold misses are introduced due to the fact 11

Doall I = 1 to N Do J = 1 to N A (J, I) = B(J, I) * C(J, I) -D(J, I) enddo enddo Doall J = 1 to N Do I = 2 to N A(J, I) = B(J, I) * E(J, I) + A(J, I-1) enddo enddo Example of Poor Inter-task affinity

Doall I = 1 to N Do J = 1 to N A(J, I) = B(J, I) * C(J,I) - D(J, I) enddo enddo Doall I = 1 to N Do J = 2 to N A(J, I) = B(J, I) *E(J, I) + A(J-1, I) enddo enddo Example of good inter-task affinity

Figure 15: Task A nity
ExtendedMiss CoherenceMiss MemoryMiss ExecutionTime
Metrics Swm256 0:411% 0:296% 16:22% 28:2M Mgrid 0:672% 0:269% 41:06% 39:6M Adi 0:986% 0:794% 41:16% 65:1M Tomcatv 1:555% 1:120% 54:07% 19:5M Ora 0:007% 0:0001% 41:06% 6:87M

Table 2: Program behavior without context switch

that cache is ushed periodically. In this case, MID-insensitive bin-hopping scheme (MIBH) is expected to have poor performance since it destroys the compiler cultivated data-task co-allocation information. Swim256, and Mgrid complied to this expectation, while for Ora, Tomcatv, and Adi, MID-sensitive bin-hopping (MSBH) did not outperform its counterpart. In the simulation, rst level cache is set to 4-way and secondary cache is set to 8-way associative. Data is summarized in Table 2 to Table 6 For Tomcatv and Adi, the a nity of parallel tasks is bad. This is due to the fact that some loops are accessing data by row, while some are sweeping by column (illustrated by Figure 15. These poorly-aligned parallel loops are enclosed in a sequential loop to execute iteratively. In this case, data is moving back and forth between di erent nodes throughout the execution. The performance is dominated by the coherence-related misses. Their extended miss rates are relatively high (1% - 1.6%), and most misses are coherence-related
Schemes

MIBH MSBH

Swm256 57:8 27:5

Mgrid 67:5 49:5

Adi 84:1 85:4

Tomcatv 18:8 18:9

Ora 7:3 7:3

Table 3: Execution time with context switch (In millions of cycles)

12

Schemes

MIBH MSBH

Swm256 2:363% 0:719%

Mgrid 2:588% 1:427%

Adi 2:724% 2:759%

Tomcatv 2:508% 2:494%

Ora 0:010% 0:010%

Table 4: Extended Cache Misses with context switch
Schemes

MIBH MSBH

Swm256 0:411% 0:191%

Mgrid 0:513% 0:391%

Adi 0:245% 0:241%

Tomcatv 0:403% 0:415%

Ora 0:0001% 0:0001%

Table 5: Coherence Miss Rate with context switch

(70% - 80%). Data allocation does not have as important a role. As a result, MIBH has a similar performance to that of MSBH. Ora is mainly a scalar program. The extended miss rate is very low (0.007%), and there is almost no coherence-related miss. Hence little data allocation issue is involved. On the other hand, since the the execution time for Ora is short (7 million cycles), only one context switch is seen. As expected, MSBH and MIBH schemes have close performances. Swim256 and Mgrid have good inter-task a nity and fairly low extended miss rates (below 0.7%). After the introduction of context switch, a lot of cold misses were triggered, which pumped up the extended miss rate. A high-miss rate and a high percentage of non-coherence misses emphasize the importance of data-task co-allocation. This is indicated MIBH's high memory miss rate. On the other hand, longer execution time of MIBH brings along more context switches, which drove up the extended miss rate further. We see that MSBH is about 27-50% faster than MIBH. The execution time of Swm256 with context switch using MSBH is shorter than that of without context switch due to the fact that memory miss ratio of the former is much lower than that of the later. Some coherence misses are converted to cold misses, which are serviced by the local memory instead of remote cache. To summarize, in multiprogramming environment, non-coherence misses are relatively important. The percentage of data that is residing in remote memory modules decides the latency caused by these kinds of misses. The
Schemes

MIBH MSBH

Swm256 35:19% 10:50%

Mgrid 30:51% 20:39%

Adi 53:51% 53:06%

Tomcatv 51:39% 49:65%

Ora 76:27% 75:79%

Table 6: Memory Miss Rate with context switch

13

compiler can usually cultivate e ective data-allocation information in the virtual address. We should honor this embedded information in page-mapping step in order to achieve good performance.

5 Conclusion

For the system we simulated and the benchmarks we tested, MID-sensitive Bin-hopping with globally managed cache-bin ID has better performance than MID-sensitive page-coloring. While for page-coloring, it is important to properly overlap the MID bits with the bin-index. MID-sensitive page mapping outperforms MID-insensitive page mapping signi cantly if non-coherence cache misses are heavy. Our preliminary results suggest that MID-sensitive bin-hopping is a good choice for CC-NUMA multiprocessors of similar con gurations such as the one we simulated. Future work needs to be done to see how our observations stands on other programs and other system con gurations, including programs with a bigger data set running on a larger system.

References

1. A. Agarwal, D. Kranz, and V. Natarajan. Automatic partitioning of parallel loops and data arrays for distributed shared memory multiprocessors. In Proc. International Conference on Parallel Processing, volume I: Architecture, pages 2{11, St. Charles, IL, 1993. 2. J. M. Anderson and M. S. Lam. Global optimizations for parallelism and locality on scalable parallel machines. In Proc. ACM SIGPLAN Conf. on Prog. Lang. Design and Imp., pages 112{125, June 1993. 3. E. Bugnion, J. Anderson, T. Mowry, M. Rosenblum and M. S. Lam. Compiler-directed page coloring for multiprocessors. To appear in Proc. of the 7th Int. Sym. on Architectural Support for Programming Languages and Operating Systems, October 1996. 4. J. Gu, Z. Li, G. Lee. Experience with E cient Array Data-Flow Analysis for Array Privatization, to appear in Proc. of Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, June 1997. 5. J. Huang, Z. Li. Reducing Cache Misses for CC-NUMA by Careful Page-Mapping, Technical Report 97-036, Dept. of Computer Science and Engineering, University of Minnesota. 6. Richard E. Kessler and Mark D. Hill. Page placement algorithms for large real-indexed caches. In ACM Transactions on Computer Systems, 10(4), November 1992. 7. W. Li and K. Pingali. Access normalization: Loop restructuring for NUMA computers. ACM Trans. on Computer Systems, 11(4), November 1993. 8. T. N. Nguyen. Inter-procedural Compiler Analysis for Reducing Memory Latency and Network Tra c. PhD thesis, University of Minnesota, 1996. 9. T. N. Nguyen, Z. Li, J. Huang, G. Jin, D. Kim. Performance Evaluation of Memory Allocation Schemes on CC-NUMA Multiprocessors, Technical Report 96-043, Department of Computer Science, University of Minnesota. 10. W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical Recipes: The Art of Scienti c Computing (Fortran Version). Cambridge University Press, 1989. 11. T. Romer, D. Lee, B. Bershad, J. Chen. Dynamic Page-Mapping Policies for Cache Con ict Resolution on Standard Hardware. Proceedings of the First Symposium on Operating System Design and Implementation, Nov. 94. 12. J. E. Veenstra and R. J. Fowler. MINT tutorial and user manual. Technical Report 452, Dep. of Computer Science, University of Rochester, June 1993.

14


赞助商链接

更多相关文章:
更多相关标签:

All rights reserved Powered by 甜梦文库 9512.net

copyright ©right 2010-2021。
甜梦文库内容来自网络,如有侵犯请联系客服。zhit325@126.com|网站地图