Machine Learning and Neural Networks - page 29

 

Lecture 36: Alan Edelman and Julia Language



Lecture 36: Alan Edelman and Julia Language

In this video, Alan Edelman discusses the power of programming languages for machine learning and their importance in mathematics. He highlights the recent development of the Julia language, as recognized by Google, for its technical merits and usability in machine learning. Edelman explains how automatic differentiation in Julia works and gives an example of computing the square root of x without using numerical finite differences through the Babylonian algorithm. He also discusses the use of types in Julia for efficient computation and simplifying the backpropagation process with block matrices. Overall, Edelman stresses the importance of linear algebra for mathematical computations and its role in understanding complex phenomena.

  • 00:00:00 In this section, Alan Edelman discusses a demonstration by Professor Strang on row rank equals column rank and how this concept applies to the zero matrix. He then talks about recent developments in Julia, a programming language that has been gaining traction in machine learning, and how Google has recognized its power in this field. Google recently published a blog post stating that there are only two languages powerful enough for machine learning, and Julia is one of them. Edelman provides examples to illustrate this point and encourages students to check out the blog post for further information.

  • 00:05:00 In this section, Alan Edelman discusses the importance of programming languages in a mathematical sense and their ability to do more than just implement algorithms. He explains that Julia, Swift, C++, and Rust are the four programming languages deemed appropriate for machine learning based on their technical merits, and usability. Edelman stresses the importance of linear algebra as a basis for all courses in engineering and its unfortunate delay in history. He then delves into automatic differentiation and how it relates to calculus, his initial skepticism towards it, and the technical details he has explored in his notebook on forward mode automatic differentiation.

  • 00:10:00 In this section, Alan Edelman discusses his initial thoughts on automatic differentiation and how he thought it was just like calculus that he learned in school, but with a computer. However, he soon realized that there was a third method that was neither finite differences nor the chain rule, which fascinated him. He then shares an example of how he computed the square root of x using the Babylonian algorithm in Julia, and how he was able to get the derivative of the square root without explicitly typing the derivative formula, thanks to Julia's automatic differentiation feature.

  • 00:15:00 In this section, the speaker describes using Julia code to calculate the square root of a number without using finite difference calculations. The code creates a type of variable called a "dual number" which is a pair of floats representing a numerical function and its derivative. The speaker then overloads plus and divide operations to implement the quotient rule, allowing for the calculation of the square root using the Babylonian algorithm. The code works without the use of numerical finite differences, and the speaker notes that Julia allows for the reuse of existing code in new contexts to perform "magic".

  • 00:20:00 In this section, Alan Edelman explains how Julia programming language can efficiently compute the derivative using the Babylonian algorithm on a dual number in assembler code. He demonstrates how the same code run in Python's symbolic computation package gives symbolic computation with big coefficients, which is very inefficient. He then reveals the SVD, another algorithm that convinced him of how the Babylonian algorithm works. By taking the derivative of every line of the code, the algorithm can converge to the square root and derivative of the square roots. The resulting derivative is not symbolic or numerical but uses the quotient rule and addition rule in every step to get the answer.

  • 00:25:00 In this section, Alan Edelman, the creator of the Julia language, discusses how the automatic differentiation works in the language. Instead of manually taking derivatives of each line, Edelman suggests that software can do this automatically by letting the JIT compiler handle it. This eliminates the need for writing a translator or handwriting, making coding much more streamlined. Edelman notes that machine learning relies heavily on optimization problems, which are all about taking derivatives, making automatic differentiation an essential component of the process. Finally, he explains how using types can simplify creating structured matrices to store data.

  • 00:30:00 In this section, Alan Edelman discusses the use of types in Julia to efficiently store only what is needed when performing computations, which sets it apart from languages like Python and MATLAB that have more overhead. He then briefly touches on the idea of immersed mode differentiation in neural networks, starting with a scalar example and generalizing to matrices and vectors. He writes out the linear algebra involved in this process, but runs out of time before he can fully explain it.

  • 00:35:00 In this section, Edelman explains how to use block matrices in Julia to perform backpropagation without having to manually compute the derivatives. He shows how the use of a diagonal matrix and a lower triangular matrix can simplify the backpropagation process and make use of the built-in functions of Julia. By using linear algebra, he demonstrates how a backslash can solve the lower triangular matrix, making the task of calculating derivatives much easier. Edelman emphasizes that linear algebra is essential for many mathematical computations, and it is the secret to understanding many complex phenomena.
Lecture 36: Alan Edelman and Julia Language
Lecture 36: Alan Edelman and Julia Language
  • 2019.05.16
  • www.youtube.com
MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018Instructor: Alan Edelman, Gilbert StrangView the complete cou...
 

MIT 6.172 Performance Engineering of Software Systems, Fall 2018 -  1. Introduction and Matrix Multiplication



1. Introduction and Matrix Multiplication

In this YouTube video titled "1. Introduction and Matrix Multiplication," the lecturer discusses the importance of performance engineering and how it has evolved over time. Using the example of matrix multiplication, the speaker shows how coding techniques and machine specifications can greatly impact performance. The discussion covers topics such as loop order, cache usage, and parallel programming. The speaker also explores ways to optimize code for different processors and arithmetic calculations. Overall, the video provides valuable insights into the world of performance engineering and its practical applications in modern day computing systems.

  • 00:00:00 In this section, the lecturer discusses the importance of performance engineering and why it is studied. Performance is not always the top priority when it comes to software development. However, it is the currency of computing and is used to buy other desired properties such as ease of programming, security, and correctness. Performance degradation can cause unusable software, and people's main complaint about computing systems is that they are too slow. Therefore, while performance may not have intrinsic value, it contributes to the things that the developers care about.

  • 00:05:00 In this section, the speaker discusses the history of performance engineering in computing and the shift from early days of intense performance engineering due to limited machine resources, to the era of Moore's Law where chip densities doubled every two years and waiting for hardware to catch up was more beneficial than optimizing software. However, the speaker notes that Dennard scaling, which allowed clock speeds to get larger by reducing power, came to an end in 2004 and so the need for performance engineering is more important than ever before. The speaker includes quotes from computer scientists Donald Knuth, Bill Wolfe, and Michael Jackson about the importance of readable code and cautioning against premature optimization.

  • 00:10:00 In this section, the speaker discusses the limits on clock speeds that were reached in the early 2000s due to power density, resulting in the development of multi-core processors to scale performance. However, this meant that performance was no longer free and required parallel programming, which was not something that the industry had done before. From that point on, software had to adapt to the new hardware configurations to be effective, and as a result, there was an increased focus on performance in software development jobs.

  • 00:15:00 In this section, the speaker begins by explaining the practical and theoretical importance of learning how to write software that efficiently uses modern hardware. They then provide an example of performance engineering using the well-studied problem of matrix multiplication, discussing the machine that they will be using, including its specifications and capabilities such as parallel processing, cache size, and memory capacity, and concluding with an explanation of the basic code for Python for matrix multiplication. The speaker emphasizes the power of the machine and the potential for fun projects that utilize its capabilities.

  • 00:20:00 In this section, the lecturer discusses the performance of Python, Java, and C++ in the context of matrix multiplication. The lecture demonstrates that Python is too slow for large matrix multiplication, with a running time of approximately 21,000 seconds, Java is faster and produces almost a nine times speed-up over Python, and C++ is the fastest with a running time of approximately 1,100 seconds, which is twice as fast as Java and 18 times faster than Python.

  • 00:25:00 In this section, the speaker discusses the differences in performance between interpreted languages like Python, compiled languages like C, and languages like Java that are compiled to bytecode and then interpreted. Interpreters like Python are versatile but slow, whereas compilers like C take advantage of hardware and machine instructions to optimize performance. JIT compilers, like those used in Java, recover some performance by only compiling the most frequently executed pieces of code. The speaker notes that the performance model of Python is difficult to understand, which is why they will use C for performance comparisons in the course. However, there will be a guest lecture where they will discuss performance engineering in managed languages like Python.

  • 00:30:00 In this section, the importance of loop order for performance in matrix multiplication is discussed. The order of the loops affects the cache locality which impacts the running time. The matrices are laid out in memory in row-major order in C, and in column-major order in Fortran. The access pattern for order ijk has good spatial locality for C but poor spatial locality for B. The tool "cachegrind" is useful to determine the miss rates for code, and simple changes such as adjusting compiler flags can also improve performance.

  • 00:35:00 In this section, the speaker discusses how to optimize code to get better performance out of a machine. It is important to choose a good optimization flag, such as -O2 or -O3, but sometimes a lower optimization flag can actually optimize better depending on the code. Additionally, multi-core processors can be utilized with parallel loops using the silk infrastructure, resulting in a speed-up of almost 18x on 18 cores. The speaker emphasizes that parallelizing outer loops is more effective than parallelizing inner loops, which can actually slow down the program. However, there are still sources of opportunity for optimization, such as cache misses and not using vectorized instructions effectively.

  • 00:40:00 In this section, the speaker discusses how to manage cache misses better by restructuring the computation to reuse data in the cache as much as possible. By using a tiling scheme, the matrices are broken into smaller submatrices and multiplied in blocks to reduce the number of memory accesses needed. The speaker explains that a tuning parameter is necessary to determine the size of the submatrices, and suggests that experimentation is the best way to find the optimal value. Through this approach, the speaker demonstrates that it is possible to significantly reduce the number of memory accesses needed to compute the same size footprint, making the computation more efficient.

  • 00:45:00 In this section, the speaker discusses the benefits of using blocking, or tiling, when performing matrix multiplication, such as improved cache usage and faster runtime. They explain the different levels of caches that chips have, and how programmers can use tuning parameters to optimize their code for each cache level. However, the process of tuning becomes more complex with each level of cache, and the code can become hard to read and understand. The speaker suggests a divide and conquer approach that uses parallel programming to recursively solve smaller sub-problems. Although this method improves cache usage, the overhead of function calls slows down the computation, highlighting the need for clever performance engineering.

  • 00:50:00 In this section, the speaker discusses optimizing matrix multiplication using the divide-and-conquer technique and adjusting the threshold for using the algorithm. A base case is set for when the threshold is below a certain number, and the algorithm uses ordinary matrix multiplication. The best value for the base case was found to be 32, resulting in a 1.3 second runtime and 12% of peak performance. Additionally, the speaker introduces the concept of vectorization using the vector hardware, which processes data in a single instruction, multiple data format. The speaker outlines different ways to optimize vectorization, including using vectorization reports, flags for specific architectures, and the fast math flag, which allows the compiler to reorder associations. The use of architecture-native and fast math results in double the performance when using any compiler vectorizer.

  • 00:55:00 In this section of the video, the speaker discusses the various bit sizes used for arithmetic calculations, with 64-bit being the most common for linear algebra calculations. They also mention that lower precision arithmetic can be used for AI applications to improve performance. The speaker goes on to talk about using vector instructions and the AVX intrinsics, which provide up to 41% peak performance and a speed-up of about 50,000. They caution that performance improvements similar to those achieved in matrix multiplication may not be seen in other areas, but this course will teach students how to achieve substantial performance improvements on their own. Additionally, the course will focus on multi-core computing rather than GPUs or other areas.
1. Introduction and Matrix Multiplication
1. Introduction and Matrix Multiplication
  • 2021.10.06
  • www.youtube.com
MIT 6.172 Performance Engineering of Software Systems, Fall 2018Instructor: Charles LeisersonView the complete course: https://ocw.mit.edu/6-172F18YouTube Pl...
 

Lecture 2. Bentley Rules for Optimizing Work



2. Bentley Rules for Optimizing Work

This YouTube video discusses various optimization techniques for computer programs. The Bentley rules for optimizing work are introduced and the optimizations are grouped into data structures, loops, logic, and functions. Different techniques like encoding values, data structure augmentation, pre-computation, caching, and utilizing sparse matrices are discussed. The speaker also touches on the benefits of using a sparse matrix representation for graphs, logic optimization and collision detection optimization in graphics programs. Implementing these optimization techniques helps in reducing the running time of programs, making them more efficient.

The second part of the video covers several categories of optimization techniques, including loop hoisting, the use of sentinels in loops, loop unrolling and fusion, and function inlining. The speaker advises against premature optimization and emphasizes the importance of maintaining correctness and using regression testing. The video also outlines the Bentley Rules for Optimizing Work, a six-step guide to increasing productivity and achieving goals in an efficient manner. These rules include establishing clear goals, breaking tasks down, planning and organizing, prioritizing tasks, minimizing distractions, and regularly reviewing and adjusting one's approach.

  • 00:00:00 In this section, the lecturer introduces the concept of optimizing work in computer programs and explains how reducing the work of a program can improve its performance. He also talks about the Bentley rules for optimizing work, named after John Lewis Bentley, who wrote a book on writing efficient programs. The optimizations are grouped into four categories: data structures, loops, logic, and functions, and the lecturer plans to cover all 22 rules in a series of mini lectures throughout the course. While reducing the work of a program is a good heuristic for improving its running time, the complex nature of computer hardware means that it does not always translate to a reduction in running time, so the lecturer will also cover architecture-specific optimizations later in the course.

  • 00:05:00 In this section of the video, the speaker discusses the concept of packing and encoding data structures to store more than one data value in a machine word and convert data values into a representation that requires fewer bits. By reducing the number of things to move around in memory, it becomes a good heuristic for decreasing the running time of a program. The speaker provides an example of encoding dates to store them using fewer bits to make it easier to fetch the month, year, or day for a specific date. The speaker suggests using bit fields facilities in C to store the structured data, making it more accessible. This representation requires slightly more bits but is much more efficient in accessing specific data values within the structure.

  • 00:10:00 In this section, the video discusses three optimization techniques. The first technique is deciding whether to encode values as sequential integers or to unpack them for faster access. The second technique is data structure augmentation, where adding information to a data structure can optimize common operations. An example is adding a tail pointer to a singly linked list to make appending two lists more efficiently. The third technique is pre-computation, which is to perform calculations in advance to avoid doing the computations at mission-critical times. An example is using Pascal's triangle to pre-compute binomial coefficients to speed up the program that uses them.

  • 00:15:00 In this section, the speaker discusses how to pre-compute Pascal's triangle using a recursive formula and C code. They explain that the recursive formula involves calling the choose function, and how to pre-compute the table at runtime using a loop. Additionally, they discuss how to initialize the table at compile time by putting the table values in the source code, which saves time during program execution. Finally, they provide an example table of binomial coefficients up to 10 that can be accessed during program execution.

  • 00:20:00 In this section, the speaker introduces the concept of meta programming as a way to avoid manually typing in large tables of constant values into a program. By writing a program that generates the necessary code, the tedious task can be done automatically. The speaker provides a Python code snippet as an example. The topic of caching is also introduced, as a technique to avoid repeating costly computations by storing recently accessed results in a cache. The example given is calculating the hypotenuse of a right triangle using the square root operator, where the cache pre-stores previous hypotenuses, along with the values of a and b. The hypotenuse function first checks if the input values match those in the cache, and if so, returns the cached value without having to recompute the square root.

  • 00:25:00 In this section, the speaker discusses the concept of caching to optimize work in a program. By storing commonly computed values in a cache, programs can save time by not having to repeatedly compute the same values. While hardware also does caching, software can do it as well, with a cache of size 1,000 storing the 1,000 most recently computed values being an example. The optimization can speed up a program by about 30% if the same values are repeatedly computed. The speaker then introduces the idea of exploiting sparsity in an input to avoid computing on zero elements of that input, thus saving computation time. They demonstrate this with an example of matrix vector multiplication and introduce the compressed sparse Row (CSR) data structure that can speed up the multiplication of matrices.

  • 00:30:00 In this section, the speaker discusses how to optimize the storage and computational efficiency of sparse matrices using the Compressed Sparse Row (CSR) format. The CSR format stores the non-zero elements of a matrix in three arrays: the values array, the columns array, and the rows array. The speaker explains how to compute the length of a row and how to perform matrix-vector multiplication using the CSR format. The CSR format can be more space-efficient than the dense matrix representation if the number of non-zero elements is significantly less than N^2. However, for relatively dense matrices, the dense matrix representation might take less space. The speaker provides a code example for performing matrix-vector multiplication using the CSR format.

  • 00:35:00 In this section, the instructor discusses the benefits of using a sparse matrix to represent a graph and how it can be utilized to run classic graph algorithms such as breadth-first search and PageRank more efficiently. The sparse graph representation consists of two arrays: offsets and edges, where the degrees of each vertex can be obtained by taking the difference between consecutive offsets. The weight of the edges can also be stored either in a separate array or interleaved with the edges to improve cache locality. The section ends with a brief introduction to logic optimizations, specifically constant folding and propagation, which evaluates constant expressions during compilation to reduce the runtime.

  • 00:40:00 In this section of the video, the speaker discusses different optimization techniques for code, including constant folding and propagation, common sub-expression elimination, and exploiting algebraic identities. By defining constants at compile time, the compiler can evaluate them and save time during runtime. Common sub-expression elimination allows for avoiding computing the same expression multiple times by storing the result for future use. Exploiting algebraic identities involves replacing more expensive expressions with cheaper alternatives. The speaker emphasizes that while the compiler is usually good at implementing these optimizations, it is still important to know them for situations when the compiler does not do it automatically.

  • 00:45:00 In this section of the video, the speaker discusses two optimization techniques. The first is reducing the usage of the square root operator, which is computationally expensive, by using algebraic identities to avoid square roots. The second optimization technique is short-circuiting, which involves stopping a series of tests as soon as the result is known, in the case where an array contains non-negative integers and the sum of the values in the array needs to be checked against a limit. The optimization can eliminate the need to look at all elements in the array and can speed up program execution, but should be used judiciously based on the frequency of when the test can be short-circuited.

  • 00:50:00 In this section, the video discusses the concept of short-circuiting logical operators and their usefulness in optimizing code. The double ampersand and double vertical bar are short-circuiting logical operators that can save time and resources by only evaluating one side of the argument. Additionally, the video suggests ordering tests based on their frequency of success and cost to execute. This approach can take advantage of short-circuiting to skip expensive tests if less expensive ones already fail. Finally, the video introduces the idea of creating a fast path by using checks to exit a program early if a result is already known. An example of this is checking if the bounding boxes of two balls intersect before checking other collision conditions.

  • 00:55:00 In this section, Bentley discusses ways to optimize tests for collision detection in graphics programs. He suggests a fast path test to determine if bounding boxes intersect before performing the more expensive test for collision. This test involves checking the absolute value of the difference on each coordinate and checking if it's greater than the sum of the two radii. Bentley also notes that combining tests through a single switch statement or even table lookups can greatly improve performance. Overall, these optimizations can be beneficial for many applications and graphics programs.

  • 01:00:00 In this section, the video covers the third category of optimizations, which is related to loops. The first loop optimization discussed is hoisting, which involves avoiding recomputing loop invariant code each time through the body of a loop. By computing an expression once and storing it in a variable, rather than recomputing it in every iteration, it can save running time. The second loop optimization is the use of sentinels, special dummy values placed in data structures to simplify the handling of boundary conditions and the logic of handling loop exit tests. By modifying the program to use two additional entries in the array, a sentinel can be used to only need to perform one check in every iteration of the loop.

  • 01:05:00 In this section, the speaker discusses two techniques for optimizing code: boundary conditions and loop unrolling. For boundary conditions, he shows how to modify a loop to efficiently handle the special case when all the elements of the input array are zero. By adding a dummy element to the end of the array and checking for an overflow, the code can detect when it should stop. For loop unrolling, he explains two types: full and partial. While full loop unrolling is rare due to larger loop sizes, partial loop unrolling reduces the number of iterations of a loop by combining several into one, which can improve performance by reducing the number of control flow instructions executed.

  • 01:10:00 In this section, the instructor discusses loop unrolling and loop fusion optimizations. Loop unrolling involves combining multiple iterations of a loop into a single iteration, thereby reducing the overhead of loop control and enabling more compiler optimizations. However, unrolling too much can negatively affect performance by polluting the instruction cache. Loop fusion, on the other hand, combines multiple loops over the same index range into a single loop, reducing loop control overhead and improving cache locality. The instructor also discusses eliminating wasted iterations by modifying loop bounds to avoid executing loop iterations over empty loop bodies.

  • 01:15:00 In this section, Bentley discusses function optimizations, specifically the concept of inlining to avoid the overhead of a function call. By declaring a function as static inline, the compiler will try to replace a call to the function with the body of the function itself, which eliminates the need for a separate function call. While macros could also do this, inline functions are more structured and evaluate all of their arguments, avoiding the risk of a macro pasting an expensive expression multiple times in the code. Bentley advises against premature optimization and encourages developers to first ensure that their program is correct and to use regression testing to maintain correctness. Finally, he points out that many levels of optimization can be automated by the compiler, and checking the assembly code can reveal any such optimizations.

  • 01:20:00 In this section, the Bentley Rules for Optimizing Work are outlined in a series of six steps. These rules consist of establishing clear goals, breaking tasks down into smaller parts, taking time to plan and organize, prioritizing tasks, minimizing distractions, and regularly reviewing and adjusting your approach. By following these guidelines, you can increase your productivity and achieve your goals in a more efficient manner. Additionally, incorporating these strategies into your daily routine can help you maintain a strong work ethic and stay focused on your objectives.
2. Bentley Rules for Optimizing Work
2. Bentley Rules for Optimizing Work
  • 2019.09.23
  • www.youtube.com
MIT 6.172 Performance Engineering of Software Systems, Fall 2018Instructor: Julian ShunView the complete course: https://ocw.mit.edu/6-172F18YouTube Playlist...
 

Lecture 3. Bit Hacks



3. Bit Hacks

This YouTube video covers a variety of bit manipulation topics, including binary representation, two's complement, bitwise operators, swapping variables without a temporary variable, and code optimization. The video demonstrates various bit tricks such as finding the minimum of two integers without using if-else statements and how to swap two integers without using a temporary variable. The speaker discusses unpredictable branches and introduces a no-branch minimum bit trick for when predictable branches are unavailable, and shows how bit hacks can optimize code by replacing costly operations such as division with simple bitwise operations. The video also discusses the de Bruijn sequence and its application in solving problems such as the N Queens problem.

The second part discusses solving the N Queens problem using bit vectors and efficiently counting the number of 1 bits in a binary word. Backtracking is used to implement the N Queens problem, and bit vectors are used to efficiently represent the board. Three checks are described for safely placing a queen in the N Queens problem, and a method for counting the number of 1 bits in a word by recursively eliminating the least significant 1 bit is presented. Additionally, the use of table lookup and register manipulation to count the number of 1 bits is discussed. The video ends with a demonstration of a divide-and-conquer approach to counting 1 bits that has a performance proportional to the log base two of the word length. Resources for further learning are also provided.

  • 00:00:00 In this section, we learn about binary representation for words and how to compute integer values from them. We also learn about the two's complement representation for signed integers and the special cases of the all zeros and all ones word. Furthermore, we see an important identity for the ones complement of X and its relation to the negative of X in two's complement representation.

  • 00:05:00 In this section, the presenter explains Ones Complement and how to obtain the negative of a number by adding 1 to the Ones Complement. He also goes over the use of hex numbers for representing large binary constants and gives a table for converting between hex and binary. The video then goes over the bitwise operators in C++ and shows examples of how to use them for logical and, logical or, exclusive or, and left and right shift operations.

  • 00:10:00 In this section, the video discusses various common idioms that can be implemented using bitwise operators in C. One example is setting or clearing the case bit in a word X by using a shift followed by an OR or NOT and an AND, respectively. Another example is extracting a bit field from a word X by generating a mask with ones in the positions of the desired bits and zeros elsewhere, then ANDing the mask with X and right-shifting the extracted bits. The video also mentions that using these tricks can be particularly useful when working with compressed or encoded data.

  • 00:15:00 In this section, the video discusses how to swap two integers without using a temporary variable by using some bit tricks. The code involves setting X equal to X XOR Y, then Y equal to X XOR Y, and finally X equal to X XOR Y. This works because XOR is its own inverse and the first line generates a mask with ones where the bits in X and Y differ, which is then used to flip the bits in Y that are different from X. This allows for the swap without the use of a temporary variable. The video also emphasizes the importance of masking to ensure safety when working with these tricks.

  • 00:20:00 In this section, the speaker discusses two bit hacks. The first hack is a method to swap two variables without using a temporary variable. The hack involves using XOR and a mask to flip the bits that differ between the two variables. While this hack is cool, it is not the most efficient way to swap variables due to poor instruction level parallelism. The second hack is a way to find the minimum of two integers without using if-else statements, which can cause performance issues due to branch misprediction. Instead, the speaker shows how to use bitwise operations to compare the integers and calculate the minimum without branching.

  • 00:25:00 In this section, the speaker discusses code optimization and the use of the "restrict" keyword in a subroutine that merges two sorted arrays. The process is explained through an example where two green arrays are merged into a blue array. The predictability of each branch in the code is also discussed, with the first branch being predictable while the second is unpredictable.

  • 00:30:00 In this section, the video discusses predictable and unpredictable branches in programming and introduces a no-branch minimum bit trick as a solution to the unpredictable branch issue. The trick involves using a variable called "comp" to store the comparison result between the first element of a and b and getting the minimum of a and b using a bit trick. Then, the minimum value is placed in C, and the value of A or B is incremented or decremented based on the value of "comp." The video notes that while the trick may not work in some cases, it is helpful to understand what the compiler is doing and that many bit hacks for words can extend to bit and word hacks for vectors, making it useful to know about these tricks.

  • 00:35:00 In this section, the video discusses bit hacks and their usefulness in programming. The example given is a bit trick for modular addition that allows for quicker operations without using division. By setting Z to the sum of X and Y, and then checking if it is less than or greater than N, the program can return Z if it is within the range or the result of Z minus N to bring it back into range. The program uses similar logic to minimum and has an unpredictable branch that can result in a branch misprediction. Another example given is a way to round a value up to the nearest power of two using bit manipulation by decrementing N, performing bitwise or operations with N right shifted by different values, and then incrementing at the end.

  • 00:40:00 In this section of the video, the speaker discusses two bit manipulation problems. The first problem involves finding the smallest power of two that is greater than a given value n. The speaker explains how to achieve this by using bit manipulation and decrementing n if it is already a power of two to guarantee that the logarithm of n minus one bit is set. The second problem involves computing the mask of the least significant one in a word X, and the speaker explains how to achieve this by setting the result to X and using the bitwise and operation with negative X. The speaker also presents code to find the index of the bit by multiplying X by a constant and looking up the result in a table. The section ends with the speaker performing a math magic trick involving bit manipulation.

  • 00:45:00 In this section, a YouTube video shows a group performing a magic trick with cards and a bell. The performer claims to read the audience's minds and asks them to cut the deck of cards before distributing them. The performer predicts the suit and number of each person's card and is seemingly successful. They attribute their abilities to wearing an "awesome onesie" and a magic hat, as well as clearing the air with a bell.

  • 00:50:00 In this section, we learn about the de Bruyne sequence and its correlation with computing the log base 2 of a power of 2. The de Bruyne sequence is a cyclic bit sequence where every possible bit string of length K occurs exactly once as a substring in the sequence. Using a convert table, we can multiply the de Bruyne sequence constant by a power of 2 and extract the substring that appears at the beginning of the sequence to compute the log base 2 of the power of 2. By left shifting the sequence and then looking up the starting position of the substring in the convert table, we can determine the log base 2 of the integer we started with, which solves the card trick that was previously demonstrated.

  • 00:55:00 In this section, the speaker discusses the de Bruijn sequence, which is a cyclic sequence of a binary alphabet that contains each possible n-bit string exactly once. The sequence can be used to solve problems such as the N Queens problem and can be generated using a magic trick. The speaker also explains how the bit trick works to determine the position of a substring of a de Bruijn sequence after a left shift. The performance of this bit trick is limited by the performance of multiplication and table lookup, but there is now a hardware instruction to compute it.

  • 01:00:00 In this section, the speaker discusses the N Queens problem, which involves placing N chess queens on an NxN chessboard so that no two queens threaten each other. The problem is often implemented using backtracking, where queens are placed row by row and the program backtracks when no valid position can be found. Different representations of the board are also discussed, with the most compact being a set of three bit vectors. The down vector stores the presence of queens in each column, the left diagonal vector stores the presence of queens on each left diagonal, and the right diagonal vector stores the presence of queens on each right diagonal. Using bit vectors allows for a more efficient implementation of the algorithm.

  • 01:05:00 In this section, the process of checking whether a position is safe to place a queen in an n-queens problem using bit vectors is described. The three checks involve checking whether there is already a queen in the same row, the same column, and the same diagonal as the position. This method is efficient and guarantees that a queen can be safely placed if it passes all three checks. Another problem discussed is the population count, which involves counting the number of 1 bits in a word X. The presented method repeatedly eliminates the least significant 1 bit in X until it becomes zero, and the number of iterations required is proportional to the number of 1 bits in X.

  • 01:10:00 In this section, the speaker discusses the use of table lookup to efficiently count the number of 1 bits in a binary word. The table lookup method involves creating a table of size 256 that stores the number of ones in each 8-bit word and then looking up the count for each 8-bit substring in the binary word. The speaker notes that the performance of this method is bottlenecked by the memory operations required to access the table. The speaker goes on to present a third way to count the number of 1 bits using registers, where they create masks and execute specific instructions that allow them to count the number of ones in a binary word without accessing memory. The presenter goes through an example to explain how this method works.

  • 01:15:00 In this section, the speaker demonstrates how to count the number of 1s in an input word using a parallel divide-and-conquer method, where the performance is proportional to log base two of the word length, W. It's pointed out that most modern machines have an intrinsic pop count instruction that is faster than anything that can be coded, accessible via compiler intrinsics. However, this can make code less portable across different machines. The speaker provides some resources for learning more about bit tricks, including a website, a textbook, a chess programming website, and a book called Hacker's Delight.
3. Bit Hacks
3. Bit Hacks
  • 2019.09.23
  • www.youtube.com
MIT 6.172 Performance Engineering of Software Systems, Fall 2018Instructor: Julian ShunView the complete course: https://ocw.mit.edu/6-172F18YouTube Playlist...
 

Lecture 4. Assembly Language & Computer Architecture



Lecture 4. Assembly Language & Computer Architecture

This video provides a comprehensive overview of assembly language and computer architecture. Assembly language is an important interface for optimizing code performance, and understanding computer architecture is essential for mastering assembly language. The speaker explains the history of x86 64 architecture and its development, its key registers, data types, memory addressing modes, and instruction set architecture, including stacks, integer and binary logic, boolean logic, and subroutines. They also discuss extensions like zero and sign extension and various addressing modes in assembly language. Additionally, the video discusses floating-point types, vectors, and vector units, traditional and SSE instructions, and computer architecture design features like superscalar processing, out of order execution, and branch prediction.

The video also covers several topics related to assembly language and computer architecture. One of the central themes is instruction level parallelism (ILP) and pipeline stalls, which are caused by hazards such as data dependencies. The speaker discusses true, anti, and output data dependencies and how superscalar processors can exploit more parallelism in hardware to execute multiple instructions at a time. However, to avoid hazards, architects have implemented strategies such as renaming and reordering, as well as speculative execution to guess the outcome of a branch and execute it beforehand. The speaker encourages the audience to understand these methods to better comprehend software optimizations.

  • 00:00:00 In this section, the instructor talks about assembly language and computer architecture, which are often not covered in modern software courses. Understanding these concepts is necessary for optimizing code performance, and assembly language is the best interface for doing so. The process of compiling code involves several stages, including pre-processing, compiling, assembling, and linking. Assembly language is a mnemonic structure of the machine code that makes it more human-readable, and the final executable is produced through a linking stage using the LD command.

  • 00:05:00 In this section, the speaker explains that assembly and machine code are very similar in structure, with OP codes in assembly corresponding to specific bit patterns in machine code. The speaker introduces the program ABS, which can produce a disassembly of machine code, revealing the mnemonic, human-readable assembly code. The speaker then discusses several reasons why someone might want to look at assembly code, including revealing what the compiler did and did not do, debugging compiler errors, and reverse engineering a program without access to its source.

  • 00:10:00 In this section, the speaker explains the expectations for the class, which include understanding how a compiler implements linguistic constructs with x86 instructions, being able to read x86 assembly language, and understanding the performance implications of common assembly patterns. Students should also be able to write code from scratch if necessary, and gain mastery to a level where this is feasible. The speaker then dives into the x86 64 instruction set architecture, including registers, instructions, data types, and memory addressing modes. Key registers include general purpose registers and the flags register, which tracks arithmetic operations and carries. The instruction pointer guides the hardware through the sequence of instructions in an assembly language program.

  • 00:15:00 In this section, the speaker discusses the history of x86 64 architecture and its development from 16-bit word machines with limited memory to wider words for indexing as more memory became available. The speaker also explains the addition of vector registers such as the xmm and AVX registers for vectorization, and how they are used. They also touch on the topic of general-purpose registers and how their functional purposes have changed significantly over time but their names have stuck due to historical reasons. Additionally, the speaker explains how certain registers are aliased to the same thing for lower and upper halves of the short word, 32-bit, or 64-bit words.

  • 00:20:00 In this section, the speaker explains the basics of x86 64 assembly language and how it works. Registers have different names depending on which part of the register is being accessed, and some have specific purposes like RSP being used as the stack pointer. The format of an x86 64 instruction code is to have an opcode and an operand list, usually with 0-3 operands that can be sources or destinations. The speaker explains the difference between the AT&T syntax and the Intel syntax for referring to operations and provides a list of common op codes like "move" and "conditional move." Additionally, the speaker explains the importance of extending the sign when moving from a 32-bit value register into a 64-bit register.

  • 00:25:00 In this section, the speaker discusses the various types of instructions in assembly language, including instructions for stacks, integer arithmetic, binary logic, boolean logic, jumps, and subroutines. The opcodes may be augmented with a suffix that describes the data type of the operation or a condition code. For example, a move with a "cue" indicates that the data being moved is a quad word, which consists of 8 bytes or 64 bits. The x86 64 data types and their assembly suffixes are also discussed, and examples are given to illustrate how sign extension works. The presence or absence of certain opcodes and mnemonics in assembly language can be confusing, but the compiler needs to understand these instructions to execute the program correctly.

  • 00:30:00 In this section, the speaker discusses the extensions like zero extension and sign extension that occur when moving 32-bit operations to 64-bit operations. They also mention how the Intel instruction set can become more complicated with the addition of new instructions. The speaker goes on to explain the different condition codes used in conditional jumps and conditional moves and the flags that are used with them, such as the zero flag and the overflow flag. The reason for certain condition codes checking the zero flag is also discussed.

  • 00:35:00 In this section, the speaker discusses the different direct and indirect addressing modes in assembly language. Direct addressing modes include immediate, direct memory, and using a register. Register indirect and register indexed addressing modes allow for accessing memory by providing the address in a register or offsetting the address. The speaker notes that accessing memory can be slow and expensive, so it is important to store values in registers whenever possible to speed up the process. They also mention the importance of caching in optimizing memory access.

  • 00:40:00 In this section, the video discusses the use of pointer relative indexing, where the instruction pointer is used to index data rather than a general-purpose register. The base index scale displacement addressing method is also introduced, which is a complicated instruction that supports gentle indexing off a base pointer. The video provides an overview of some assembly language idioms, including the XOR opcode, which is used for zeroing registers, and the test opcode, which computes the bitwise and of two values and preserves the register flags. Lastly, the video touches on jump instructions and labels, which allow for control flow in the code.

  • 00:45:00 In this section, the video discusses assembly language and computer architecture, including various instruction sets and the history of floating point. The x86 instruction sets, including x87 and SSE, support single and double precision scalar floating point arithmetic, and vector instructions. The SSE instructions are generally preferred by compilers over x87 due to their simplicity in compilation and optimization. The purpose of using no operation (nop) instructions in assembly, such as alignment optimization, is also explained.

  • 00:50:00 In this section, the speaker discusses floating-point types, vectors, and vector units that are utilized in computer architecture. The speaker explains that the way vector units work is that the processor issues instructions to all vector units, each of which is called a lane. The lanes contain the integer or floating-point arithmetic and all operate in lockstep, meaning they all do the same thing. They can perform several operations at once, and all this can be done with a single instruction. The speaker explains that some architectures require vector operands to be aligned, while others can shift vectors in memory, which results in a performance difference between the two. Additionally, there are architectures that support cross-lane operations, such as permuting, shuffling, and scatter-gather.

  • 00:55:00 In this section, the speaker discusses the differences between traditional and SSE instructions and how they can be used. They also mention the aliasing trick where ymm registers alias xmm registers and extended them to 512 bits with avx-512. They then move into a discussion of computer architecture, specifically the difference between a five-stage pipeline and a modern processor featuring 14 to 19 pipeline stages. The design features they discuss include vector or Hardware I, superscalar processing, out of order execution, and branch prediction, but they mention that they will not delve deeply into out of order execution due to time constraints.

  • 01:00:00 In this section of the video, the instructor discusses instruction level parallelism (ILP) and pipeline stalls. ILP involves finding opportunities to execute multiple instructions simultaneously in order to improve throughput. However, pipeline stalls can occur when an instruction cannot be executed due to a hazard, such as a structural hazard, data hazard, or control hazard, with data hazards being the most common. A true dependence occurs when an instruction reads after a write dependence, while an anti-dependence occurs when an instruction wants to write into a location but has to wait until the previous instruction has read the value. The compiler attempts to reduce pipeline stalls by optimizing the code to avoid hazards.

  • 01:05:00 In this section, the concept of data dependencies in assembly language is discussed. There are three types of data dependencies: true, anti, and output. Arithmetic operations, especially complex ones such as floating-point arithmetic, have a longer latency and require separate functional units in the processor, sometimes with separate registers. To increase instruction level parallelism, architects have implemented techniques like issue multiple instructions per cycle to exploit more parallelism in the hardware.

  • 01:10:00 In this section, the video explains how superscalar processors can execute multiple instructions at a time, using the example of Haswell breaking instructions into simpler operations called micro ops and emitting up to four micro ops per cycle. The video then details strategies to free processors from executing instructions in order, including bypassing and register renaming, which allow for non-dependent instructions to be executed out of order, resulting in faster processing times. These strategies require keeping track of dependencies and executing code in a way that avoids hazards such as data dependencies.

  • 01:15:00 In this section, the speaker discusses renaming and reordering, which are two important methods used in modern processors to avoid data hazards. The speaker also talks about speculative execution, which is used in branch prediction to guess the outcome of a branch and execute it beforehand, to avoid stalling. However, if the guess is wrong, it will cost about 15 to 20 cycles to undo the computation. The speaker briefly mentions how branch predictors work but points out that it is complicated and requires further study. Despite the hurried end, the speaker encourages the audience to understand the various methods, which will help in understanding why certain software optimizations work and don't work.
4. Assembly Language & Computer Architecture
4. Assembly Language & Computer Architecture
  • 2019.09.23
  • www.youtube.com
MIT 6.172 Performance Engineering of Software Systems, Fall 2018Instructor: Charles LeisersonView the complete course: https://ocw.mit.edu/6-172F18YouTube Pl...
 

Lecture 5. C to Assembly



Lecture 5. C to Assembly

In this portion of the video, the importance of understanding C to assembly language is discussed, along with how C code is implemented in assembly language using a compiler. The focus is specifically on how LLVM IR is translated into assembly in the Linux x86 64 calling convention. The presenter explains the basic components of LLVM IR and how constructs in the C programming language are translated into LLVM IR. The video also covers the layout of virtual memory, the issue of coordinating function calls between multiple functions, and the Linux x86 64 calling convention's use of two pointers - the BP and SP - to manage all stack frames.

The video also explains the strategies for maintaining register states in C to Assembly programming, such as saving registers off as either caller-save or callee-save, and how x86 calling convention avoids wasting work. It covers how function calls work in C and assembly, discussing the process of saving arguments and local variables onto the stack, as well as the common optimization of using the stack pointer instead of the base pointer. The video also shows the process of compiling LV miR down to assembly code, discussing the function prologue, saving registers, handling conditions, and converting C code to assembly code using a control flow graph. Finally, it talks about the function epilogue used to restore registers before returning results.

  • 00:00:00 In this section of the video, TB Shuttle discusses the importance of understanding C to assembly language. He notes that assembly language provides greater precision than C code, and can reveal details about a program that are not obvious from the C code. Looking at assembly can also allow developers to determine what the compiler did or did not do when optimizing a program. In addition, bugs can arise that would only create unexpected behavior when optimizing code at certain levels, making it difficult to spot these bugs. Lastly, understanding assembly can allow developers to modify the assembly code by hand or to reverse engineer code.

  • 00:05:00 In this section, the video discusses how C code is implemented in assembly language through the use of a compiler which has to make complex decisions regarding what assembly instructions to use, how to implement C conditionals and loops, and how to coordinate function calls. To understand the translation process, the video introduces the LVM IR, which is a simplified assembly that the compiler uses to reason about the program. The video then explains how constructs in the C programming language, such as functions, conditionals, and loops, get translated into LVM IR. The section ends with a brief mention of LVM IR attributes.

  • 00:10:00 In this section, the focus is on the process of how LLVM ir gets translated into assembly, specifically in the Linux x86 64 calling convention. The presenter provides a brief primer on LLVM ir, explaining its basic components of functions, instructions, and registers, which resemble a simpler version of assembly. LLVM ir registers are similar to c variables and are distinguished by their names, and each function has its own local register names. The presenter highlights the registers in an example code snippet and notes that the syntax for registers is hijacked when referring to different basic blocks in LLVM. The talk concludes with a case study on how this process works for a simple code to compute Fibonacci numbers.

  • 00:15:00 In this section, the speaker explains the syntax of LV Mir instructions, which involves a register name, an equal symbol, an opcode, and an operand list. While some instructions return a value that gets stored in a local register, others don't. The LV Mir instruction set is smaller than that of x86 since it contains only a few instructions for data movements, arithmetic and logic operations, and control flow. In LV Mir, everything is exposed as types, which include integers (with a defined number of bits), floating-point types, pointer types, vector types, and label types for basic blocks. The speaker also mentions that LV Mir vector operations don't look like SSE or AVX, but rather like ordinary operations that work on a vector type.

  • 00:20:00 In this section, the video discusses how sequences of operations in C code are translated into LLVM IR, and the rules of thumb for interpreting the code in IR. The excerpt also explains how LLVM IR handles primitive and aggregate types, such as arrays and structs, and shows examples of how accessing elements in an array involves computing addresses in memory. Additionally, the video explains how C functions are translated into LLVM IR, with the corresponding return statements in both languages.

  • 00:25:00 In this section, the video explains how functions and parameters in C translate to LV Mir. Functions in LV Mir are similar to functions in C, but C parameters become lists of parameters in LV Mir. However, reading LV Mir can be challenging as the registers are not well-named, making it difficult to decipher. The video also discusses basic blocks in LV Mir, which are chunks of code partitioned based on control flow instructions. The connections between basic blocks are based on edges induced by branch instructions. Finally, the video touches upon conditionals in LV Mir, where if-then-else statements become conditional branch instructions that induce basic blocks and control flow edges.

  • 00:30:00 In this section, the video explains how conditional operations and branches work in LLVM IR. The process begins with a comparison between a literal value and a value stored in a register, which creates a one-bit integer or boolean result. This result is then passed to a conditional branch action along with two labels of basic blocks indicating where to go if the predicate is true or false. If there is an unconditional branch with one operand, the result is a direct jump to a specific basic block. The video also shows how to create a diamond shape in the control flow graph whenever there is a conditional statement, and provides an example of a loop written in C code.

  • 00:35:00 In this section, the video discusses the components of a C loop, which include the loop body and loop control. The loop body is executed on each iteration, while the loop control manages all iterations of the loop. A C loop produces a looping pattern in the control flow graph, which in turn creates a loop structure in the LLVM ir. The video then analyzes the LLVM ir code for the loop control, where there is a weird fie instruction that arises commonly when dealing with loops. The fie instruction tries to solve a problem with the LLVM representation of the code, where I is represented by a whole bunch of different registers, making it unclear where I actually went.

  • 00:40:00 In this section, the video discusses the mapping of the induction variable in a loop to various locations in LLVM, which it does due to the variable's changing values across iterations of the loop. This leads to the "static single assignment" invariant in LLVM that each instruction only defines the value of a register once, which poses a problem for induction variables. The "phi" instruction solves this issue by defining a register value that depends on how the control flow merges at the entry point of a loop. The video also discusses attributes in LLVM and how they can provide extra information for LLVM constructs, such as the NSW attribute attached to the "add" instruction.

  • 00:45:00 In this section of the video, the focus is on LLVM IR, which is similar to assembly but simpler in many ways, as all computed values are stored in registers that are like ordinary C variables. LLVM IR uses the static single assignment paradigm where each variable is written by at most one instruction. The video describes how to translate C code into LLVM IR and then into assembly, with some additional complexities in the process. The compiler selects the actual assembly instructions needed for the LLVM IR operations, decides which general-purpose registers are used, and coordinates function calls between different source files and binary libraries. The discussion then turns to the Linux x86 64 calling convention.

  • 00:50:00 In this section, the layout of virtual memory for a program is discussed. The virtual memory is divided into segments, such as the stack and heap segments, which are organized with the use of assembler directives in assembly code. Additionally, different types of storage directives are discussed, such as the space directive and long segment, which allocate memory and store values. Attention is then turned to the stack segment where data is stored to manage function calls and returns, which includes storing local variables, function arguments, the return address, and possibly intermediate results.

  • 00:55:00 In this section of the video, the presenter discusses the issue of coordinating function calls across multiple functions, some of which may originate from different files or libraries. To ensure that all functions play nice together, a calling convention has been established that all functions must obey. The Linux x86 64 calling convention uses two pointers - the BP and SP - to manage all stack frames for each function instantiation. When a call instruction is executed, the current value of the IP is pushed onto the stack as the return address, and the call instruction jumps to the address of some function in program memory. The return instruction undoes the operations of the call instruction and pops the return address off the stack to return to the caller's execution. To handle the problem of multiple functions wanting to use the same
    registers, the calling convention details rules for which registers each function can use, and how they must preserve those registers through function calls.

  • 01:00:00 In this section, the video discusses the strategies for maintaining register states when operating with different functions in C to Assembly programming. It mentions the two methods that can be used, one where the caller saves off the register state before invoking a call, and the other where the callee saves all the register state. The x86 calling convention involves both, specifying some registers as callee-save and others as caller-save to avoid wasting work. The video also explores how stack memory is organized, and the stack grows down. Finally, it discusses coordinating how functions are going to use overlapping parts of stack memory.

  • 01:05:00 In this section, the video discusses how a function call works in C and assembly. Before calling function C, function B places non-register arguments for function C onto a reserved linkage block in its own stack memory below its local variables. It'll access those arguments with negative offsets. When function C starts, it executes a function prolog which saves off the base pointer for B's stack frame and sets BP equal to SP because it's entering a new frame. It then allocates space on the stack for its own local variables as well as space that B will use for any linkage blocks that it creates for its callers or for the things that it calls. The video also mentions a common optimization where the compiler might get rid of BP and do all of the indexing based off the stack pointer RSP to gain one more general-purpose register.

  • 01:10:00 In this section, the instructor takes the audience through the process of compiling LV miR down to assembly code. The first step involves defining the function 'fib' as a globally accessible function using some assembler directives such as global fib directive and alignment. Then, the function prologue is saved with a push queue of var BP and a mob queue of RSP rbp. The assembly code also pushes a couple of registers onto the stack, which are Kali-save registers, before moving the argument to the function, RTI, and squirrels it away into the RBX register. Lastly, a comparison instruction is evaluated to check whether the value of N is less than two, and if so, it returns the input value. Otherwise, it goes through some straight-line code to compute fib of n minus one and fib of n minus two, add them together and return that result.

  • 01:15:00 In this section, the video discusses conditional jumps and the varying behaviors that occur next in the code depending on the comparison results. The JGE instruction jumps to the label for the false side of the LLVM branch operation when n is greater than or equal to 2, while the operations correspond to the true side of the operation when n is less than 2. The LEA instruction is used for address calculation, while the move operation saves the result of the function call into R14. The next set of instructions compute the value of n-2, stash it into RDI, and then call fib on n-2, returning the result into our AX. Finally, the operation adds R14 to our AX.

  • 01:20:00 In this section, the video discusses the conversion from C code to assembly code. The speaker explains that the process uses a control flow graph, composed of basic blocks connected by control flow edges, to represent the code. They also mention the complexity of dealing with calling conventions in assembly code. The function epilogue is used to restore registers before anything in the stack frame, before returning the result. The video concludes by summarizing the main topics covered throughout the section.
5. C to Assembly
5. C to Assembly
  • 2019.09.23
  • www.youtube.com
MIT 6.172 Performance Engineering of Software Systems, Fall 2018Instructor: Tao B. SchardlView the complete course: https://ocw.mit.edu/6-172F18YouTube Playl...
 

Lecture 6. Multicore Programming



Lecture 6. Multicore Programming

This video lecture discusses multi-core programming and the emergence of multi-core processors due to Moore's Law and the end of scaling of clock frequencies. The speaker explains the power density issue faced by processors and how it led to the addition of multiple cores to chips to keep up with Moore's Law. The lecture also covers the basics of cache coherence protocols in shared memory hardware and concurrency platforms like Pthreads, TBB, OpenMP, and Silk that provide abstractions for parallel programming. The pros and cons of each platform are discussed and demonstrated with examples of implementing Fibonacci programs. The video provides a comprehensive overview of multi-core programming and the challenges and solutions faced by programmers.

The video also covers various aspects of Silk, an abstraction tool for handling parallel processing. The speaker discusses topics such as nested silk for loops, generating assembly code, reduction using reducers, scheduler, and optimizing for performance. They also provide an overview of the Silk ecosystem and related tools such as silk sanitizer and silk scale for debugging and analyzing scalability, respectively. The main takeaway is that writing parallel programs for multicore processors can be challenging, but Silk simplifies the process by handling complex tasks efficiently, giving programmers more control over the execution of their code.

  • 00:00:00 In this section, the instructor discusses multi-core programming and the reasons for the emergence of multi-core processors. With the advent of Moore's law, which claim that the number of transistors that can fit on a chip doubles every two years, and the end of scaling of clock frequencies around 2004 to 2005, semiconductor vendors started producing chips with multiple processor cores on them. This was due to the fact that it was no longer possible to increase the frequency of single cores on a chip, thereby making it necessary to switch to a design model that allowed for parallel processing. The instructor also clarifies the relationship between the number of transistors on a chip and the maximum frequency of the processor.

  • 00:05:00 In this section, the speaker discusses the power density issue faced by processors, where increasing clock frequency causes the power density to increase exponentially. This led to multiple cores being added to chips to keep up with Moore's Law and not exceed the power density limits. The speaker then introduces the abstract multi-core architecture, known as a chip multiprocessor, and outlines the lectures on hardware challenges and software solutions to program parallel programs on multi-core machines using different concurrency platforms such as P threads, winAPI threads, Intel threading building blocks, openmp, and so plus.

  • 00:10:00 In this section, the speaker explains how caches work in multicore programming. When one processor loads a value into its cache from main memory, it will keep that value in cache in case it needs to access it again in the future. If another processor wants to load the same value, it can also get it through the first processor's cache instead of going all the way to main memory. However, a problem arises when one processor updates the value in its own cache, making the value in another processor's cache stale. The MSI protocol is a basic solution to this problem, which labels cache lines with three possible states (M, S, and I) to ensure coherency between caches.

  • 00:15:00 In this section, the lecture discusses the basics of cache coherence protocols in shared memory hardware, particularly how modifications to a cache line in one processor's cache can invalidate other caches' copies of the same line. The lecture introduces a simple protocol and explains how other more complex protocols exist in practice. The hardware is responsible for managing conflicts when multiple processors modify the same cache line, but this can result in an "invalidation storm" and slow down performance. The lecture also notes that concurrency platforms can abstract these complexities and help with synchronization and communication in parallel programming.

  • 00:20:00 In this section, the speaker discusses different concurrency platforms using the Fibonacci numbers example. The Fibonacci program is implemented using a recursive algorithm that computes Fibonacci numbers multiple times, making it a poor algorithm. The two recursive calls can be parallelized as they are completely independent calculations. Pthreads, a standard API for threading, can be used to implement this parallelization.

  • 00:25:00 In this section, the speaker discusses Pthreads, a library of functions that enable concurrency in programming. Pthreads provides a do-it-yourself concurrency platform, as it allows you to specify concurrency in your code using a library of functions with special semantics. Each thread in Pthreads implements an abstraction of a processor and is multiplexed onto the actual machine resources. Pthreads also provides masks that simplify the protocols involved in inner thread coordination. The Pthreads library has key functions such as pthread_create, which stores and identifies for the new thread that pthread creates, and pthread_join, which waits until the thread finishes before continuing on in the code. The speaker demonstrates how to implement a Fibonacci series using Pthreads.

  • 00:30:00 In this section, the presenter discusses the implementation of Pthreads code for executing the Fibonacci sequence in parallel. They explain that if the input size is small enough, it is better to execute the program sequentially due to the cost of creating threads in parallel. The code marshals the input argument to the thread and stores it in the arguments structure. The presenter discusses Pthread create, Pthread join, and some of its disadvantages, such as becoming much more complicated if the code needs to recursively create threads. They also mention that this code only creates one thread, so the maximum speed-up possible is two if run on four processors.

  • 00:35:00 In this section of the video, the issues with Pthreads and the code for Fibonacci sequence are discussed. The high overhead to creating a thread results in coarse-grained concurrency, and the scalability issue is that the two cores are not doing the same amount of work. The code also lacks modularity, as the Fibonacci logic is not encapsulated nicely within one function, making it difficult to maintain and change. The code also becomes complicated due to argument marshalling, which is similar to having to write code in assembly. The video then introduces Threading Building Blocks (TBB), a library developed by Intel to provide a solution for these problems.

  • 00:40:00 In this section, the video discusses the use of the Intel Thread Building Blocks (TBB) library, a C++ library designed to allow programmers to use native threads and a work-stealing algorithm without having to deal with threads directly. By specifying tasks, TBB load balances them across threads, making code simpler to write and performing better than using POSIX threads. The video shows an example of implementing a Fibonacci program using TBB, demonstrating how it recursively creates child tasks, allowing for more parallelism and scalability across multiple processors. TBB also provides templates for loop parallelism, data aggregation, and software pipelining, as well as concurrent container classes for safe concurrent access to shared data.

  • 00:45:00 In this section, the speaker introduces two linguistic solutions to the concurrency problem: OpenMP and Silk. OpenMP is a specification that provides linguistic extensions to C and C++, as well as Fortran, using compiler pragmas to specify which pieces of code should run in parallel. It supports loop parallelism, task parallelism, and pipeline parallelism, and has many scheduling and data sharing directives, as well as synchronization constructs. The speaker provides an example of implementing Fibonacci in OpenMP, which is simpler and performs better than using Pthreads or TBB. OpenMP is a popular solution to writing parallel programs as it provides many features and simplifies the code.

  • 00:50:00 In this section, the speaker discusses the silk programming platform, which supports joint and vector parallelism and includes a provably efficient work-stealing scheduler. The program also has a hyper object library for parallelizing code with global variables and comes with programming tools such as a silk screen race detector and a scalability analyzer called silk view. The speaker also notes that while they will not be using silk plus in the class, they will be using a better compiler known as taper LLVM, which produces better code relative to its base compiler than all other implementations of silk out there.

  • 00:55:00 In this section, the use of silk spawn and silk sync statements to enable parallel execution is discussed through examples. The first example is the salt coat for Fibonacci, which includes silk spawn and silk sync statements to suggest that fib of n-1 can be executed in parallel with the function calling it while fib of n-2 is being executed. However, the runtime system decides whether or not to run these tasks in parallel. Another example discussed is loop parallelism, where the silk for statement is used to execute a for loop in parallel with the compiler recursively dividing the iteration space into half and using silk spawn and silk sync statements until the range becomes too small to execute in parallel. It is important to guarantee that iterations are independent for correct results, and using silk for adds some additional overheads.

  • 01:00:00 In this section, the video discusses using nested silk for loops and the details of generating assembly code using a flag in the clang compiler. The example code for a summation of values using a silk for loop raises the issue of determinacy race when multiple processors write to the same memory location. Silk addresses this issue through the use of reducers, which are hyper objects that handle the addition function for a given variable, while registering and unregistering the reducer macros. This is a way to deal with the problem of reduction, which can come up in multicore programming, with many other reduction operators also available for use.

  • 01:05:00 In this section, the speaker discusses reducers in Silk - algebraic structures that have an associative binary operation and an identity element. Silk has several predefined reducers, including addition, multiplication, min, max, and XOR, and you can define your own reducer. One of the benefits of Silk is that there is always a valid serial interpretation of the program, making debugging easier, and it has a runtime scheduler that monitors and maps the program onto available processor cores, using a work-stealing scheduling algorithm to balance tasks evenly. Unlike other concurrency platforms, Silk's scheduler is theoretically efficient.

  • 01:10:00 In this section, the speaker provides a high-level overview of the Silk ecosystem for multicore programming, including how to compile Silk source code, benchmark parallel and serial performance, and debug issues using tools such as silk sanitizer and silk scale. They also emphasize the importance of optimizing the serial program for performance, and how Silk's scheduler can handle different tasks during program execution. Additionally, the speaker explains how silk scale can determine the maximum number of processors that a code can take advantage of, making it a useful tool for analyzing scalability.

  • 01:15:00 In this section, the speaker summarizes the key takeaways from the lecture on multicore programming. They explain that most modern processors have multiple cores, making it necessary to write parallel programs for high performance. However, writing such programs can be difficult, which is where Silk comes in. This tool abstracts processor cores from the programmer and handles synchronization, communication protocols, and load balancing. The speaker also mentions a future project where students will implement their own parallel screensaver using Silk.
6. Multicore Programming
6. Multicore Programming
  • 2019.09.23
  • www.youtube.com
MIT 6.172 Performance Engineering of Software Systems, Fall 2018Instructor: Julian ShunView the complete course: https://ocw.mit.edu/6-172F18YouTube Playlist...
 

Lecture 7. Races and Parallelism



Lecture 7. Races and Parallelism

The video covers a range of topics related to races, parallelism, and computation dags in Silk programming. Some key concepts covered include the spawn and sync statements for concurrent execution, the importance of avoiding race conditions, and using Silk's race detector to identify them. The video also covers Amdahl's law, the work law, and span law as ways to quantify the amount of parallelism in a program, along with ways to analyze the work and span of computations. The potential speedup and parallelism of parallel sorting algorithms and the concept of scheduling theory are also discussed, with a focus on the greedy scheduler's theorem. Overall, the video provides valuable insights into understanding and optimizing program performance in Silk programming.

The video explains the corollaries of the greedy scheduler bound, which essentially states that any greedy scheduler achieves near-perfect linear speed-up as long as T1/Tinfinity is greater than or equal to P^2. The silk scheduler, which uses a work stealing scheduler, can achieve near-perfect linear speed-up as long as the number of processors is much less than T1/Tinfinity. The silk runtime system works by maintaining a work deck of ready strands and manipulates the bottom of the deck like a stack. The video also discusses the Cactus Stack, which allows for multiple views of stacks in parallel and makes parallel function calls possible. The upper bound on stack space required by P processor execution is often much looser than the actual amount needed since each processor may not need to go all the way down the computation graph every time it steals work.

  • 00:00:00 In this section, the instructor introduces the topic of races and parallelism in Silk, which will be important for the upcoming homework assignment and project. The basics of Silk's spawn and sync statements are reviewed, which allow for concurrent execution of parent and child functions. The instructor also emphasizes the importance of meeting with MIT policy members and avoiding race conditions in code, which can lead to disastrous consequences as seen with past examples. The hardest race bugs to find are those that caused catastrophic events, and conventional testing is often not effective. Silk does offer a tool to help identify potential race bugs in code.

  • 00:05:00 In this section, the video explains how races are one of the most challenging bugs for developers to find because it may occur only in rare events when logically parallel codes access the same memory location, and at least one registers a write to it. As an example, the video demonstrates a simple code that utilizes a dependency graph to show how the race bug shared between two parallel paths does not always occur. The race happens when both stores write to the same memory location, which could result in different outputs depending on which execution path executes entirely first.

  • 00:10:00 In this section, the speaker explains the concept of determinacy races, which occur when two instructions access the same memory location and at least one of the instructions is a write operation, resulting in non-deterministic behavior in the program. The speaker gives tips on how to avoid races, such as ensuring that iterations of a Silk for-loop are independent, and making sure that code between a Silk spawn statement and the corresponding Silk sync statement is also independent. The speaker also notes that the machine word size matters, and care should be taken when reading and writing to packed data structures that use non-standard types.

  • 00:15:00 In this section, the video discusses the Silk platform's "race detector," which can help identify potential race conditions in a program. By using the '-f sanitized equal to silk' flag to generate an instrumented program, Silk can report and localize offending races, which helps debug code. The video also explains the concept of parallelism, and how the Silk execution model uses computation graphs to illustrate the unfolding of computation during execution. These concepts are important to understand when trying to optimize and improve program performance.

  • 00:20:00 types of vertices in the computation dag: the initial strand, continuation strands, function-call strands, and final strands. The initial strand contains the first instruction to be executed, and the final strand contains the last instruction executed in the program. Continuation strands represent the continuation of a spawn operation, whereas function-call strands represent the execution of a function call. It's important to note that the computation dag dynamically unfolds during execution and is processor-oblivious, meaning the runtime system figures out how to map tasks to processors at runtime. In summary, the computation dag is a powerful tool to understand the parallelism and concurrency of a program.

  • 00:25:00 In this section, we learn about computation dags and how they can be used to analyze parallelism in a program. The computation dag represents the dependencies between different parts of the program and has different types of edges, including spawn edges, call edges, return edges, and continue edges. It is possible to use computation dags to analyze how much parallelism there is in a program, and we use Amdahl's law to quantify the amount of parallelism. In the example provided, there are fewer than seven nodes that need to be executed sequentially, indicating that there is some degree of parallelism in the computation.

  • 00:30:00 In this section, the concept of Amdahl's Law is introduced as a way to determine the maximum speed-up possible in parallel processing. It is determined that the serial fraction of the program is 3/18, resulting in a maximum speed-up of 6. While Amdahl's Law provides an upper bound on parallelism, it is often too loose in practical cases. The work law and span law are introduced as alternative definitions of parallelism, with the work law stating that the execution time on P processors is greater than or equal to the work of the program divided by P, and the span law stating that the execution time on P processors is at least the execution time on an infinite number of processors. These laws provide better upper bounds on maximum speed-up than Amdahl's Law in many cases.

  • 00:35:00 In this section, the speaker discusses how to compose the work and span quantities of different computations. They explain that when executing two parallel computations, the work is still the sum of the work of the individual computations, and the span is the maximum of the span of the individual computations. The speaker also defines speed-up on P processors and discusses sublinear, linear, and superlinear speed-up. They note that the maximum possible speed-up is equal to the parallelism of the computation, which is equal to the average amount of work per step along the computation. Overall, this section provides insight into the composition of computations and how to measure their parallelism.

  • 00:40:00 In this section, the speaker discusses how to analyze the work and span of computations to compute the maximum parallelism, using examples such as a computation DAG and a parallel quicksort algorithm. The speaker introduces the Silk Scale scalability analyzer, which uses compiler instrumentation to analyze the serial execution of a program and derive upper bounds on the parallel speed-up of the program. The speaker also mentions that analyzing the parallelism of large computations by hand can be tedious, but Silk Scale can help with this.

  • 00:45:00 In this section, the video discusses the results of running a parallel computation and generating a plot that shows the observed speed-up, as well as bounds from the span and work laws. The plot shows that the maximum speed-up is about 5, indicating that the parallelism in the program is low. The analysis reveals that the bottleneck to the parallelism is executing the partition function sequentially, and the expected work and span of the parallel version of quicksort is on the order of n log n. The maximum parallelism that can be achieved is on the order of log log n, which is not very high. To increase the parallelism, the partition function must be parallelized.

  • 00:50:00 In this section, the speaker discusses the importance of implementing parallelism in partition and merge sort algorithms in order to see significant speed-up. Parallel versions of these algorithms have an overall span bound with log squared n and a high parallelism of n over log n. Additionally, there are many other practical parallel algorithms out there that have high parallelism and a work bound asymptotically equal to that of the corresponding sequential algorithm. The speaker also introduces the concept of scheduling theory, explaining that a centralized scheduler can map computation DAGs onto available processors at runtime, while a distributed scheduler is used in Silk programming. A greedy scheduler, which does as much as possible on every step of the computation, is used as an example.

  • 00:55:00 In this section, the concept of a greedy scheduler is introduced, which performs as much work as possible in the current time step without thinking about the future. A complete step, where at least P strands are ready, and an incomplete step, where fewer than P strands are ready, are defined. The famous theorem, first shown by Ron Graham in 1968, states that the time bound achieved by a greedy scheduler is less than or equal to T1/P + T infinity, with T1 being the work and T infinity being the span. The proof for this theorem is provided, and it is shown that any greedy scheduler achieves within a factor of two of the optimal running time.

  • 01:00:00 In this section, the video explains the corollaries of the greedy scheduler bound, which achieves within a factor of two of the optimal scheduler. The corollary states that any greedy scheduler achieves near-perfect linear speed-up when T1/Tinfinity is greater than or equal to P^2, where T1/P times T infinity measures the amount of parallelism in the computation compared to the number of processors available. The silk scheduler uses a work stealing scheduler, which has an expected running time of TP equal to T1/P plus order T infinity, with a Big O in front of the T infinity to account for the overheads of scheduling, and can achieve near-perfect linear speed-up as long as the number of processors is much less than T1/Tinfinity. The silk runtime system works by maintaining a work deck of ready strands and manipulates the bottom of the deck like a stack. The instrumentation in the Silk Scale allows measuring work and span terms to determine the parallelism in the program.

  • 01:05:00 In this section, the speaker discusses the concept of spawning and parallelism in processors. They explain that when a processor calls a function, it places that function's frame at the bottom of its stack and can spawn frames at the bottom of the stack. Multiple processes can occur in parallel, and returns can be made from spawns and calls. When workers run out of work to do, they steal from the top of another processor's deck. The famous theorem states that workers steal very infrequently, giving near-linear speed-up. The speaker also notes that Silk supports C's rules for pointers, where a pointer to a stack space can be passed from a parent to a child, but not from a child to a parent.

  • 01:10:00 In this section of the video, the speaker discusses the Cactus Stack, which is the stack that can be seen by a task in a Silk program's ancestor computation graph. This stack allows for multiple views of stacks in parallel, which makes parallel function calls possible. The speaker explains that the stack space required by a Silk program can be found by taking the stack space required by a serial execution of the program (S sub 1) and multiplying it by the number of processors (P) that will be used (S sub P ≤ P x S sub 1). The speaker provides a high-level proof of this concept and notes that the upper bound on stack space required by P processor execution is often much looser than the actual amount needed since each processor may not need to go all the way down the computation graph every time it steals work.
7. Races and Parallelism
7. Races and Parallelism
  • 2019.09.23
  • www.youtube.com
MIT 6.172 Performance Engineering of Software Systems, Fall 2018Instructor: Julian ShunView the complete course: https://ocw.mit.edu/6-172F18YouTube Playlist...
 

Lecture 8. Analysis of Multithreaded Algorithms



Lecture 8. Analysis of Multithreaded Algorithms

This video discusses the master method for analyzing divide and conquer recurrences, and applies it to multithreaded algorithms by comparing log base B of n to log base B of A with F of N to determine their growth in n and the work required. The video presents a cheat sheet with solutions for basic multithreaded algorithms and covers topics such as work and span analysis, work efficiency, and overcoming overhead costs through grain size optimization. The importance of empathy when communicating technical topics is emphasized, and the concept of a DAG is introduced as a means of balancing work and span to increase parallelism and decrease runtime.

The video also discusses the analysis of multithreaded algorithms, focusing on work and span, and how to maximize parallelism while minimizing span to achieve near perfect linear speed-up. The speaker suggests a divide and conquer approach to multi-threaded algorithms, where the number of tasks given to threads is sufficiently large, and warns against spawning numerous small tasks due to scheduling overheads. The example code presented includes an efficient one and a lousy one. The speaker also discusses how to represent sub-matrices in two-dimensional coding and indexing and emphasizes using 'restrict' in the divide and conquer matrix multiplication code to improve performance.

  • 00:00:00 In this section, the instructor begins by reminding the students about divide and conquer recurrences and the general method for solving them called the master method. The method deals with recurrences and their form T(n) = a*T(n/B) + F(n), where a is a constant, B is the splitting factor, and F(n) is the total amount of work to split the problem into subproblems of size n/B. The students are reminded of the recursion tree, and how each level splits into subproblems of size n/B, they add the running time across the rows, calculate the height of the tree and multiply it by the number of subproblems at each level (a^k). Finally, the students calculate that n^log base B of A is the total running time of the algorithm.

  • 00:05:00 In this section, the speaker explains how to determine the growth in n and work required when evaluating multithreaded algorithms. The key is to compare log base B of n to log base B of A with F of N. The speaker covers three possible cases, the first being when log base B n is much bigger than F of n. In this case, the constant fraction of the weight is in the leaves, leading to the answer being n to the log base B of A. The second case is when log base B n is approximately equal to F of n. For this, you tack on an extra log, and the answer is n to the log base B of a log to the k plus 1 of n. Lastly, the third case is when an to log base B is much less than F of N, meaning that F has to satisfy a regularity condition, which is satisfied by all the functions they will be looking at like polynomials and logarithms.

  • 00:10:00 In this section, the speaker presents a cheat sheet with basic multithreaded algorithms solutions that are commonly used in computer science. The first algorithm, T(n) = T(n/2) + n, has a solution of Θ(n^2) as it falls under case one. The second algorithm, T(n) = T(n/2) + n^2logn, has a solution of Θ(n^2logn) as it falls under case two with k = 0. For the third algorithm, T(n) = T(n/2) + n^2, the solution is Θ(n^2) as it falls under case one. The fourth algorithm, T(n) = 2T(n/2) - n, is a tricky one as it does not fall under any cases of the master method and has a solution of Θ(n^2loglogn).

  • 00:15:00 In this section, the speaker discusses the Aqua Bazi method and how it's more complicated than the master method, which is sufficient for analyzing most programs. They then provide an example of a parallel loop with in-place matrix transpose code where the outer loop is parallelized. The speaker emphasizes the importance of correctly defining the iteration space and load balancing for efficient parallel processing. They also explain how loops are implemented by the open silk compiler and runtime system.

  • 00:20:00 In this section, the speaker describes a recursive program for a loop that utilizes divide and conquer to split the loop into two parts and recursively call itself on each side until it reaches a one-element iteration. The work of this computation is analyzed in terms of work and span, with the leaf work being order n squared and the upper part being theta n because every level from the leaves to the root has just n nodes. The open silk runtime system does not have a loop primitive, so loops are effectively translated into this divide and conquer approach in parallel.

  • 00:25:00 In this section, the speaker discusses the work and span analysis of a multithreaded algorithm. He explains that the total work is Θ(n²) because constant work is done for each of the n leaves of the complete binary tree. The span of the loop control is log(n), which means that an infinite number of processors could complete the task in logarithmic time. However, since there is a linear component to the calculation (n), the total span is Θ(n). Therefore, the parallelism is Θ(n), which is a good amount for most practical systems. The speaker then explores the scenario in which the inner loop is parallelized as well, and discusses the resulting amount of work.

  • 00:30:00 In this section of the video, the professor discusses the concept of work efficiency in parallel algorithms. Parallelizing algorithms does not change the work, but it should hopefully reduce the span of the calculation in order to achieve big parallelism. However, sometimes parallelizing can result in adding more work, which can be problematic because it slows down the program compared to the original algorithm. Work-efficient parallel algorithms are what the professor is interested in teaching, as they are able to reduce the span to achieve big parallelism without adding more work. The professor emphasizes the importance of asking questions and making mistakes in the learning process, as it can help others who may have the same questions but are afraid to ask. He encourages students to participate and engage in the learning process, even if they are not in the top 10%.

  • 00:35:00 In this section, the importance of empathy in communication when it comes to technical topics is emphasized. The professor encourages students to be patient with others in the class who may not be familiar with the material being covered. Moving on to the analysis of multithreaded algorithms, a divide and conquer algorithm for vector addition is presented, and the parallelism is found to be n over log n. The relationship between parallelism and the number of processors is discussed, with the professor emphasizing that while more parallelism may seem better, it is only beneficial up to a certain threshold.

  • 00:40:00 In this section, the speaker discusses optimizing away some of the overhead in a multithreaded algorithm, specifically the subroutine call overhead. They introduce the concept of "grain size," which suggests to the compiler that there be up to G elements per chunk at the leaves of the divide-and-conquer process. This allows for the subroutine overhead to be amortized across G iterations rather than just one, resulting in better performance. If the grain size isn't specified, the system makes its best guess to minimize overhead. The speaker breaks down the variables I, G, and s to analyze the work in this context and compares it to the work in the original code without the parallel control stuff.

  • 00:45:00 In this section, the speaker talks about analyzing multithreaded algorithms and how the loop control is cheap on a modern out of order processor. They look at the span, which is the time it takes to execute the algorithm with an infinite number of processors, and discuss the overhead cost versus the cost of the iterations. The speaker breaks down the span in terms of the number of leaves and the colorful operations, referred to as 's', that need to be performed on each of them. They also answer questions on the number of internal nodes in a complete binary tree and the paths taken to calculate the span.

  • 00:50:00 In this section, the speaker discusses the analysis of multithreaded algorithms, particularly the concept of a DAG (directed acyclic graph) and how it affects the path of the algorithm. They emphasize the need for independence between different subtrees of the DAG in order to allow for parallel processing. The goal is to balance the work and span of the algorithm, as these two factors work in opposite directions. The key is to have G (the amount of parallelism) be much greater than s over I (the sequential portion of the algorithm), which allows for a smaller overhead and faster processing. The speaker also mentions an implementation of this concept, but notes that it will be discussed later in the lecture series.

  • 00:55:00 In this section, the speaker presents an implementation of multithreaded algorithms using a for-loop to add two vectors. The loop uses an operator called V add to perform serial vector addition, and the loop spawns off additions in groups of G using a vector of size G. Assuming G equals one, the work of the loops is order n, and the span is also order n. The parallelism is measured by taking the ratio of work to span, which is n divided by n, which simplifies to 1. The speaker emphasizes that the goal of analyzing multithreading algorithms is to increase parallelism and decrease span to minimize runtime, and highlights the technique called strip mining, where a loop of length N is replaced by a double nested loop to parallelize computations, as one way to improve multithreaded algorithms.

  • 01:00:00 In this section, the speaker analyzes the performance of multithreaded algorithms in terms of work and span. They discuss how to minimize span to maximize parallelism, as the span is in the denominator. The key is to generate ten times more parallelism than processors if one wants near perfect linear speed-up. The speaker also suggests trading some parallelism off to reduce the work overhead if necessary. They encourage fiddling with the grain size to do so, as long as sufficient parallelism is retained.

  • 01:05:00 In this section, the speaker discusses the use of divide and conquer strategies in multi-threaded algorithms, advising against spawning numerous small tasks one after the other, unless they are chunked to a few tasks, in which case the overhead is fine. Generally, there should be a divide and conquer approach, where the number of tasks given to threads is sufficiently large. The speaker warns to watch out for scheduling overheads and suggests parallelizing the outer loop instead of the inner loops, which have more overheads. The example codes presented here show an efficient one, where scheduling occurs only once, and a lousy one, where overhead occurs n times. Matrix multiplication is used as an example of a multi-threaded algorithm using a divide and conquer strategy to multiply n by n matrices by doing 8 multiplications of n over 2 by n over 2 matrices, and then adding two n by n matrices.

  • 01:10:00 In this section, the speaker discusses how to represent sub-matrices in two-dimensional coding and indexing, particularly in row major and column major order for languages like C and Fortran. The idea is that every two-dimensional matrix can be indexed as a one-dimensional matrix, and when dealing with sub-matrices, one needs to add the number of rows and length of the long matrix to know where the IJ element is. Furthermore, when implementing divide and conquer, knowing the starting corners of each of the four sub-matrices is crucial. Ultimately, the speaker emphasizes using 'restrict' in the divide and conquer matrix multiplication code to tell the compiler which elements can or cannot be aliased.

  • 01:15:00 In this section, the speaker discusses a multithreaded algorithm for matrix multiplication. The algorithm is based on recursive subdivision of the matrices into smaller submatrices, with each submatrix being recursively multiplied to obtain the final result. The speaker highlights a bit trick to quickly determine whether n is a power of two, and uses a macro to simplify the indexing of the matrices. The algorithm exhibits high parallelism, which can be reduced to improve performance. Other similar algorithms are also mentioned.
8. Analysis of Multithreaded Algorithms
8. Analysis of Multithreaded Algorithms
  • 2019.09.23
  • www.youtube.com
MIT 6.172 Performance Engineering of Software Systems, Fall 2018Instructor: Charles LeisersonView the complete course: https://ocw.mit.edu/6-172F18YouTube Pl...
 

Lecture 9. What Compilers Can and Cannot Do



Lecture 9. What Compilers Can and Cannot Do

This video discusses how compilers can optimize code, using various examples. It explains how compilers can eliminate unnecessary storage and memory references, and how they can optimize loops to improve performance.

This video also explains the concept of compiler optimization passes, and how they can be used to improve code performance. It also discusses the limitations of compilers, specifically in regards to vectorization. Compilers can only vectorize code if they can determine that the pointers in the code will not alias. If the pointers do alias, the compiler will not be able to vectorize the code. As a programmer, you can help the compiler by annotating your pointers to provide more information about their use.

  • 00:00:00 In this lecture, Tao Schardl discusses what compilers can and cannot do. He explains that studying compiler optimizations can help developers understand how to write code that takes advantage of optimizations, and how to avoid manually optimizing code that the compiler can already optimize.

  • 00:05:00 The compiler is a big piece of software and it can absolutely help for debugging when the compiler itself has bugs. It helps to understand where the compiler might have made a mistake or where the complier simply just didn't do what you thought it should be able to do.

  • 00:10:00 The compiler can do a lot of optimizations, but there are some things it can't do. For example, it can't always create a fast path or optimize loops.

  • 00:15:00 The compiler can do a lot to optimize code, but there are some things it cannot do well. One example is data structures - the compiler is not smart about them in the same way as it is about functions. Another is loops - the compiler can do some optimizations on them, but there are restrictions.

  • 00:20:00 This YouTube video explains how compilers can use simple operations to replace more complex ones. For example, to replace a division operation, a compiler can multiply by a magic number and then shift the result. This video also explains how the address calculator works.

  • 00:25:00 The video discusses how compilers can optimize code, using a simple "vex scale" function as an example. The code is first shown with no optimizations, and then with various optimizations applied. The optimizations shown include reducing the number of instructions, using vector instructions, and using magic numbers.

  • 00:30:00 This video discusses how compilers can optimize code by eliminating unnecessary memory operations. In particular, it shows how a compiler can optimize a function that multiplies two scalar values by eliminating the need to store the values in memory. Finally, it shows how a compiler can optimize a function that operates on a structure by eliminating the need to store the structure in memory.

  • 00:35:00 In this YouTube video, the speaker explains how compilers can optimize code by eliminating unnecessary storage and memory references. He uses the example of a structure with multiple fields, and demonstrates how the compiler can optimize the code by carefully analyzing the math involved in the address calculations. By understanding what each instruction does, the compiler can optimize the code and remove unnecessary operations.

  • 00:40:00 The compiler works hard to transform data structures and scalar values to store as much as it possibly can purely within registers and avoid using any local storage if possible. For function calls, the compiler sees the call instruction and the code for the function being called. It then tries to make the function in the code above even faster.

  • 00:45:00 The compiler can inline code to avoid function call overhead, and it can also optimize code by removing unnecessary operations.

  • 00:50:00 In this YouTube video, the speaker explains that there are three main reasons why a compiler may not inline a function: if the function is recursive, if the function is in a different compilation unit, or if the function could increase code size and hurt performance. The speaker also gives some tips for controlling function inlining in your own programs.

  • 00:55:00 This video looks at how compilers can optimize loops to make programs run faster. Code hoisting, or loop invariant code motion, is one optimization that can be used to improve performance.

  • 01:00:00 The compiler can optimize code by performing a sequence of transformation passes. These passes are designed to find properties like identical address calculations and replace them with more efficient code.

  • 01:05:00 The compiler is able to vectorize this loop, despite the fact that it does not know if the addresses of 'x' and 'y' overlap. This is because the structure of the compiled code is more complicated than the two-line function that was given as input.

  • 01:10:00 This YouTube video explains what compilers can and cannot do. Compilers can vectorize code if the code does not contain any loops. However, if the code does contain loops, the compiler can only vectorize the code if the loop is full of vector operations. If the loop is not full of vector operations, the compiler will not be able to vectorize the code.

  • 01:15:00 The question of whether or not a compiler can vectorize code is a difficult one, as it depends on a number of factors. In particular, the compiler needs to be able to determine whether or not two pointers will alias, which can be a difficult task. As a programmer, you can help the compiler by annotating your pointers to provide more information about their use.
9. What Compilers Can and Cannot Do
9. What Compilers Can and Cannot Do
  • 2019.09.23
  • www.youtube.com
MIT 6.172 Performance Engineering of Software Systems, Fall 2018Instructor: Tao B. SchardlView the complete course: https://ocw.mit.edu/6-172F18YouTube Playl...