Arvind Sudarsanam

Greater Boston Contact Info
305 connections

Join to view profile

About

I am an experienced compiler engineer and have worked with multiple compilers including…

Activity

Join now to see all activity

Experience & Education

  • Intel Corporation

View Arvind’s full experience

See their title, tenure and more.

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

Publications

  • Data flow computation for greater parallelism

    Proceedings of the High Performance Embedded Computing workshop (HPEC)

    This paper proposed three changes that can be made
    to the current CMP hardware architectures and
    software methodologies to more efficiently take
    advantage of multi-core processors. These changes,
    though non-trivial from the hardware and tool
    developer point of view, would not require the
    development, training and use of a new software
    programming paradigm to more efficiently use
    CMPS as often proposed. Changing compute
    hardware and compilation methodologies is far…

    This paper proposed three changes that can be made
    to the current CMP hardware architectures and
    software methodologies to more efficiently take
    advantage of multi-core processors. These changes,
    though non-trivial from the hardware and tool
    developer point of view, would not require the
    development, training and use of a new software
    programming paradigm to more efficiently use
    CMPS as often proposed. Changing compute
    hardware and compilation methodologies is far easier
    than changing users.

  • Memory architecture template for Fast Block Matching algorithms on FPGAs

    Proceeding of 24th IEEE International Symposium on Parallel and Distributed Processing

    Fast Block Matching (FBM) algorithms for video compression are well suited for acceleration using parallel data-path architecture on Field Programmable Gate Arrays (FPGAs). However, designing an efficient on-chip memory subsystem to provide the required throughput to this parallel data-path architecture is a complex problem. This paper proposes a memory architecture template that is explored using a Bounded Set algorithm to design efficient on-chip memory subsystems for FBM algorithms. The…

    Fast Block Matching (FBM) algorithms for video compression are well suited for acceleration using parallel data-path architecture on Field Programmable Gate Arrays (FPGAs). However, designing an efficient on-chip memory subsystem to provide the required throughput to this parallel data-path architecture is a complex problem. This paper proposes a memory architecture template that is explored using a Bounded Set algorithm to design efficient on-chip memory subsystems for FBM algorithms. The resulting memory subsystems are compared with three existing memory subsystems. Results show that our memory subsystems can provide full parallelism in majority of test cases and can process integer pixels of a 1080 p video sequence up to a rate of 275 frames per second.

  • A Power Efficient Linear Equation Solver on a Multi-FPGA Accelerator

    International Journal of Computers and Applications

    This paper presents an approach to explore a commercial multi field programmable gate array (FPGA) system as high performance accelerator and the problem of solving an LU decomposed linear system of equations using forward and back substitution is addressed. Block-based right-hand-side solver algorithm is described and a novel data flow and memory architectures that can support arbitrary data types, block sizes and matrix sizes is proposed. These architectures have been implemented on a…

    This paper presents an approach to explore a commercial multi field programmable gate array (FPGA) system as high performance accelerator and the problem of solving an LU decomposed linear system of equations using forward and back substitution is addressed. Block-based right-hand-side solver algorithm is described and a novel data flow and memory architectures that can support arbitrary data types, block sizes and matrix sizes is proposed. These architectures have been implemented on a multi-FPGA system. Capabilities of the accelerator system are pushed to its limits by implementing the problem for double precision complex floatingpoint data. Detailed timing data is presented and augmented with data from a performance model proposed in this paper. Performance of the accelerator system is evaluated against that of a state of the art low power Beowulf cluster node running an optimized LAPACK implementation. Both systems are compared using the power efficiency (performance/watt) metric. FPG A system is about eleven times more power efficient than the compute node of a cluster.

  • Analysis and Design of a Context Adaptable SAD/MSE Accelerator

    International Journal of Reconfigurable Computing

    Design of flexible multimedia accelerators that can cater to multiple algorithms is being aggressively pursued in the media processors community. Such an approach is justified in the era of sub-45 nm technology where an increasingly dominating leakage power component is forcing designers to make the best possible use of on-chip resources. In this paper we present an analysis of two commonly used window-based operations (sum of absolute differences and mean squared error) across a variety of…

    Design of flexible multimedia accelerators that can cater to multiple algorithms is being aggressively pursued in the media processors community. Such an approach is justified in the era of sub-45 nm technology where an increasingly dominating leakage power component is forcing designers to make the best possible use of on-chip resources. In this paper we present an analysis of two commonly used window-based operations (sum of absolute differences and mean squared error) across a variety of search patterns and block sizes (, , etc.). We propose a context adaptable architecture that has (i) configurable 2D systolic array and (ii) 2D Configurable Register Array (CRA). CRA can cater to variable pixel access patterns while reusing fetched pixels across search windows. Benefits of proposed architecture when compared to 15 other published architectures are adaptability, high throughput, and low latency at a cost of increased footprint, when ported on a Xilinx FPGA.

  • Dynamically Reconfigurable Systolic Array Accelerators: A Case Study with EKF and DWT Algorithms

    IET Computers & Digital Techniques

    Field programmable gate arrays (FPGAs) are increasingly being adopted as the primary on-board computing system for autonomous deep space vehicles. There is a need to support several complex applications for navigation and image processing in a rapidly responsive on-board FPGA-based computer. Developing such a computer requires the designer to explore and combine several design concepts such as systolic array (SA) design, hardware-software partitioning and partial dynamic reconfiguration (PDR)…

    Field programmable gate arrays (FPGAs) are increasingly being adopted as the primary on-board computing system for autonomous deep space vehicles. There is a need to support several complex applications for navigation and image processing in a rapidly responsive on-board FPGA-based computer. Developing such a computer requires the designer to explore and combine several design concepts such as systolic array (SA) design, hardware-software partitioning and partial dynamic reconfiguration (PDR). In this study a microprocessor/co-processor design that can simultaneously accelerate multiple single precision floating-point algorithms is proposed. Two such algorithms are extended Kalman filter (EKF) and discrete wavelet transform (DWT). Key contributions include (i) polymorphic systolic array (PolySA), comprising partial reconfigurable regions that can accelerate algorithms amenable to being mapped onto linear SAs and (ii) performance model to predict the overall execution time of EKF algorithm on the proposed PolySA architecture. When implemented on a low-end Xilinx Virtex4 SX35 FPGA, the design provides a speed-up of at least 4.18x and 6.61x over a state-of-the-art microprocessor used in spacecraft systems for the EKF and DWT algorithms, respectively. The performance of EKF algorithm on the proposed PolySA architecture was compared against the performance on two types of conventional (non-polymorphic) hardware architectures and the results showed that the proposed architecture outperformed the other two architectures in most of the test cases.

  • PRR-PRR Dynamic Relocation

    IEEE Computer Architecture Letters

    Partial bitstream relocation (PBR) on FPGAs has been gaining attention in recent years as a potentially promising technique to scale parallelism of accelerator architectures at run time, enhance fault tolerance, etc. PBR techniques to date have focused on reading inactive bitstreams stored in memory, on-chip or off-chip, whose contents are generated for a specific partial reconfiguration region (PRR) and modified on demand for configuration into a PRR at a different location. As an alternative,…

    Partial bitstream relocation (PBR) on FPGAs has been gaining attention in recent years as a potentially promising technique to scale parallelism of accelerator architectures at run time, enhance fault tolerance, etc. PBR techniques to date have focused on reading inactive bitstreams stored in memory, on-chip or off-chip, whose contents are generated for a specific partial reconfiguration region (PRR) and modified on demand for configuration into a PRR at a different location. As an alternative, we propose a PRR-PRR relocation technique to generate source and destination addresses, read the bitstream from an active PRR (source) in a non-intrusive manner, and write it to destination PRR. We describe two options of realizing this on Xilinx Virtex 4 FPGAs: (a) hardware-based accelerated relocation circuit (ARC) and (b) a software solution executed on Microblaze. A comparative performance analysis to highlight the speed-up obtained using ARC is presented. For real test cases, performance of our implementations are compared to estimated performances of two state of the art methods.

  • Methodology to Derive Polymorphic Soft-IP Cores for FPGAs

    Journal of IET Computers & Digital Techniques

    The configurable nature of field-programmable gate arrays (FPGAs) has allowed designers to take advantage of various data flow characteristics in application kernels to create custom architecture implementations, by optimising instruction level paralleism (ILP) and pipelining at the register transfer level. However, not all applications are composed of pure data flow kernels. Intermingling of control and data flows in applications offers more interesting challenges in creating custom…

    The configurable nature of field-programmable gate arrays (FPGAs) has allowed designers to take advantage of various data flow characteristics in application kernels to create custom architecture implementations, by optimising instruction level paralleism (ILP) and pipelining at the register transfer level. However, not all applications are composed of pure data flow kernels. Intermingling of control and data flows in applications offers more interesting challenges in creating custom architectures. The authors present one possible way to take advantage of correlations that may be present among data flow graphs (DFGs) embedded in control flow graphs. In certain cases, where there is sufficient correlation and ILP, the proposed context adaptable architecture (CAA) design methodology results in an interesting and useful custom architecture for such embedded DFGs. Certain other application characteristics may demand the use of alternative methodologies such as partial and dynamic reconfiguration (PDR) and a mixture of PDR and common sub-graph methods (PDR-CSG). The authors present a rigorous analysis, combined with some benchmarking efforts to showcase the differences, advantages and disadvantages of the CAA methodology with other methodologies. The authors also present an analysis of how the core underlying algorithm in our methodology compares with other published algorithms and the differences in resulting designs on an FPGA for a sample set of test cases.

  • Performance of LU decomposition on a Multi-FPGA System Compared to a Low Power Commodity Microprocessor System

    Journal for Scalable computing: Practice and Experience

    Lower/Upper triangular (LU) factorization plays an important role in
    scientific and high performance computing. This paper presents an
    implementation of the LU decomposition algorithm for double
    precision complex numbers on a star topology based multi-FPGA
    platform. The out of core implementation moves data through multiple
    levels of a hierarchical memory system (hard disk, DDR SDRAMs and
    FPGA block RAMS) using completely pipelined data paths in all steps
    of the algorithm…

    Lower/Upper triangular (LU) factorization plays an important role in
    scientific and high performance computing. This paper presents an
    implementation of the LU decomposition algorithm for double
    precision complex numbers on a star topology based multi-FPGA
    platform. The out of core implementation moves data through multiple
    levels of a hierarchical memory system (hard disk, DDR SDRAMs and
    FPGA block RAMS) using completely pipelined data paths in all steps
    of the algorithm. Detailed performance numbers for all phases of
    the algorithm are presented and compared to a highly optimized
    implementation for a low power microprocessor based system. We also
    compare the performance/Watt for the FPGA and the microprocessor
    system. Finally, recommendations will be given on how improvements
    of the FPGA design would increase the performance of the double
    precision complex LU factorization on the FPGA based system.

  • Memory Support Design for LU Decomposition on Starbridge Hypercomputer

    IEEE Proceedings of the conference on Field Programmable Technology (FPT)

    LU matrix decomposition is a linear algebra algorithm used to reduce the complexity required to solve a large system of linear equations. Large systems of equations frequently need to be solved in physics, engineering, and computational chemistry. In the hardware implementation of such LU algorithms supporting modules must be included which handle the transfer of memory between the disk and processing nodes. This paper looks at the data transfer hardware which supports an implementation of a…

    LU matrix decomposition is a linear algebra algorithm used to reduce the complexity required to solve a large system of linear equations. Large systems of equations frequently need to be solved in physics, engineering, and computational chemistry. In the hardware implementation of such LU algorithms supporting modules must be included which handle the transfer of memory between the disk and processing nodes. This paper looks at the data transfer hardware which supports an implementation of a block-based LU algorithm on a multi-FPGA system. Preliminary results are provided which show the required areas and latencies of these designs

  • Multi FPGA based High Performance LU Decomposition

    Proceedings of the High Performance Embedded Computing workshop (HPEC)

    LU Decomposition is a linear algebra routine that is used to bring down the complexity of solving a system of linear equations with multiple RHS. Its application can be found in computational physics (modeling 2-D structures), image processing, and computational chemistry (design and analysis of molecular structures). This paper investigates the hardware software co-design of large scale block-based LU decomposition algorithm on the Starbridge Hypercomputer. Results are shown for a double…

    LU Decomposition is a linear algebra routine that is used to bring down the complexity of solving a system of linear equations with multiple RHS. Its application can be found in computational physics (modeling 2-D structures), image processing, and computational chemistry (design and analysis of molecular structures). This paper investigates the hardware software co-design of large scale block-based LU decomposition algorithm on the Starbridge Hypercomputer. Results are shown for a double precision complex matrix of size 1024x1024 implemented on a system comprising of a single PC connected to 'N' FPGAs via a single PCI bus. Performance results and comparisons with a cluster will be provided at the time of presentation at the conference.

    Multi-FPGA based High Performance LU Decomposition (PDF Download Available). Available from: https://www.researchgate.net/publication/228966044_Multi-FPGA_based_High_Performance_LU_Decomposition [accessed Mar 17, 2016].

  • Design of Embedded Compute Intensive Processing Elements and their Scheduling in a Reconfigurable Environment

    Canadian Journal of Electrical and Computer Engineering (CJECE)

    This paper addresses the problem of solving computationally intensive algorithms such as multimedia and graphics applications. A novel methodology to design embedded compute-intensive processing elements (ECIPEs) is proposed. In order to identify common data flow patterns among core data flow graphs (DFGs), a low-complexity and parallelism-aware common subgraph extraction algorithm is proposed. In addition, a reconfiguration-aware static scheduling technique to manage task and resource…

    This paper addresses the problem of solving computationally intensive algorithms such as multimedia and graphics applications. A novel methodology to design embedded compute-intensive processing elements (ECIPEs) is proposed. In order to identify common data flow patterns among core data flow graphs (DFGs), a low-complexity and parallelism-aware common subgraph extraction algorithm is proposed. In addition, a reconfiguration-aware static scheduling technique to manage task and resource dependencies is proposed. To validate the success of this approach, estimates of reconfiguration times obtained by performing several experiments (on an assorted set of algorithms taken from media standards such as MPEG-4 and frequently used graphics algorithms) are provided, and the potential for reduction in the number of reconfiguration cycles is shown.

    Other authors
  • Implementation of Polymorphic Matrix Inversion using Viva

    Military and Aerospace conference on Programmable Logic Devices (MAPLD)/NASA

    Matrix inverse operation is an important module in many scientific applications including the solution of a system of linear equations, eigen values computation, and training of a neural network. Many of these applications encompass matrices of varying order and data type (fixed, floating, complex, etc). Introducing a polymorphic feature in the hardware implementation of matrix inverse can have an important benefit, as the design’s shelf-life is increased and the associated design time is…

    Matrix inverse operation is an important module in many scientific applications including the solution of a system of linear equations, eigen values computation, and training of a neural network. Many of these applications encompass matrices of varying order and data type (fixed, floating, complex, etc). Introducing a polymorphic feature in the hardware implementation of matrix inverse can have an important benefit, as the design’s shelf-life is increased and the associated design time is reduced. Polymorphism is a well-known concept in software, but is often ignored in hardware design techniques. Existing software solutions fail to capture the massive parallelism that is available in a typical matrix inverse algorithm, in an efficient manner that balances performance with silicon real estate. FPGAs present an ideal implementation platform for matrix inversion due to their inherent parallel architecture and high inter-connectivity. There have been prior efforts towards the hardware implementation of matrix inverse for an FPGA-based system. Most of these implementations focus on a specific type of matrices (triangular/sparse). In the past, researchers have also proposed methods targeting LU decomposition, but none of these approaches are polymorphic in either data type, order of the tensor or information rate. Therefore compelling the designer to restart the design nearly from scratch when data types or order of the matrix changes.

    Other authors
  • A Fast and Efficient FPGA-Based Implementation for Solving a System of Linear Interval Equations

    IEEE Proceedings of the conference on Field Programmable Technology (FPT)

    This paper addresses the problem of solving a system of linear interval equations (an NP-hard problem), wherein the co-efficients on the LHS and the RHS are all represented using intervals. This problem is transformed into a global optimization problem and a modified branch and bound algorithm suited for an FPGA-based implementation is proposed. This algorithm is modified to extract parallelism and further speed-up is achieved by pipelining the implementation. The implementation was designed…

    This paper addresses the problem of solving a system of linear interval equations (an NP-hard problem), wherein the co-efficients on the LHS and the RHS are all represented using intervals. This problem is transformed into a global optimization problem and a modified branch and bound algorithm suited for an FPGA-based implementation is proposed. This algorithm is modified to extract parallelism and further speed-up is achieved by pipelining the implementation. The implementation was designed using Xilinx 1SE 6.1 and VHDL was the design entry language. A speed-up of 14 for a Xilinx Virtex 2P30 FPGA over a 1.5 GHz Intel Centrino processor based implementation was obtained.

  • High Level - Application Analysis Techniques & Architectures - to Explore Design possibilities for Reduced Reconfiguration Area Overheads in FPGAs executing Compute Intensive Applications

    Proceedings of the Reconfigurable Architectures Workshop

    This paper proposes a novel common subgraph extraction algorithm which aims to minimize the total number of gates (reconfiguration area overhead) involved in implementing compute-intensive scientific and media applications using reconfigurable architectures. Motivation behind the proposed research is illustrated using an example from Biochemical Algorithms Library (BALL). The design of novel context adaptable architectures to implement common subgraphs is also proposed with an example from the…

    This paper proposes a novel common subgraph extraction algorithm which aims to minimize the total number of gates (reconfiguration area overhead) involved in implementing compute-intensive scientific and media applications using reconfigurable architectures. Motivation behind the proposed research is illustrated using an example from Biochemical Algorithms Library (BALL). The design of novel context adaptable architectures to implement common subgraphs is also proposed with an example from the video warping functions of the MPEG-4 standard. Three different models of mapping such architectures onto hybrid/pure FPGA systems are proposed. Estimates obtained by applying these techniques and architectures for various media and scientific functions are shown.

  • Novel predicated data flow analysis based memory design for data and control intensive multimedia applications

    Proceedings of the SPIE conference on Electronic Imaging

    There has been an ever increasing demand for fast and power efficient solutions for mobile multimedia computing applications. The research discussed in this paper proposes an automated tool-set to design a reconfigurable architecture targeted towards multimedia applications, which are both data and control intensive. One important design step is custom memory design. This paper discusses a novel methodology to design a power, area and time efficient memory architecture for a given Control Data…

    There has been an ever increasing demand for fast and power efficient solutions for mobile multimedia computing applications. The research discussed in this paper proposes an automated tool-set to design a reconfigurable architecture targeted towards multimedia applications, which are both data and control intensive. One important design step is custom memory design. This paper discusses a novel methodology to design a power, area and time efficient memory architecture for a given Control Data Flow Graph (CDFG) of an application. It uses the concept of Predicated Data Flow Analysis to get the memory requirements of each control path of the CDFG and a novel algorithm is used to merge these requirements. Final memory architecture is reconfigurable during run-time and a dynamic memory manager has been designed to support the same. An illustrative example involving a self-generated CDFG is shown to demonstrate the flow of the proposed algorithm. Results for various multimedia algorithms found in MPEG-4 codec show the effectiveness of this approach over memory design based on conventional Data Flow Analysis techniques.

  • Resource Estimation and Task Scheduling for Multithreaded Reconfigurable Architectures

    Proceedings of the International Conference on Parallel and Distributed Systems


    Print
    Request Permissions
    Reconfigurable computing is an emerging paradigm of research that offers cost-effective solutions for computationally intensive applications through hardware reuse. There is a growing need in this domain for techniques to exploit parallelism inherent in the target application and to schedule the parallelized application. This paper proposes a method to estimate the optimal number of resources through critical path analysis while keeping resource utilization…


    Print
    Request Permissions
    Reconfigurable computing is an emerging paradigm of research that offers cost-effective solutions for computationally intensive applications through hardware reuse. There is a growing need in this domain for techniques to exploit parallelism inherent in the target application and to schedule the parallelized application. This paper proposes a method to estimate the optimal number of resources through critical path analysis while keeping resource utilization near optimal. We also propose an algorithm to optimally schedule the parallel threads of execution in linear time. Our algorithm is based on the idea of enhanced partial critical path (ePCP) and handles memory latencies and reconfiguration overheads. Results obtained show the effectiveness of our approach over other critical path based methods.

  • Task Scheduling of Control Data Flow Graphs for Reconfigurable Architectures

    Proceedings of the International Conference on Engineering of Reconfigurable Systems and Algorithms

    Task scheduling is an essential part of the design cycle of a reconfigurable hardware implementation for a given application. Most of the current multimedia applications provide a large amount of variations to users and hence are control dominated. To arrive at an optimal schedule for such applications would involve a highly complex scheduling algorithm. This paper proposes a low complexity scheduling algorithm that provides a near optimal solution. Existing approaches suggest that Branch and…

    Task scheduling is an essential part of the design cycle of a reconfigurable hardware implementation for a given application. Most of the current multimedia applications provide a large amount of variations to users and hence are control dominated. To arrive at an optimal schedule for such applications would involve a highly complex scheduling algorithm. This paper proposes a low complexity scheduling algorithm that provides a near optimal solution. Existing approaches suggest that Branch and Bound method of scheduling gives the most optimal solution, but at the same time is highly complex. Our approach introduces the concept of an enhanced Partial Critical Path. Our scheduling algorithm generates near-optimal solution at O(n) complexity. Branch and Bound algorithm can be run selectively to approach optimality, thus reducing the overall complexity. Special cases involving loops have also been addressed. The effect of reconfiguration on the schedule has been analyzed and a solution has been proposed.

  • Current Trends for Silicon and Embedded Computing Solutions for Automotive Applications

    Proceedings of the convergence conference of SAE

    Automotive applications have started providing functionalities like dynamic navigation and multimedia computing that are both complex as well as real-time in nature. A single chip implementation of these applications is possible only if the processing power of the chip is high enough. The advances in silicon technology have been significant and the silicon real estate available for processing on a single chip is increasing at a rapid pace. Moore's law predicts that by 2005, a billion…

    Automotive applications have started providing functionalities like dynamic navigation and multimedia computing that are both complex as well as real-time in nature. A single chip implementation of these applications is possible only if the processing power of the chip is high enough. The advances in silicon technology have been significant and the silicon real estate available for processing on a single chip is increasing at a rapid pace. Moore's law predicts that by 2005, a billion transistors will reside on a single chip. This makes it possible for automobile designers to provide the end-user with a complete embedded solution.

  • Pattern recognition tool to detect reconfigurable patterns in MPEG4 video processing

    IEEE Proceedings of Parallel and Distributed Processing Symposium

    Current approaches towards building a reconfigurable processor are targeted towards general purpose computing or a limited range of media specific applications and are not specifically tuned for mobile multimedia applications. The increasing demand for mobile multimedia processing with stringent constraints for low power, low chip area and high flexibility at both the encoder and decoder naturally demand the design and development of a dynamically reconfigurable multimedia processor. We have…

    Current approaches towards building a reconfigurable processor are targeted towards general purpose computing or a limited range of media specific applications and are not specifically tuned for mobile multimedia applications. The increasing demand for mobile multimedia processing with stringent constraints for low power, low chip area and high flexibility at both the encoder and decoder naturally demand the design and development of a dynamically reconfigurable multimedia processor. We have performed a detailed complexity analysis of the MPEG-4 video coding mode which has illustrated the potential for reconfigurable computing. We have recently proposed a methodology for designing a reconfigurable media processor. This involves the design of a parser that identifies data/control flow graphs generated from the input assembly code of an UltraSPARC V-9 architecture; recurring pattern analyzer that uses a clustering based approach to identify specific sequences of operations that can potentially be implemented in hardware; and finally a count of such modules at every level of granularity with the associated weights based on the complexity of computation and data transfers used by partitioner and router. In this paper we then propose the design of the parser and pattern recognizer with results for detecting the reconfigurable patterns in MPEG4.

Patents

  • Reconfigurable processing

    Issued US PCT/US2004/003609

    A method of producing a reconfigurable circuit device for running a computer program of moderate complexity such as multimedia processing. Code for the application is compiled into Control Flow Graphs representing distinct parts of the application to be run. From those Control Flow Graphs are extracted basic blocks. The basic blocks are converted to Data Flow Graphs by a compiler utility. From two or more Data Flow Graphs, a largest common subgraph is determined. The largest common subgraph is…

    A method of producing a reconfigurable circuit device for running a computer program of moderate complexity such as multimedia processing. Code for the application is compiled into Control Flow Graphs representing distinct parts of the application to be run. From those Control Flow Graphs are extracted basic blocks. The basic blocks are converted to Data Flow Graphs by a compiler utility. From two or more Data Flow Graphs, a largest common subgraph is determined. The largest common subgraph is ASAP scheduled and substituted back into the Data Flow Graphs which also have been scheduled. The separate Data Flow Graphs containing the scheduled largest common subgraph are converted to data paths that are then combined to form code for operating the application. The largest common subgraph is effected in hardware that is shared among the parts of the application from which the Data Flow Graphs were developed. Scheduling of the overall code is effected for sequencing, providing fastest run times and the code is implemented in hardware by partitioning and placement of processing elements on a chip and design of the connective fabric for the design elements.

    Other inventors
    See patent
  • Near optimal configurable adder tree for arbitrary shaped 2d block sum of absolute differences (sad) calculation engine.

    Filed US US 12/581,482

    Embodiments of a near optimal configurable adder tree for arbitrary shaped 2D block sum of absolute differences (SAD) calculation engine are generally described herein. Other embodiments may be described and claimed. In some embodiments, a configurable two-dimensional adder tree architecture for computing a sum of absolute differences (SAD) for various block sizes up to 16 by 16 comprises a first stage of one-dimensional adder trees and a second stage of one-dimensional adder trees, wherein…

    Embodiments of a near optimal configurable adder tree for arbitrary shaped 2D block sum of absolute differences (SAD) calculation engine are generally described herein. Other embodiments may be described and claimed. In some embodiments, a configurable two-dimensional adder tree architecture for computing a sum of absolute differences (SAD) for various block sizes up to 16 by 16 comprises a first stage of one-dimensional adder trees and a second stage of one-dimensional adder trees, wherein each one-dimensional adder tree comprises an input routing network, a plurality of adder units, and an output routing network.

    Other inventors
    See patent

Honors & Awards

  • Outstanding Research Assistant

    ECE Department, USU

More activity by Arvind

View Arvind’s full profile

  • See who you know in common
  • Get introduced
  • Contact Arvind directly
Join to view full profile

Other similar profiles

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Others named Arvind Sudarsanam in United States

Add new skills with these courses