Arvind Sudarsanam

Greater Boston Contact Info

Sign in to view Arvind’s full profile

Welcome back

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

305 connections

View mutual connections with Arvind

Welcome back

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

Join to view profile

Intel Corporation

Utah State University

About

I am an experienced compiler engineer and have worked with multiple compilers including…

Activity

Humbled and grateful to join this amazing group of outstanding engineers!

Humbled and grateful to join this amazing group of outstanding engineers!

Liked by Arvind Sudarsanam
It’s the 100 day anniversary When Eltropy and POPi/o came together Congratulations Team. 🎉🎉🎉🎉

It’s the 100 day anniversary When Eltropy and POPi/o came together Congratulations Team. 🎉🎉🎉🎉

Liked by Arvind Sudarsanam
Liked by Arvind Sudarsanam

Liked by Arvind Sudarsanam

Join now to see all activity

Experience & Education

Intel Corporation

******** ***

****** ******** ********
********

****** ******** ********
**** ***** **********

****** ** ********** (**.*.) ******** *********** *.*

2004 - 2009
******* ***** **********

******'* ****** ********** *** *********** *********** *.*

2001 - 2004

View Arvind’s full experience

See their title, tenure and more.

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

Publications

Data flow computation for greater parallelism

Proceedings of the High Performance Embedded Computing workshop (HPEC) 2011

This paper proposed three changes that can be made
to the current CMP hardware architectures and
software methodologies to more efficiently take
advantage of multi-core processors. These changes,
though non-trivial from the hardware and tool
developer point of view, would not require the
development, training and use of a new software
programming paradigm to more efficiently use
CMPS as often proposed. Changing compute
hardware and compilation methodologies is far…

This paper proposed three changes that can be made
to the current CMP hardware architectures and
software methodologies to more efficiently take
advantage of multi-core processors. These changes,
though non-trivial from the hardware and tool
developer point of view, would not require the
development, training and use of a new software
programming paradigm to more efficiently use
CMPS as often proposed. Changing compute
hardware and compilation methodologies is far easier
than changing users.
Memory architecture template for Fast Block Matching algorithms on FPGAs

Proceeding of 24th IEEE International Symposium on Parallel and Distributed Processing 2010

Fast Block Matching (FBM) algorithms for video compression are well suited for acceleration using parallel data-path architecture on Field Programmable Gate Arrays (FPGAs). However, designing an efficient on-chip memory subsystem to provide the required throughput to this parallel data-path architecture is a complex problem. This paper proposes a memory architecture template that is explored using a Bounded Set algorithm to design efficient on-chip memory subsystems for FBM algorithms. The…

Fast Block Matching (FBM) algorithms for video compression are well suited for acceleration using parallel data-path architecture on Field Programmable Gate Arrays (FPGAs). However, designing an efficient on-chip memory subsystem to provide the required throughput to this parallel data-path architecture is a complex problem. This paper proposes a memory architecture template that is explored using a Bounded Set algorithm to design efficient on-chip memory subsystems for FBM algorithms. The resulting memory subsystems are compared with three existing memory subsystems. Results show that our memory subsystems can provide full parallelism in majority of test cases and can process integer pixels of a 1080 p video sequence up to a rate of 275 frames per second.
A Power Efficient Linear Equation Solver on a Multi-FPGA Accelerator

International Journal of Computers and Applications 2009

This paper presents an approach to explore a commercial multi field programmable gate array (FPGA) system as high performance accelerator and the problem of solving an LU decomposed linear system of equations using forward and back substitution is addressed. Block-based right-hand-side solver algorithm is described and a novel data flow and memory architectures that can support arbitrary data types, block sizes and matrix sizes is proposed. These architectures have been implemented on a…

This paper presents an approach to explore a commercial multi field programmable gate array (FPGA) system as high performance accelerator and the problem of solving an LU decomposed linear system of equations using forward and back substitution is addressed. Block-based right-hand-side solver algorithm is described and a novel data flow and memory architectures that can support arbitrary data types, block sizes and matrix sizes is proposed. These architectures have been implemented on a multi-FPGA system. Capabilities of the accelerator system are pushed to its limits by implementing the problem for double precision complex floatingpoint data. Detailed timing data is presented and augmented with data from a performance model proposed in this paper. Performance of the accelerator system is evaluated against that of a state of the art low power Beowulf cluster node running an optimized LAPACK implementation. Both systems are compared using the power efficiency (performance/watt) metric. FPG A system is about eleven times more power efficient than the compute node of a cluster.
Analysis and Design of a Context Adaptable SAD/MSE Accelerator

International Journal of Reconfigurable Computing 2009

Design of flexible multimedia accelerators that can cater to multiple algorithms is being aggressively pursued in the media processors community. Such an approach is justified in the era of sub-45 nm technology where an increasingly dominating leakage power component is forcing designers to make the best possible use of on-chip resources. In this paper we present an analysis of two commonly used window-based operations (sum of absolute differences and mean squared error) across a variety of…

Design of flexible multimedia accelerators that can cater to multiple algorithms is being aggressively pursued in the media processors community. Such an approach is justified in the era of sub-45 nm technology where an increasingly dominating leakage power component is forcing designers to make the best possible use of on-chip resources. In this paper we present an analysis of two commonly used window-based operations (sum of absolute differences and mean squared error) across a variety of search patterns and block sizes (, , etc.). We propose a context adaptable architecture that has (i) configurable 2D systolic array and (ii) 2D Configurable Register Array (CRA). CRA can cater to variable pixel access patterns while reusing fetched pixels across search windows. Benefits of proposed architecture when compared to 15 other published architectures are adaptability, high throughput, and low latency at a cost of increased footprint, when ported on a Xilinx FPGA.
Dynamically Reconfigurable Systolic Array Accelerators: A Case Study with EKF and DWT Algorithms

IET Computers & Digital Techniques 2009

Field programmable gate arrays (FPGAs) are increasingly being adopted as the primary on-board computing system for autonomous deep space vehicles. There is a need to support several complex applications for navigation and image processing in a rapidly responsive on-board FPGA-based computer. Developing such a computer requires the designer to explore and combine several design concepts such as systolic array (SA) design, hardware-software partitioning and partial dynamic reconfiguration (PDR)…

Field programmable gate arrays (FPGAs) are increasingly being adopted as the primary on-board computing system for autonomous deep space vehicles. There is a need to support several complex applications for navigation and image processing in a rapidly responsive on-board FPGA-based computer. Developing such a computer requires the designer to explore and combine several design concepts such as systolic array (SA) design, hardware-software partitioning and partial dynamic reconfiguration (PDR). In this study a microprocessor/co-processor design that can simultaneously accelerate multiple single precision floating-point algorithms is proposed. Two such algorithms are extended Kalman filter (EKF) and discrete wavelet transform (DWT). Key contributions include (i) polymorphic systolic array (PolySA), comprising partial reconfigurable regions that can accelerate algorithms amenable to being mapped onto linear SAs and (ii) performance model to predict the overall execution time of EKF algorithm on the proposed PolySA architecture. When implemented on a low-end Xilinx Virtex4 SX35 FPGA, the design provides a speed-up of at least 4.18x and 6.61x over a state-of-the-art microprocessor used in spacecraft systems for the EKF and DWT algorithms, respectively. The performance of EKF algorithm on the proposed PolySA architecture was compared against the performance on two types of conventional (non-polymorphic) hardware architectures and the results showed that the proposed architecture outperformed the other two architectures in most of the test cases.
PRR-PRR Dynamic Relocation

IEEE Computer Architecture Letters 2009

Partial bitstream relocation (PBR) on FPGAs has been gaining attention in recent years as a potentially promising technique to scale parallelism of accelerator architectures at run time, enhance fault tolerance, etc. PBR techniques to date have focused on reading inactive bitstreams stored in memory, on-chip or off-chip, whose contents are generated for a specific partial reconfiguration region (PRR) and modified on demand for configuration into a PRR at a different location. As an alternative,…

Partial bitstream relocation (PBR) on FPGAs has been gaining attention in recent years as a potentially promising technique to scale parallelism of accelerator architectures at run time, enhance fault tolerance, etc. PBR techniques to date have focused on reading inactive bitstreams stored in memory, on-chip or off-chip, whose contents are generated for a specific partial reconfiguration region (PRR) and modified on demand for configuration into a PRR at a different location. As an alternative, we propose a PRR-PRR relocation technique to generate source and destination addresses, read the bitstream from an active PRR (source) in a non-intrusive manner, and write it to destination PRR. We describe two options of realizing this on Xilinx Virtex 4 FPGAs: (a) hardware-based accelerated relocation circuit (ARC) and (b) a software solution executed on Microblaze. A comparative performance analysis to highlight the speed-up obtained using ARC is presented. For real test cases, performance of our implementations are compared to estimated performances of two state of the art methods.
Methodology to Derive Polymorphic Soft-IP Cores for FPGAs

Journal of IET Computers & Digital Techniques 2008

The configurable nature of field-programmable gate arrays (FPGAs) has allowed designers to take advantage of various data flow characteristics in application kernels to create custom architecture implementations, by optimising instruction level paralleism (ILP) and pipelining at the register transfer level. However, not all applications are composed of pure data flow kernels. Intermingling of control and data flows in applications offers more interesting challenges in creating custom…

The configurable nature of field-programmable gate arrays (FPGAs) has allowed designers to take advantage of various data flow characteristics in application kernels to create custom architecture implementations, by optimising instruction level paralleism (ILP) and pipelining at the register transfer level. However, not all applications are composed of pure data flow kernels. Intermingling of control and data flows in applications offers more interesting challenges in creating custom architectures. The authors present one possible way to take advantage of correlations that may be present among data flow graphs (DFGs) embedded in control flow graphs. In certain cases, where there is sufficient correlation and ILP, the proposed context adaptable architecture (CAA) design methodology results in an interesting and useful custom architecture for such embedded DFGs. Certain other application characteristics may demand the use of alternative methodologies such as partial and dynamic reconfiguration (PDR) and a mixture of PDR and common sub-graph methods (PDR-CSG). The authors present a rigorous analysis, combined with some benchmarking efforts to showcase the differences, advantages and disadvantages of the CAA methodology with other methodologies. The authors also present an analysis of how the core underlying algorithm in our methodology compares with other published algorithms and the differences in resulting designs on an FPGA for a sample set of test cases.
Performance of LU decomposition on a Multi-FPGA System Compared to a Low Power Commodity Microprocessor System

Journal for Scalable computing: Practice and Experience 2007

Lower/Upper triangular (LU) factorization plays an important role in
scientific and high performance computing. This paper presents an
implementation of the LU decomposition algorithm for double
precision complex numbers on a star topology based multi-FPGA
platform. The out of core implementation moves data through multiple
levels of a hierarchical memory system (hard disk, DDR SDRAMs and
FPGA block RAMS) using completely pipelined data paths in all steps
of the algorithm…

Lower/Upper triangular (LU) factorization plays an important role in
scientific and high performance computing. This paper presents an
implementation of the LU decomposition algorithm for double
precision complex numbers on a star topology based multi-FPGA
platform. The out of core implementation moves data through multiple
levels of a hierarchical memory system (hard disk, DDR SDRAMs and
FPGA block RAMS) using completely pipelined data paths in all steps
of the algorithm. Detailed performance numbers for all phases of
the algorithm are presented and compared to a highly optimized
implementation for a low power microprocessor based system. We also
compare the performance/Watt for the FPGA and the microprocessor
system. Finally, recommendations will be given on how improvements
of the FPGA design would increase the performance of the double
precision complex LU factorization on the FPGA based system.
Memory Support Design for LU Decomposition on Starbridge Hypercomputer

IEEE Proceedings of the conference on Field Programmable Technology (FPT) 2006

LU matrix decomposition is a linear algebra algorithm used to reduce the complexity required to solve a large system of linear equations. Large systems of equations frequently need to be solved in physics, engineering, and computational chemistry. In the hardware implementation of such LU algorithms supporting modules must be included which handle the transfer of memory between the disk and processing nodes. This paper looks at the data transfer hardware which supports an implementation of a…

LU matrix decomposition is a linear algebra algorithm used to reduce the complexity required to solve a large system of linear equations. Large systems of equations frequently need to be solved in physics, engineering, and computational chemistry. In the hardware implementation of such LU algorithms supporting modules must be included which handle the transfer of memory between the disk and processing nodes. This paper looks at the data transfer hardware which supports an implementation of a block-based LU algorithm on a multi-FPGA system. Preliminary results are provided which show the required areas and latencies of these designs
Multi FPGA based High Performance LU Decomposition

Proceedings of the High Performance Embedded Computing workshop (HPEC) 2006

LU Decomposition is a linear algebra routine that is used to bring down the complexity of solving a system of linear equations with multiple RHS. Its application can be found in computational physics (modeling 2-D structures), image processing, and computational chemistry (design and analysis of molecular structures). This paper investigates the hardware software co-design of large scale block-based LU decomposition algorithm on the Starbridge Hypercomputer. Results are shown for a double…

LU Decomposition is a linear algebra routine that is used to bring down the complexity of solving a system of linear equations with multiple RHS. Its application can be found in computational physics (modeling 2-D structures), image processing, and computational chemistry (design and analysis of molecular structures). This paper investigates the hardware software co-design of large scale block-based LU decomposition algorithm on the Starbridge Hypercomputer. Results are shown for a double precision complex matrix of size 1024x1024 implemented on a system comprising of a single PC connected to 'N' FPGAs via a single PCI bus. Performance results and comparisons with a cluster will be provided at the time of presentation at the conference.

Multi-FPGA based High Performance LU Decomposition (PDF Download Available). Available from: https://www.researchgate.net/publication/228966044_Multi-FPGA_based_High_Performance_LU_Decomposition [accessed Mar 17, 2016].
Design of Embedded Compute Intensive Processing Elements and their Scheduling in a Reconfigurable Environment

Canadian Journal of Electrical and Computer Engineering (CJECE) December 1, 2005
This paper addresses the problem of solving computationally intensive algorithms such as multimedia and graphics applications. A novel methodology to design embedded compute-intensive processing elements (ECIPEs) is proposed. In order to identify common data flow patterns among core data flow graphs (DFGs), a low-complexity and parallelism-aware common subgraph extraction algorithm is proposed. In addition, a reconfiguration-aware static scheduling technique to manage task and resource…

This paper addresses the problem of solving computationally intensive algorithms such as multimedia and graphics applications. A novel methodology to design embedded compute-intensive processing elements (ECIPEs) is proposed. In order to identify common data flow patterns among core data flow graphs (DFGs), a low-complexity and parallelism-aware common subgraph extraction algorithm is proposed. In addition, a reconfiguration-aware static scheduling technique to manage task and resource dependencies is proposed. To validate the success of this approach, estimates of reconfiguration times obtained by performing several experiments (on an assorted set of algorithms taken from media standards such as MPEG-4 and frequently used graphics algorithms) are provided, and the potential for reduction in the number of reconfiguration cycles is shown.

Other authors
Implementation of Polymorphic Matrix Inversion using Viva

Military and Aerospace conference on Programmable Logic Devices (MAPLD)/NASA Sep 2005
Matrix inverse operation is an important module in many scientific applications including the solution of a system of linear equations, eigen values computation, and training of a neural network. Many of these applications encompass matrices of varying order and data type (fixed, floating, complex, etc). Introducing a polymorphic feature in the hardware implementation of matrix inverse can have an important benefit, as the design’s shelf-life is increased and the associated design time is…

Matrix inverse operation is an important module in many scientific applications including the solution of a system of linear equations, eigen values computation, and training of a neural network. Many of these applications encompass matrices of varying order and data type (fixed, floating, complex, etc). Introducing a polymorphic feature in the hardware implementation of matrix inverse can have an important benefit, as the design’s shelf-life is increased and the associated design time is reduced. Polymorphism is a well-known concept in software, but is often ignored in hardware design techniques. Existing software solutions fail to capture the massive parallelism that is available in a typical matrix inverse algorithm, in an efficient manner that balances performance with silicon real estate. FPGAs present an ideal implementation platform for matrix inversion due to their inherent parallel architecture and high inter-connectivity. There have been prior efforts towards the hardware implementation of matrix inverse for an FPGA-based system. Most of these implementations focus on a specific type of matrices (triangular/sparse). In the past, researchers have also proposed methods targeting LU decomposition, but none of these approaches are polymorphic in either data type, order of the tensor or information rate. Therefore compelling the designer to restart the design nearly from scratch when data types or order of the matrix changes.

Other authors
A Fast and Efficient FPGA-Based Implementation for Solving a System of Linear Interval Equations

IEEE Proceedings of the conference on Field Programmable Technology (FPT) 2005

This paper addresses the problem of solving a system of linear interval equations (an NP-hard problem), wherein the co-efficients on the LHS and the RHS are all represented using intervals. This problem is transformed into a global optimization problem and a modified branch and bound algorithm suited for an FPGA-based implementation is proposed. This algorithm is modified to extract parallelism and further speed-up is achieved by pipelining the implementation. The implementation was designed…

This paper addresses the problem of solving a system of linear interval equations (an NP-hard problem), wherein the co-efficients on the LHS and the RHS are all represented using intervals. This problem is transformed into a global optimization problem and a modified branch and bound algorithm suited for an FPGA-based implementation is proposed. This algorithm is modified to extract parallelism and further speed-up is achieved by pipelining the implementation. The implementation was designed using Xilinx 1SE 6.1 and VHDL was the design entry language. A speed-up of 14 for a Xilinx Virtex 2P30 FPGA over a 1.5 GHz Intel Centrino processor based implementation was obtained.
High Level - Application Analysis Techniques & Architectures - to Explore Design possibilities for Reduced Reconfiguration Area Overheads in FPGAs executing Compute Intensive Applications

Proceedings of the Reconfigurable Architectures Workshop 2005

This paper proposes a novel common subgraph extraction algorithm which aims to minimize the total number of gates (reconfiguration area overhead) involved in implementing compute-intensive scientific and media applications using reconfigurable architectures. Motivation behind the proposed research is illustrated using an example from Biochemical Algorithms Library (BALL). The design of novel context adaptable architectures to implement common subgraphs is also proposed with an example from the…

This paper proposes a novel common subgraph extraction algorithm which aims to minimize the total number of gates (reconfiguration area overhead) involved in implementing compute-intensive scientific and media applications using reconfigurable architectures. Motivation behind the proposed research is illustrated using an example from Biochemical Algorithms Library (BALL). The design of novel context adaptable architectures to implement common subgraphs is also proposed with an example from the video warping functions of the MPEG-4 standard. Three different models of mapping such architectures onto hybrid/pure FPGA systems are proposed. Estimates obtained by applying these techniques and architectures for various media and scientific functions are shown.
Novel predicated data flow analysis based memory design for data and control intensive multimedia applications

Proceedings of the SPIE conference on Electronic Imaging 2005

There has been an ever increasing demand for fast and power efficient solutions for mobile multimedia computing applications. The research discussed in this paper proposes an automated tool-set to design a reconfigurable architecture targeted towards multimedia applications, which are both data and control intensive. One important design step is custom memory design. This paper discusses a novel methodology to design a power, area and time efficient memory architecture for a given Control Data…

There has been an ever increasing demand for fast and power efficient solutions for mobile multimedia computing applications. The research discussed in this paper proposes an automated tool-set to design a reconfigurable architecture targeted towards multimedia applications, which are both data and control intensive. One important design step is custom memory design. This paper discusses a novel methodology to design a power, area and time efficient memory architecture for a given Control Data Flow Graph (CDFG) of an application. It uses the concept of Predicated Data Flow Analysis to get the memory requirements of each control path of the CDFG and a novel algorithm is used to merge these requirements. Final memory architecture is reconfigurable during run-time and a dynamic memory manager has been designed to support the same. An illustrative example involving a self-generated CDFG is shown to demonstrate the flow of the proposed algorithm. Results for various multimedia algorithms found in MPEG-4 codec show the effectiveness of this approach over memory design based on conventional Data Flow Analysis techniques.
Resource Estimation and Task Scheduling for Multithreaded Reconfigurable Architectures

Proceedings of the International Conference on Parallel and Distributed Systems 2004

Print
Request Permissions
Reconfigurable computing is an emerging paradigm of research that offers cost-effective solutions for computationally intensive applications through hardware reuse. There is a growing need in this domain for techniques to exploit parallelism inherent in the target application and to schedule the parallelized application. This paper proposes a method to estimate the optimal number of resources through critical path analysis while keeping resource utilization…

Print
Request Permissions
Reconfigurable computing is an emerging paradigm of research that offers cost-effective solutions for computationally intensive applications through hardware reuse. There is a growing need in this domain for techniques to exploit parallelism inherent in the target application and to schedule the parallelized application. This paper proposes a method to estimate the optimal number of resources through critical path analysis while keeping resource utilization near optimal. We also propose an algorithm to optimally schedule the parallel threads of execution in linear time. Our algorithm is based on the idea of enhanced partial critical path (ePCP) and handles memory latencies and reconfiguration overheads. Results obtained show the effectiveness of our approach over other critical path based methods.
Task Scheduling of Control Data Flow Graphs for Reconfigurable Architectures

Proceedings of the International Conference on Engineering of Reconfigurable Systems and Algorithms 2004

Task scheduling is an essential part of the design cycle of a reconfigurable hardware implementation for a given application. Most of the current multimedia applications provide a large amount of variations to users and hence are control dominated. To arrive at an optimal schedule for such applications would involve a highly complex scheduling algorithm. This paper proposes a low complexity scheduling algorithm that provides a near optimal solution. Existing approaches suggest that Branch and…

Task scheduling is an essential part of the design cycle of a reconfigurable hardware implementation for a given application. Most of the current multimedia applications provide a large amount of variations to users and hence are control dominated. To arrive at an optimal schedule for such applications would involve a highly complex scheduling algorithm. This paper proposes a low complexity scheduling algorithm that provides a near optimal solution. Existing approaches suggest that Branch and Bound method of scheduling gives the most optimal solution, but at the same time is highly complex. Our approach introduces the concept of an enhanced Partial Critical Path. Our scheduling algorithm generates near-optimal solution at O(n) complexity. Branch and Bound algorithm can be run selectively to approach optimality, thus reducing the overall complexity. Special cases involving loops have also been addressed. The effect of reconfiguration on the schedule has been analyzed and a solution has been proposed.
Current Trends for Silicon and Embedded Computing Solutions for Automotive Applications

Proceedings of the convergence conference of SAE 2002

Automotive applications have started providing functionalities like dynamic navigation and multimedia computing that are both complex as well as real-time in nature. A single chip implementation of these applications is possible only if the processing power of the chip is high enough. The advances in silicon technology have been significant and the silicon real estate available for processing on a single chip is increasing at a rapid pace. Moore's law predicts that by 2005, a billion…

Automotive applications have started providing functionalities like dynamic navigation and multimedia computing that are both complex as well as real-time in nature. A single chip implementation of these applications is possible only if the processing power of the chip is high enough. The advances in silicon technology have been significant and the silicon real estate available for processing on a single chip is increasing at a rapid pace. Moore's law predicts that by 2005, a billion transistors will reside on a single chip. This makes it possible for automobile designers to provide the end-user with a complete embedded solution.
Pattern recognition tool to detect reconfigurable patterns in MPEG4 video processing

IEEE Proceedings of Parallel and Distributed Processing Symposium 2002

Current approaches towards building a reconfigurable processor are targeted towards general purpose computing or a limited range of media specific applications and are not specifically tuned for mobile multimedia applications. The increasing demand for mobile multimedia processing with stringent constraints for low power, low chip area and high flexibility at both the encoder and decoder naturally demand the design and development of a dynamically reconfigurable multimedia processor. We have…

Current approaches towards building a reconfigurable processor are targeted towards general purpose computing or a limited range of media specific applications and are not specifically tuned for mobile multimedia applications. The increasing demand for mobile multimedia processing with stringent constraints for low power, low chip area and high flexibility at both the encoder and decoder naturally demand the design and development of a dynamically reconfigurable multimedia processor. We have performed a detailed complexity analysis of the MPEG-4 video coding mode which has illustrated the potential for reconfigurable computing. We have recently proposed a methodology for designing a reconfigurable media processor. This involves the design of a parser that identifies data/control flow graphs generated from the input assembly code of an UltraSPARC V-9 architecture; recurring pattern analyzer that uses a clustering based approach to identify specific sequences of operations that can potentially be implemented in hardware; and finally a count of such modules at every level of granularity with the associated weights based on the complexity of computation and data transfers used by partitioner and router. In this paper we then propose the design of the parser and pattern recognizer with results for detecting the reconfigurable patterns in MPEG4.

Patents

Reconfigurable processing

Issued August 23, 2007 US PCT/US2004/003609
A method of producing a reconfigurable circuit device for running a computer program of moderate complexity such as multimedia processing. Code for the application is compiled into Control Flow Graphs representing distinct parts of the application to be run. From those Control Flow Graphs are extracted basic blocks. The basic blocks are converted to Data Flow Graphs by a compiler utility. From two or more Data Flow Graphs, a largest common subgraph is determined. The largest common subgraph is…

A method of producing a reconfigurable circuit device for running a computer program of moderate complexity such as multimedia processing. Code for the application is compiled into Control Flow Graphs representing distinct parts of the application to be run. From those Control Flow Graphs are extracted basic blocks. The basic blocks are converted to Data Flow Graphs by a compiler utility. From two or more Data Flow Graphs, a largest common subgraph is determined. The largest common subgraph is ASAP scheduled and substituted back into the Data Flow Graphs which also have been scheduled. The separate Data Flow Graphs containing the scheduled largest common subgraph are converted to data paths that are then combined to form code for operating the application. The largest common subgraph is effected in hardware that is shared among the parts of the application from which the Data Flow Graphs were developed. Scheduling of the overall code is effected for sequencing, providing fastest run times and the code is implemented in hardware by partitioning and placement of processing elements on a chip and design of the connective fabric for the design elements.

Other inventors
See patent
Near optimal configurable adder tree for arbitrary shaped 2d block sum of absolute differences (sad) calculation engine.

Filed October 19, 2009 US US 12/581,482
Embodiments of a near optimal configurable adder tree for arbitrary shaped 2D block sum of absolute differences (SAD) calculation engine are generally described herein. Other embodiments may be described and claimed. In some embodiments, a configurable two-dimensional adder tree architecture for computing a sum of absolute differences (SAD) for various block sizes up to 16 by 16 comprises a first stage of one-dimensional adder trees and a second stage of one-dimensional adder trees, wherein…

Embodiments of a near optimal configurable adder tree for arbitrary shaped 2D block sum of absolute differences (SAD) calculation engine are generally described herein. Other embodiments may be described and claimed. In some embodiments, a configurable two-dimensional adder tree architecture for computing a sum of absolute differences (SAD) for various block sizes up to 16 by 16 comprises a first stage of one-dimensional adder trees and a second stage of one-dimensional adder trees, wherein each one-dimensional adder tree comprises an input routing network, a plurality of adder units, and an output routing network.

Other inventors
See patent

Honors & Awards

Outstanding Research Assistant

ECE Department, USU

Dec 2007

More activity by Arvind

Liked by Arvind Sudarsanam

Liked by Arvind Sudarsanam
Hands down the BEST conference In the #creditunion universe Thank you Northwest Credit Union Association For letting me share my life’s (and…

Hands down the BEST conference In the #creditunion universe Thank you Northwest Credit Union Association For letting me share my life’s (and…

Liked by Arvind Sudarsanam
https://lnkd.in/g8XqWBA Happy Thursday everyone, STEMology Club founder, Rachna, was honored by AIF (American Indian Foundation) to be one of their…

https://lnkd.in/g8XqWBA Happy Thursday everyone, STEMology Club founder, Rachna, was honored by AIF (American Indian Foundation) to be one of their…

Liked by Arvind Sudarsanam
It is great news and a proud moment for BITSians to hear that Mr Shital Patil of Mechanical engineering department has done us proud as he has been…

It is great news and a proud moment for BITSians to hear that Mr Shital Patil of Mechanical engineering department has done us proud as he has been…

Liked by Arvind Sudarsanam
Our CEO and co-founder Ashish Garg had a great time this week in #StLouis meeting with West Community Credit Union and Capacity! #WeGetTheMessage…

Our CEO and co-founder Ashish Garg had a great time this week in #StLouis meeting with West Community Credit Union and Capacity! #WeGetTheMessage…

Liked by Arvind Sudarsanam
#productmanagement reading list #60 Product Manager vs. Product Owner vs. Project Manager https://lnkd.in/fZpc9g7 A Product Manager’s Guide to…

#productmanagement reading list #60 Product Manager vs. Product Owner vs. Project Manager https://lnkd.in/fZpc9g7 A Product Manager’s Guide to…

Liked by Arvind Sudarsanam
As 2019 comes to an end, Team Eltropy is looking back at some of its favorite moments of the year. We look forward to an even better…

As 2019 comes to an end, Team Eltropy is looking back at some of its favorite moments of the year. We look forward to an even better…

Liked by Arvind Sudarsanam
Liked by Arvind Sudarsanam

Liked by Arvind Sudarsanam
Konichiwa from Japan. Incredible week ahead with our partners Fujitsu Global NTT DATA Unisys NRI (Nomura Research Institute)

Konichiwa from Japan. Incredible week ahead with our partners Fujitsu Global NTT DATA Unisys NRI (Nomura Research Institute)

Liked by Arvind Sudarsanam
Fantastic turnout for the Xilinx Developer forum in Silicon Valley! Join us in Asia & Europe in the upcoming months. https://lnkd.in/gaaQiWs

Fantastic turnout for the Xilinx Developer forum in Silicon Valley! Join us in Asia & Europe in the upcoming months. https://lnkd.in/gaaQiWs

Liked by Arvind Sudarsanam

View Arvind’s full profile

See who you know in common
Get introduced
Contact Arvind directly

Join to view full profile

Sign in

Stay updated on your professional world

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

Other similar profiles

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Others named Arvind Sudarsanam in United States

4 others named Arvind Sudarsanam in United States are on LinkedIn

See others named Arvind Sudarsanam

Add new skills with these courses

See all courses

About

Activity

Humbled and grateful to join this amazing group of outstanding engineers!

Liked by Arvind Sudarsanam

It’s the 100 day anniversary When Eltropy and POPi/o came together Congratulations Team. 🎉🎉🎉🎉

Liked by Arvind Sudarsanam

Liked by Arvind Sudarsanam

Experience & Education

Intel Corporation

****** ******** ********

View Arvind’s full experience

See their title, tenure and more.

Publications

Data flow computation for greater parallelism

Proceedings of the High Performance Embedded Computing workshop (HPEC) 2011

Memory architecture template for Fast Block Matching algorithms on FPGAs

Proceeding of 24th IEEE International Symposium on Parallel and Distributed Processing 2010

A Power Efficient Linear Equation Solver on a Multi-FPGA Accelerator

International Journal of Computers and Applications 2009

Analysis and Design of a Context Adaptable SAD/MSE Accelerator

International Journal of Reconfigurable Computing 2009

Dynamically Reconfigurable Systolic Array Accelerators: A Case Study with EKF and DWT Algorithms

IET Computers & Digital Techniques 2009

PRR-PRR Dynamic Relocation

IEEE Computer Architecture Letters 2009

Methodology to Derive Polymorphic Soft-IP Cores for FPGAs

Journal of IET Computers & Digital Techniques 2008

Performance of LU decomposition on a Multi-FPGA System Compared to a Low Power Commodity Microprocessor System

Journal for Scalable computing: Practice and Experience 2007

Memory Support Design for LU Decomposition on Starbridge Hypercomputer

IEEE Proceedings of the conference on Field Programmable Technology (FPT) 2006

Multi FPGA based High Performance LU Decomposition

Proceedings of the High Performance Embedded Computing workshop (HPEC) 2006

Design of Embedded Compute Intensive Processing Elements and their Scheduling in a Reconfigurable Environment

Canadian Journal of Electrical and Computer Engineering (CJECE) December 1, 2005

Implementation of Polymorphic Matrix Inversion using Viva

Military and Aerospace conference on Programmable Logic Devices (MAPLD)/NASA Sep 2005

A Fast and Efficient FPGA-Based Implementation for Solving a System of Linear Interval Equations

IEEE Proceedings of the conference on Field Programmable Technology (FPT) 2005

High Level - Application Analysis Techniques & Architectures - to Explore Design possibilities for Reduced Reconfiguration Area Overheads in FPGAs executing Compute Intensive Applications

Proceedings of the Reconfigurable Architectures Workshop 2005

Novel predicated data flow analysis based memory design for data and control intensive multimedia applications

Proceedings of the SPIE conference on Electronic Imaging 2005

Resource Estimation and Task Scheduling for Multithreaded Reconfigurable Architectures

Proceedings of the International Conference on Parallel and Distributed Systems 2004

Task Scheduling of Control Data Flow Graphs for Reconfigurable Architectures

Proceedings of the International Conference on Engineering of Reconfigurable Systems and Algorithms 2004

Current Trends for Silicon and Embedded Computing Solutions for Automotive Applications

Proceedings of the convergence conference of SAE 2002

Pattern recognition tool to detect reconfigurable patterns in MPEG4 video processing

IEEE Proceedings of Parallel and Distributed Processing Symposium 2002

Patents

Reconfigurable processing

Issued August 23, 2007 US PCT/US2004/003609

Near optimal configurable adder tree for arbitrary shaped 2d block sum of absolute differences (sad) calculation engine.

Filed October 19, 2009 US US 12/581,482

Honors & Awards

Outstanding Research Assistant

ECE Department, USU

More activity by Arvind

Liked by Arvind Sudarsanam

Hands down the BEST conference In the #creditunion universe Thank you Northwest Credit Union Association For letting me share my life’s (and…

Liked by Arvind Sudarsanam

https://lnkd.in/g8XqWBA Happy Thursday everyone, STEMology Club founder, Rachna, was honored by AIF (American Indian Foundation) to be one of their…

Liked by Arvind Sudarsanam

It is great news and a proud moment for BITSians to hear that Mr Shital Patil of Mechanical engineering department has done us proud as he has been…

Liked by Arvind Sudarsanam

Our CEO and co-founder Ashish Garg had a great time this week in #StLouis meeting with West Community Credit Union and Capacity! #WeGetTheMessage…

Liked by Arvind Sudarsanam

#productmanagement reading list #60 Product Manager vs. Product Owner vs. Project Manager https://lnkd.in/fZpc9g7 A Product Manager’s Guide to…

Liked by Arvind Sudarsanam

As 2019 comes to an end, Team Eltropy is looking back at some of its favorite moments of the year. We look forward to an even better…

Liked by Arvind Sudarsanam

Liked by Arvind Sudarsanam

Konichiwa from Japan. Incredible week ahead with our partners Fujitsu Global NTT DATA Unisys NRI (Nomura Research Institute)

Liked by Arvind Sudarsanam

Fantastic turnout for the Xilinx Developer forum in Silicon Valley! Join us in Asia & Europe in the upcoming months. https://lnkd.in/gaaQiWs

Liked by Arvind Sudarsanam

View Arvind’s full profile