Peipei Zhou

Pittsburgh, Pennsylvania, United States Contact Info
4K followers 500+ connections

Join to view profile

About

I am a tenure-track assistant professor at University of Pittsburgh, ECE department. I…

Activity

Join now to see all activity

Experience & Education

  • University of Pittsburgh

View Peipei’s full experience

See their title, tenure and more.

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

Publications

  • CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP Architecture

    Association for Computing Machinery

    We design end-to-end deep learning acceleration framework on AMD Versal ACAP

    See publication
  • (PhD Dissertation) Modeling and Optimization for Customized Computing: Performance, Energy and Cost Perspective

    UCLA Electronic Theses and Dissertations

    This dissertation investigates design target, modeling, and optimization for field-programmable gate array (FPGA) customized computing at chip-level, node-level and cluster-level. FPGAs have gained popularity in the acceleration of a wide range of applications with 10x-100x performance/energy efficiency over the general-purpose processors. The design choices of FPGA accelerators for different targets at different levels are enormous. To guide the designers to find the best design choices…

    This dissertation investigates design target, modeling, and optimization for field-programmable gate array (FPGA) customized computing at chip-level, node-level and cluster-level. FPGAs have gained popularity in the acceleration of a wide range of applications with 10x-100x performance/energy efficiency over the general-purpose processors. The design choices of FPGA accelerators for different targets at different levels are enormous. To guide the designers to find the best design choices, modeling is inevitable.

    See publication
  • (IEEE TCAD Donald O. Pederson Best Paper Award) Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networksloja virtual

    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

    With the recent advancement of multilayer convolutional neural networks (CNNs) and fully connected networks (FCNs), deep learning has achieved amazing success in many areas, especially in visual content understanding and classification. To improve the performance and energy efficiency of the computation-demanding CNN, the FPGA-based acceleration emerges as one of the most attractive alternatives. In this paper, we design and implement Caffeine, a hardware/software co-designed library to…

    With the recent advancement of multilayer convolutional neural networks (CNNs) and fully connected networks (FCNs), deep learning has achieved amazing success in many areas, especially in visual content understanding and classification. To improve the performance and energy efficiency of the computation-demanding CNN, the FPGA-based acceleration emerges as one of the most attractive alternatives. In this paper, we design and implement Caffeine, a hardware/software co-designed library to efficiently accelerate the entire CNN and FCN on FPGAs. First, we propose a uniformed convolutional matrix-multiplication representation for both computation-bound convolutional layers and communication-bound FCN layers. Based on this representation, we optimize the accelerator microarchitecture and maximize the underlying FPGA computing and bandwidth resource utilization based on a revised roofline model. Moreover, we design an automation flow to directly compile highlevel network definitions to the final FPGA accelerator. As a case study, we integrate Caffeine into the industry-standard software deep learning framework Caffe. We evaluate Caffeine and its integration with Caffe by implementing VGG16 and AlexNet networks on multiple FPGA platforms. Caffeine achieves a peak performance of 1460 giga fixed point operations per second on a medium-sized Xilinx KU060 FPGA board; to our knowledge, this is the best published result. It achieves more than 100× speedup on FCN layers over prior FPGA accelerators. An end-to-end evaluation with Caffe integration shows up to 29× and 150× performance and energy gains over Caffe on a 12-core Xeon server, and 5.7× better energy efficiency over the GPU implementation. Performance projections for a system with a high-end FPGA (Virtex7 690t) show even higher gains.

    Other authors
    See publication
  • (Best Paper Nominee) SODA: Stencil with Optimized Dataflow Architecture

    2018 International Conference On Computer Aided Design

    Stencil computation is one of the most important kernels in many application domains such as image processing, solving partial diferential equations, and cellular automata. Many of the stencil kernels are complex, usually consist of multiple stages or iterations, and are often computation-bounded. Such kernels are often off-loaded to FPGAs to take advantages of the efficiency of dedicated hardware. However, implementing such complex kernels efficiently is not trivial, due to complicated data…

    Stencil computation is one of the most important kernels in many application domains such as image processing, solving partial diferential equations, and cellular automata. Many of the stencil kernels are complex, usually consist of multiple stages or iterations, and are often computation-bounded. Such kernels are often off-loaded to FPGAs to take advantages of the efficiency of dedicated hardware. However, implementing such complex kernels efficiently is not trivial, due to complicated data dependencies, difficulties of programming FPGAs with RTL, as well as large design space.
    In this paper we present SODA, an automated framework for implementing Stencil algorithms with Optimized Datalow Architecture on FPGAs. The SODA microarchitecture minimizes the on-chip reuse bufer size required by full data reuse and provides flexible and scalable fine-grained parallelism. The SODA automation framework takes high-level user input and generates efficient, high-frequency datalow implementation. This significantly reduces the difficulty of programming FPGAs efficiently for stencil algorithms. The SODA design-space exploration framework models the resource constraints and searches for the performance-optimized coniguration with accurate models for post-synthesis resource utilization and on-board execution throughput. Experimental results from on-board execution using a wide range of benchmarks show up to 3.28x speed up over 24-thread CPU and our fully automated framework achieves better performance compared with manually designed state-of-the-art FPGA accelerators.

    Other authors
    See publication
  • Latte: Locality Aware Transformation for High-Level Synthesis

    2018 IEEE International Symposium on Field-Programmable Custom Computing Machines

    First-author paper
    Modern FPGA chips feature abundant reconfigurable resources such as LUTs, FFs, BRAMs and DSPs. High-level synthesis (HLS) further advances users productivity in designing accelerators and scaling out the design quickly via fine-grain and coarse-grain pipelining and duplication to utilize on-chip resources. However, current HLS tools fail to consider data locality in the scaled-out design; this leads to a long critical path which results in a low operating frequency. In…

    First-author paper
    Modern FPGA chips feature abundant reconfigurable resources such as LUTs, FFs, BRAMs and DSPs. High-level synthesis (HLS) further advances users productivity in designing accelerators and scaling out the design quickly via fine-grain and coarse-grain pipelining and duplication to utilize on-chip resources. However, current HLS tools fail to consider data locality in the scaled-out design; this leads to a long critical path which results in a low operating frequency. In this paper we summarize the timing degradation problems to four common collective communication and computation patterns in HLS-based accelerator design: scatter, gather, broadcast and reduce. These widely used patterns scale poorly in one-to-all or all-to-one data movements between off-chip communication interface and on-chip storage, or inside the computation logic. Therefore, we propose the Latte microarchitecture featuring pipelined transfer controllers (PTC) along data paths in these patterns. Furthermore, we implement an automated framework to apply our Latte implementation in HLS with minimal user efforts. Our experiments show that Latte-optimized designs greatly improve the timing of baseline HLS designs by 1.50x with only 3.2% LUT overhead on average, and 2.66x with 2.7% overhead at maximum.

    Other authors
    See publication
  • ST-Accel: A High-Level Programming Platform for Streaming Applications on FPGA

    2018 IEEE International Symposium on Field-Programmable Custom Computing Machines

    In recent years we have witnessed the emergence of the FPGA in many high-performance systems. This is due to FPGA's high reconfigurability and improved user-friendly programming environment. OpenCL, supported by major FPGA vendors, is a high-level programming platform that liberates hardware developers from having to deal with the complex and error-prone HDL development. While OpenCL exposes a GPU-like programming model, which is well-suited for compute-intensive tasks, in many state-of-art…

    In recent years we have witnessed the emergence of the FPGA in many high-performance systems. This is due to FPGA's high reconfigurability and improved user-friendly programming environment. OpenCL, supported by major FPGA vendors, is a high-level programming platform that liberates hardware developers from having to deal with the complex and error-prone HDL development. While OpenCL exposes a GPU-like programming model, which is well-suited for compute-intensive tasks, in many state-of-art systems that deploy FPGA, we observe that the workloads are streaming-like, which is communication-intensive. This mismatch leads to low throughput and high end-to-end latency.
    In this paper, we propose ST-Accel, a new high-level programming platform for streaming applications on FPGA. It has the following advantages: (i) ST-Accel adopts the multiprocessing programming model to capture the inherent pipeline-level parallelism of streaming applications while reducing the end-to-end latency. (ii) A message-passing-based host/FPGA communication model is used to avoid the coherency issue of shared memory, thus enabling host/FPGA communication during kernel execution. (iii) ST-Accel provides a high-level abstraction for I/O devices to support direct I/O device access that eliminates the overhead of host CPU and reduces the I/O latency. (iv) ST-Accel enables the decoupled access/execute architecture to maximize the utilization of I/O devices. (v) The host/FPGA communication interface is redesigned to cater to the demands of both latency-critical and throughput-critical scenarios. The experimental results on the Amazon AWS cloud and local machine show that ST-Accel can achieve 1.6X-166X throughput and 1/3 latency for typical streaming workloads when compared to OpenCL.

    Other authors
    See publication
  • (Best Paper Nominee) Doppio: I/O-Aware Performance Analysis, Modeling and Optimization for In-Memory Computing Framework

    2018 IEEE International Symposium on Performance Analysis of Systems and Software

    First-author paper
    In conventional Hadoop MapReduce applications, I/O used to play a heavy role in the overall system performance. More recently, a study from the Apache Spark community—state-of-the-art in-memory cluster computing framework—reports that I/O is no longer the bottleneck and has a marginal performance impact on applications like SQL processing. However, we observe that simply replacing HDDs with SSDs in a Spark cluster can have over 10x performance improvement for certain…

    First-author paper
    In conventional Hadoop MapReduce applications, I/O used to play a heavy role in the overall system performance. More recently, a study from the Apache Spark community—state-of-the-art in-memory cluster computing framework—reports that I/O is no longer the bottleneck and has a marginal performance impact on applications like SQL processing. However, we observe that simply replacing HDDs with SSDs in a Spark cluster can have over 10x performance improvement for certain stages in large-scale production-quality genome processing. Therefore, one key question arises: How does I/O quantitatively impact the performance of today’s big data applications developed using in-memory cluster computing frameworks like Apache Spark? In this paper we select an important yet complex application—the Spark-based Genome Analysis ToolKit (GATK4)—to guide our modeling. We first use different combinations of HDDs and SSDs to measure the I/O impact on GATK4 and change the CPU core number to discover the relation between computation and I/O access. Combining with Spark underlying implementations, we further analyze the inherent cause of the above observations and build our model based on the analysis. Although building upon GATK4, our model maintains generality to other applications. Experimental results show that we can achieve an performance prediction error rate within 10% for typical Spark applications of both iterative and shuffle-heavy algorithms. Finally, we further extend our model to a broader area - that of optimal configuration selection in the public cloud. In Google Cloud, our model enables us to save 38% to 57% cost for genome sequencing compared with its recommended default configurations. Currently, more and more companies are adopting cloud computing for specific workloads. Our proposed model can have a huge impact on their choices, while also enabling them to significantly reduce their costs.

    Other authors
    See publication
  • Bandwidth Optimization Through On-Chip Memory Restructuring for HLS

    54th Annual Design Automation Conference

    High-level synthesis (HLS) is getting increasing attention from both academia and industry for high-quality and high-productivity designs. However, when inferring primitive-type arrays in HLS designs into on-chip memory buffers, commercial HLS tools fail to effectively organize FPGAs’ on-chip BRAM building blocks to realize high-bandwidth data communication; this often leads to suboptimal quality of results. This paper addresses this issue via automated on-chip buffer restructuring…

    High-level synthesis (HLS) is getting increasing attention from both academia and industry for high-quality and high-productivity designs. However, when inferring primitive-type arrays in HLS designs into on-chip memory buffers, commercial HLS tools fail to effectively organize FPGAs’ on-chip BRAM building blocks to realize high-bandwidth data communication; this often leads to suboptimal quality of results. This paper addresses this issue via automated on-chip buffer restructuring. Specifically, we present three buffer restructuring approaches and develop an analytical model for each approach to capture its impact on performance and resource consumption. With the proposed model, we formulate the process of identifying the optimal design choice into an integer non-linear programming (INLP) problem and demonstrate that it can be solved efficiently with the help of a one-time C-to-HDL(hardware description language) synthesis. The experimental results show that our automated source-to-source code transformation tool improves the performance of a broad class of HLS designs by averagely 4.8x.

    Other authors
    See publication
  • Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks

    36th International Conference on Computer-Aided Design

    With the recent advancement of multilayer convolutional neural networks (CNN), deep learning has achieved amazing success in many areas, especially in visual content understanding and classification. To improve the performance and energy-efficiency of the computation-demanding CNN, the FPGA-based acceleration emerges as one of the most attractive alternatives. In this paper we design and implement Caffeine, a hardware/soft-ware co-designed library to efficiently accelerate the entire CNN on…

    With the recent advancement of multilayer convolutional neural networks (CNN), deep learning has achieved amazing success in many areas, especially in visual content understanding and classification. To improve the performance and energy-efficiency of the computation-demanding CNN, the FPGA-based acceleration emerges as one of the most attractive alternatives. In this paper we design and implement Caffeine, a hardware/soft-ware co-designed library to efficiently accelerate the entire CNN on FPGAs. First, we propose a uniformed convolutional matrix-multiplication representation for both computation-intensive con-volutional layers and communication-intensive fully connected (FCN) layers. Second, we design Caffeine with the goal to maximize the underlying FPGA computing and bandwidth resource utilization , with a key focus on the bandwidth optimization by the memory access reorganization not studied in prior work. Moreover , we implement Caffeine in the portable high-level synthesis and provide various hardware/software definable parameters for user configurations. Finally, we also integrate Caffeine into the industry-standard software deep learning framework Caffe. We evaluate Caffeine and its integration with Caffe by implementing VGG16 and AlexNet network on multiple FPGA platforms. Caffeine achieves a peak performance of 365 GOPS on Xilinx KU060 FPGA and 636 GOPS on Virtex7 690t FPGA. This is the best published result to our best knowledge. We achieve more than 100x speedup on FCN layers over previous FPGA accelerators. An end-to-end evaluation with Caffe integration shows up to 7.3x and 43.5x performance and energy gains over Caffe on a 12-core Xeon server, and 1.5x better energy-efficiency over the GPU implementation on a medium-sized FPGA (KU060). Performance projections to a system with a high-end FPGA (Virtex7 690t) shows even higher gains.

    Other authors
    See publication
  • Energy Efficiency of Full Pipelining: A Case Study for Matrix Multiplication

    24th IEEE International Symposium on Field-Programmable Custom Computing Machines

    First-author paper
    Customized pipeline designs that minimize the pipeline initiation interval (II) maximize the throughput of FPGA accelerators designed with high-level synthesis (HLS). What is the impact of minimizing II on energy efficiency? Using a matrix-multiply accelerator, we show that matrix multiplies with II>1 can sometimes reduce dynamic energy below II=1 due to interconnect savings, but II=1 always achieves energy close to the minimum. We also identify sources of inefficient…

    First-author paper
    Customized pipeline designs that minimize the pipeline initiation interval (II) maximize the throughput of FPGA accelerators designed with high-level synthesis (HLS). What is the impact of minimizing II on energy efficiency? Using a matrix-multiply accelerator, we show that matrix multiplies with II>1 can sometimes reduce dynamic energy below II=1 due to interconnect savings, but II=1 always achieves energy close to the minimum. We also identify sources of inefficient mapping in the commercial tool flow.

    Other authors
    See publication
  • A Fully Pipelined and Dynamically Composable Architecture of CGRA

    2014 FCCM

    Other authors

Courses

  • Advanced Computer Architecture

    CS251A

  • Algorithms

    CS280

  • Arithmetic Algorithm and Processor

    CS252A

  • Data Science and Data Analytics

    CS249

  • Database Systems

    CS143

  • Design of VLSI Circuits and Systems (section 1)

    EE216A

  • Domain Specific Computing

    CS259

  • Machine Learning Algorithm

    CS260

  • Object-Oriented Programming in C++

    CS32

  • Parallel Computer Architecture

    CS251B

  • Parallel and Distributed Computing

    CS133

  • Programming Languages

    CS131

  • Special Topics in Circuits & Embedded System

    EE209AS

  • Special Topics in Signals & Systems

    EE239AS

Projects

  • Sorting Algorithm in OpenCL on heterogeneous platform: CPU/GPU/FPGA

    Using Opencl, implemented sorting algorithm on heterogeneous platform including
    Intel 32 cores Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
    Tesla K10 GPU,
    Xilinx Zynq 7000 FPGA.

  • CS-BWAMEM: A Cloud-Scale Sequence Aligner for DNA Sequencing

    - Present

    Goal: Create a new tool to handle the ever increasing large-scale human genome data (300+GB per individual)
    Duty: The main architect of CS-BWAMEM
    Result:
    (1) Developed > 80% of the codes on top of Spark and HDFS/Tachyon
    (2) Maintain the github repository
    Highlights:
    (1) 25K code base, 15K Scala, and 2-3K C/C++ libraries;
    (2) Parquet: reduce disk I/O overhead;
    (3) Bypass the Spark broadcast path to improve software scalability;
    (4) Use native execution…

    Goal: Create a new tool to handle the ever increasing large-scale human genome data (300+GB per individual)
    Duty: The main architect of CS-BWAMEM
    Result:
    (1) Developed > 80% of the codes on top of Spark and HDFS/Tachyon
    (2) Maintain the github repository
    Highlights:
    (1) 25K code base, 15K Scala, and 2-3K C/C++ libraries;
    (2) Parquet: reduce disk I/O overhead;
    (3) Bypass the Spark broadcast path to improve software scalability;
    (4) Use native execution (C/C++) and hardware accelerators to replace slower Java;
    (5) Can finish a task (300GB data) in 300+ cores cluster and provide 12x speedup over the best existing tool

    CS-BWAMEM available at: https://github.com/ytchen0323/cloud-scale-bwamem

    Other creators
    See project
  • Movie Search Website and Database System

    RDBMS with an implementation of index structure in B+tree. Developed in C++
    Designed website on movie database online search system based on MySQL language and PHP
    Optimized the search algorithm and improved the performance by 100x-1000x speed up compared to linear search algorithm

    See project
  • Object-Oriented C++ projects

    Maze video game with 7 different characters in C++, optimized the path find algorithm and improved the runtime performance
    Designed the database system based on binary tree in C++, optimized the search algorithm

  • Accelerator-Rich Architectures Exploration

    Goal: Develop a prototyping flow to enable rapid design space exploration for accelerator-rich architectures
    Product: ARAPrototyper, including an automated synthesis flow, system software stack, and user APIs
    Highlight: Users can evaluate their designs and applications on a real silicon prototype (Xilinx Zynq SoC)

    Other creators

Honors & Awards

  • IEEE Transactions on Computer-Aided Design Donald O. Peder- son Best Paper Award

    IEEE Council on EDA

    Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks won Donald O. Pederson Best Paper Award, which is awarded annually to recognize the
    best paper published in the IEEE Transactions on CAD in the two calendar years preceding
    the award.

  • UCLA Samueli School of Engineering Outstanding Ph.D. Researcher award

    UCLA Samueli School of Engineering

    UCLA recently graduated its first Centennial Class (1919-2019). Congratulations on graduating Ph.D. Peipei Zhou from the VAST Lab for receiving the 2019 Computer Science Department Outstanding Ph.D. Researcher award in the first Centennial Class. Her dissertation title is "Modeling and Optimization for Customized Computing: Performance, Energy and Cost Perspective".

  • Best Paper Nominee at ICCAD'18

    2018 International Conference On Computer Aided Design

    SODA: Stencil with Optimized Dataflow Architecture won "Best Paper Nominee" in 2018 International Conference On Computer Aided Design (ICCAD'18)

  • Phi Tau Phi Scholarship

    Phi Tau Phi Scholastic Honor Society of America

    The West America Chapter of the Phi Tau Phi Scholastic Honor Society offers four or more awards each year to undergraduate and graduate students in recognition of their academic achievements and scholarly contributions. In addition to accomplished scholars, students who are talented in areas other than academics, such as those who have demonstrated exceptional leadership in society, special talents in fine arts, or strong commitment to Chinese heritage and culture, are encouraged to apply.

  • Best Paper Nominee at ISPASS'18

    2018 IEEE International Symposium on Performance Analysis of Systems and Software

    Doppio: I/O-Aware Performance Analysis, Modeling and Optimization for In-Memory Computing Framework won "Best Paper Nominee" in 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'18)

Languages

  • English

    -

  • Chinese

    -

Recommendations received

4 people have recommended Peipei

Join now to view

More activity by Peipei

View Peipei’s full profile

  • See who you know in common
  • Get introduced
  • Contact Peipei directly
Join to view full profile

Other similar profiles

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Others named Peipei Zhou

Add new skills with these courses