skip to main content
research-article
Open access

Halide: decoupling algorithms from schedules for high-performance image processing

Published: 27 December 2017 Publication History
  • Get Citation Alerts
  • Abstract

    Writing high-performance code on modern machines requires not just locally optimizing inner loops, but globally reorganizing computations to exploit parallelism and locality---doing things such as tiling and blocking whole pipelines to fit in cache. This is especially true for image processing pipelines, where individual stages do much too little work to amortize the cost of loading and storing results to and from off-chip memory. As a result, the performance difference between a naive implementation of a pipeline and one globally optimized for parallelism and locality is often an order of magnitude. However, using existing programming tools, writing high-performance image processing code requires sacrificing simplicity, portability, and modularity. We argue that this is because traditional programming models conflate the computations defining the algorithm with decisions about intermediate storage and the order of computation, which we call the schedule.
    We propose a new programming language for image processing pipelines, called Halide, that separates the algorithm from its schedule. Programmers can change the schedule to express many possible organizations of a single algorithm. The Halide compiler then synthesizes a globally combined loop nest for an entire algorithm, given a schedule. Halide models a space of schedules which is expressive enough to describe organizations that match or outperform state-of-the-art hand-written implementations of many computational photography and computer vision algorithms. Its model is simple enough to do so often in only a few lines of code, and small changes generate efficient implementations for x86, ARM, Graphics Processors (GPUs), and specialized image processors, all from a single algorithm.
    Halide has been public and open source for over four years, during which it has been used by hundreds of programmers to deploy code to tens of thousands of servers and hundreds of millions of phones, processing billions of images every day.

    References

    [1]
    Adams, A., Talvala, E., Park, S.H., Jacobs, D.E., Ajdin, B., Gelfand, N., Dolson, J., Vaquero, D., Baek, J., Tico, M., Lensch, H.P.A., Matusik, W., Pulli, K., Horowitz, M., Levoy, M. The Frankencamera: An experimental platform for computational photography. ACM Trans. Graph. 29, 4 (2010), 29:1--29:12.
    [2]
    Aubry, M., Paris, S., Hasinoff, S.W., Kautz, J., Durand, F. Fast local Laplacian filters: Theory and applications. ACM Trans. Graph. 33, 5 (2014), 167.
    [3]
    Bacon, D.F., Graham, S.L., Sharp, O.J. Compiler transformations for high-performance computing. ACM Comput Surv. 26, 4 (Dec. 1994).
    [4]
    Blythe, D. The Direct3D 10 system. ACM Trans. Graph. 25, (2006), 724--734.
    [5]
    Buck, I. GPU computing: Programming a massively parallel processor. In Proceedings of the International Symposium on Code Generation and Optimization (Tessellations Publishing, Phoenix, Arizona, 2007).
    [6]
    Chamberlain, B., Callahan, D., Zima, H. Parallel programmability and the Chapel language. Int J High Perform Comput Appl. 21, (2007), 291--312.
    [7]
    Chen, J., Paris, S., Durand, F. Real-time edge-aware image processing with the bilateral grid. ACM Trans. Graph. 26, 3 (2007), 103:1--103:9.
    [8]
    Elliott, C. Functional image synthesis. In Proceedings of Bridges 2001, Mathematical Connections in Art, Music, and Science (IEEE Computer Society, Washington, DC, USA, 2001).
    [9]
    Fatahalian, K., Horn, D.R., Knight, T.J., Leem, L., Houston, M., Park, J.Y., Erez, M., Ren, M., Aiken, A., Dally, W.J., Hanrahan, P. Sequoia: Programming the memory hierarchy. In ACM/IEEE conference on Supercomputing (ACM, New York, NY, 2006).
    [10]
    Feautrier, P. Dataflow analysis of array and scalar references. Int J Parallel Program. 20, 1 (1991), 23--53.
    [11]
    Frigo, M., Johnson, S.G. The design and implementation of FFTW3. Proc IEEE 93, 2 (2005).
    [12]
    Gordon, M.I., Thies, W., Karczmarek, M., Lin, J., Meli, A.S., Leger, C., Lamb, A.A., Wong, J., Hoffman, H., Maze, D.Z., Amarasinghe, S. A stream compiler for communication-exposed architectures. In International Conference on Architectural Support for Programming Languages and Operating Systems (ACM, New York, NY, 2002).
    [13]
    Govindaraju, N., Lloyd, B., Dotsenko, Y., Smith, B., Manferdelli, J. High performance discrete Fourier transforms on graphics processors. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. IEEE (Washington, DC, January 2008).
    [14]
    Halide source repository. http://github.com/halide/Halide.
    [15]
    Hasinoff, S.W., Sharlet, D., Geiss, R., Adams, A., Barron, J.T., Kainz, F., Chen, J., Levoy, M. Burst photography for high dynamic range and low-light imaging on mobile cameras. ACM Trans. Graph. 35, 6 (2016).
    [16]
    Holzmann, G. Beyond Photography: The Digital Darkroom. Prentice Hall, Englewood Cliffs, NJ, 1988.
    [17]
    Mullapudi, R.T., Adams, A., Sharlet, D., Ragan-Kelley, J., Fatahalian, K. Automatically scheduling halide image processing pipelines. ACM Trans. Graph. 35, 4 (2016).
    [18]
    Mullapudi, R.T., Vasista, V., Bondhugula, U. PolyMage: Automatic optimization for image processing pipelines. In ACM SIGPLAN Notices (ACM, New York, NY, 2015), volume 50, 429--443.
    [19]
    The OpenCL specification, version 1.2. http://www.khronos.org/registry/cl/specs/opencl-1.2.pdf, 2011.
    [20]
    Püschel, M., Moura, J.M.F., Johnson, J., Padua, D., Veloso, M., Singer, B., Xiong, J., Franchetti, F., Gacic, A., Voronenko, Y., Chen, K., Johnson, R.W., Rizzolo, N. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, special issue on "Program Generation, Optimization, and Adaptation" 93, 2 (2005), 232--275.
    [21]
    Ragan-Kelley, J. Decoupling algorithms from the organization of computation for high performance image processing. PhD thesis, Massachusetts Institute of Technology (2014).
    [22]
    Ragan-Kelley, J., Adams, A., Paris, S., Levoy, M., Amarasinghe, S., Durand, F. Decoupling algorithms from schedules for easy optimization of image processing pipelines. ACM Trans. Graph. 31, 4 (2012).
    [23]
    Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., Amarasinghe, S. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (ACM, New York, NY, 2013).
    [24]
    Rudy, G., Khan, M.M., Hall, M., Chen, C., Chame, J. A programming language interface to describe transformations and code generation. In Proceedings of the 23rd International Conference on Languages and Compilers for Parallel Computing LCPC'10, (Springer-Verlag, Berlin, Heidelberg, 2011), 136--150.
    [25]
    Suriana, P., Adams, A., Kamil, S. Parallel associative reductions in halide. In Proceedings of the 2017 International Symposium on Code Generation and Optimization (ACM, New York, NY, 2017).

    Cited By

    View all
    • (2024)Research on the Application of Multimedia Image Processing Technology in Sports Sociology EducationInternational Journal of Web-Based Learning and Teaching Technologies10.4018/IJWLTT.34798919:1(1-21)Online publication date: 17-Jul-2024
    • (2024)DNNOPT: A Framework for Efficiently Selecting On-chip Memory Loop Optimizations of DNN AcceleratorsProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649196(126-137)Online publication date: 7-May-2024
    • (2024)Optimizing Nested Recursive QueriesProceedings of the ACM on Management of Data10.1145/36392712:1(1-27)Online publication date: 26-Mar-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Communications of the ACM
    Communications of the ACM  Volume 61, Issue 1
    January 2018
    110 pages
    ISSN:0001-0782
    EISSN:1557-7317
    DOI:10.1145/3176926
    Issue’s Table of Contents
    This work is licensed under a Creative Commons Attribution-NoDerivs International 4.0 License.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 December 2017
    Published in CACM Volume 61, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • NSF
    • Department of Energy

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)5,693
    • Downloads (Last 6 weeks)208

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Research on the Application of Multimedia Image Processing Technology in Sports Sociology EducationInternational Journal of Web-Based Learning and Teaching Technologies10.4018/IJWLTT.34798919:1(1-21)Online publication date: 17-Jul-2024
    • (2024)DNNOPT: A Framework for Efficiently Selecting On-chip Memory Loop Optimizations of DNN AcceleratorsProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649196(126-137)Online publication date: 7-May-2024
    • (2024)Optimizing Nested Recursive QueriesProceedings of the ACM on Management of Data10.1145/36392712:1(1-27)Online publication date: 26-Mar-2024
    • (2024)Optimizing Dynamic-Shape Neural Networks on Accelerators via On-the-Fly Micro-Kernel PolymerizationProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640390(797-812)Online publication date: 27-Apr-2024
    • (2024)An Optimizing Framework on MLIR for Efficient FPGA-based Accelerator Generation2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00017(75-90)Online publication date: 2-Mar-2024
    • (2024)Template-Based Automatic Library Function Generation with Halide for Compute-Intensive Simulink Models2024 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS)10.1109/COOLCHIPS61292.2024.10531173(1-6)Online publication date: 17-Apr-2024
    • (2024)Verified Validation for Affine Scheduling in Polyhedral CompilationTheoretical Aspects of Software Engineering10.1007/978-3-031-64626-3_17(287-305)Online publication date: 14-Jul-2024
    • (2024)AuDaLa is Turing CompleteFormal Techniques for Distributed Objects, Components, and Systems10.1007/978-3-031-62645-6_12(221-229)Online publication date: 17-Jun-2024
    • (2024): Deductive Verification and Scheduling Languages Join ForcesTools and Algorithms for the Construction and Analysis of Systems10.1007/978-3-031-57256-2_4(71-89)Online publication date: 6-Apr-2024
    • (2023)FlowPix: Accelerating Image Processing Pipelines on an FPGA Overlay using a Domain Specific CompilerACM Transactions on Architecture and Code Optimization10.1145/362952320:4(1-25)Online publication date: 25-Oct-2023
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Digital Edition

    View this article in digital edition.

    Digital Edition

    Magazine Site

    View this article on the magazine site (external)

    Magazine Site

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media