Moving data between processes has often been discussed as one of the
major bottlenecks in parallel computing—there is a large body of research, striving
to improve communication latency and bandwidth on different networks, measured
with ping-pong benchmarks of different message sizes. In practice, the data to be
communicated generally originates from application data structures and needs to be
serialized before communicating it over serial network channels.
This document describes SeRViSO, a selective retransmission scheme for video streaming over overlay networks. It begins with background on related work in video streaming using application layer multicast and data-driven overlay networks. It then describes the key components of SeRViSO, which builds on existing data-driven overlay network approaches but introduces a selective retransmission mechanism to handle lost packets. Packets are prioritized based on their importance to video decoding, and a selection algorithm chooses which less important packets may not be retransmitted in order to improve playback quality when losses are high. The document evaluates SeRViSO through a prototype implementation and tests showing it can reduce retransmission overhead by up to 40% without significantly degrading video quality
This document provides information about an organization called SBGC that provides IEEE project assistance to students. It offers categories of projects based on whether students have their own project ideas or want to select from SBGC's list. It lists the technologies and domains they support and the departments they can assist. It also describes the project deliverables and support provided, as well as the technologies and departments they work with.
Iaetsd implementation of context features using context-aware information fil...Iaetsd Iaetsd
This document proposes a Context-Aware Information Filter (CAIF) to filter unwanted messages in online social networks. It aims to address the limitations of current rule-based filtering systems that do not consider the context of messages. The CAIF would have two inputs - messages and context updates. It presents the AGILE algorithm to make indexing adaptive by increasing or decreasing index accuracy and scope based on the frequency of context updates and message handling calls. This achieves high message throughput while efficiently processing context changes.
Multi-path TCP (MP-TCP) has the prospective to
significantly advance applications performance by using multiple
paths evidently. Multipath TCP was intended and employed as a
backward compatible replacement for TCP. For this reason, it
exposes the standard socket API to the applications that cannot
control the utilization of the different paths. This is a key feature
for applications that are unaware of the multipath nature of the
network. On the contrary, this is a limitation for applications
that could benefit from specific knowledge to use multiple paths
in a way that fits their needs. Therefore, hosts are often
connected by multiple paths, but TCP restricts communications
to a single path per transport connection. Resource usage within
the network would be more efficient where these multiple paths
able to be used concurrently. This should enhance user
experience through improved resilience to network failure and
higher throughput. In this paper, we have focused on MPTCP
and discussed the performance issues and its solution. We believe
our concept will be useful for future works of MPTCP
performance evaluation.
This paper focuses on packet routing in Delay Tolerant Networks (DTNs) where end-to-end connectivity is intermittent. It studies routing policies for transferring files when packets arrive progressively at the source node. It analyzes the optimality conditions for routing policies in terms of delivery probability and delay. It proposes piecewise-threshold policies that perform better than existing work-conserving policies, especially when there is an energy constraint. It extends the analysis to coded packets generated using linear block codes and rateless coding. Numerical results show piecewise-threshold policies have higher efficiency than work-conserving policies.
Centrality-Based Network Coder Placement For Peer-To-Peer Content DistributionIJCNCJournal
Network coding has been shown to achieve optimal multicast throughput, yet at an expensive computation
cost: every node in the network has to code. Interested in minimizing resource consumption of network
coding while maintaining its performance, in this paper, we propose a practical network coder placement
algorithm which achieves comparable content distribution time as network coding, and at the same time,
substantially reduces the number of network coders compared to a full network coding solution in which all
peers have to encode, i.e. become encoders. Our algorithm is derived from two key elements. First, it is
based on the insight that coding at upstream peers eliminates information duplication to downstream peers,
which results in efficient content distribution. Second, our placement strategy exploits centrality
characteristics of the network topology to quickly determine key positions to place encoders. Performance
evaluation using various topology and algorithm parameters confirms the effectiveness of our proposed
method.
NEW TECHNOLOGY FOR MACHINE TO MACHINE COMMUNICATION IN SOFTNET TOWARDS 5Gijwmn
Machine to Machine communication or M2M, refers to a model of communication where devices communicate directly with each other using the available wired or wireless channels. M2M is a new concept proposed under 3GPP(3rd Generation Partnership Project); several research are working on providing solutions for M2M communication for the 5G networks. Challenges associated with M2M communication are the lack of standards, security, poor infrastructure, interoperability and diverse architecture. In this paper, we propose a new mechanism called TM2M5G (The Machine to Machine for 5G) based on SOFTNET platform which results in support of 5G heterogeneous network. In this paper, we
propose the architecture for M2M communication based on SOFTNET and provide new features support like security algorithms for data transmission among devices and scheduling algorithm for seamless transmission of data packets over the network. Finallysimulation results ofthis algorithm based on a system level simulator, considering two different approaches for analyzing the parameters such as delay, throughput and bandwidth are presented.
The document proposes a Crosslayered and Power Conserved Routing Topology (CPCRT) for congestion control in mobile ad hoc networks. CPCRT aims to distinguish between packet loss due to link failure versus other causes like congestion. It takes a cross-layer approach using information from the physical, MAC, and application layers. The proposed method also aims to conserve power during packet transmission by adjusting transmission power levels based on received signal strength. Simulation results show that CPCRT can better utilize resources and conserve power during congestion control compared to other approaches.
Adaptive resource allocation and internet traffic engineering on data networkcsandit
This research paper describes the issues of bandwidth allocation, optimum capacity allocation,network operational cost reduction, and improve Internet user experience. Traffic engineering (TE) is used to manipulate network traffic to achieve certain requirements and meets certain
needs. TE becomes one of the most important building blocks in the design of the Internet backbone infrastructure. Research objective: efficient allocation of bandwidth across multiple paths. Optimum path selection. Minimize network traffic delays and maximize bandwidth utilization over multiple network paths. The bandwidth allocation is performed proportionally over multiple paths based on the path capacity.
The Minimum Cost Forwarding Using MAC Protocol for Wireless Sensor NetworksIJMER
The document discusses the Minimum Cost Forwarding (MCF) routing protocol for wireless sensor networks. MCF establishes optimal routing paths with few message exchanges and is scalable and simple to implement. The authors formally model MCF as timed automata and use model checking to verify its properties. Their analysis identified weaknesses in MCF concerning equal-cost paths and node failures. The authors present improvements to address deficiencies in the original MCF protocol.
Classroom Shared Whiteboard System using Multicast Protocolijtsrd
Multiple hosts wish to receive the same data from one or more senders. Multicast routing defines extensions to IP routers to support broadcasting data in IP networks. Multicast data is sent and received at a multicast address which defines a group. Data is sent and received in multicast groups via routing trees from sender s to receivers. Demonstrative lectures require to share the computer screen of the lecturer to the students as well as to make discussion with the students. The Multicast protocol is the most suitable method because of its capability in speed and better synchronized process. The word multicast is typically used to refer to IP multicast which is often employed for streaming media, and Internet television applications. Wit Yee Swe | Khaing Thazin Min | Khin Chan Myae Zin | Yi Yi Aung "Classroom Shared Whiteboard System using Multicast Protocol" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-5 , August 2019, URL: https://www.ijtsrd.com/papers/ijtsrd27976.pdfPaper URL: https://www.ijtsrd.com/engineering/electronics-and-communication-engineering/27976/classroom-shared-whiteboard-system-using-multicast-protocol/wit-yee-swe
Analytical average throughput and delay estimations for LTESpiros Louvros
This document summarizes an article that appeared in a journal published by Elsevier. The article proposes an analytical model to estimate average throughput and packet transmission delay for uplink cell edge users in LTE networks. The model uses probability analysis and mathematical modeling to estimate transmission delay and throughput, providing cell planners with an analytical tool for evaluating uplink performance under different conditions. The model accounts for factors like scheduling decisions, resource allocation, channel conditions and buffering that impact transmission delay and throughput for cell edge users.
A Real Time Framework of Multiobjective Genetic Algorithm for Routing in Mobi...IDES Editor
Routing in mobile networks is a multiobjective
optimization problem. The problem needs to consider multiple
objectives simultaneously such as Quality of Service
parameters, delay and cost. This paper uses the NSGA-II
multiobjectve genetic algorithm to solve the dynamic shortest
path routing problem in mobile networks and proposes a
framework for real-time software implementation.
Simulations confirm a good quality of solution (route
optimality) and a high rate of convergence.
This document summarizes a research paper that proposes CafRep, an adaptive congestion control protocol for delay-tolerant networks (DTNs). CafRep uses implicit heuristics based on contact and resource congestion to offload traffic from congested parts of the network to less congested areas. It also adaptively replicates messages at lower rates in different parts of the network with non-uniform congestion levels. The paper evaluates CafRep across three real mobility traces and shows it outperforms state-of-the-art DTN forwarding algorithms in maintaining high delivery rates while keeping low delays and packet loss, especially in congested networks.
An overview on application of machine learning techniques in optical networksKhaleda Ali
This document provides an overview of machine learning techniques applied to optical networks. It discusses how optical networks have become more complex with the introduction of technologies like coherent transmission and elastic optical networks. This increased complexity motivates the use of machine learning to analyze network data and make decisions. The document surveys existing work on machine learning applications in optical communications and networking. It aims to introduce researchers to this field and propose new research directions to further the application of machine learning to optical networks.
Performing Network Simulators of TCP with E2E Network Model over UMTS NetworksAM Publications,India
Wireless links losses result in poor TCP throughput since losses are perceived as congestion by TCP with the evolution of 3G technologies like Universal Mobile Telecommunication System (UMTS), the usage of TCP has become more popular for a reliable end-to-end (e2e) data delivery. However, TCP was initially designed for wired networks and therefore it suffers performance degradation due to the radio signal getting affected by fading, shadowing and interference. There are many strategies proposed by the research community on how to improve the performance of TCP over wireless links such as introducing link-layer retransmission, explicitly notifying the sender of network conditions or using new variants of TCP. As UMTS network coverage and availability are currently experiencing rapid growth, optimization of various internal components of its wireless network is very important. One of the optimization is the introduction of High Speed Downlink Packet Access (HSDPA). This architecture not only allows higher data rates but also more reliable data transfer by the introduction of Hybrid ARQ (HARQ). With this enhancement to the UMTS network, it becomes vital to see the performance of TCP in such a network. Therefore in this thesis, we try to evaluate two aspects of UMTS networks: first, the impact of HSDPA parameters like scheduling algorithm and RLC/MAC-hs buffer size on overall performance of TCP and second, to study the behaviour of two categories of TCP rate and flow control: loss based and delay based. Our simulation shows that delay based TCP tends to perform better than loss based TCP in our selected scenarios. The simulations are performed using the network simulator NS-2 with an e2e network model for enhanced UMTS (EURANE).
IRJET- A Review on Audible Sound Analysis based on State Clustering throu...IRJET Journal
This document reviews progress in acoustic modeling for statistical parametric speech synthesis, from early hidden Markov models to recent neural network approaches. It discusses how hidden Markov models were previously dominant but artificial neural networks are now replacing them due to improvements in naturalness. The document also examines developing accurate audio classifiers using machine learning techniques on public datasets and improving classification accuracy beyond current levels of 50-79% by employing strategies like cross-validation.
This document introduces the Reactive Data System (RDS) framework called RFX for solving fast data problems reactively. It discusses how RFX was developed to handle common issues like counting pageviews, unique users, and real-time marketing. RFX is an open source, full stack framework that uses various tools like Kafka, Spark, and Redis to process high volumes of event data in real-time for applications like analytics, advertising, and monitoring. The document provides an example architecture and topology for collecting tracking data, processing it through RFX components, and generating reports.
A Day in the Life of a Hadoop AdministratorEdureka!
This document outlines the daily tasks of a Hadoop administrator, which include monitoring the cluster, planning maintenance tasks, executing regular utility tasks like backups and file merging, upgrading systems, assisting developers, and troubleshooting issues. It also provides demonstrations on achieving high availability in Hadoop and YARN clusters, and discusses tools for monitoring cluster resources, user permissions, and common error messages. The document promotes an online Hadoop administration certification course from Edureka that teaches skills for planning, deploying, monitoring, tuning and securing Hadoop clusters.
Where is my next jobs in the age of Big Data and AutomationTrieu Nguyen
The document discusses how automation is impacting knowledge work jobs and proposes that the best approach is augmentation, where humans and machines work together. It provides examples of how different knowledge work jobs like teachers, lawyers, and financial advisors could take steps to augment their work with automation. The key steps include humans mastering automated systems, identifying new areas for automation, focusing on tasks they currently do best, finding niche roles, and building automated systems. The implications are that organizations should adopt an augmentation perspective, select the right technologies, design work for humans and machines, provide transition options for employees, and appoint a leader to manage workplace changes.
This document summarizes the key findings from the 2016 O'Reilly Data Science Salary Survey, which collected responses from 983 data professionals. Some of the main findings include: Python and Spark contribute most to salary; those who code more earn higher salaries; SQL, Excel, R and Python are the most commonly used tools; attending more meetings correlates with higher pay; women earn less than men for the same work; and geographic location, as measured by GDP, serves as a proxy for salary variation. The report also clusters respondents based on their tool usage and tasks to identify subgroups.
Slide 3 Fast Data processing with kafka, rfx and redisTrieu Nguyen
1. The document discusses using the RFX (Reactive Function X) framework to solve problems with fast data processing.
2. RFX is a design pattern and collection of open source tools that can be used to quickly build data products and implement an agile data pipeline.
3. Examples of how RFX could be used for web analytics are presented, including counting pageviews and unique users in near real-time and detecting DDOS attacks.
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in ProductionCloudera, Inc.
This document discusses best practices for upgrading Hadoop clusters with Cloudera Manager. It describes how the Cloudera Manager upgrade wizard provides a simplified, guided process for upgrading Hadoop distributions with minimal downtime. The upgrade wizard automates many of the manual steps previously required for upgrades and allows rolling upgrades for non-major upgrades when certain conditions are met. Following best practices like testing upgrades in non-production environments and having backup policies in place can help avoid issues during upgrades.
Agenda:
• Background for the development: From commodity
to experience
• Indirect use of experiences: Experience as value
adding
• Experience process
• Selling pure experiences: Using the experience realm
model
• How to develop experiences
• Creating the experience settings
Introduction to Human Data Theory for Digital EconomyTrieu Nguyen
Key ideas in this slide:
1) Knowledge about the theory of “Human Data World”
2) Examples about Data Product in real life
3) How to build a Data Product
This presentation on building servers explains what is Netty, why choosing it and shows how with very little code you can build an asynchronous app server.
This document is a table of contents for a book titled "Netty Cookbook: Recipes for building asynchronous event-driven network applications". The book contains recipes for using the Netty framework to build scalable and high performance networked applications. It is aimed at Java developers with some networking knowledge who want to use Netty. The book covers topics like building TCP servers and clients, handling different protocols, integrating with web frameworks, real-time applications, security, and connecting to big data systems. It includes 9 chapters with recipes to solve common network programming problems using Netty.
Netty Notes Part 2 - Transports and BuffersRick Hightower
This document provides notes on Netty Part 2 focusing on transports and buffers. It discusses the different Netty transport options including NIO, epoll, and OIO. It explains that Netty provides a common interface for different implementations. The document also covers Netty buffers including ByteBuf, direct vs array-backed buffers, composite buffers, and buffer pooling. It emphasizes that performance gains come from reducing byte copies and buffer allocation.
Art Nouveau was a total art style that emerged in the late 19th century, incorporating architecture, design and fine arts. It took inspiration from natural, organic forms like vines and flowers. Two key figures were Antonio Gaudi, a Spanish architect known for unique structures like Casa Mila and Parque Güell that featured curving shapes, and Charles Rennie MacKintosh, a Scottish designer who pioneered the Art Nouveau interior style using flowing lines and nature-inspired motifs. Art Nouveau emphasized harmony and rejected historical influences in favor of a modern aesthetic focused on the natural world.
Building a useful target architecture - Myth or reality2Regine Deleu
The document discusses the value of developing a target enterprise architecture and how to do so successfully. It emphasizes the importance of having an "enterprise" mindset that is willing to take risks and invest in innovations. Key tools for informing decisions about transformation include a business capability model, investment portfolio, and enterprise architecture with aligned strategies, standards, and perspectives. The architecture should be based on key events and guide the organization flexibly towards its goals.
The document discusses consciousness as a limitation. It begins by reviewing concepts covered so far, and introduces the idea that consciousness arises from limitations of our mental capabilities.
It then tells two stories to illustrate its point. The first is about an alien visiting special needs children and mistakenly thinking their condition results from something extra they possess rather than deficits. The second introduces the concept of a "fully aware being".
The document argues that traditional views of the mind are mistaken in assuming consciousness results from something added rather than limitations. It asserts consciousness arises from the distributed and limited nature of information processing in the brain, not from any single structure or region.
Crossroads of Asynchrony and Graceful DegradationC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1VmbI3t.
Nitesh Kant describes how embracing asynchrony in the Netflix applications, from networking to business processing, creates gracefully degrading and highly resilient applications. Filmed at qconsf.com.
Nitesh Kant is an engineer in Netflix’s Edge Gateway team, working on Netflix’s asynchronous Inter Process Communication stack. He is the author of RxNetty which forms the core of this stack and is currently moving Zuul to this new architecture.
Hardback solution to accelerate multimedia computation through mgp in cmpeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
The document proposes a federated learning approach for decentralized traffic flow prediction to address privacy and latency issues. Federated learning allows multiple nodes to build a shared machine learning model without sharing local datasets. The authors describe federated learning and its advantages like data security, diversity, and efficiency. They discuss applications and related work using federated learning for traffic prediction. The proposed approach uses a federated averaging algorithm to train a model across decentralized nodes holding local traffic data, aiming to accurately predict traffic flow while preserving data privacy.
A3: application-aware acceleration for wireless data networksZhenyun Zhuang
This document discusses application-aware acceleration (A3) for improving application performance over wireless networks. It presents results showing that while enhanced transport protocols improve performance for FTP, they provide little benefit for other popular applications like CIFS, SMTP, and HTTP. This is because the behavior of these applications, designed for reliable LANs, negatively impacts their performance over lossy wireless links. The document proposes A3 as a middleware solution that offsets these behavioral problems through application-specific design principles, while remaining transparent to applications.
Load Balance in Data Center SDN Networks IJECEIAES
In the last two decades, networks had been changed according to the rapid changing in its requirements. The current Data Center Networks have large number of hosts (tens or thousands) with special needs of bandwidth as the cloud network and the multimedia content computing is increased. The conventional Data Center Networks (DCNs) are highlighted by the increased number of users and bandwidth requirements which in turn have many implementation limitations. The current networking devices with its control a nd forwarding planes coupling result in network architectures are not suitable for dynamic computing and storage needs. Software Defined networking (SDN) is introduced to change this notion of traditional networks by decoupling control and forwarding planes. So, due to the rapid increase in the number of applications, websites, storage space, and some of the network resources are being underutilized due to static routing mechanisms. To overcome these limitations, a Software Defined Network based Openflow Data Center network architecture is used to obtain better performance parameters and implementing traffic load balancing function. The load balancing distributes the traffic requests over the connected servers, to diminish network congestions, and reduce un derutilization problem of servers. As a result, SDN is developed to afford more effective configuration, enhanced performance, and more flexibility to deal with huge network designs.
The document describes lessons learned from developing protocols to enable data sharing in a virtual enterprise. It discusses protocols selected by the NIIIP Consortium that build on STEP to allow engineering organizations to share technical product data over the Internet. The protocols included SDAI Java/IDL bindings, EXPRESS-X for data mapping, and STEP Services for data integration. These were used to implement a Virtual Enterprise Product Data Repository (VEPR) demonstrated in the last of three cycles to integrate product data from multiple sources. Key lessons included the need for standards to contribute and access controlled data in a VEPR as well as for applications to operate on data from different repositories.
The development of embedded applications (such as Wireless Sensor Network protocols) often
requires a shift to formal specifications. To insure the reliability and the performance of the
WSNs, such protocols must be designed following some methods reducing error rate. Formal
methods (as Automata, Petri nets, algebra, logics, etc.) were largely used in the specification of
these protocols, their analysis and their verification. After that, their implementation is an
important phase to deploy, test and use those protocols in real environments. The main
objective of the current paper is to formalize the transformation from high-level specification (in
Timed Automata) to low-level implementation (in NesC language and TinyOs system) and to
automate such transformation. The proposed transformation approach defines a set of rules that
allow the passage between these two levels. We implemented our solution and we illustrated the
proposed approach on a protocol case study for the "humidity" and "temperature" sensing in
WSNs applications.
An Adjacent Analysis of the Parallel Programming Model Perspective: A SurveyIRJET Journal
This document provides an overview and analysis of parallel programming models. It begins with an abstract discussing the growing demand for parallel computing and challenges with existing parallel programming frameworks. It then reviews several relevant studies on parallel programming models and architectures. The document goes on to describe several key parallel programming models in more detail, including the Parallel Random Access Machine (PRAM) model, Unrestricted Message Passing (UMP) model, and Bulk Synchronous Parallel (BSP) model. It discusses aspects of each model like architecture, communication methods, and associated cost models. The overall goal is to compare benefits and limitations of different parallel programming models.
Contemporary Energy Optimization for Mobile and Cloud Environmentijceronline
Cloud and mobile computing applications are increasing heavily in terms of usage. These two areas extending usability of systems. This review paper gives information about cloud and mobile applications in terms of resources they consume and the need of choosing variety of features for users from several locations and the evolutionary provisions for service provider and end users. Both the fields are combined to provide good functionality, efficiency and effectiveness with mobile phones. The enhancement by considering power consumption by means of resource constrained nature of devices, communication media and cost effectiveness. This paper discuss about the concepts related to power consumption, underlying protocols and the other performance issues
This paper presents a technique for end hosts to detect if intermediaries like routers are applying compression to traffic flows without the end hosts' knowledge. The technique is non-intrusive and only uses packet inter-arrival times for detection, requiring no changes to or cooperation from intermediaries. Simulations and internet experiments show the approach can accurately detect compression applied by intermediaries. The technique could help end hosts optimize their own use of compression resources by avoiding redundant compression when intermediaries are already compressing traffic.
Evaluation of load balancing approaches for Erlang concurrent application in ...TELKOMNIKA JOURNAL
Cloud system accommodates the computing environment including PaaS (platform as a service), SaaS (software as a service), and IaaS (infrastructure as service) that enables the services of cloud systems. Cloud system allows multiple users to employ computing services through browsers, which reflects an alternative service model that alters the local computing workload to a distant site. Cloud virtualization is another characteristic of the clouds that deliver virtual computing services and imitate the functionality of physical computing resources. It refers to an elastic load balancing management that provides the flexible model of on-demand services. The virtualization allows organizations to improve high levels of reliability, accessibility, and scalability by having a capability to execute applications on multiple resources simultaneously. In this paper we use a queuing model to consider a flexible load balancing and evaluate performance metrics such as mean queue length, throughput, mean waiting time, utilization, and mean traversal time. The model is aware of the arrival of concurrent applications with an Erlang distribution. Simulation results regarding performance metrics are investigated. Results point out that in Cloud systems both the fairness and load balancing are to be significantly considered.
Improvement in Error Resilience in BIST using hamming codeIJMTST Journal
In the current scenario of IP core based SoC, to test the CUT we need to communication link between Circuit Under Test and ATPG, so before applying to actual DUT. If there is a problem with this link, there may be a lip in bit of test data. Compared to original test data, if there is a bit lip in the original data, the codeword may change and hence the decompressed data will have a large number of bit deviation. This deviation in bits can severely degrade the test quality and overall fault coverage which may affect yield. The error resilience is the capability of the test data to resist against such bit lips. Here in this paper, the earlier methods of error resilience is compared and a Hamming code based error resilience technique is proposed to improve the error resilience capacity of compressed test data. This method is applied on Huffman code based compressed test data of widely used ISCAS benchmark circuits. The fault coverage measurement results show the effectiveness of the proposed method. The basic goal here is to survey the effect of bit lips on fault coverage and prepare a platform for further development in this avenue.
This document summarizes a research paper that proposes a framework for dynamically partitioning mobile applications between a mobile device and cloud computing resources. The framework consists of runtime systems on both the mobile device and cloud to support adaptive partitioning and distributed execution. It aims to efficiently serve large numbers of users by allowing computation instances on the cloud to be shared among multiple applications and tenants. The paper formulates the partitioning problem as an optimization problem that allocates application components and wireless bandwidth to maximize throughput.
PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDMOD...ijgca
This document discusses modeling cloud computing data centers as queuing systems to analyze performance factors. It presents an analytical model of a cloud data center as a [(M/G/1) : (∞/GDMODEL)] queuing system with single task arrivals and infinite task buffer capacity. The model is solved to obtain important performance metrics like mean number of tasks in the system. Prior work on modeling cloud systems and queuing theory concepts are also reviewed. Key assumptions of the proposed model include tasks following a Poisson arrival process and service times having a general probability distribution.
PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDMOD...ijgca
This document discusses modeling cloud computing data centers as queuing systems to analyze performance factors. It begins with background on cloud computing and queuing theory. It then models a cloud data center as an [(M/G/1) : (∞/GDMODEL)] queuing system with single task arrivals and infinite task buffer capacity. Key performance factors analyzed include mean number of tasks in the system. Analytical results are obtained by solving the model to estimate response time distribution and other metrics. The modeling approach allows determining the relationship between performance and number of servers/buffer size.
Review and Performance Comparison of Distributed Wireless Reprogramming Proto...IOSR Journals
Abstract:A Reprogramming service should be efficient, reliable and secured in Wireless sensor network.
Wireless reprogramming for wireless sensor network emphasize over the process of changing or improving the
functionality of simulation or existing code. For challenging and on demand security purpose, secure and
distributed routing protocols such as SDRP and ISDRP were developed. This paper reviews and compares the
propagation delay for two reprogramming protocols, SDRP and ISDRP, which based on hierarchy of energies
in network. Both are based on identity-based cryptography. But in the improved protocol the keys are
distributed to the network as per the sorting and communication capabilities to improve the broadcast or
communication nature of the network. Moreover, ISDRP demonstrates the security concepts, which deals over
the key encryption properties using heap sort algorithm and the confidentiality parameter is enhanced by
changing the private key values after certain interval of time for cluster head in respect to different public keys.
The ISDRP shows high efficiency rate clearly with the throughput and propagation results by implementation in
practice over SRDP.
Keywords: identity-based cryptography,ISDRP, heapsort algorithm, Reprogramming, SDRP, Wireless sensor
network.
Presentation for AFIN - 2012. EC Standardization mandate M/441,
Protocol Meter-Bus for measuring,
M2M API standards, EU-Russia Partnership for Modernization
About M2M standards and their possible extensions.
EC Standardization mandate M/441. Protocol Meter-Bus for measuring. On M2M API standards. EU-Russia Partnership for Modernization
BCFIC-2012
Application-Aware Big Data Deduplication in Cloud EnvironmentSafayet Hossain
The document proposes AppDedupe, a distributed deduplication framework for cloud environments that exploits application awareness, data similarity, and locality. AppDedupe uses a two-tiered routing scheme with application-aware routing at the director level and similarity-aware routing at the client level. It builds application-aware similarity indices with super-chunk fingerprints to speed up intra-node deduplication efficiently. Evaluation results show that AppDedupe consistently outperforms state-of-the-art schemes in deduplication efficiency and achieving high global deduplication effectiveness.
PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDM O...ijgca
The ever-increasing status of the cloud computing h
ypothesis and the budding concept of federated clou
d
computing have enthused research efforts towards in
tellectual cloud service selection aimed at develop
ing
techniques for enabling the cloud users to gain max
imum benefit from cloud computing by selecting
services which provide optimal performance at lowes
t possible cost. Cloud computing is a novel paradig
m
for the provision of computing infrastructure, whic
h aims to shift the location of the computing
infrastructure to the network in order to reduce th
e maintenance costs of hardware and software resour
ces.
Cloud computing systems vitally provide access to l
arge pools of resources. Resources provided by clou
d
computing systems hide a great deal of services fro
m the user through virtualization. In this paper, t
he
cloud data center is modelled as
queuing system with a single task arrivals
and a task request buffer of infinite capacity.
final Year Projects, Final Year Projects in Chennai, Software Projects, Embedded Projects, Microcontrollers Projects, DSP Projects, VLSI Projects, Matlab Projects, Java Projects, .NET Projects, IEEE Projects, IEEE 2009 Projects, IEEE 2009 Projects, Software, IEEE 2009 Projects, Embedded, Software IEEE 2009 Projects, Embedded IEEE 2009 Projects, Final Year Project Titles, Final Year Project Reports, Final Year Project Review, Robotics Projects, Mechanical Projects, Electrical Projects, Power Electronics Projects, Power System Projects, Model Projects, Java Projects, J2EE Projects, Engineering Projects, Student Projects, Engineering College Projects, MCA Projects, BE Projects, BTech Projects, ME Projects, MTech Projects, Wireless Networks Projects, Network Security Projects, Networking Projects, final year projects, ieee projects, student projects, college projects, ieee projects in chennai, java projects, software ieee projects, embedded ieee projects, "ieee2009projects", "final year projects", "ieee projects", "Engineering Projects", "Final Year Projects in Chennai", "Final year Projects at Chennai", Java Projects, ASP.NET Projects, VB.NET Projects, C# Projects, Visual C++ Projects, Matlab Projects, NS2 Projects, C Projects, Microcontroller Projects, ATMEL Projects, PIC Projects, ARM Projects, DSP Projects, VLSI Projects, FPGA Projects, CPLD Projects, Power Electronics Projects, Electrical Projects, Robotics Projects, Solor Projects, MEMS Projects, J2EE Projects, J2ME Projects, AJAX Projects, Structs Projects, EJB Projects, Real Time Projects, Live Projects, Student Projects, Engineering Projects, MCA Projects, MBA Projects, College Projects, BE Projects, BTech Projects, ME Projects, MTech Projects, M.Sc Projects, Final Year Java Projects, Final Year ASP.NET Projects, Final Year VB.NET Projects, Final Year C# Projects, Final Year Visual C++ Projects, Final Year Matlab Projects, Final Year NS2 Projects, Final Year C Projects, Final Year Microcontroller Projects, Final Year ATMEL Projects, Final Year PIC Projects, Final Year ARM Projects, Final Year DSP Projects, Final Year VLSI Projects, Final Year FPGA Projects, Final Year CPLD Projects, Final Year Power Electronics Projects, Final Year Electrical Projects, Final Year Robotics Projects, Final Year Solor Projects, Final Year MEMS Projects, Final Year J2EE Projects, Final Year J2ME Projects, Final Year AJAX Projects, Final Year Structs Projects, Final Year EJB Projects, Final Year Real Time Projects, Final Year Live Projects, Final Year Student Projects, Final Year Engineering Projects, Final Year MCA Projects, Final Year MBA Projects, Final Year College Projects, Final Year BE Projects, Final Year BTech Projects, Final Year ME Projects, Final Year MTech Projects, Final Year M.Sc Projects, IEEE Java Projects, ASP.NET Projects, VB.NET Projects, C# Projects, Visual C++ Projects, Matlab Projects, NS2 Projects, C Projects, Microcontroller Projects, ATMEL Projects, PIC Projects, ARM Projects, DSP Projects, VLSI Projects, FPGA Projects, CPLD Projects, Power Electronics Projects, Electrical Projects, Robotics Projects, Solor Projects, MEMS Projects, J2EE Projects, J2ME Projects, AJAX Projects, Structs Projects, EJB Projects, Real Time Projects, Live Projects, Student Projects, Engineering Projects, MCA Projects, MBA Projects, College Projects, BE Projects, BTech Projects, ME Projects, MTech Projects, M.Sc Projects, IEEE 2009 Java Projects, IEEE 2009 ASP.NET Projects, IEEE 2009 VB.NET Projects, IEEE 2009 C# Projects, IEEE 2009 Visual C++ Projects, IEEE 2009 Matlab Projects, IEEE 2009 NS2 Projects, IEEE 2009 C Projects, IEEE 2009 Microcontroller Projects, IEEE 2009 ATMEL Projects, IEEE 2009 PIC Projects, IEEE 2009 ARM Projects, IEEE 2009 DSP Projects, IEEE 2009 VLSI Projects, IEEE 2009 FPGA Projects, IEEE 2009 CPLD Projects, IEEE 2009 Power Electronics Projects, IEEE 2009 Electrical Projects, IEEE 2009 Robotics Projects, IEEE 2009 Solor Projects, IEEE 2009 MEMS Projects, IEEE 2009 J2EE P
Similar to Application-oriented ping-pong benchmarking: how to assess the real communication overheads (20)
Building Your Customer Data Platform with LEO CDP in Travel Industry.pdfTrieu Nguyen
1. The document outlines the Chief Platform Engineer's background and introduces LEO CDP, a customer data platform for the travel industry.
2. It discusses 5 challenges companies face related to customer growth, journeys, data platforms, communication and understanding customers with big data.
3. A case study shows how LEO CDP can be used to create a customer journey map for a travel agency, including personalized promotions and offers sent via email.
How to track and improve Customer Experience with LEO CDPTrieu Nguyen
This document discusses how to track and improve customer experience using LEO CDP. It begins by explaining why measuring customer experience is important, then introduces four key metrics: Customer Feedback Score, Customer Effort Score, Customer Satisfaction Score, and Net Promoter Score. It describes using journey maps to manage customer experience data and visualize the customer journey. Finally, it presents LEO CDP as a software solution for collecting customer experience data, building surveys, and generating reports to gain insights to improve products, services, and the overall customer experience.
[Notes] Customer 360 Analytics with LEO CDPTrieu Nguyen
Part 1: Why should every business need to deploy a CDP ?
1. Big data is the reality of business today
2. What are technologies to manage customer data ?
3. The rise of first-party data and new technologies for Digital Marketing
4. How to apply USPA mindset to build your CDP for data-driven business
Part 2: How to use LEO CDP for your business
1. Core functions of LEO CDP for marketers and IT managers
2. Data Unification for Customer 360 Analytics
3. Data Segmentation
4. Customer Personalization
5. Customer Data Activation
Part 3: Case study in O2O Retail and Ecommerce
1. How to build customer journey map for ecommerce and retail
2. How to do customer analytics to find ideal customer profiles
The ideal customer profile in a B2B context
The ideal customer profile in a B2C context
3. Manage product catalog for customer personalization
4. Monitoring Data of Customer Experience (CX Analytics)
CX Data Flow
CX Rating plugin is embedded in the website, to collect feedback data
An overview of CX Report
A CX Report in a customer profile
5. Monitoring data with real-time event tracking reports
Event Data Flow
Summary Event Data Report
Event Data Report in a Customer Profile
Part 4: How to setup an instance of LEO CDP for free
1. Technical architecture
2. Server infrastructure
3. Setup middlewares: Nginx, ArangoDB, Redis, Java and Python
Network requirements
Software requirements for new server
ArangoDB
Nginx Proxy
SSL for Nginx Server
Java 8 JVM
Redis
Install Notes for Linux Server
Clone binary code for new server
Set DNS hosts for LEO CDP workers
4. Setup data for testing and system verification
Part 5: Summary all key ideas
Why should you invest in LEO CDP ?
Purpose: Big data and AI democracy for SMEs companies
Problem: Customer Analytics and Customer Personalization
Solutions: CDP + CX + Personalization Engine
Product demo: LEO CDP for Ecommerce and Fintech
Business model: Freemium → Ecosystem → Subscription
Market size: 20 billion USD in 2026 and CAGR 34.6%
Differentiation: cloud-native software
Go-to-market approach: Community → Free → Paid
Team: 1 full-stack dev, 1 data scientist and 12,000 fans of BigDataVietnam.org Community
Need 150,000 USD for scaling business (you get 20% share)
The document outlines new features and updates for 2022 from USPA Technology Company, including a new dedicated dashboard for CMOs, updated UI for Customer 360 Insights, and a focus on data-driven business processes and digital marketing in B2B through standardizing data-driven processes and focusing on customer insights.
Lộ trình triển khai LEO CDP cho ngành bất động sảnTrieu Nguyen
1) Hiểu bài toán số hoá trải nghiệm khách hàng
2) Nghiên cứu giải pháp LEO CDP
3) Lộ trình triển khai
Phát triển / số hoá điểm chạm khách hàng
Xây dựng bản đồ hành trình khách hàng
Định nghĩa các metrics và KPI quan trọng
Xây dựng web portal và mobile data hub
Xây dựng kế hoạch Digital Marketing
Triển khai CDP và Marketing Automation
Xây dựng đội Analytics để phân tích dữ liệu
From Dataism to Customer Data PlatformTrieu Nguyen
1) How to think in the age of Dataism with LEO CDP ?
2) Why is Dataism for human, business and society ?
3) How should LEO Customer Data Platform (LEO CDP) work ?
4) How to use LEO CDP for your business ?
Data collection, processing & organization with USPA frameworkTrieu Nguyen
1) How to think in the age of Dataism with USPA framework ?
2) How to collect customer data
3) Data Segmentation Processing for flexibility and scalability
4) Data Organization for personalization and business activation
Part 1: Introduction to digital marketing technologyTrieu Nguyen
This document provides an overview of a mini-course on data-driven marketing using the USPA framework presented by Trieu Nguyen. It includes biographical information about Trieu Nguyen's background and experience in big data projects, machine learning, and digital marketing roles. The document also outlines the topics that will be covered in the mini-course, including digital media models, search engine marketing, social media marketing, advertising technology, customer data platforms, and case studies. Key terms like omnichannel strategy, customer experience strategy, and artificial intelligence strategies for marketing are also defined.
Transform your marketing and sales capabilities with Big Data and A.I
1) Why is Customer Data Platform (CDP) ?
Case study: Enhancing the revenue of your restaurant with CDP and mobile app marketing
Question: Why can CDP disrupt business model for restaurant industry (B2C) ?
2) How would CDP work in practice ?
Introducing USPA.tech as logical framework for implementing CDP in practice
How Can a Customer Data Platform Enhance Your Account-Based Marketing Strategy (B2B) ?
3) How can we implement CDP for business?
Introducing the CDP as customer-first marketing platform for all industries (my key idea in this slide)
How to build a Personalized News Recommendation PlatformTrieu Nguyen
This document discusses how to build a personalized news recommendation platform. It explains that recommendation systems are needed to retain users, increase traffic, and improve the content experience. It describes popular techniques like collaborative filtering, content-based filtering, and hybrid systems. Specifically, it outlines a case study using a USPA framework with real social news data. Key factors for a news recommendation system are discussed like novelty, user history, and location. The document also provides a simple example of building a recommendation engine with Apache Spark.
How to grow your business in the age of digital marketing 4.0Trieu Nguyen
1. The document discusses how businesses can grow in the digital marketing age using technologies like cloud services, big data, AI, and headless CMS platforms.
2. It introduces LeoCloudCMS as a headless API CMS that is built for digital marketing 4.0 and can run scalably on cloud computing.
3. The key idea is to think of your entire business as a "box" and use LeoCloudCMS to attract internet users into the box and offer valuable services.
Video Ecosystem and some ideas about video big dataTrieu Nguyen
Introduction to Video Ecosystem Mind Map
Video Streaming Platform
Video Ad Tech Platform
Video Player Platform
Video Content Distribution Platform
Video Analytics Platform
Summary of key ideas
Q & A
Concepts, use cases and principles to build big data systems (1)Trieu Nguyen
1) Introduction to the key Big Data concepts
1.1 The Origins of Big Data
1.2 What is Big Data ?
1.3 Why is Big Data So Important ?
1.4 How Is Big Data Used In Practice ?
2) Introduction to the key principles of Big Data Systems
2.1 How to design Data Pipeline in 6 steps
2.2 Using Lambda Architecture for big data processing
3) Practical case study : Chat bot with Video Recommendation Engine
4) FAQ for student
This document discusses open over-the-top (OTT) video content platforms. It defines OTT as streaming media distributed directly over the internet bypassing traditional distribution methods. The document then covers OTT market drivers and business models. It examines the most popular OTT platform in Vietnam and challenges for successful OTT platforms including scalability, content acquisition and management, audience engagement, and business models. Finally, it proposes a modular technical architecture for an open OTT video platform using open source technologies.
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisTrieu Nguyen
This document provides an introduction to Apache Hadoop and Spark for data analysis. It discusses the growth of big data from sources like the internet, science, and IoT. Hadoop is introduced as providing scalability on commodity hardware to handle large, diverse data types with fault tolerance. Key Hadoop components are HDFS for storage, MapReduce for processing, and HBase for non-relational databases. Spark is presented as improving on MapReduce by using in-memory computing for iterative jobs like machine learning. Real-world use cases of Spark at companies like Uber, Pinterest, and Netflix are briefly described.
Introduction to Recommendation Systems (Vietnam Web Submit)Trieu Nguyen
1) Why do we need recommendation systems ?
2) How can we think with recommendation systems ?
3) How can we implement a recommendation system with open source technologies ?
RFX framework https://github.com/rfxlab
Apache Kafka: https://kafka.apache.org
Apache Spark: https://spark.apache.org
Introduction to Data Science
1.1 What is Data Science, importance of data science,
1.2 Big data and data Science, the current Scenario,
1.3 Industry Perspective Types of Data: Structured vs. Unstructured Data,
1.4 Quantitative vs. Categorical Data,
1.5 Big Data vs. Little Data, Data science process
1.6 Role of Data Scientist
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...JeevanKp7
Long-term oxygen therapy (LTOT) and novel techniques of evaluating treatment efficacy have enhanced the quality of life and decreased healthcare expenses for COPD patients.
The cost of a pulmonary blood gas test is comparable to the cost of two days of oxygen therapy and the cost of a hospital stay is equivalent to the cost of one month of oxygen therapy, long-term oxygen therapy (LTOT) is a cost-effective technique of treating this disease.
A small number of clinical investigations on LTOT have shown that it improves the quality of life of COPD patients by reducing the loss of their respiratory capacity. A study of 8487 Danish patients found that LTOT for 1524 hours per day extended life expectancy from 1.07 to 1.40 years.
Overview of Statistical software such as ODK, surveyCTO,and CSPro
2. Software installation(for computer, and tablet or mobile devices)
3. Create a data entry application
4. Create the data dictionary
5. Create the data entry forms
6. Enter data
7. Add Edits to the Data Entry Application
8. CAPI questions and texts
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...rightmanforbloodline
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B. Fraleigh, Verified Chapters 1 - 56,.pdf
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B. Fraleigh, Verified Chapters 1 - 56,.pdf
Annex K RBF's The World Game pdf documentSteven McGee
Signals & Telemetry Annex K for RBF's The World Game / Trade Federations / USPTO 13/573,002 Heart Beacon Cycle Time - Space Time Chain meters, metrics, standards. Adaptive Procedural template framework structured data derived from DoD / NATO's system of systems engineering tech framework
2. T. Schneider et al.
as observed by an application. It supports serialization loops in C and Fortran as
well as MPI datatypes for representative application access patterns. Our benchmark,
consisting of seven micro-applications, unveils significant performance discrepancies
between the MPI datatype implementations of state of the art MPI implementations.
Our micro-applications aim to provide a standard benchmark for MPI datatype imple-
mentations to guide optimizations similarly to the established benchmarks SPEC CPU
and Livermore Loops.
Keywords MPI datatypes · Benchmark · Data movement · Access-pattern
Mathematics Subject Classification 68M10 · 68M14
1 Motivation and state of the art
One of the most common benchmarks in HPC to gauge network performance is a
ping-pong benchmark over a range of different message sizes which are sent from
and received into a consecutive buffer. With such a benchmark we can judge the
minimum achievable latency and maximum available bandwidth for an application.
As we show in Fig. 1a, this correlates weakly with the communication overhead
that typical computational science applications experience, because such applications
generally do not communicate consecutive data, but serialize (often called pack) their
data into a consecutive buffer before sending it.
The MPI standard [15] is the de facto standard for implementing high-performance
scientific applications. The advantage of MPI is that it enables a user to write
performance-portable codes. This is achieved by abstraction: Instead of expressing
a communication step as a set of point-to-point communications in a low-level com-
munication API it can be expressed in an abstract and platform independent way. MPI
implementers can tune the implementation of these abstract communication patterns
for specific machines.
100
200
300
400
500
Datasize [Byte]
Bandwidth[MB/s]
Test Name
MILC_su3_zd
NAS_LU_x
SPECFEM3D_cm
Traditional Ping−Pong
WRF_x_vec
WRF_y_vec
100
200
300
400
500
0 50K 100K 150K 0 50K 100K 150K
Datasize [Byte]
Bandwidth[MB/s]
Test Name
MILC_su3_zd
NAS_LU_x
SPECFEM3D_cm
Traditional Ping−Pong
WRF_x_vec
WRF_y_vec
(a) (b)
Fig. 1 Bandwidth attained by several applications benchmarks, compared with a normal ping-pong of the
same size. No application is able to attain the performance outlined by the standard ping-pong benchmark
when manual pack loops are used. MPI datatypes (DDTs) recognize that the buffer in the NAS_LU_x (cf.
Sect. 2 for a detailed description of all patterns.) case is already contiguous and do not perform an extra
copy. However, there are also many cases where MPI DDTs perform worse than manual packing. a Manual
packing with Fortran 90. b Packing with MPI DDTs
123
3. Application-oriented ping-pong benchmarking
Fig. 2 An example use case for MPI derived datatypes
MPI derived datatypes allow the specification of arbitrary data layouts in all places
where MPI functions accept a datatype argument (e.g., MPI_INT). We give an example
for the usage of MPI DDTs to send/receive a vector of integers in Fig. 2. All elements
with even indices are to replaced by the received data, elements with odd indices are to
be sent. Without the usage of MPI DDTs one would either have to allocate temporary
buffers and manually pack/unpack the data or send a large number of messages. The
usage of MPI DDTs greatly simplifies this example. If the used interconnect supports
non-contiguous transfers (such as Cray Gemini [2]) the two copies can be avoided
completely. Therefore the usage of MPI DDTs not only simplifies the code but also
can improve the performance due to the zero-copy formulation. In Fig. 1b we show
that some applications can benefit from using MPI DDTs instead of manual pack loops
(for example NAS LU and MILC, as was already demonstrated in [13]).
Not many scientific codes leverage MPI DDTs, even though their usage would be
appropriate in many cases. One of the reasons might be that current MPI implementa-
tions in some cases still fail to match the performance of manual packing, despite the
work that is done on improving DDT implementations [8,20,22]. Most of this work
is guided by a small number of micro-benchmarks. This makes it hard to gauge the
impact of a certain optimization on real scientific codes.
Coming back to the high-level language analogy made before and comparing this
situation to the one of people developing new compiler optimizations techniques or
microarchitecture extensions we see that, unlike for other fields, there is no application-
derived set of benchmarks to evaluate MPI datatype implementations. Benchmark
suites such as SPEC [10] or the Livermore Loops [14] are used (e.g., [1]) to eval-
uate compilers and microarchitectures. To address this issue, we developed a set of
micro-applications1 that represent access patterns of representative scientific applica-
tions as optimized pack loops as well as MPI datatypes. Micro-applications are, simi-
larly to mini-applications [5,7,12], kernels that represent real production level codes.
However, unlike mini-applications that represent whole kernels, micro-applications
focus on one particular aspect (or “slice”) of the application, for example the I/O,
the communication pattern, the computational loop structure, or, as in our case, the
communication data access pattern.
1.1 Related work
Previous work in the area of MPI DDTs focuses on improving its performance, either
by improving the way DDTs are represented in MPI or by using more cache efficient
1 Which can be downloaded from http://unixer.de/research/datatypes/ddtbench.
123
4. T. Schneider et al.
strategies for packing and unpacking the datatype to and from a contiguous buffer [8].
Interconnect features such as RDMA Scatter/Gather operations [22] have also been
considered. Several application studies demonstrate that MPI datatypes can outper-
form explicit packing in real-world application kernels such as FFTs [13] and matrix
redistribution [4]. However, performance of current datatype implementations remains
suboptimal and has not received as much attention as latency and bandwidth, proba-
bly due to the lack of a reasonable and simple benchmark. For example Gropp et al.
[11] found that several basic performance expectations are violated by MPI imple-
mentations in use today, i.e., sending data with a more expressive MPI datatype is in
some cases faster than using a less expressive one. The performance of MPI datatypes
is often measured using artificial micro-benchmarks, which are not related to spe-
cific application codes, such as the benchmarks proposed by Reussner et al. [17]. We
identify an unstructured access class, which is present in many molecular dynamics
and finite element codes. This access pattern is completely ignored in many datatype
optimization papers. However, the issue of preparing the communication buffer has
received very little attention compared to tuning the communication itself. In this
work, we show that the serialization parts of the communication can take a share of
up to 80 % of the total communication overheads because they happen at the sender
and at the receiver.
In contrast to the related work discussed in this section, our micro-applications
offer three important features: (1) they represent a comprehensive set of application
use cases, (2) they are easy to compile and use on different architectures, and (3)
they isolate the data access and communication performance parts and thus enable the
direct comparison of different systems. They can be used as benchmarks for tuning
MPI implementations as well as for hardware/software co-design of future (e.g., exas-
cale) network hardware that supports scatter/gather access. This paper is an extended
and improved version of [18]. In this version we extended the description of the ana-
lyzed application codes and investigated if the extracted access patterns are persistent
across the application run. We added results for the datatype performance of Cray’s
vendor MPI and compare the performance of manual pack loops implemented in
Fortran and C.
2 Representative communication data access patterns
We analyzed many parallel applications, mini-apps and application benchmarks for
their local access patterns to send and receive memory. Our analysis covers the domains
of atmospheric sciences, quantum chromodynamics, molecular dynamics, material
science, geophysical science, and fluid dynamics. We created seven micro-apps to
span these application areas. Table 1 provides an overview of investigated application
classes, their test cases, and a short description of the respective data access patterns. In
detail, we analyzed the applications WRF [19], SPECFEM3D_GLOBE [9], MILC [6]
and LAMMPS [16], representing respectively the fields of weather simulation, seis-
mic wave propagation, quantum chromodynamics and molecular dynamics. We also
included existing parallel computing benchmarks and mini-apps, such as the NAS
Parallel Benchmarks (NPB) [21], the Sequoia benchmarks as well as the Mantevo
mini-apps [12].
123
5. Application-oriented ping-pong benchmarking
Table 1 Overview of the application areas, test names, and access patterns of the micro-applications
contained in our benchmark
Application class Testname Access pattern
Atmospheric science WRF_x_vec Struct of 2d/3d/4d face exchanges in different
directions (x,y), using different (semantically
equivalent) datatypes: nested vectors (_vec)
and subarrays (_sa)
WRF_y_vec
WRF_x_sa
WRF_y_sa
Quantum chromodynamics MILC_su3_zd 4d face exchange, z direction, nested vectors
Fluid dynamics NAS_MG_x 3d face exchange in each direction (x,y,z) with
vectors (y,z) and nested vectors (x)
NAS_MG_y
NAS_MG_z
NAS_LU_x 2d face exchange in x direction (contiguous) and
y direction (vector)
NAS_LU_y
Matrix transpose FFT 2d FFT, different vector types on
send/recv side
SPECFEM3D_mt 3d matrix transpose, vector
Molecular dynamics LAMMPS_full Unstructured exchange of different particle types
(full/atomic), indexed datatypes
LAMMPS_atomic
Geophysical science SPECFEM3D_oc Unstructured exchange of acceleration data for
different earth layers, indexed datatypes
SPECFEM3D_cm
Those applications spend a significant amount of their run-time in communication
functions, for example MILC up to 12 %, SPECFEM3D_GLOBE up to 3 %, and WRF
up to 16 % for the problems we use in our micro-applications, which is confirmed by
the analysis done in [3] and [9].
We found that MPI DDTs are rarely used in the HPC codes considered, and thus
we analyzed the data access patterns of the (pack and unpack) loops that are used to
(de-)serialize data for sending and receiving. Interestingly, the data access patterns of
all those applications can be categorized into three classes: Cartesian Face Exchange,
Unstructured Access and Interleaved Data.
In the following we will describe each of the three classes in detail and give specific
examples of codes that fit each category.
2.1 Face exchange for n-dimensional Cartesian grids
Many applications store their working set in n-dimensional arrays that are distributed
across one or more dimensions. In a communication face, neighboring processes then
exchange the “sides” of “faces” of their part of the working set. Such access patterns
can be observed in many of the NAS codes, such as LU and MG, as well as in WRF and
123
6. T. Schneider et al.
(a) (b)
Fig. 3 Data layout of the NAS LU and MG benchmark. a NAS MG, b NAS LU
MILC. For this class of codes, it is possible to construct matching MPI DDTs using the
subarray datatype or nested vectors. Some codes in this class, such as WRF, exchange
faces of more than one array in each communication step. This can be done with MPI
DDTs using a struct datatype to combine the sub-datatypes that each represents a
single array.
The Weather Research and Forecasting (WRF) application uses a regular three-
dimensional Cartesian grid to represent the atmosphere. Topographical land informa-
tion and observational data are used to define initial conditions of forecasting simu-
lations. The model solution is computed using a Runge–Kutta time-split integration
scheme in the two horizontal dimensions with an implicit solver in the vertical dimen-
sion. WRF employs data decompositions in the two horizontal dimensions only. WRF
does not store all information in a single data structure, therefore the halo exchange
is performed for a number of similar arrays. The slices of these arrays that have to
be communicated are packed into a single buffer. We create a struct of hvectors of
vector datatypes or a struct of subarrays datatypes for the WRF tests, which are named
WRF_{x,y}_{vec,sa}, one test for each direction, and each datatype choice (nested
vectors or subarrays, respectively). WRF contains 150 different static datatypes which
can be reused during the application run.
NAS MG communicates the faces of a 3d array in a 3d stencil where each process
has six neighbors. The data access pattern for one direction is visualized in Fig. 3a.
The data-access pattern in MG can be expressed by an MPI subarray datatype or using
nested vectors. Our NAS_MG micro-app has one test for the exchange in each of the
three directions NAS_MG_{x,y,z} using nested vector datatypes. NAS MG uses a few
different but static access patterns and therefore all datatypes can be reused.
The NAS LU application benchmark solves a three-dimensional system of equa-
tions resulting from an unfactored implicit finite-difference discretization of the
Navier–Stokes equations. In the dominant communication function, LU exchanges
faces of a four-dimensional array. The first dimension of this array is of fixed size
(5). The second (nx) and third (ny) dimension depend on the problem size and are
distributed among a quadratic processor grid. The fourth (nz) dimension is equal
to the third dimension of the problem size. Figure 3b visualizes the data layout.
Our NAS_LU micro-app represents the communication in each of the two directions
NAS_LU_{x,y}. NAS LU uses a few different but static access patterns and therefore
all datatypes can be reused.
123
7. Application-oriented ping-pong benchmarking
The MIMD Lattice Computation (MILC) Collaboration studies Quantum Chro-
modynamics (QCD), the theory of strong interaction, a fundamental force describing
theinteractions of quarks andgluons. TheMILCcodeis publiclyavailablefor thestudy
of lattice QCD. The su3_rmd application from that code suite is part of SPEC CPU2006
and SPEC MPI. Here we focus on the CG solver in su3_rmd. Lattice QCD represents
space-time as a four-dimensional regular grid of points. The code is parallelized using
domain decomposition and communicates with neighboring processes that contain
off-node neighbors of the points in its local domain. MILC uses 96 different MPI
DDTs to accomplish its halo exchange in the 4 directions (named ±x, ±y, ±z, ±t).
The datatypes stay the same over the course of the application run. The MILC_su3_zd
micro-app performs the communication done for the −z direction.
An important observation we made from constructing datatypes for the applications
in the face exchange class is that the performance of the resulting datatype heavily
depends on the data layout of the underlying array. For example, if the exchanged face
is contiguous in memory (e.g., for some directions in WRF and MG), using datatypes
can essentially eliminate the packing overhead completely. That is the reason we
included tests for all different directions of each application.
2.2 Exchange of unstructured elements
The codes in this class maintain scatter–gather lists which hold the indices of elements
to be communicated. Molecular Dynamics applications (e.g., LAMMPS) simulate the
interaction of particles. Particles are often distributed based on their spatial location
and particles close to boundaries need to be communicated to neighboring processes.
Since particles move over the course of the simulation each process keeps a vector of
indices of local particles that need to be communicated in the next communication step.
This access pattern can be captured by an indexed datatype. A similar access pattern
occurs in Finite Element Method (FEM) codes (e.g., Mantevo MiniFE/HPCCG) and
the Seismic Element Method (SEM) codes such as SPECFEM3D_GLOBE. Here
each process keeps a mapping of mesh points in the local mesh defining an element
and the global mesh. Before the simulation can advance in time the contributions
from all elements which share a common global grid point need to be taken into
account.
LAMMPS is a molecular dynamics simulation framework which is capable of
simulating many different kinds of particles (i.e., atoms, molecules, polymers, etc.)
and the forces between them. Similar to other molecular dynamics codes it uses a
spatial decomposition approach for parallelization. Particles are moving during the
simulation and may have to be communicated if they cross a process boundary. The
properties of local particles are stored in vectors and the indices of the particles that
have to be exchanged are not known a priori. Thus, we use an indexed datatype to
represent this access. We created two tests, LAMMPS_{full,atomic}, that differ in the
number of properties associated with each particle. The LAMMPS code in its current
form does not amend to datatype reuse.
SPECFEM3D_GLOBE is a spectral-element application that allows the simula-
tion of global seismic wave propagation through high resolution earth models. It is
123
8. T. Schneider et al.
used on some of the biggest HPC systems available [9]. The earth is described by a
mesh of hexahedral volume elements. Grid points that lie on the sides, edges or cor-
ners of an element are shared between neighboring elements. SPECFEM3D_GLOBE
maintains a mapping between grid points in the local mesh to grid points in the global
mesh. Before the system can be marched forward in time, the contributions from
all grid points that share a common global grid point need to be considered. The
contribution for each global grid point needs to be collected, potentially from neigh-
boring processes. Our micro-app representing SPECFEM3D_GLOBE has two tests,
SPECFEM3D_{oc,cm}, which differ in the amount of data communicated per index.
The nine different datatypes needed by this code can be reused, since their usage only
depends on the used mesh, which does not change during runtime.
The Mantevo mini-app MiniFE has a data access pattern very similar to
SPECFEM3D_GLOBE, which is not surprising, since MiniFE models a finite ele-
ment code and the seismic element method in SPECFEM3D_GLOBE is a variant
of the FEM method. The Mantevo mini-app MiniMD is a miniature version of the
LAMMPS code described above.
Our results show that current MPI DDT implementations are often unable to
improve such unstructured access over packing loops. Furthermore, the overhead of
creating datatypes for this kind of access (indexed datatypes) is high.
2.3 Interleaved data or transpose
Fast Fourier Transforms (FFTs) are used in many scientific applications and are
among the most important algorithms in use today. FFTs can be multi-dimensional:
As the one-dimensional Fourier Transform expresses the input as a superposition of
sinusoids, the multi-dimensional variant expresses the input as a superposition of plane
waves, or multi-dimensional sinusoids. For example, a two-dimensional FFT can be
computed by performing 1d-FFTs along both dimensions. If the input matrix is dis-
tributed among MPI processes along the first dimension, each process can compute
the first 1d-FFT without communication. After this step the matrix has to be redis-
tributed, such that each process now holds complete vectors of the other dimension,
which effectively transposes the distributed matrix. After the second 1d-FFT has been
computed locally the matrix is transposed again to regain the original data layout. In
MPI the matrix transpose is naturally done with an MPI_Alltoall operation.
Hoefler and Gottlieb presented a zero-copy implementation of a 2d-FFT using MPI
DDTs to eliminate the pack and unpack loops in [13] and demonstrated performance
improvements up to a factor of 1.5 over manual packing. The FFT micro-app captures
the communication behavior of a two-dimensional FFT.
SPECFEM3D_GLOBE exhibits a similar pattern, which is used to transpose a
distributed 3D array. We used Fortran’s COMPLEX datatype as the base datatype for
the FFT case in our benchmark (in C two DOUBLEs) and a single precision floating
point value for the SPECFEM3D_mt case. The MPI DDTs used in those cases are
vectors of the base datatypes where the stride is the matrix size in one dimension. To
interleave the data this type is resized to the size of one base datatype. An example for
this technique is given in Fig. 4.
123
9. Application-oriented ping-pong benchmarking
Fig. 4 Datatype for 2d-FFT
(a) (b) (c)
Fig. 5 Measurement loops for the micro-applications. The time for each phase (rectangle) is measured on
process 0. a Manual Pack Loop, b Send/Recv with MPI DDTs, c MPI_Pack
3 Micro-applications for Benchmarking MPI datatypes
We implemented all data access schemes that we discussed above as micro-
applications with various data sizes. For this, we used the original data layout and
pack loops whenever possible to retain the access pattern of the applications. We also
choose array sizes that are representing real input cases. The micro-applications are
implemented in Fortran (the language of most presented applications) as well as C to
enable a comparison between compilers. We compiled all benchmarks with highest
optimization.
All benchmark results shown in this paper have been obtained on either the Odin
cluster at IU Bloomington or on JYC, the Blue Waters test system at the National Cen-
ter for Supercomputing Applications. Odin consists of 128 nodes with AMD Opteron
270 HE dual core CPUs and an SDR Infiniband interconnect. JYC consists of a sin-
gle cabinet Cray XE6 (approx. 50 nodes with 1,600 Interlagos 2.3–2.6 GHz cores).
We used the GNU compiler version 4.6.2 and compiled all benchmarks with −O3
optimization.
We performed a ping-pong benchmark between two hosts using MPI_Send() and
MPI_Recv() utilizing the original pack loop and our datatype as shown in Fig. 5. Our
benchmark also performs packing with MPI using MPI_Pack() and MPI_Unpack()
(cf. Fig. 5c), however, packing overhead for explicit packing with MPI has been
omitted due to lack of space and the small practical relevance of those functions. For
comparison we also performed a traditional ping-pong of the same data size as the
MPI DDTs type size.
The procedure runs two nested loops: the outer loop creates a new datatype in each
iteration and measures the overhead incurred by type creation and commit; the inner
loop uses the committed datatype a configurable number of times. In all experiments
123
10. T. Schneider et al.
Manual packing MPI DDTs
0
1000
2000
3000
4000
60K 90K 120K 150K 60K 90K 120K 150K
Datasize [Byte]
Time[us]
Phase
Communication RTT (incl. Packing)
DDT Create
DDT Free
Pack
Unpack
Fig. 6 Median duration of the different benchmark phases for the WRF_x_vec test, using Open MPI 1.6
on Odin
Manual packing MPI DDTs
0
2500
5000
7500
10000
Datasize [Byte]
Time[us]
Phase
Communication RTT (incl. Packing)
DDT Create
DDT Free
Pack
Unpack
50K 100K 150K 50K 100K 150K
Fig. 7 Median duration of the different benchmark phases for the SPECFEM3D_cm test, using Open MPI
1.6 on Odin
we used 10 outer loop iterations and 20 iterations of the inner loop. Time for each phase
(rectangles in Fig. 5) is recorded in a result file. We provide an example script for GNU
R to perform the packing overhead analysis as shown in this paper. Measurements are
done only on the client side, so the benchmark does not depend on synchronized clocks.
If we measure the time for each phase multiple times (in the two loops described
above) and plot the median value for each phase, we get a result as shown in Fig. 6,
where we plot the times for three different sizes of the WRF_x_vec test, using Open
MPI 1.6. Note that the time for packing has been added to the communication round
trip time (RTT) of manual packing to enable direct comparison with the MPI DDT case
where packing happens implicitly and is thus also included in the communication time.
It can be seen that using MPI datatypes is beneficial in this case. The WRF applica-
tion exhibits a face exchange access pattern, as explained before. The datatypes needed
for this pattern are simple to create, therefore also the datatype creation overhead is
low. For the SPECFEM3D_cm test (Fig. 7) the situation is different: datatypes for
unstructured exchanges are very costly to construct. Also none of the MPI implemen-
tations was able to outperform manual packing for this test. Unless otherwise noted
we assume datatype reuse in all benchmarks. That means the costs for creating and
destroying datatypes (or allocating/freeing buffers) are not included in the communi-
cation costs. This is reasonable because most applications create their datatypes only
once during their entire run and amortize these costs over many communication steps.
For two-dimensional FFTs (Fig. 8), the usage of derived datatypes also improves
performance, compared with manual packing. Note the large difference in the times
required for manual packing compared to manual unpacking—this is caused by the
fact that during packing large blocks can be copied, while during unpack each element
has to be handled individually.
123
11. Application-oriented ping-pong benchmarking
Manual packing MPI DDTs
0
25000
50000
75000
100000
125000
0 5M 10M 15M 0 5M 10M 15M
Datasize [Byte]
Time[us]
Phase
Communication RTT (incl. Packing)
DDT Create
DDT Free
Pack
Unpack
Fig. 8 Median duration of the different benchmark phases for the FFT test, using Cray MPI on JYC (Blue
Waters test system)
MILC_su3_zd NAS_LU_x NAS_LU_y SPECFEM3D_cm SPECFEM3D_oc WRF_x_vec WRF_y_sa WRF_y_vec
0
25
50
75
100
25K
50K
75K
100K
10K
20K
30K
40K
0
10K
20K
30K
40K
50K
100K
150K
5K
10K
60K
90K
120K
150K
50K
60K
70K
80K
90K
50K
60K
70K
80K
90K
Datasize [Byte]
PackingOverhead[%]
Pack Method
MPI DDTs (MVAPICH 1.8)
MPI DDTs (Open MPI 1.6)
Manual (F90)
Fig. 9 Packing overheads (relative to communication time) for different micro-apps and datasizes and MPI
implementations on Odin
It is interesting to know the fraction of time that data packing needs, compared with
the rest of the round-trip. Since we can not measure the time for data-packing directly
in the MPI DDT benchmark we use the following method: Let tpp be the time for a
round-trip including all packing operations (implicit or explicit) and tnet the time to
perform a ping-pong of the same size without packing.
The overhead for packing relative to the communication time can be expressed as
ovh =
tpp−tnet
tpp
.
The serial communication time tnet was practically identical for the tested MPI
implementations (<5 % variation). This enables us to plot the relative overheads for
different libraries into a single diagram for a direct comparison. Figure 9 shows those
relative pack overheads for some representative micro-application tests performed
with Open MPI 1.6 as well as MVAPICH 1.8 on the Odin cluster; we always ran one
process per node to isolate the off-node communication.
123
12. T. Schneider et al.
MILC_su3_zd NAS_LU_x NAS_LU_y SPECFEM3D_cm SPECFEM3D_oc WRF_x_vec WRF_y_sa WRF_y_vec
0
25
50
75
100
25K
50K
75K
100K
10K
20K
30K
40K
0
10K
20K
30K
40K
50K
100K
150K
5K
10K
60K
90K
120K
150K
50K
60K
70K
80K
90K
50K
60K
70K
80K
90K
Datasize [Byte]
PackingOverhead[%]
Pack Method
Manual Packing (C)
MPI DDTs (Cray MPI)
Manual Packing (Fortran 90)
Fig. 10 Packing overheads (relative to communication time) for different micro-apps and datasizes and
compilers on JYC
Note that the overhead for the creation of the datatype was not included in the
calculations of the packing overheads in Fig. 9, because most applications are able to
cacheandreusedatatypes.Fromthisfigurewecanmakesomeinterestingobservations:
In the NAS_LU_x test case both MPI implementations outperform manual packing by
far, the packing overhead with MPI DDTs is almost zero. In this case the data is already
contiguous in memory, and therefore does not need to be copied in the first place—the
manual packing is done anyway in the NAS benchmark to simplify the code. Both MPI
implementations seem able to detect that the extra copy is unnecessary. We observe that
the datatype engine of Open MPI performs better than MVAPICH’s implementation.
The SPECFEM3D tests show that unordered accesses with indexed datatypes are not
implemented efficiently by both Open MPI and MVAPICH. This benchmark shows
the importance of optimizing communication memory accesses: up to 81 % of the
communication time of the WRF_x_vec test case are spent with packing/unpacking
data, which can be reduced to 73 % with MPI DDTs. In the NAS_LU_x case, which
sends a contiguous buffer, using MPI DDTs reduce the packing overhead from 30 to
7 % without increasing the code complexity.
In Fig. 10 we compare the packing overhead of several micro-applications when
different compilers are used. We implemented each test in C as well as in Fortran (while
most of the original code was written in Fortran). For most tests there is no significant
difference. For WRF the packing loop expressed in Fortran is slightly faster. For MILC
the packing loop written in C is much faster on JYC. Cray’s MPI implementation is
outperformed by manual packing in all of our tests. This indicates some optimization
potential in the datatype implementation of Cray MPI.
123
13. Application-oriented ping-pong benchmarking
4 Conclusions
We analyzed a set of scientific applications for their communication buffer access
patterns and isolated those patterns in micro-applications to experiment with MPI
datatypes. In this study, we found three major classes of data access patterns: Face
exchanges in n-dimensional Cartesian grids, irregular access of datastructures of vary-
ing complexity based on neighbor-lists in FEM, SEM and molecular dynamics codes
as well as access of interleaved data in order to redistribute data elements in the case
of matrix transpositions. In some cases (such as WRF) several similar accesses to
datastructures can be fused into a single communication operation through the usage
of a struct datatype. We provide the micro-applications to guide MPI implementers
in optimizing datatype implementations and to aid hardware-software co-design deci-
sions for future interconnection networks.
We demonstrated that the optimization of data packing (implicit or explicit) is
crucial, as packing can make up up to 80 % of the communication time with the data
access patterns of real world applications. We showed that in some cases zero-copy
formulations can help to mitigate this problem. Those findings make clear that system
designers should not rely solely on ping-pong benchmarks with contiguous buffers,
they should take the communication buffer access patterns of real applications into
account.
While we present a large set of results for relevant systems, it is necessary to
repeat the experiments in different environments. Thus, we provide the full benchmark
source-code and data analysis tools at http://unixer.de/research/datatypes/ddtbench/.
Acknowledgments This work was supported by the DOE Office of Science, Advanced Scientific Com-
puting Research, under award number DE-FC02-10ER26011, program manager Sonia Sachs.
References
1. Aiken A, Nicolau A (1988) Optimal loop parallelization. In: Proceedings of the ACM SIGPLAN
conference on programming language design and implementation (PLDI’88), vol 23. ACM, pp 308–
317
2. Alverson R, Roweth D, Kaplan L (2010) The Gemini System interconnect. In: Proceedings of the
IEEE symposium on high performance interconnects (HOTI’10), IEEE Computer Society, pp 83–87
3. Armstrong B, Bae H, Eigenmann R, Saied F, Sayeed M, Zheng Y (2006) HPC benchmarking and
performance evaluation with realistic applications. In: SPEC benchmarking workshop
4. Bajrovi´c E, Träff JL (2011) Using MPI derived datatypes in numerical libraries. In: Recent advances
in the message passing interface (EuroMPI’11). Springer, Berlin, pp 29–38
5. Barrett RF, Heroux MA, Lin PT, Vaughan CT, Williams AB (2011) Poster: mini-applications: Vehicles
for co-design. In: Proceedings of the companion on high performance computing, networking, storage
and analysis (SC’11 companion), ACM, pp 1–2
6. Bernard C, Ogilvie MC, DeGrand TA, Detar CE, Gottlieb SA, Krasnitz A, Sugar RL, Toussaint D
(1991) Studying quarks and gluons on MIMD parallel computers. Int J Supercomput Appl SAGE
5:61–70
7. Brunner TA (2012) Mulard: a multigroup thermal radiation diffusion mini-application. Technical
report, DOE exascale research conference
8. Byna S, Gropp W, Sun XH, Thakur R (2003) Improving the performance of MPI derived datatypes
by optimizing memory-access cost. In: Proceedings of the IEEE international conference on cluster
computing (CLUSTER’03). IEEE Computer Society, pp 412–419
123
14. T. Schneider et al.
9. Carrington L, Komatitsch D, Laurenzano M, Tikir M, Michéa D, Le Goff N, Snavely A, Tromp J
(2008) High-frequency simulations of global seismic wave propagation using SPECFEM3D_GLOBE
on 62k processors. In: Proceedings of the ACM/IEEE conference on supercomputing (SC’08), IEEE
Computer Society, pp 60:1–60:11
10. Dixit KM (1991) The SPEC benchmarks. In: Parallel computing, vol 17. Elsevier Science Publishers
B.V., Amsterdam, pp 1195–1209
11. Gropp W, Hoefler T, Thakur R, Träff JL (2011) Performance expectations and guidelines for MPI
derived datatypes. In: Recent advances in the message passing interface (EuroMPI’11), LNCS, vol
6960. Springer, New York, pp 150–159
12. Heroux MA, Doerfler DW, Crozier PS, Willenbring JM, Edwards HC, Williams A, Rajan M, Keiter
ER, Thornquist HK, Numrich RW (2009) Improving performance via mini-applications. Technical
report, Sandia National Laboratories, SAND2009-5574
13. Hoefler T, Gottlieb S (2010) Parallel zero-copy algorithms for fast Fourier transform and conjugate
gradient using MPI datatypes. In: Recent advances in the message passing interface (EuroMPI’10),
LNCS, vol 6305. Springer, New York, pp 132–141
14. McMahon FH (1986) The livermore Fortran kernels: a computer test of the numerical performance
range. Technical report, Lawrence Livermore National Laboratory, UCRL-53745
15. MPI Forum (2009) MPI: a message-passing interface standard. Version 2.2
16. Plimpton S (1995) Fast parallel algorithms for short-range molecular dynamics. Academic Press Pro-
fessional. J Comput Phys 117:1–19
17. Reussner R, Träff J, Hunzelmann G (2000) A benchmark for MPI derived datatypes. In: Recent
advances in parallel virtual machine and message passing interface (EuroPVM/MPI’00), LNCS, vol
1908. Springer, New York, pp 10–17
18. Schneider T, Gerstenberger R, Hoefler T (2012) Micro-applications for communication data access
patterns and MPI datatypes. In: Recent advances in the message passing interface (EuroMPI’12),
LNCS, vol 7490. Springer, New York, pp 121–131
19. Skamarock WC, Klemp JB (2008) A time-split nonhydrostatic atmospheric model for weather research
and forecasting applications. Academic Press Professional. J Comput Phys 227:3465–3485
20. Träff J, Hempel R, Ritzdorf H, Zimmermann F (1999) Flattening on the fly: Efficient handling of
MPI derived datatypes. In: Recent advances in parallel virtual machine and message passing interface
(EuroPVM/MPI’99), LNCS, vol 1697. Springer, New York, pp 109–116
21. van der Wijngaart RF, Wong P (2002) NAS parallel benchmarks version 2.4. Technical report, NAS
Technical, Report NAS-02-007
22. Wu J, Wyckoff P, Panda D (2004) High performance implementation of MPI derived datatype com-
munication over InfiniBand. In: Proceedings of the international parallel and distributed processing
symposium (IPDPS’04). IEEE Computer Society
123