subscribe to arXiv mailings

Extraction of Atypical Aspects from Customer Reviews: Datasets and Experiments with Language Models

Authors: Smita Nannaware, Erfan Al-Hossami, Razvan Bunescu

Abstract: A restaurant dinner may become a memorable experience due to an unexpected aspect enjoyed by the customer, such as an origami-making station in the waiting area. If aspects that are atypical for a restaurant experience were known in advance, they could be leveraged to make recommendations that have the potential to engender serendipitous experiences, further increasing user satisfaction. Although… ▽ More A restaurant dinner may become a memorable experience due to an unexpected aspect enjoyed by the customer, such as an origami-making station in the waiting area. If aspects that are atypical for a restaurant experience were known in advance, they could be leveraged to make recommendations that have the potential to engender serendipitous experiences, further increasing user satisfaction. Although relatively rare, whenever encountered, atypical aspects often end up being mentioned in reviews due to their memorable quality. Correspondingly, in this paper we introduce the task of detecting atypical aspects in customer reviews. To facilitate the development of extraction models, we manually annotate benchmark datasets of reviews in three domains - restaurants, hotels, and hair salons, which we use to evaluate a number of language models, ranging from fine-tuning the instruction-based text-to-text transformer Flan-T5 to zero-shot and few-shot prompting of GPT-3.5. △ Less

Submitted 5 November, 2023; originally announced November 2023.

Comments: Proceedings of the Knowledge-aware and Conversational Recommender Systems Workshop (KaRS) @ RecSys, September 19, 2023

arXiv:2310.03210 [pdf, ps, other]

Can Language Models Employ the Socratic Method? Experiments with Code Debugging

Authors: Erfan Al-Hossami, Razvan Bunescu, Justin Smith, Ryan Teehan

Abstract: When employing the Socratic method of teaching, instructors guide students toward solving a problem on their own rather than providing the solution directly. While this strategy can substantially improve learning outcomes, it is usually time-consuming and cognitively demanding. Automated Socratic conversational agents can augment human instruction and provide the necessary scale, however their dev… ▽ More When employing the Socratic method of teaching, instructors guide students toward solving a problem on their own rather than providing the solution directly. While this strategy can substantially improve learning outcomes, it is usually time-consuming and cognitively demanding. Automated Socratic conversational agents can augment human instruction and provide the necessary scale, however their development is hampered by the lack of suitable data for training and evaluation. In this paper, we introduce a manually created dataset of multi-turn Socratic advice that is aimed at helping a novice programmer fix buggy solutions to simple computational problems. The dataset is then used for benchmarking the Socratic debugging abilities of a number of language models, ranging from fine-tuning the instruction-based text-to-text transformer Flan-T5 to zero-shot and chain of thought prompting of the much larger GPT-4. The code and datasets are made freely available for research at the link below. https://github.com/taisazero/socratic-debugging-benchmark △ Less

Submitted 4 October, 2023; originally announced October 2023.

Comments: 8 pages, 2 tables. To be published in Proceedings of the 2024 Technical Symposium on Computer Science Education (SIGCSE'24)

arXiv:2202.04847 [pdf, other]

A Survey on Artificial Intelligence for Source Code: A Dialogue Systems Perspective

Authors: Erfan Al-Hossami, Samira Shaikh

Abstract: In this survey paper, we overview major deep learning methods used in Natural Language Processing (NLP) and source code over the last 35 years. Next, we present a survey of the applications of Artificial Intelligence (AI) for source code, also known as Code Intelligence (CI) and Programming Language Processing (PLP). We survey over 287 publications and present a software-engineering centered taxon… ▽ More In this survey paper, we overview major deep learning methods used in Natural Language Processing (NLP) and source code over the last 35 years. Next, we present a survey of the applications of Artificial Intelligence (AI) for source code, also known as Code Intelligence (CI) and Programming Language Processing (PLP). We survey over 287 publications and present a software-engineering centered taxonomy for CI placing each of the works into one category describing how it best assists the software development cycle. Then, we overview the field of conversational assistants and their applications in software engineering and education. Lastly, we highlight research opportunities at the intersection of AI for code and conversational assistants and provide future directions for researching conversational assistants with CI capabilities. △ Less

Submitted 10 February, 2022; originally announced February 2022.

Comments: 55 pages, 16 Figures, 4 Tables

ACM Class: I.2.2; I.2.7; K.3.1

arXiv:2202.03755 [pdf, other]

doi 10.1007/s10515-022-00331-3

Can We Generate Shellcodes via Natural Language? An Empirical Study

Authors: Pietro Liguori, Erfan Al-Hossami, Domenico Cotroneo, Roberto Natella, Bojan Cukic, Samira Shaikh

Abstract: Writing software exploits is an important practice for offensive security analysts to investigate and prevent attacks. In particular, shellcodes are especially time-consuming and a technical challenge, as they are written in assembly language. In this work, we address the task of automatically generating shellcodes, starting purely from descriptions in natural language, by proposing an approach ba… ▽ More Writing software exploits is an important practice for offensive security analysts to investigate and prevent attacks. In particular, shellcodes are especially time-consuming and a technical challenge, as they are written in assembly language. In this work, we address the task of automatically generating shellcodes, starting purely from descriptions in natural language, by proposing an approach based on Neural Machine Translation (NMT). We then present an empirical study using a novel dataset (Shellcode_IA32), which consists of 3,200 assembly code snippets of real Linux/x86 shellcodes from public databases, annotated using natural language. Moreover, we propose novel metrics to evaluate the accuracy of NMT at generating shellcodes. The empirical analysis shows that NMT can generate assembly code snippets from the natural language with high accuracy and that in many cases can generate entire shellcodes with no errors. △ Less

Submitted 8 February, 2022; originally announced February 2022.

Comments: 33 pages, 5 figures, 9 tables. To be published in Automated Software Engineering journal

arXiv:2109.00279 [pdf, other]

doi 10.1109/ISSRE52982.2021.00042

EVIL: Exploiting Software via Natural Language

Authors: Pietro Liguori, Erfan Al-Hossami, Vittorio Orbinato, Roberto Natella, Samira Shaikh, Domenico Cotroneo, Bojan Cukic

Abstract: Writing exploits for security assessment is a challenging task. The writer needs to master programming and obfuscation techniques to develop a successful exploit. To make the task easier, we propose an approach (EVIL) to automatically generate exploits in assembly/Python language from descriptions in natural language. The approach leverages Neural Machine Translation (NMT) techniques and a dataset… ▽ More Writing exploits for security assessment is a challenging task. The writer needs to master programming and obfuscation techniques to develop a successful exploit. To make the task easier, we propose an approach (EVIL) to automatically generate exploits in assembly/Python language from descriptions in natural language. The approach leverages Neural Machine Translation (NMT) techniques and a dataset that we developed for this work. We present an extensive experimental study to evaluate the feasibility of EVIL, using both automatic and manual analysis, and both at generating individual statements and entire exploits. The generated code achieved high accuracy in terms of syntactic and semantic correctness. △ Less

Submitted 1 September, 2021; originally announced September 2021.

Comments: Paper accepted at the 32nd International Symposium on Software Reliability Engineering (ISSRE 2021)

arXiv:2104.13100 [pdf, other]

doi 10.18653/v1/2021.nlp4prog-1.7

Shellcode_IA32: A Dataset for Automatic Shellcode Generation

Authors: Pietro Liguori, Erfan Al-Hossami, Domenico Cotroneo, Roberto Natella, Bojan Cukic, Samira Shaikh

Abstract: We take the first step to address the task of automatically generating shellcodes, i.e., small pieces of code used as a payload in the exploitation of a software vulnerability, starting from natural language comments. We assemble and release a novel dataset (Shellcode_IA32), consisting of challenging but common assembly instructions with their natural language descriptions. We experiment with stan… ▽ More We take the first step to address the task of automatically generating shellcodes, i.e., small pieces of code used as a payload in the exploitation of a software vulnerability, starting from natural language comments. We assemble and release a novel dataset (Shellcode_IA32), consisting of challenging but common assembly instructions with their natural language descriptions. We experiment with standard methods in neural machine translation (NMT) to establish baseline performance levels on this task. △ Less

Submitted 18 March, 2022; v1 submitted 27 April, 2021; originally announced April 2021.

Comments: Paper accepted to NLP4Prog Workshop 2021 co-located with ACL-IJCNLP 2021. Extended journal version of this work has been published in the Automated Software Engineering journal, Volume 29, Article no. 30, March 2022, DOI: 10.1007/s10515-022-00331-3

Showing 1–6 of 6 results for author: Al-Hossami, E