Databases
See recent articles
- [1] arXiv:2407.17657 [pdf, other]
-
Title: My Ontologist: Evaluating BFO-Based AI for Definition SupportSubjects: Databases (cs.DB)
Generative artificial intelligence (AI), exemplified by the release of GPT-3.5 in 2022, has significantly advanced the potential applications of large language models (LLMs), including in the realms of ontology development and knowledge graph creation. Ontologies, which are structured frameworks for organizing information, and knowledge graphs, which combine ontologies with actual data, are essential for enabling interoperability and automated reasoning. However, current research has largely overlooked the generation of ontologies extending from established upper-level frameworks like the Basic Formal Ontology (BFO), risking the creation of non-integrable ontology silos. This study explores the extent to which LLMs, particularly GPT-4, can support ontologists trained in BFO. Through iterative development of a specialized GPT model named "My Ontologist," we aimed to generate BFO-conformant ontologies. Initial versions faced challenges in maintaining definition conventions and leveraging foundational texts effectively. My Ontologist 3.0 showed promise by adhering to structured rules and modular ontology suites, yet the release of GPT-4o disrupted this progress by altering the model's behavior. Our findings underscore the importance of aligning LLM-generated ontologies with top-level standards and highlight the complexities of integrating evolving AI capabilities in ontology engineering.
- [2] arXiv:2407.17881 [pdf, other]
-
Title: Unraveling the Never-Ending Story of Lifecycles and Vitalizing ProcessesSubjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Business process management (BPM) has been widely used to discover, model, analyze, and optimize organizational processes. BPM looks at these processes with analysis techniques that assume a clearly defined start and end. However, not all processes adhere to this logic, with the consequence that their behavior cannot be appropriately captured by BPM analysis techniques. This paper addresses this research problem at a conceptual level. More specifically, we introduce the notion of vitalizing business processes that target the lifecycle process of one or more entities. We show the existence of lifecycle processes in many industries and that their appropriate conceptualizations pave the way for the definition of suitable modeling and analysis techniques. This paper provides a set of requirements for their analysis, and a conceptualization of lifecycle and vitalizing processes.
New submissions for Friday, 26 July 2024 (showing 2 of 2 entries )
- [3] arXiv:2407.17941 (cross-list from cs.SE) [pdf, other]
-
Title: RDFGraphGen: A Synthetic RDF Graph Generator based on SHACL ConstraintsComments: 19 pagesSubjects: Software Engineering (cs.SE); Databases (cs.DB)
This paper introduces RDFGraphGen, a general-purpose, domain-independent generator of synthetic RDF graphs based on SHACL constraints. The Shapes Constraint Language (SHACL) is a W3C standard which specifies ways to validate data in RDF graphs, by defining constraining shapes. However, even though the main purpose of SHACL is validation of existing RDF data, in order to solve the problem with the lack of available RDF datasets in multiple RDF-based application development processes, we envisioned and implemented a reverse role for SHACL: we use SHACL shape definitions as a starting point to generate synthetic data for an RDF graph. The generation process involves extracting the constraints from the SHACL shapes, converting the specified constraints into rules, and then generating artificial data for a predefined number of RDF entities, based on these rules. The purpose of RDFGraphGen is the generation of small, medium or large RDF knowledge graphs for the purpose of benchmarking, testing, quality control, training and other similar purposes for applications from the RDF, Linked Data and Semantic Web domain. RDFGraphGen is open-source and is available as a ready-to-use Python package.
- [4] arXiv:2407.18157 (cross-list from cs.CR) [pdf, other]
-
Title: Enhanced Privacy Bound for Shuffle Model with Personalized PrivacySubjects: Cryptography and Security (cs.CR); Databases (cs.DB)
The shuffle model of Differential Privacy (DP) is an enhanced privacy protocol which introduces an intermediate trusted server between local users and a central data curator. It significantly amplifies the central DP guarantee by anonymizing and shuffling the local randomized data. Yet, deriving a tight privacy bound is challenging due to its complicated randomization protocol. While most existing work are focused on unified local privacy settings, this work focuses on deriving the central privacy bound for a more practical setting where personalized local privacy is required by each user. To bound the privacy after shuffling, we first need to capture the probability of each user generating clones of the neighboring data points. Second, we need to quantify the indistinguishability between two distributions of the number of clones on neighboring datasets. Existing works either inaccurately capture the probability, or underestimate the indistinguishability between neighboring datasets. Motivated by this, we develop a more precise analysis, which yields a general and tighter bound for arbitrary DP mechanisms. Firstly, we derive the clone-generating probability by hypothesis testing %from a randomizer-specific perspective, which leads to a more accurate characterization of the probability. Secondly, we analyze the indistinguishability in the context of $f$-DP, where the convexity of the distributions is leveraged to achieve a tighter privacy bound. Theoretical and numerical results demonstrate that our bound remarkably outperforms the existing results in the literature.
- [5] arXiv:2407.18241 (cross-list from cs.LG) [pdf, other]
-
Title: Numerical Literals in Link Prediction: A Critical Examination of Models and DatasetsSubjects: Machine Learning (cs.LG); Databases (cs.DB)
Link Prediction(LP) is an essential task over Knowledge Graphs(KGs), traditionally focussed on using and predicting the relations between entities. Textual entity descriptions have already been shown to be valuable, but models that incorporate numerical literals have shown minor improvements on existing benchmark datasets. It is unclear whether a model is actually better in using numerical literals, or better capable of utilizing the graph structure. This raises doubts about the effectiveness of these methods and about the suitability of the existing benchmark datasets.
We propose a methodology to evaluate LP models that incorporate numerical literals. We propose i) a new synthetic dataset to better understand how well these models use numerical literals and ii) dataset ablations strategies to investigate potential difficulties with the existing datasets. We identify a prevalent trend: many models underutilize literal information and potentially rely on additional parameters for performance gains. Our investigation highlights the need for more extensive evaluations when releasing new models and datasets.
Cross submissions for Friday, 26 July 2024 (showing 3 of 3 entries )
- [6] arXiv:2401.00659 (replaced) [pdf, other]
-
Title: Cost-effective Datasets Discovery: When Distinctiveness MattersSubjects: Databases (cs.DB)
In this paper, we aim to find a set of datasets that can enrich a base dataset by introducing the maximum number of distinct tuples (i.e., maximizing distinctiveness), driven by a user's query set and a budget limit. We prove this problem to be NP-hard and, subsequently, we develop a greedy algorithm that attains an approximation ratio of (1-1/e)/2. However, this algorithm lacks efficiency and scalability due to its frequent computation of the exact distinctiveness marginal gain of any candidate dataset for selection, which requires scanning through every tuple in candidate datasets and thus is unaffordable in practice. To overcome this limitation, we propose an efficient machine learning (ML)-based method for estimating the distinctiveness marginal gain of any candidate dataset that effectively eliminates the need to test each tuple individually. Estimating the distinctiveness marginal gain of a dataset involves estimating the number of distinct tuples in the tuple sets returned by each query in a query set across multiple datasets. This can be viewed as the cardinality estimation for a query set on a set of datasets, and the proposed method is the first to tackle this cardinality estimation problem. This is a significant advancement over prior methods limited to single-query cardinality estimation on a single dataset that fall short in identifying overlaps among tuple sets returned by each query in a query set across multiple datasets. Extensive experiments using five real-world data pools demonstrate that our algorithm utilizing ML-based distinctiveness estimation outperforms all relevant baselines in terms of both effectiveness and efficiency.