Your application was succesfully submitted:

5c7ba9e6ef779562381cd489

29/07/2024 6:25 AM

Passport

Contact information

Welcome to this NGI SEARCH Open Call

We strongly recommend to carefully read the Guide for Applicants (GfA) and Frequently Asked Questions (FAQ) documents before starting to fill out this form.

More importantly, please follow the Signup & Application Instructions here for a seamless Open Call Management platform signup process and application form completion. They include crucial elements to take into account!

Hannah Bast

+491637683183

bast@informatik.uni-freiburg.de

Germany

Project

Established

Search and discovery features for existing digital common projects Natural language processing

Make the QLever graph database fully compliant with the SPARQL 1.1 standard by the W3C, while maintaining its high scalability and efficiency advantage over other graph databases.

QLever is a graph database with several unique features. Above all, it is extremely efficient and can handle RDF datasets of hundreds of billions of triples on a standard PC. It also shines on smaller datasets, where it can process queries much faster than other engines. QLever is already well known and used. The main road block for a wider usage is its full compliance to the SPARQL 1.1 standard. This is harder for QLever than for other databases because every feature we implement must not only be correct, but also macht our efficiency requirements. The goal of this project is to make QLever fully compliant with the SPARQL 1.1 standard, while maintaining its extremely high efficiency.

https://qlever.cs.uni-freiburg.de

More and more initiatives make their data publicly available as RDF. Prominent examples are: Wikidata (20 billion triples), PubChem (19 billion triples), OpenStreetMap (70 billion triples), and UniProt (160 billion triples). RDF is an elegant data model, standardized by the W3C since 2004, which has interoperability at its core. It does not only allow concise semantic searches of the individual datasets, but also federated search over arbitrary combinations of datasets.

Many graph databases exist, but most are slow for complex queries and cannot handle large datasets (like the above) at all. Until recently there has only been single open-source software (by a company, OpenLink Virtuoso) that can handle tens of billions of triples, however, with various limitations and an old codebase written in C. QLever’s mission is to close this gap, with a fully open-source project that is written in modern C++, and that can handle even the largest datasets efficiently on a standard machine.

Minimum Quality Criteria (MQC)

NOTE: As indicated in the Guide for Applicants, applications will be rejected if your project does not meet the basic MQC requirements.

https://github.com/ad-freiburg/qlever

QLever is fully open-source and all development happens in full daylight, progressing via pull requests of a well-digestible size (not too small and not too large). All code is reviewed. The test coverage is very high and all new code should aim at a test coverage close to 100%. There is extensive continuous integration, in particular: support for a variety of compilers and platforms, automatic testing, and automatic static analysis of the code.

The code (as well the individual pull requests and commits) is thoroughly documented, which cannot be said of many other graph databases. This is important for the onboarding of new contributors. So far, QLever has attracted 26 contributors, including several from outside the original core group.

The project has a very lively issue tracker, discussion forum, and Wiki. Besides documentation of the code, there is also documentation of the overall architectures and of the main design principles.

https://qlever.cs.uni-freiburg.de

The website https://qlever.cs.uni-freiburg.de provides SPARQL endpoints, based on QLever, for many of the most important publicly available RDF datasets, including: Wikidata, PubChem, OpenStreetMap, and UniProt. Beyond a standardized API, the website also provides a powerful user interface, which allows the incremental construction of SPARQL queries using interactive suggestions obtained after each keystroke. These suggestions are realized themselves via SPARQL queries and are unique to QLever because they only work with a sufficiently efficient query engine.

The DBLP bibliography for computer science has set up its own endpoint based on QLever: https://sparql.dblp.org . The DBLP website alone gets over 10 million page impressions every month.

All endpoints, including the one for DBLP, can also be set up by users themselves, on their own machine, using a tool available under https://github.com/ad-freiburg/qlever-control .

For the impact, these services have on the community, see B.2.

https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_update/WDQS_backend_alternatives

There is a large community behind each of the aforementioned datasets. QLever has already begun to impact each of them, and we expect more impact in the future. We give two examples:

1. Wikidata is struggling for years with its growing dataset and its current graph database is already stretched beyond its limits. QLever is currently on the shortlist and the most promising candidate, see https://www.wikidata.org/wiki/Wikidata:SPARQLqueryservice/WDQSbackendupdate/WDQSbackendalternatives . The main obstacle is that QLever is not yet fully compliant to the SPARQL 1.1 standard.

2. OpenStreetMap so far has its own tailor-made tool (“Overpass”) for querying its huge data. However, the query language is not standardized and there is no interoperability with other datasets, like Wikidata. QLever enables that and the community has already started to adopt it: https://wiki.openstreetmap.org/wiki/QLever . In particular, our endpoint is used to find errors and inconsistencies in the data.

Excellence

Excellence description

Ambition: Applicants have to demonstrate to what extent the proposed project is beyond the State of the Art and describe the innovative approach behind it (e.g. ground-breaking objectives, novel concepts and approaches, new organisational models).

Innovation: Applicants should provide information about the level of innovation within their market and about the degree of differentiation that this project will bring.

QLever is already well known and used, see the descriptions in the fields above. The focus of our development so far has been on the core features, in particular, on the central data structures and algorithms for query planning and query processing and their careful and extensible implementation in modern C++. That is what makes QLever so fast and sets it apart from other graph databases.

The main road block for wider employment is full compliance to the SPARQL 1.1 standard by the W3C. In particular, this has been explicitly (and independently) formulated by the following three projects: Wikidata, UniProt, and OpenStreetMap. In particular, a full implementation of SPARQL 1.1 Update is needed (dynamic changes of the dataset).

It is the goal of this project to achieve this full SPARQL 1.1 compliance. This is more challenging than for other graph databases because of the strong efficiency requirements of QLever, see explanations for the the next field (“Ambition”).

We use this field to describe what sets QLever apart from other graph databases.

1. Most graph databases use standard algorithms for their core components. For QLever, the first question we ask for every core component is: What is the best algorithm for this task. Our extensive experience in algorithms research obviously helps.

2. It is not enough to choose the theoretically best algorithm. It also has to be implemented carefully and perform well in practice. This is an art known as algorithm engineering. Hannah Bast has been practicing and teaching this for over 20 years.

3. The choice of programming language is important. None of the graph databases written in Java can handle large datasets efficiently (or at all). The only other engine that comes close to QLever’s efficiency is Virtuoso, but its codebase is old, poorly documented, and written in C, which is a major road block for further development. QLever is written in modern C++ and very well documented.

Any project that provides their data as RDF benefits from an efficient graph database in multiple ways:

1. The ability to carry out complex queries. Many existing graph databases are too slow for this, even on medium-sized datasets.

2. The ability to compute and download large results. Existing graph databases have problems with this or cannot do it at all. QLever is explicitly designed to handle this.

3. Statistics queries and queries that identify errors in the dataset. These are often among the most compute-intensive queries and many existing graph databases cannot handle them, QLever can.

4. QLever provides context-sensitive auto-completion that lets users construct SPARQL queries incrementally, requiring only a basic understanding of SPARQL and no particular knowledge of the dataset. This is a unique feature of QLever.

5. QLever allows complex spatial queries and the interactive display of very large numbers of geometric objects on a map. No other spatial database has this feature.

Already now, QLever is the fastest graph database on the market, by a wide margin. As described above, this is thanks to its sophisticated and carefully designed core algorithms and data structures, the very careful algorithm engineering, and the use of modern C++.

As mentioned above, and elaborated more in the Section on “Implementation” below, we have extensive experience in all these fields (algorithm design, algorithm engineering, C++, system building).

Continuing along these lines, we have no doubts that we can make QLever fully compliant with the SPARQL 1.1 standard of the W3C, while maintaining the extraordinary high efficiency.

Impact

Impact Description

At least 2 out of the 3 impacts listed below need to be addressed (see explanation text for each one of the following 3 criteria)

  • I Market opportunity
  • II Open-Source based Commercial Strategy and Scalability
  • III Environmental and social impact

Market potential: The market for graph databases is booming; the recent survey https://arxiv.org/abs/2102.13027 from 2021 lists over one hundred graph databases. According to a recent Gartner report, “Semantic Data Integration & Knowledge Graphs” is one of the top-10 trends in Data Integration and Engineering: https://www.linkedin.com/posts/juansequeda_add-semantic-data-integration-knowledge-activity-7196105766615801856-FOIk/

Degree of competition: We keep a close eye on the other graph databases. According to a recent performance evaluation, QLever is by far the most efficient: https://github.com/ad-freiburg/qlever/wiki/QLever-performance-evaluation-and-comparison-to-other-SPARQL-engines

End users needs: As described above, users from various communities (Wikidata, OpenStreetMap, UniProt, …) are currently experiencing significant limitations with current tools because of long query times and limited query capabilities. Qlever can overcome all of these limitations.

Interoperability of datasets is one of the most important topics of our time. An area, where this is particularly evident, is health care. The current situation is still bleak: different hospitals (and often even different departments in the same hospital) use different (often proprietary) systems to store their data, which makes an exchange very difficult.

The HL7 FHIR standard has been a big step forward in this respect in that it creates a common (non-proprietary) exchange format. One of its base standards is RDF, which is the basis for interoperability.

Once large amounts of health-care data become available as FHIR-RDF, efficient graph databases will be key for storing and querying them, just like they are key already now for bis projects like Wikidata, UniProt, or OpenStreetMap. We expect this to happen soon. Then an efficient graph database like QLever has the potential to be a game changer.

We already mentioned how other projects (Wikidata, OpenStreetMap, …) each have their own community with own specific use cases: complex queries, downloading of parts of the dataset, statistics queries, explorative queries, and queries that aim at finding errors or inconsistencies in the dataset.

All these projects suffer from efficiency issues with their current systems. The main road block for using QLever is the missing compliance with the full SPARQL 1.1 standard. Once this compliance is reached, a quick adoption of QLever is very likely, with the corresponding advantages for the users from these communities.

We also reiterate that QLever makes it easy for users to set up services on their own machines. This is harder for other engines not only because the setup is often complicated, but more importantly, because other graph databases require expensive machines or even multiple machines, whereas QLever is designed to function well on a standard PC with moderate resources.

Implementation

Implementation Description

Team: You should see yourself as an NGI Searcher with the ambition of changing the governance of social networks, management and leadership qualities for the better.

Resources: Demonstrate the quality and effectiveness of the resources assigned to complete the proposed milestones.

Hannah Bast has 35 years of experience in algorithms research and building software systems that actually work in practice. In particular, she is known for CompleteSearch, the search technology behind https://dblp.org , the public transit routing on https://www.google.com/maps (designed and implemented during an extended sabbatical at Google), and for https://qlever.cs.uni-freiburg.de .

Johannes Kalmbach has been working on QLever in Hannah Bast’s group since 2018. In a short time, he has become QLever’s chief developer, introducing many important innovations to the code and helping numerous contributors with the onboarding. He is very proficient in modern C++. His commit history speaks for itself:
https://github.com/joka921

Robin Textor-Falconi has been working on QLever in Hannah Bast’s group since 2021. He has implemented live query analysis and is currently working on reducing QLever’s memory footprint during query processing: https://github.com/RobinTF .

We are very enthusiastic about building software systems, based on sophisticated algorithms and data structures, that bring powerful search capabilities to users. Our prime goals are high efficiency, resourcefulness, and making it as easy as possible for users to formulate complex queries with as little expertise as possible. QLever also provides an intuitive query analysis tool, which helps users understand the results and the resource requirements of their queries.

At the same time, we also want to cater to expert users, giving them the possibility to understand what is happening under the hood if they want to. A perfect example for this is our QLever command-line tool at https://github.com/ad-freiburg/qlever-control . Setting up an own SPARQL endpoint is as easy as “qlever setup-config wikidata && qlever get-data && qlever index && qlever start”. On demand, the tool also provides the detailed command lines it uses and is highly configurable.

Hannah Bast

University of Freiburg, Chair for Algorithms and Data Structures

Female

Oversees the complete development of QLever, hiring, documentation, code reviews, coding herself.

https://ad.informatik.uni-freiburg.de/staff/bast/cv

https://ad.informatik.uni-freiburg.de

Team members (Press the yellow (+) button below to add up to 2 other team members)

Johannes Kalmbach

Ph.D. student

Germany

--

--

Robin Textor-Falconi

Prospective Ph.D. student

Germany

--

--
Other Team Information

Please also provide the following additional information about the team.

Johannes Kalmbach and Robin Text Falconi can work on the project full time.
Hannah Bast can spend more than 50% of her time on the project.

N/A

Work Plan and Resources to be committed

We want to complete the following four milestones, each of which requires about six months of if performed by a single person (the third and fourth milestone can be easily split up):

1. Implement SPARQL 1.1 Update. This concerns dynamic updates of the dataset and is a deal-breaker for all of the projects mentioned above (Wikidata, OpenStreetMap, UniProt).

2. Implement SPARQL 1.1 Federated search . QLever already has a basic implementation, but important features like variable SERVICE IRIs are still missing. The implementation is not yet efficient when large results are involved.


3. Implement missing SPARQL 1.1 features. This is a hard requirement by all the projects mentioned above. In particular, support for named graphs is used in many applications and still missing.


4. Implement missing GeoSPARQL features. This is important for projects based on OpenStreetMap data. In particular, dynamic spatial joins and primitives like geof:distance are not yet supported.

150

The money will be used to pay two full-time positions for 9 months each (Johannes Kalmbach and Robin Textor-Falconi, see above), as well two student assistants.

The software development of QLever requires people with a strong background in algorithms, but who are also able to put these ideas into practice, and who are experienced in writing professional C++ code. Johannes Kalmbach and Robin Text-Falconi are ideal candidates for this job, given their track record so far.

The student assistants will help with coding tasks that require less expertise. We have several competent candidates for these jobs.

Statistical Section

Statistical Section Description

We kindly ask you to fill in some statistical questions about how you heard about NGI Search Open Call. Thank you!

Word of mouth

DECLARATION OF HONOUR

Declaration of Honour

Please carefully read the statements below. You will not be able to change the statements after the deadline. By ticking the boxes below, I confirm that:

Moreover, the entity I represent, or persons with power of representation, decision-making or control over the aforementioned legal entity:

PROCESSING OF PERSONAL DATA

Processing of Personal Data Description

You can read the GDPR (processing of personal data) information clause for this open call following this link here.

Yes

Yes

Yes

YES

eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9..L8EoXBPuX_Wp_qqQwFfSyJ-9W0zMNG3shDsJtPeusyueo4y3wKk90NUrkUxdDdHNh32mVzGp6v3qi4BnFc9Ks0U7D65YPpXQ8rX-fcZRK73-zKuFt8cFnx-TADQsFRi-8_tTW7lClEmDmbpUA-IJ5ngxvkrloJg4VBmCFxFG5oaeewOTK8PHB-xbLF_4ON0-JoJANpPWWrK7ZTj8_jLkZ_LugoGXV8P5nMVJbM-bqzPj4LrxYC82PiFOOdXn2AGylpvwFo1DFYiRfMYYwo6hUhJgjCyLI4fg4l2GqNRg725TAD6nkUPP9t7c4hDxCSs5RHda1HWPjyVZn8Ju_rdf6g