My GRC20 Hackathon participation

  • Post

Participation

We participated in the first GRC20 Hackathon with SpaceDev.

We were assigned the academic fields of Economics and Finance: Course, Lesson, Paper, School, and Topic.

We chose to work with Economic and Finance papers and decided to use Arxiv.org to get the papers.

Arxiv is a curated research-sharing platform that provides an API and has the paper's metadata well-structured and organized in its categories.

Data structure

With the papers as the main entity to track, we think the entry point would be to identify its main characteristics, they would be part of spaces related to their categories or academic fields (focusing on economics and finances, but we could track other categories associated to the same papers), and have one or more authors associated with them.

We have only their names for the authors, but that would identify it as a node and possibly associate different papers with the same author.

It is important to check for already existing academic field spaces since they are more likely to be used by someone else. We ended up deciding to use the already existing academic fields and relate them to the paper with the academic field's property, but if that academic field has an associated space, link it in the related spaces.

The papers have some associated categories in arXiv, so we decided to map those categories to tags.

The papers will have a title, an abstract, authors, academic fields, related spaces, tags, a published date, a web URL, and a download URL.

Data processing and uploading

1. Downloading papers from arXiv

We downloaded all available papers from arXiv related to the assigned categories. Since arXiv provides an API, we used a script to fetch all the papers' metadata and store it locally as JSON files.

2. Defining Entities and Database Schema

With all the downloaded files, we identified each component and determined the necessary entity definitions (as described in the previous section). Then, we designed and implemented the database schemas to store each of these entities.

3. Publishing to Testnet

Once the entities were defined, we started publishing test examples of each entity on the testnet, making adjustments as needed.

4. Choosing an Upload Strategy

To ensure all entities were properly connected, we considered two approaches:

  • Option 1: Paper-First Approach This method involved starting with the paper, checking each relation, and creating any missing related entities on the fly. However, this proved inefficient for large datasets, as certain entities (e.g., an academic field like "Economics") would need to be checked repeatedly across multiple papers.

  • Option 2: Entity-First Approach In this approach, we first deployed all related entities before uploading the papers. For example, we would first create the "Economics" academic field, and then associate papers with it afterward.

Based on our data structure, we opted for the Entity-First Approach. We first deployed all tags, persons, and the arXiv project, and only then did we proceed with deploying all the papers using these pre-existing entities.

4. Publishing to testnet

Once the entities were defined, we started publishing test examples of each entity on the testnet, making adjustments as needed.

5. Deploying to Mainnet

Once we were satisfied with the results, we updated the code and repeated the process on mainnet. For the hackathon, we joined the relevant spaces we wanted to contribute to and deployed a small number of papers to verify that everything was functioning correctly.

Used code

All the code we used in the hackathon was published in a GitHub repository.