A logo showing the text blog.marcnuri.com
Español
Home»Java»Building a GitHub Dependents Scraper with Quarkus and Picocli

Recent Posts

  • Fabric8 Kubernetes Client 6.5.0 is now available!
  • Eclipse JKube 1.11 is now available!
  • Fabric8 Kubernetes Client 6.4.1 is now available!
  • I bought an iPad
  • Three years at Red Hat

Categories

  • Front-end
  • Java
  • JavaScript
  • Legacy
  • Operations
  • Personal
  • Pet projects
  • Tools

Archives

  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • September 2022
  • August 2022
  • July 2022
  • June 2022
  • May 2022
  • February 2022
  • January 2022
  • December 2021
  • November 2021
  • October 2021
  • September 2021
  • August 2021
  • July 2021
  • January 2021
  • December 2020
  • November 2020
  • October 2020
  • September 2020
  • August 2020
  • July 2020
  • June 2020
  • May 2020
  • December 2019
  • October 2019
  • September 2019
  • July 2019
  • March 2019
  • November 2018
  • July 2018
  • June 2018
  • May 2018
  • April 2018
  • March 2018
  • February 2018
  • December 2017
  • July 2017
  • December 2015
  • November 2015
  • November 2008
  • November 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007
  • March 2007

Building a GitHub Dependents Scraper with Quarkus and Picocli

2020-07-31 in Java tagged Client / GitHub / Java / JKube / Picocli / Quarkus by Marc Nuri | Last updated: 2021-10-09
Versión en Español

Introduction

During the past few months, my team and I have been working very hard to release Eclipse JKube. JKube is the successor of the deprecated Fabric8 Maven Plugin, and as such, our main goal right now is to migrate the current user-base to the new project. You can learn more about JKube and how to get started in this other post.

GitHub provides some fancy stats and metrics, including information about the project’s dependency graph. This information is really valuable since we get to know which projects (within GitHub) depend on ours. So for our user-base migration use case, this information is spot on. Unfortunately, the GitHub developers API offers information about dependencies, but not about dependents.

In this blog post, I will show you how to create a simple web scraper using Picocli and Quarkus to build a native binary that will scrape dependents for any GitHub project.

Note: Make sure you comply with GitHub Scraping and API Usage Restrictions before using this tool.

Application bootstrap

The first step is to bootstrap the project, we can use the handy code.quarkus.io web interface to generate our project. In this case, we only need Picocli experimental dependency. We’ll also need to add the jsoup dependency to our pom.xml.

Since the application will be scraping the GitHub website which uses SSL, we need to enable the https URL protocol for the GraalVM build. Quarkus Maven Plugin provides a very simple configuration that allows us to include build arguments for GraalVM to use.

1<!-- ... -->
2<build>
3  <plugins>
4    <plugin>
5      <groupId>io.quarkus</groupId>
6      <artifactId>quarkus-maven-plugin</artifactId>
7      <executions>
8        <execution>
9          <id>package</id>
10          <goals>
11            <goal>native-image</goal>
12          </goals>
13          <configuration>
14            <dockerBuild>${dockerBuild}</dockerBuild>
15            <additionalBuildArgs>
16              <additionalBuildArg>-H:EnableURLProtocols=https</additionalBuildArg>
17              <additionalBuildArg>-H:EnableURLProtocols=http</additionalBuildArg>
18            </additionalBuildArgs>
19          </configuration>
20        </execution>
21      </executions>
22    </plugin>
23  </plugins>
24</build>
25<!-- ... -->

Quarkus Command line application

Next up, we will make our application command-line friendly by using Picocli. There are a few guides online that will help us achieve this, and even a full section in the quarkus-cheat-sheet.

Application Main Class

Our application is really simple and only has an entry-point, so to make things easier, we’ll annotate the application’s main class with @CommandLine.Command. This is the resulting code:

1@CommandLine.Command(name = "github-dependents")
2public class Application implements Runnable {
3
4  private final ScraperService scraperService;
5
6  @CommandLine.Parameters(index = "0", paramLabel = "URL", arity = "1",
7    description = "GitHub URL to the projects dependents list")
8  String dependentsUrl;
9
10  @Inject
11  public Application(ScraperService scraperService) {
12    this.scraperService = scraperService;
13  }
14
15  @Override
16  public void run() {
17    try {
18      new URL(dependentsUrl);
19      scraperService.scrape(dependentsUrl);
20    } catch(MalformedURLException ex) {
21      System.err.printf("URL %s is invalid, please provide a valid URL.%n", dependentsUrl);
22      CommandLine.usage(this, System.out);
23    } catch (Exception e) {
24      System.err.println(e.getMessage());
25      e.printStackTrace();
26    }
27  }
28
29}

Since I want to make this application usable for any GitHub project, I need input from the user specifying which page they want to scrape. For this purpose, I’ll annotate a dependentsUrl field with @CommandLine.Parameters. Please note how Picocli is very user-friendly and provides a set of parameters that will be used to generate the CLI help command.

We’ll use standard Java CDI @Inject annotation provided by Quarkus to inject the service class that will be used to perform the scraping.

Scraper Service

This class contains the actual logic that will scrape the GitHub dependents page. The program will recurse through the different pages and return a JSON representation of each dependent including organization, name, URL, stars, and forks.

The service also performs some validations for the user inputs (valid URL, URL belongs to a GitHub dependents page, etc.) and has some logic to retry in case GitHub returns a 429 - Too Many Requests HTTP status code.

Running the application

JVM

To run the application using a Java Virtual Machine, first, we need to compile and package the application:

1mvn clean package

Once we’ve packaged the application we can run it with a target repository of our choice:

1java -jar  target/github-dependents-scraper-uber.jar "https://github.com/eclipse/jkube/network/dependents?package_id=UGFja2FnZS0xMDY0ODYxMDkz"

Native binary

To run the application using a native binary, first, we need to compile, package and build the native image for the application:

1mvn clean package -Pnative

Once the binary file is ready we can run it with a target repository of our choice:

1./target/github-dependents-scraper-uber "https://github.com/eclipse/jkube/network/dependents?package_id=UGFja2FnZS0xMDY0ODYxMDkz"

The following GIF shows a quick demo of the application running;

A demo of this application running
A demo of this application running

Conclusion

In this article, you’ve seen how easy it is to create a simple but very useful command-line tool using Quarkus and Picocli, and how to create a native binary with no pain using Quarkus features.

You can check the full source code for this post in the github-dependents-scraper GitHub repository.

Quarkus
Quarkus
Twitter iconFacebook iconLinkedIn iconPinterest iconEmail icon

Post navigation

← Apache Camel used on a Kubernetes Cassandra clusterAccess the Kubernetes API from a Pod in Java →
© 2007 - 2023 Marc Nuri