Skip to content

Automation: Direction #3

Description

@github-actions

Last generated: 2026-01-22T18:46:14.274Z
Provider: openai
Model: gpt-5.2

Summary

This repo is a Maven-based Java MapReduce project targeting Common Crawl data. Current automation likely exists (new .github content was recently added), but the repo lacks clear CI gates for: repeatable builds, unit/integration test execution, dependency hygiene, and basic static checks. The most reliable, low-risk automation direction is to make “build + test + package” deterministic and fast in CI, then add lightweight quality/security checks that don’t require a Hadoop cluster.

Direction (what and why)

Direction: Establish a minimal, deterministic CI pipeline around Maven that:

  1. builds and runs tests on every PR/push,
  2. produces a versioned jar artifact, and
  3. enforces basic hygiene (format/lint optional, but dependency & CVE scanning is high-value).

Why: MapReduce/Hadoop jobs often fail late (in cluster) due to packaging, dependency conflicts, or subtle runtime assumptions. A CI gate that validates compilation, tests, shading/assembly (if used), and dependency sanity reduces deploy-time failures and makes changes safer. Keep it “local-mode friendly” to avoid requiring Hadoop in CI.

Plan (next 1-3 steps)

1) Make Maven builds deterministic and CI-friendly

Files to add/change:

  • Add .mvn/wrapper/maven-wrapper.properties and mvnw, mvnw.cmd (Maven Wrapper)
  • Add/adjust .github/workflows/ci.yml

Concrete actions:

  • Use Maven Wrapper in CI to avoid runner Maven version drift.
  • Enable dependency caching in CI (actions/setup-java with cache: maven).
  • Run a standard build matrix for supported Java (likely 8/11; pick based on pom.xml—if unknown, start with 8 + 11, trim later).

CI job steps (example target):

  • ./mvnw -B -ntp -DskipTests=false test
  • ./mvnw -B -ntp package
  • Upload built jar(s) from target/ as workflow artifacts.

Definition of done:

  • CI passes on a clean checkout with no manual steps.
  • Artifacts are available for download from the Actions run.

2) Add baseline test scaffolding that doesn’t need a cluster

Files to add/change:

  • src/test/java/... (new unit tests)
  • Ensure pom.xml includes Surefire (unit tests) and (optional) Failsafe (integration tests) config.

Concrete actions:

  • Add small tests around any pure functions / parsing utilities (WARC parsing helpers, URL normalization, record filtering rules, etc.).
  • If the project reads sample inputs, add a tiny fixture under src/test/resources/ (keep it <100KB).
  • If Hadoop classes are used, prefer local/unit testing:
    • Test mappers/reducers with MRUnit (if compatible) or lightweight harnesses, or
    • Test parsing/extraction logic separately from Hadoop glue.

Definition of done:

  • At least 3–5 unit tests run in CI and validate key behavior.
  • Tests run in <30s.

3) Add dependency and security hygiene checks (low noise)

Files to add/change:

  • pom.xml: add OWASP Dependency-Check plugin (or use GitHub’s dependency review + Dependabot)
  • .github/dependabot.yml (optional but recommended)
  • .github/workflows/dependency-review.yml (optional if PR-based)

Concrete actions:

  • Add Dependabot for Maven (pom.xml) with weekly cadence.
  • Add OWASP dependency-check in CI as a non-blocking step initially (warn-only), then promote to blocking once tuned.

Definition of done:

  • Automated dependency update PRs appear.
  • CI surfaces known CVEs without breaking builds unexpectedly.

Risks/unknowns

  • Java/Hadoop compatibility: The required Java version may be pinned by Hadoop dependencies (often Java 8 for older stacks). Verify pom.xml maven.compiler.* and dependency versions before finalizing the CI matrix.
  • Packaging strategy: If the job requires a shaded “fat jar” (common for Hadoop), pom.xml may need maven-shade-plugin configuration. CI should validate the actual deployable artifact.
  • Tests may require external data: If current code assumes large Common Crawl inputs, we’ll need to factor logic so it can be tested with small fixtures.
  • Existing workflows: Recent commit mentions “workflows synced”; avoid duplicating conflicting workflows in .github/workflows/. Consolidate into one clear CI entrypoint.

Suggested tests

  1. Maven build sanity

    • ./mvnw -B -ntp test
    • ./mvnw -B -ntp package (verify jar exists in target/)
  2. Rule/config validation (if rules.json is consumed by code)

    • Add a unit test that loads rules.json from the repo root (or move it to src/main/resources/) and validates schema/required fields.
    • Fail fast on malformed JSON or missing keys.
  3. Small fixture processing

    • Add a minimal input sample under src/test/resources/ (tiny WARC snippet or representative record format used by the job).
    • Unit test that parsing/extraction returns expected output.
  4. (Optional) Integration test profile

    • ./mvnw -B -ntp -Pintegration-test verify
    • Keep it local-only (no cluster), and skip by default in developer workflows if slow.

Verification checklist (quick)

  • ./mvnw test passes locally on a clean checkout
  • GitHub Actions CI runs on PR and on push to master
  • CI uploads target/*.jar artifact
  • At least one test fails when intentionally breaking a core behavior (proves tests are meaningful)
  • Dependabot PRs appear (if enabled) and CI runs on them

Metadata

Metadata

Assignees

No one assigned

    Labels

    automationAutomation-generated direction and planning

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions