Skip to content

Discussion: Project to Quantify the Commons #17

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
annatuma opened this issue Jul 29, 2022 · 4 comments
Closed

Discussion: Project to Quantify the Commons #17

annatuma opened this issue Jul 29, 2022 · 4 comments
Assignees
Labels
🟩 priority: low Low priority and doesn't need to be rushed question Further information is requested 🏷 status: label work required Needs proper labelling before it can be worked on 💬 talk: discussion Open for discussions and feedback

Comments

@annatuma
Copy link

Background:

Creative Commons has submitted a project to UMSI and they have determined that this project is a potential fit for the course SI 485: Information Analysis Capstone and Final Project. In this course, advanced undergraduate students deliver data-oriented solutions through the development and analysis of data sets, building tools to extract useful information for clients through manipulation, analysis and visualization. This ticket is intended for discussion of the project, with the goal of refining the potential questions we'd like answered and getting input from those who have considered this challenge in the past.

Project General Information

Project Idea:

Creative Commons (CC) seeks to quantify the use of CC legal tools (works in the commons). CC legal tools include the licenses (e.g. CC BY, CC BY-NC-SA) and public declarations e.g. CC0, PDM). This project would include data collection, analysis, and visualization.

Potential questions to be answered:

  • How many works are in the commons?
  • What can we determine from the rate of change?
  • How can those works be characterized (e.g. by legal tool, region, language)?
  • How can the data be managed to allow future trend analysis (e.g. which languages saw the largest growth in legal tool adoption)?
  • How can the use of CC legal tools be meaningfully visualized?

Developing reproducible methodologies for gathering information about the use of CC legal tools will help CC communicate its impact, support policy work (at all levels of government and within institutions), and support the wider community.

Full Description

Creative Commons (CC) seeks to quantify the use of CC legal tools (works in the commons). CC legal tools include the licenses (e.g. CC BY, CC BY-NC-SA) and public declarations (e.g. CC0, PDM). This project would include data collection, analysis, and visualization.

First, this project should create reproducible processes or methodologies for creating a dataset of information about works that are CC licensed or dedicated to the public domain. The dataset may be built from platform APIs (e.g. Flickr), Common Crawl data, etc. The project should create a starting place not only for the project itself, but future efforts to extend the dataset and the meaning derived from it.

Second, the project should begin to create meaning from the dataset. How many works are currently in the commons? How has that changed/trended? How can those works be characterized (e.g. by legal tool, region, language)? How can the data be managed to allow future trend analysis (e.g. which languages saw the largest growth in legal tool adoption)?

Third, optionally, how can the data be visualized to communicate meaning and allow exploration?

Project Outcome

  • What deliverable(s) would students produce and share with your organization as a result of this project?
  • How do you plan to use the feedback, recommendations, or product you receive from the student team?

Students should create reproducible processes or methodologies for creating a dataset, the resulting dataset, and analysis. Optionally students may create visualizations of the dataset.

The processes, dataset, and analysis will help Creative Commons communicate its impact, support policy work (at all levels of government and within institutions), and support the wider community.

What do students need for this project to be successful?

Examples: skills needed, social impact orientation, interest or experience in a specific field/domain/industry.

Curiosity, motivation, proficiency in a programming language that can be used to query APIs and manipulate data (e.g. JavaScript, Pearl, Python, Ruby; Python is preferred), and a recognition of the value of open knowledge.

Data Proposal Information

Data Set

We expect students to create a new data set for us

Size of Data Set

How big is the data set? Approximately how many rows and columns does it have?

Between 200 million and 2 billion rows with 10 columns. The last effort to quantify the commons in 2017 estimated 1.4 billion works. I expect metadata can be discovered on at least 200 million works. Columns could include: URL, author, date, legal_tool, language, reference_count, etc.

Findings from Data Set

What do you want to learn from your data set? Please share 3-5 specific questions that the data can help solve:

How many works are in the commons?
What can we determine from the rate of change?
How can those works be characterized (e.g. by legal tool, region, language)?
How can the data be managed to allow future trend analysis (e.g. which languages saw the largest growth in legal tool adoption)?

Data Availability, Type, Format

No dataset currently exists and CC has not made a recommendation on format. Input is welcome on this subject.

@annatuma annatuma added 🟩 priority: low Low priority and doesn't need to be rushed 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work ✨ goal: improvement Improvement to an existing feature 💻 aspect: code Concerns the software code in the repository labels Jul 29, 2022
@annatuma
Copy link
Author

@TimidRobot I've created this discussion ticket on the basis of the proposal you prepared and submitted to UMSI.

I'm still thinking about how to refine the specific questions in terms of findings. In doing so, I've also been recalling past conversations on this challenge and topic with @mathemancer and @kgodey. If they are able to read through and share any feedback we would certainly welcome their input.

@annatuma annatuma added question Further information is requested 💬 talk: discussion Open for discussions and feedback and removed 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work ✨ goal: improvement Improvement to an existing feature 💻 aspect: code Concerns the software code in the repository labels Jul 29, 2022
@cc-open-source-bot cc-open-source-bot added the 🏷 status: label work required Needs proper labelling before it can be worked on label Jul 30, 2022
@mathemancer
Copy link
Contributor

Really cool to see you all are getting back into this!

First question (how many works)

One quick refinement of the first question above might be to choose one of

  • "there are at least X works" where X is guaranteed,
  • "we estimate there are about Y works", or
  • "we estimate at least Z works" (where Z > X).

Answering the first question requires more caution in data gathering, and probably leaning more on APIs of major known data sources, or simply looking it up by hand. A basic approach would be to ask the appropriate person "How many Wikipedia pages are there?" "How about broken down by language?" "How many files of various licenses are there hosted by Wikimedia Commons?" etc. The problem is that this likely won't result in anything but a bare count with very little metadata. The structure described in the prompt requires more. So, you'd probably need to use the WMC API (using the Wikipedia example) to gather the data. This can be slow, and requires a fair amount of domain knowledge about which APIs to hit and their structure. For example, the CC Catalog needed about 6 months to completely refresh the image metadata from Wikimedia Commons due to rate limits from their API. That would only be media files. You'd also need to figure out a way to gather similar metadata about Wikipedia pages (over all languages). As you're aware, many of the software pieces needed for gathering this are already developed and available, but the infrastructure to actually run that software is non-trivial.

Given all that, I suspect the second and/or third question(s) above is/are going to be more appropriate for the project. It's also a more interesting problem to work on. For that, it's probably better to lean on Common Crawl, with some fun analytics to try to estimate how much is missed by a given Common Crawl run (or set of Common Crawl runs). This gives you more flexibility, and less time spent simply waiting for the machinery to gather data. It's also the only "easy" way to get the network data (e.g., links between pages containing CC licenses).

One could tackle all three questions, but for that I'd suggest two completely separate sub-teams with one working on the first question and the other the second and third.

Second question (rate of change)

This could be tackled with either approach described above, since most API sources track when a work was added to the collection, and Common Crawl runs monthly. In fact, it's probably easier to get a satisfying answer or set of answers to this question than the first, since you don't need to worry as much about missing data.

Third question

This depends on metadata gathered (clearly). I'd suggest tailoring the details to that.

Fourth question

This is kind of a combo of the previous questions. That said, if CC really wants to keep a handle on this over time, they should reduce the scope of the project to what can be robustly implemented in such a way that it's easy to maintain and keep using into the future. The first instinct of most data science students (or indeed most data scientists) given the above questions will be to put together something ad-hoc that answers the questions once. This doesn't suffice in this case, and that will be quite a difficult aspect of the project. Even with all the work and development on CC Catalog, we still had parts of the data-gathering process that required manual intervention by an expert on a regular basis.

@TimidRobot
Copy link
Member

@mathemancer Thank you! This is very helpful!

I especially appreciate your insight about reducing scope and splitting the work between sub-teams.

@TimidRobot
Copy link
Member

The project repository for Quantifying the Commons is: creativecommons/quantifying

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🟩 priority: low Low priority and doesn't need to be rushed question Further information is requested 🏷 status: label work required Needs proper labelling before it can be worked on 💬 talk: discussion Open for discussions and feedback
Projects
None yet
Development

No branches or pull requests

4 participants