-
-
Notifications
You must be signed in to change notification settings - Fork 12
Discussion: Project to Quantify the Commons #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@TimidRobot I've created this discussion ticket on the basis of the proposal you prepared and submitted to UMSI. I'm still thinking about how to refine the specific questions in terms of findings. In doing so, I've also been recalling past conversations on this challenge and topic with @mathemancer and @kgodey. If they are able to read through and share any feedback we would certainly welcome their input. |
Really cool to see you all are getting back into this! First question (how many works)One quick refinement of the first question above might be to choose one of
Answering the first question requires more caution in data gathering, and probably leaning more on APIs of major known data sources, or simply looking it up by hand. A basic approach would be to ask the appropriate person "How many Wikipedia pages are there?" "How about broken down by language?" "How many files of various licenses are there hosted by Wikimedia Commons?" etc. The problem is that this likely won't result in anything but a bare count with very little metadata. The structure described in the prompt requires more. So, you'd probably need to use the WMC API (using the Wikipedia example) to gather the data. This can be slow, and requires a fair amount of domain knowledge about which APIs to hit and their structure. For example, the CC Catalog needed about 6 months to completely refresh the image metadata from Wikimedia Commons due to rate limits from their API. That would only be media files. You'd also need to figure out a way to gather similar metadata about Wikipedia pages (over all languages). As you're aware, many of the software pieces needed for gathering this are already developed and available, but the infrastructure to actually run that software is non-trivial. Given all that, I suspect the second and/or third question(s) above is/are going to be more appropriate for the project. It's also a more interesting problem to work on. For that, it's probably better to lean on Common Crawl, with some fun analytics to try to estimate how much is missed by a given Common Crawl run (or set of Common Crawl runs). This gives you more flexibility, and less time spent simply waiting for the machinery to gather data. It's also the only "easy" way to get the network data (e.g., links between pages containing CC licenses). One could tackle all three questions, but for that I'd suggest two completely separate sub-teams with one working on the first question and the other the second and third. Second question (rate of change)This could be tackled with either approach described above, since most API sources track when a work was added to the collection, and Common Crawl runs monthly. In fact, it's probably easier to get a satisfying answer or set of answers to this question than the first, since you don't need to worry as much about missing data. Third questionThis depends on metadata gathered (clearly). I'd suggest tailoring the details to that. Fourth questionThis is kind of a combo of the previous questions. That said, if CC really wants to keep a handle on this over time, they should reduce the scope of the project to what can be robustly implemented in such a way that it's easy to maintain and keep using into the future. The first instinct of most data science students (or indeed most data scientists) given the above questions will be to put together something ad-hoc that answers the questions once. This doesn't suffice in this case, and that will be quite a difficult aspect of the project. Even with all the work and development on CC Catalog, we still had parts of the data-gathering process that required manual intervention by an expert on a regular basis. |
@mathemancer Thank you! This is very helpful! I especially appreciate your insight about reducing scope and splitting the work between sub-teams. |
The project repository for Quantifying the Commons is: creativecommons/quantifying |
Background:
Creative Commons has submitted a project to UMSI and they have determined that this project is a potential fit for the course SI 485: Information Analysis Capstone and Final Project. In this course, advanced undergraduate students deliver data-oriented solutions through the development and analysis of data sets, building tools to extract useful information for clients through manipulation, analysis and visualization. This ticket is intended for discussion of the project, with the goal of refining the potential questions we'd like answered and getting input from those who have considered this challenge in the past.
Project General Information
Project Idea:
Full Description
Project Outcome
What do students need for this project to be successful?
Examples: skills needed, social impact orientation, interest or experience in a specific field/domain/industry.
Data Proposal Information
Data Set
Size of Data Set
How big is the data set? Approximately how many rows and columns does it have?
Findings from Data Set
What do you want to learn from your data set? Please share 3-5 specific questions that the data can help solve:
Data Availability, Type, Format
The text was updated successfully, but these errors were encountered: