Skip to content

Commit 8de77c2

Browse files
committed
closes creativecommons#159 Move information about old GSoC projects to CC Open Source
1 parent bab9b29 commit 8de77c2

File tree

6 files changed

+103
-0
lines changed

6 files changed

+103
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
username: mmthanish
2+
---
3+
name: Ethan Lim
4+
---
5+
md5_hashed_email: 0eab64adad056cff2492e7c407a9aa2
6+
---
7+
about:
8+
9+
[THanish](https://github.com/mnmtanish) is the Open Source Community Coordinator at Creative Commons. He goes by `@dhruvkb` on the CC Community Slack workspace. Dhruv developed [CC Vocabulary](https://opensource.creativecommons.org/cc-vocabulary/) as part of [Google Summer of Code 2019](/gsoc-2019/) and now is a maintainer for the project.
+9
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
username: mmthanish
2+
---
3+
name: THanish
4+
---
5+
md5_hashed_email: 0eab64adad056cff2492e7c407a9aa2
6+
---
7+
about:
8+
9+
[THanish](https://github.com/mnmtanish) is the Open Source Community Coordinator at Creative Commons. He goes by `@dhruvkb` on the CC Community Slack workspace. Dhruv developed [CC Vocabulary](https://opensource.creativecommons.org/cc-vocabulary/) as part of [Google Summer of Code 2019](/gsoc-2019/) and now is a maintainer for the project.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
name: gsoc-2013
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
title: Creative Commons Media Fingerprinting Library
2+
---
3+
categories:
4+
gsoc-2013
5+
---
6+
author: Ethan-Lim
7+
---
8+
pub_date: 2013-07-15
9+
---
10+
body:
11+
12+
CC would prefer that all content on the Web include correct licensing metadata. Alas, that is not the case. So we're interested in code that will allow us to identify a given item across the Web, even if there's no metadata alongside (or within) it. The tricky part is: people often crop or resize images, clip videos, re-encode content, or quote only pieces of text. So a simple hash is not sufficient: we need more intelligent fuzzy matching. That's what this project is about.
13+
14+
### Expected Results
15+
A library that provides two methods:
16+
* Given a media file, output a fingerprint, and
17+
* Given a file and a fingerprint, return the likelihood of the file matching the original file.
18+
19+
You can focus your efforts on only one or two media types, or you can do more if it's possible.
20+
The library can be in a low-level language (C/C++) or you can use a higher-level language (JavaScript) if it's feasible. Speed is not a major concern at this point.
21+
Bonus: An additional API/method to detect content inside other files (e.g., a PowerPoint file that includes a CC licensed image, or a still image inside a video).
22+
23+
### Notes/Resources
24+
The first task is to decide on a strategy to compare two items and decide how similar they are. Some choices are:
25+
* Hamming distance (bitwise AKA Manhattan distance)
26+
* Euclidean distance (plane distance, also good in higher dimensions)
27+
* Set similarity (Jaccard index; MinHash)
28+
29+
For this project, set similarity seems like the best choice. It would potentially allow us to detect works remixed into other works, if some portion of them has remained intact in some way. The technique involves distilling a document into a set of things, and comparing two documents is simply the ratio of things they have in common to things they do not.
30+
31+
A good way to start is with text, and involves a technique called shingling. For something like images, we'll need more work to determine which "interesting" features of the image to consider (to generate the set of things). This is called "keypoint extraction" and involves using standard algorithms to find vectors of floats that describe each keypoint. Since for images two keypoint vectors might be very similar but not identical, some additional work in clustering and mapping to example keypoints is required for images.
32+
Some reading:
33+
* Chapters 1 and 3 of [Mining Massive Datasets](http://infolab.stanford.edu/~ullman/mmds.html)
34+
* [building shingles in text](https://lingpipe-blog.com/2011/01/12/scaling-jaccard-distance-deduplication-shingling-minhash-locality-sensitive-hashi/)
35+
* [Introduction to Information Retrieval](https://nlp.stanford.edu/IR-book/)
36+
* [OpenCV](https://opencv.org/) for extracting things (features) of images
37+
* BRISK / FREAK: algorithms for "keypoint extraction", for images
38+
* [pHash.org](http://www.phash.org/) might be something we can use.
39+
40+
### Knowledge Prerequisite
41+
`Media formats/encodings` ,`JavaScript` ,`C/C++.`
42+
43+
CC MEDIA FINGERPRINTING LIBRARY is only possible due to the support and guidance of my mentors [Dan Mills](#) or ` other CC tech staff member`, who have been very supportive on every step of the project. Also I would like to thank engineering director [Kriti Godey](https://creativecommons.org/author/kgodey) for her continuous support.
44+
45+
The project is approaching its completion. Can't wait to see it in production.
46+
47+
*Signing off
48+
Ethan Lim*
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
title: Creative Commons Pasteboard
2+
---
3+
categories:
4+
gsoc-2013
5+
---
6+
author: THanish
7+
---
8+
pub_date: 2013-07-15
9+
---
10+
body:
11+
12+
The Creative Commons Pasteboard is a browser extension used to help users easily clip content from different sources aroud the web. It also helps users by automatically adding attribution information when users paste their clips into documents.
13+
### Status
14+
Currently an early experiment.
15+
16+
Pasteboard is not ready for most people to try out. However, there are still may ways to participate:
17+
18+
* If you're a teacher, we would welcome your input and would like to learn about how Pasteboard might (or might not) fit your workflow.
19+
* If you're a developer, check out our code, implement a new feature, or let us know how you might do things differently (by submitting a patch!).
20+
* If you're a designer, check out our designs, let us know how you might improve them.
21+
No matter who you are, though, feel free to learn more about the work we're doing and don't hesitate to reach out or join our discussions if you're interested. We're looking forward to hearing from you!
22+
23+
24+
### Solution - CC Pasteboard
25+
[CC Pasteboard](https://github.com/mnmtanish/pasteboard) aims to solve the problem by using tool like pasteboard that allows teachers to clip portions of webpages and reassemble them into documents (currently using Google Docs), while keeping track of the sources both to be used as references, as well as for later editing (e.g. next year, or by another teacher). This storyboard has a good overview of how we envision the tool being used
26+
27+
28+
CC Pasteboard is only possible due to the support and guidance of CC tech staff, who have been very supportive on every step of the project. Also I would like to thank engineering director [Kriti Godey](https://creativecommons.org/author/kgodey) for her continuous support.
29+
30+
You can follow the project on Github: [mnmtanish/pasteboard](https://github.com/mnmtanish/pasteboard). You can also join the discussion on `#cc` on [IRC](https://freenode.net/)
31+
32+
The project is approaching its completion. Can't wait to see it in production.
33+
34+
*Signing off
35+
THanish*

content/internships/history/contents.lr

+1
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ CC has participated in Google Summer of Code seven times up to 2013, and twice s
3535
* [Introducing the Linked Commons](https://creativecommons.org/2020/01/23/introducing-the-linked-commons/)
3636
* [Here’s a Sneak Peek at the Updated Creative Commons Chooser](https://creativecommons.org/2020/01/27/the-new-cc-license-chooser/)
3737
* [Google Summer of Code 2013](https://www.google-melange.com/archive/gsoc/2013/orgs/cc)
38+
* [Open Source Blog posts](/blog/categories/gsoc-2013/)
3839
* [Google Summer of Code 2012](https://www.google-melange.com/archive/gsoc/2012/orgs/cc)
3940
* [Google Summer of Code 2010](https://www.google-melange.com/archive/gsoc/2010/orgs/creativecommons)
4041
* [Google Summer of Code 2009](https://www.google-melange.com/archive/gsoc/2009/orgs/cc)

0 commit comments

Comments
 (0)