Introducing the Linked Commons

Maria Belen Guaranda

本文是一系列介绍项目的文章的一部分,这些项目是由创作共用协议指导的开源贡献者所构建的卡塔尔vs葡萄牙分析Google Summer of Code (GSoC) 2019.Maria Belen Guarandawas one of those contributors and we are grateful for her work on this project.

“By visualizing information, we turn it into a landscape that you can explore with your eyes.” David McCandless

Linked Commons (Feature)
Force-directed graph, “The Linked Commons”, uses one month of data.

The landscape of openly licensed content is wide and varied. Millions of web pages host and share CC-licensed works—in fact, we estimate that there are over 1.6 billion across the web! With this growth of CC-licensed works, Creative Commons (CC) is increasingly interested in learning how hosts and users of CC-licensed materials are connected,as well as the types of content published under a CC license and how this content is shared. Each month, CC usesCommon Crawldata to find all domains that contain CC-licensed content. This dataset contains information about the URL of the websites and the licenses used.

Using the Linked Commons

In order to draw conclusions and insights from this dataset, we created theLinked Commons: a visualization that shows how the Commons is digitally connected.

在Linked Commons中,节点(数据结构中的单位)代表一个组织、个人、学术机构等的网站。如果一个网站托管属于另一个网站或由另一个网站托管的cc授权内容,节点之间的链接就存在(通过URL链接表示)。A community represents a group of websites that are closely related to each other because they produce and/or share CC-licensed content between them.

Vast quantities of data make any web browser render elements slowly and may eventually freeze. Due to the 100k nodes included in the Linked Commons, the visualization initially took a long time to render and had a clustered appearance—this was a major concern.

That’s why we decided to use data from only a single month and chose the top 500 websites containing links to CC-licensed material, as well as all of the other domains those 500 nodes are connected to. In addition to lessening the loading time, we found that this was also more user-friendly because navigating the entire dataset’s graph would be dizzying. Even with this smaller dataset, we were able to gather valuable insights from the graph, including discovering subcommunities of CC license hosts and users. One such subcommunity is shown in the image below.

Linked Commons
教育社区,包括图书馆和大学。

上面的亚社区是一个“教育”社区;由图书馆、大学和学校组成。

Visualizations like these are valuable for CC because they can help guide our outreach efforts and targeted communications. TheCC Searchteam can also use this data to choose which domains to prioritize indexing in the CC Catalog.

The visualization is interactive; users can pan, zoom in and out, hover over a node to see its neighbors, and click on a node to display a pie chart, like the one below. We encourage users to test out the Linked Commons and see what insights they can gather from this information!

Linked Commons (2)
Pie chart of ask.openstack.org.
Linked Commons (3)
Force-directed graph, “The Linked Commons”. Neighbors of domain svgsilh highlighted.

What’s next?

We plan to continue working on the Linked Commons. Here are some features we hope to add:

  • Live updates—The graph is currently static because it uses a single month’s data file that has already been processed. We would like to automatically update the graph as soon as new data is processed.
  • Filtering domains by country—Some domains have suffixes that represent countries, such asdomain.auwhich corresponds to a domain from Australia. We plan to use these suffixes to filter nodes in the visualization by country.
  • Filtering domains by name-用户可能想要检查一个特定的域是否有cc许可的内容,以及该内容是如何使用的。我们计划添加一个搜索栏,并为用户提供搜索给定域名和/或URL的特定节点的能力。

Interested? Check out the Linked Commons here!

Give us your feedback!

The Linked Commons is an open source project. The project’s source code is available in theGithub repo. Contributions are welcome!For the technical details of how this project was developed, please read thisseries of postson theCC Open Source blog.