Combating dead links with CC Link Checker (2024)

CC Open Source Blog

Combating dead links with CC Link Checker (1)

by Bhumij Gupta on 2019-07-15

Creative Commons provides vast number of public copyright licenses for people who want to enable free distribution of their work. Creative Commons licenses currently covers over 1.6 billion resources. These license files are then translated to multiple different languages and ported for different jurisdictions for international usage. People link to the respective licenses along with their licensed works. These license files are in the form of html files, stored in creativecommons/creativecommons.org repo.

Problem Statement

These license files contain links to their deeds, license translated to other languages, internal links, and many more. Sometimes due to errors, these files may contain dead / broken links. These broken links leads to incorrect/ incomplete understanding of the license clauses and permissions by the viewer. This may sometime lead to incorrect usage of the licensed resource.

At the time of writing, the repo contains over 930 files with an average of 50 links per file. New translation of license deeds are regularly added to the repo and the existing license deeds are also updated frequently. Manually testing these files would take a lot of time. Considering the future additions of licenses, translations and jurisdiction ports, the time required for manual testing would increase drastically.

Solution - CC Link Checker

CC Link Checker aims to solve the problem by automating the task of checking links in the license and reporting errorneous and broken links. The python script scrapes all the licenses from the repo and checks the status of the links in the files. The script checks the link for 40X errors, timeout errors and invalid schema error.

Firstly, let's get the features out of the way. The script uses multiprocessing, taking full use of multiple core processors, has 2 modes of output to CLI - default and verbose, and can also print the error links, summary of the result, and mapping of error links to their URLs of occurence in a file.

Now let's hop in the nerd train and take a deeper look at development journey.

Development Journey

I started working on the project a month back i.e. 13 Jun. During this journey there were many ups and downs, with some productive and some totally unproductive days. For better understanding, lets look the journey for each week.

  • Week 1 The approach for the first week was to get the basic functionality done. The script was able to scrape and parse all license and links from the github repo using requests and BeautifulSoup. I implemented basic memoization to prevent repeated fetching of already fetched links. This step decreased the execution time of the script by several folds. Parallel to the development of the script, I also had to implement Circle-CI. Since it was my first time using a CI service, it took me quite some time to wrap my head around Circle-CI.

    Things I learned: Circle-CI, pipenv, different docstring formats, different code styles and pep8 recommendations - black and flake8

  • Week 2 With the basic functionality completed, now it was time to implement some advance features. This week I worked on implementing --verbose flag a.k.a. the verbose mode of output to CLI. The verbose mode would help in debugging and give a deeper look into how the python script was working. Also I worked on --output-errors flag, which would print error links and summary to a file.

    Everything was working fine and ahead of schedule, until a major bug was detected i.e. incorrect conversion of relative to absolute links which resulted in several false positive and false negatives. Fixing this required deducing pattern between the URL on which the license file would be displayed from the license name. This step took a lot of time and pushed project below its schedule.

    Things I learned: Argument parser and help text

  • Week 3 The output file consisted only file name and the errorneous links encountered in the file. As per suggestion of Kriti Godey, I worked on adding the summary section to the file which would include information like total error links, number of unique links along with errorneous link and the page URLs on which it is present (since many page consist the same dead link). The next step was optimizing time taken to run the script.

    Before any optimisation the script took 4+ minutes to run on cloud servers. To reduce the time, I started working on implementing multithreading to reduce the execution time of the script. Implementing multithreading was a big task as it required major refactoring of the code to make it thread safe and compatible with concurrency. After sucessfully refactoring and implementing multithreading, I was able to bring down the execution time to around 3:20.

    As pointed out by my mentor, this came with its own set of problems. Due to python's Global Interpretor Lock(GIL) no two threads can run parallely inside the python interpretor and the use of global and lock made the code more complex. Also, the performance increase was not significant. This was a setback as my week long efforts had gone in vain.

    Things I learned: Multithreading, locks

  • Week 4 To aid the situation my mentors suggested using multiprocessing instead. I had no former experience with multiprocessing, but thanks to python's beautiful documentations and examples, I was quickly able to get a hold of it. I finished implementing multiprocessing and to my surprise, there was 49.5% performance increase. The script now only took 2:27 to complete.

    The major functionality of the script were completed, and the optimisations were done. To improve the code quality and improve documentation, it was time to write unit tests for the script. To write the tests I used pytests framework which provides several benefits and higher level abstraction over the inbuilt unittest module.

    Things I learned: Multiprocessing, pytests

Future work

  • Optimising the script for more performace increase
  • Making the script more CI-friendly

CC Link Checker is only possible due to the support and guidance of my mentors Alden Page and Timid Robot Zehta, who have been very supportive on every step of the project. Also I would like to thank engineering director Kriti Godey for her continuous support.

You can follow the project on Github: creativecommons/cc-link-checker. You can also join the discussion on #cc-link-checker on Slack

The project is approaching its completion. Can't wait to see it in production.

Signing offBhumij Gupta

Combating dead links with CC Link Checker (2024)

References

Top Articles
long island for sale "van" - craigslist
How To Get Coins In Path Of Titans
Funny Roblox Id Codes 2023
Golden Abyss - Chapter 5 - Lunar_Angel
Www.paystubportal.com/7-11 Login
Joi Databas
DPhil Research - List of thesis titles
Shs Games 1V1 Lol
Evil Dead Rise Showtimes Near Massena Movieplex
Steamy Afternoon With Handsome Fernando
Which aspects are important in sales |#1 Prospection
Detroit Lions 50 50
18443168434
Zürich Stadion Letzigrund detailed interactive seating plan with seat & row numbers | Sitzplan Saalplan with Sitzplatz & Reihen Nummerierung
Grace Caroline Deepfake
978-0137606801
Nwi Arrests Lake County
Justified Official Series Trailer
London Ups Store
Committees Of Correspondence | Encyclopedia.com
Pizza Hut In Dinuba
Jinx Chapter 24: Release Date, Spoilers & Where To Read - OtakuKart
How Much You Should Be Tipping For Beauty Services - American Beauty Institute
Free Online Games on CrazyGames | Play Now!
Sizewise Stat Login
VERHUURD: Barentszstraat 12 in 'S-Gravenhage 2518 XG: Woonhuis.
Jet Ski Rental Conneaut Lake Pa
Unforeseen Drama: The Tower of Terror’s Mysterious Closure at Walt Disney World
Ups Print Store Near Me
C&T Wok Menu - Morrisville, NC Restaurant
How Taraswrld Leaks Exposed the Dark Side of TikTok Fame
Dashboard Unt
Access a Shared Resource | Computing for Arts + Sciences
Speechwire Login
Restored Republic
3473372961
Craigslist Gigs Norfolk
Netherforged Lavaproof Boots
Ark Unlock All Skins Command
Craigslist Red Wing Mn
D3 Boards
Jail View Sumter
Nancy Pazelt Obituary
Birmingham City Schools Clever Login
Trivago Anaheim California
Thotsbook Com
Vérificateur De Billet Loto-Québec
Funkin' on the Heights
Vci Classified Paducah
Www Pig11 Net
Ty Glass Sentenced
Latest Posts
Article information

Author: Van Hayes

Last Updated:

Views: 6171

Rating: 4.6 / 5 (66 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Van Hayes

Birthday: 1994-06-07

Address: 2004 Kling Rapid, New Destiny, MT 64658-2367

Phone: +512425013758

Job: National Farming Director

Hobby: Reading, Polo, Genealogy, amateur radio, Scouting, Stand-up comedy, Cryptography

Introduction: My name is Van Hayes, I am a thankful, friendly, smiling, calm, powerful, fine, enthusiastic person who loves writing and wants to share my knowledge and understanding with you.