Combating dead links with CC Link Checker (2024)

CC Open Source Blog

Combating dead links with CC Link Checker (1)

by Bhumij Gupta on 2019-07-15

Creative Commons provides vast number of public copyright licenses for people who want to enable free distribution of their work. Creative Commons licenses currently covers over 1.6 billion resources. These license files are then translated to multiple different languages and ported for different jurisdictions for international usage. People link to the respective licenses along with their licensed works. These license files are in the form of html files, stored in creativecommons/creativecommons.org repo.

Problem Statement

These license files contain links to their deeds, license translated to other languages, internal links, and many more. Sometimes due to errors, these files may contain dead / broken links. These broken links leads to incorrect/ incomplete understanding of the license clauses and permissions by the viewer. This may sometime lead to incorrect usage of the licensed resource.

At the time of writing, the repo contains over 930 files with an average of 50 links per file. New translation of license deeds are regularly added to the repo and the existing license deeds are also updated frequently. Manually testing these files would take a lot of time. Considering the future additions of licenses, translations and jurisdiction ports, the time required for manual testing would increase drastically.

Solution - CC Link Checker

CC Link Checker aims to solve the problem by automating the task of checking links in the license and reporting errorneous and broken links. The python script scrapes all the licenses from the repo and checks the status of the links in the files. The script checks the link for 40X errors, timeout errors and invalid schema error.

Firstly, let's get the features out of the way. The script uses multiprocessing, taking full use of multiple core processors, has 2 modes of output to CLI - default and verbose, and can also print the error links, summary of the result, and mapping of error links to their URLs of occurence in a file.

Now let's hop in the nerd train and take a deeper look at development journey.

Development Journey

I started working on the project a month back i.e. 13 Jun. During this journey there were many ups and downs, with some productive and some totally unproductive days. For better understanding, lets look the journey for each week.

  • Week 1 The approach for the first week was to get the basic functionality done. The script was able to scrape and parse all license and links from the github repo using requests and BeautifulSoup. I implemented basic memoization to prevent repeated fetching of already fetched links. This step decreased the execution time of the script by several folds. Parallel to the development of the script, I also had to implement Circle-CI. Since it was my first time using a CI service, it took me quite some time to wrap my head around Circle-CI.

    Things I learned: Circle-CI, pipenv, different docstring formats, different code styles and pep8 recommendations - black and flake8

  • Week 2 With the basic functionality completed, now it was time to implement some advance features. This week I worked on implementing --verbose flag a.k.a. the verbose mode of output to CLI. The verbose mode would help in debugging and give a deeper look into how the python script was working. Also I worked on --output-errors flag, which would print error links and summary to a file.

    Everything was working fine and ahead of schedule, until a major bug was detected i.e. incorrect conversion of relative to absolute links which resulted in several false positive and false negatives. Fixing this required deducing pattern between the URL on which the license file would be displayed from the license name. This step took a lot of time and pushed project below its schedule.

    Things I learned: Argument parser and help text

  • Week 3 The output file consisted only file name and the errorneous links encountered in the file. As per suggestion of Kriti Godey, I worked on adding the summary section to the file which would include information like total error links, number of unique links along with errorneous link and the page URLs on which it is present (since many page consist the same dead link). The next step was optimizing time taken to run the script.

    Before any optimisation the script took 4+ minutes to run on cloud servers. To reduce the time, I started working on implementing multithreading to reduce the execution time of the script. Implementing multithreading was a big task as it required major refactoring of the code to make it thread safe and compatible with concurrency. After sucessfully refactoring and implementing multithreading, I was able to bring down the execution time to around 3:20.

    As pointed out by my mentor, this came with its own set of problems. Due to python's Global Interpretor Lock(GIL) no two threads can run parallely inside the python interpretor and the use of global and lock made the code more complex. Also, the performance increase was not significant. This was a setback as my week long efforts had gone in vain.

    Things I learned: Multithreading, locks

  • Week 4 To aid the situation my mentors suggested using multiprocessing instead. I had no former experience with multiprocessing, but thanks to python's beautiful documentations and examples, I was quickly able to get a hold of it. I finished implementing multiprocessing and to my surprise, there was 49.5% performance increase. The script now only took 2:27 to complete.

    The major functionality of the script were completed, and the optimisations were done. To improve the code quality and improve documentation, it was time to write unit tests for the script. To write the tests I used pytests framework which provides several benefits and higher level abstraction over the inbuilt unittest module.

    Things I learned: Multiprocessing, pytests

Future work

  • Optimising the script for more performace increase
  • Making the script more CI-friendly

CC Link Checker is only possible due to the support and guidance of my mentors Alden Page and Timid Robot Zehta, who have been very supportive on every step of the project. Also I would like to thank engineering director Kriti Godey for her continuous support.

You can follow the project on Github: creativecommons/cc-link-checker. You can also join the discussion on #cc-link-checker on Slack

The project is approaching its completion. Can't wait to see it in production.

Signing offBhumij Gupta

Combating dead links with CC Link Checker (2024)

References

Top Articles
Song That Goes Yeah Yeah Yeah Yeah Sounds Like Mgmt
20 Years Later, Doom 3 Remains a Misunderstood Masterpiece
Odawa Hypixel
Chicago Neighborhoods: Lincoln Square & Ravenswood - Chicago Moms
Pieology Nutrition Calculator Mobile
The Best Classes in WoW War Within - Best Class in 11.0.2 | Dving Guides
The Pope's Exorcist Showtimes Near Cinemark Hollywood Movies 20
Notary Ups Hours
Displays settings on Mac
fltimes.com | Finger Lakes Times
Spelunking The Den Wow
Regal Stone Pokemon Gaia
A rough Sunday for some of the NFL's best teams in 2023 led to the three biggest upsets: Analysis - NFL
National Weather Service Denver Co Forecast
Apne Tv Co Com
Used Sawmill For Sale - Craigslist Near Tennessee
2 Corinthians 6 Nlt
Equipamentos Hospitalares Diversos (Lote 98)
NHS England » Winter and H2 priorities
Lazarillo De Tormes Summary and Study Guide | SuperSummary
Las 12 mejores subastas de carros en Los Ángeles, California - Gossip Vehiculos
Jeffers Funeral Home Obituaries Greeneville Tennessee
Silky Jet Water Flosser
Bj타리
In hunt for cartel hitmen, Texas Ranger's biggest obstacle may be the border itself (2024)
FSA Award Package
R/Orangetheory
Yoshidakins
Weekly Math Review Q4 3
Ducky Mcshweeney's Reviews
Nacho Libre Baptized Gif
Devotion Showtimes Near Mjr Universal Grand Cinema 16
House Of Budz Michigan
Mcgiftcardmall.con
Best Restaurant In Glendale Az
Latest Nigerian Music (Next 2020)
Leena Snoubar Net Worth
Infinite Campus Farmingdale
Letter of Credit: What It Is, Examples, and How One Is Used
SF bay area cars & trucks "chevrolet 50" - craigslist
Directions To Cvs Pharmacy
Guy Ritchie's The Covenant Showtimes Near Grand Theatres - Bismarck
Disassemble Malm Bed Frame
Florida Lottery Powerball Double Play
Sea Guini Dress Code
Sherwin Source Intranet
Ouhsc Qualtrics
Rheumatoid Arthritis Statpearls
View From My Seat Madison Square Garden
300 Fort Monroe Industrial Parkway Monroeville Oh
Electronics coupons, offers & promotions | The Los Angeles Times
Olay Holiday Gift Rebate.com
Latest Posts
Article information

Author: Van Hayes

Last Updated:

Views: 6171

Rating: 4.6 / 5 (66 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Van Hayes

Birthday: 1994-06-07

Address: 2004 Kling Rapid, New Destiny, MT 64658-2367

Phone: +512425013758

Job: National Farming Director

Hobby: Reading, Polo, Genealogy, amateur radio, Scouting, Stand-up comedy, Cryptography

Introduction: My name is Van Hayes, I am a thankful, friendly, smiling, calm, powerful, fine, enthusiastic person who loves writing and wants to share my knowledge and understanding with you.