[tor-dev] CAPTCHA Monitoring Project Updates & Findings

Hi everyone,

I made progress on the Cloudflare CAPTCHA Monitoring project since my last email, and I wanted to share some of the updates & findings. This year Tor Project is participating in GSoC under the DIAL umbrella, and I have already been posting updates to the DIAL blog [1] weekly. I started mirroring these updates [2] to my project's wiki page, and I will be posting more frequent updates here.

a) Updates:
Firstly, I moved the wiki page that explains the project, the code base, and the issue tracker to Tor Project's GitLab. They are all in the same GitLab project. You can find detailed information about the project on the wiki page [3] and leave comments & suggestions within that repository by creating issues.

Secondly, I got a fully functioning system up and running. The system fetches various URLs with Tor Browser & Firefox over Tor and checks for CAPTCHAs. The system also checks if any third-party code was injected by comparing the hash of the received page with an expected hash value. It repeats these experiments using different exit relays and records results. You can view the results on the dashboard [4] I created. I'm looking for more URLs to track for CAPTCHAs. Feel free to share the websites you frequently visit and get CAPTCHAs, so that I can track these websites with this tool as well. I want to experiment with all types of CAPTCHAs, and these URLs don't have to be fronted by Cloudflare.

b) Findings:
So far, I have observed that using the Tor Browser Bundle out of the box without changing its configurations doesn't lead to a high CAPTCHA rate on Cloudflare fronted websites (assuming the website owners don't explicitly block exit relays [5]). That said, modifying the user-agent or any other modifications that deviate your browser's fingerprint from a typical Tor Browser user, significantly increases the chance of getting CAPTCHAs. For example, using the regular Firefox over Tor resulted in getting CAPTCHAs in ~90% of the measurements. I believe Cloudflare is very aggressive against the "Firefox over Tor" users because many people, unfortunately, use Chromium/Firefox + Selenium + Tor to scrape web pages and bypass IP-based rate limits. That's why I'm interested in hearing about your specific browser/Tor configurations to test them with the CAPTCHA Monitor. Not everyone is affected in the same way because of these differences in the way we use Tor, but we can understand which differences affect the CAPTCHA rate more than others by experimenting.

Additionally, I observed that the TLS fingerprint has a significant role in whether someone gets a CAPTCHA or not. As a part of the project, I decided to capture the HTTP headers during measurements to understand how they affect the CAPTCHA rates. Initially, I was using a Python library called seleniumwire to capture the HTTP headers by intercepting the traffic between the Tor Browser and Tor. By doing this, I got a very high CAPTCHA rate, like 98% of the time. seleniumwire forwards the traffic transparently, but it has a different TLS fingerprint than Tor Browser. I figured out that the difference in the TLS fingerprints was triggering the MITM detection on the Cloudflare side, thus, resulting in very high CAPTCHA rates.

Interestingly, I tried using the exact same Tor Browser & seleniumwire setup, but without Tor and, practically, I didn't get any CAPTCHAs. I believe the MITM detection is more aggressive if the traffic is coming through an exit relay. So, I stopped using seleniumwire to capture headers because it didn't reflect what a real human Tor Browser user is usually experiencing. Please feel free to use the sample code [6] that I used to combine seleniumwire and Tor, if you are interested in doing further experimenting on this.

c) Next:
I will work on collecting more metrics by testing more configurations and websites. I will create a "Relay Search" section on the dashboard, where CAPTCHA statistics for the relays (exit relays for now) will be available. I will also work on using the collected data to predict the probability of getting CAPTCHAs with a given exit relay and configuration/setup.

Best,
Barkin

[1] https://hub.osc.dial.community/t/tor-project-cloudflare-captcha-monitoring/1558
[2] https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/wikis/Updates
[3] https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/wikis/home
[4] https://dashboard.captcha.wtf/
[5] Cloudflare has a setting to block all traffic originating from the Tor network, but that setting is not "turned on" by default
[6] https://gist.github.com/woswos/38b921f0b82de009c12c6494db3f50c5