Skip to content
Barkin Simsek edited this page Jul 9, 2020 · 1 revision

Following blog posts are mirrored from DIAL's blog.

Table of Contents

[[TOC]]

July 2020

July 3

This week I worked on solving the memory leak problem. I found the root cause and stopped the leak. I was using a timeout function while fetching the pages. It turns out the timeout value I used was shorter than it is supposed to be, and the timeout function wasn't invoking the right signals to properly kill the instances of the browser. So, I increased the timeout and added the right calls to shut down the browser instances properly. This solved the memory leak issue.

Next, I worked on the algorithm for deciding which test to run for exit relays. The algorithm compiles a list of measurements and checks if a given relay completed the list of measurements. If the measurements are not complete, the algorithm assigns one of the uncompleted measurements to the exit relay. If the given relay completed all measurements, the algorithm refreshes the oldest measurement. I plan to add priorities to the measurements to take this algorithm one step further and perform some more important measurements more frequently than others.

Finally, I worked on annotating the data using the CAPTCHA Monitor's versions. The main problem was that I didn't have properly defined versions. So, I needed to define the versions first using the merge requests I made. After that, I added the code which attaches to the version information to the results.

With the completion of this week, I completed a full month of coding, and it has been great so far. I managed to stick to the timeline I set and released a fully working version of the system. I didn't get a lot of CAPTCHAs with my system so far. I will be working on expanding the modules to track other metrics and test more websites for CAPTCHA throughout the next weeks.

June 2020

June 26

This week started with an unexpected issue. The CAPTCHA rates I was getting were very high when compared to what Tor Browser users experience in real life. After investigating, I realized that the seleniumwire library I used to capture HTTP headers was causing this issue. Interestingly, this was the case only with Tor. I wasn't getting high rates of CAPTCHA when I used seleniumwire with regular internet. Clearly, using seleniumwire and Tor together triggers something on the Cloudflare side. I think they might be detecting the increased latency or the changed TLS fingerprint.

Anyway, I opted out using seleniumwire because it was affecting the results negatively. I started using the HTTP-Header-Live addon for capturing the headers. The addon starts automatically when the browser starts and captures the headers inside of the browser without touching to the traffic itself. When the page is completely loaded, the addon writes the headers to a text file in JSON format. Later, my code reads this file saves the results. It is not the most elegant way to solve this problem but I needed to use this method since the elegant method (seleniumwire) caused problems.

Here is a sample of the code I used to connect Tor Browser to the Tor network via seleniumwire. Feel free to do further testing, if this issue sounds interesting to you. https://gist.github.com/woswos/38b921f0b82de009c12c6494db3f50c5

After solving this unexpected problem, I worked on adding support for older versions of the browsers. Now, -b or --browser_version flag can be used to provide the exact browser version. The code doesn't automatically download that version of the browser but it can be a nice future addition.

I also realized that Cloudflare injects code that wasn't a part of the original page. For example here is the original code:

<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <title>Hello world!</title>
    </head>
    <body>
        Hello world!
    </body>
</html>

Here is the version Cloudflare serves:

<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <title>Hello world!</title>
    </head>
    <body>
        Hello world!
        <script defer="" src="https://static.cloudflareinsights.com/beacon.min.js" data-cf-beacon="{&quot;rayId&quot;:&quot;5a974a483cf0b6cc&quot;,&quot;version&quot;:&quot;2020.5.1&quot;,&quot;si&quot;:10}">.     </script>
    </body>
</html>

So, I decided to detect these kinds of changes as well by hashing the page. Now, the system automatically takes the MD5 hash of the page contents and compares it with the original hash. If there is a change, it also saves that change.

Additionally, I created a new section called 'Measurement Search' for showing the individual measurements that from the graphs. It also enables users to perform custom queries on the data using the search box: cb65faad694aecd7df1a020f6a3986e7fe65f85e_2_1380x876 82ef3b3f90a738a93042d30a5041a4337479745c_2_1380x876

June 19

This week I spent my time parallelizing the CAPTCHA Monitor using processes on the host machine. Previously I was using Docker swarm to replicate the instances of the code, but it turned out to be slow and memory consuming. Instead, I used Python's multiprocessing library to replicate the workers. I needed to make a few changes in the architecture to make this happen. I needed to separate the code that manages Tor and Tor Browser from the main program loop. Now, the main program loop creates instances of that code in separate processes and makes sure that they keep running. By using the updated code, I started collecting data one more time. Every day I collect data for a different metric.

The next step was to display the collected data in a dashboard. You might remember that I mentioned a dashboard already and put a screenshot of it. Actually, that was the second dashboard solution I tried. In the very beginning, I tried using Graphana. It is a really neat open source dashboard solution, and it has well-designed layout options. These are all great features, but Graphana is geared towards time series data like the temperature of a CPU or amount of ram usage of a computer. So, the data sources and the backend are designed for that kind of data. It also doesn't provide flexibility with data manipulation. Grafana wants to display what the database query returns directly on the dashboard. Unfortunately, I needed more flexibility in the way I process data, and I needed to combine multiple queries sometimes. Still, I used Graphana for a while to see if I was wrong and I wasn't wrong.

I did further research, and I found Metabase, which is another open-source dashboard solution. As opposed to Graphana, Metabase had all the flexibility I needed in the backend to process data before showing them on the dashboard. I really liked using Metabase, but it had a lot of flaws on the frontend. For example, some of the graphs were clipped for no reason, and there was no option for fixing it. It was also consuming a lot of memory on my VPS, and I thought I could use that memory for data collection rather than spending on the dashboard for no solid reason.

So, I ended up building my own dashboard using Node.js, Bootstrap, Chart.js, and Express.js:

64f18f46f492cb7ee5d4f8fd6ce3e03e58cd7401_2_1380x876

I used my learnings from my weeks of dashboard search to create something simple and elegant. I used Node.js & Express.js on the backend to create an API and Bootstrap & Chart.js on the front end for displaying data. The cool thing is I can process the data in the way I want on the backend and send it to the dashboard through API. If I don't like anything about the frontend, I can just change it! Sure, I could do changes in the other open-source dashboard solutions as well, but I needed to go through an unnecessary amount of steps to achieve it. Also, now I can use the same backend API solution for other purposes. I was already planning to have an API for third parties to fetch data from the system, and there I have it!

Finally, I spent some time moving my project to Tor Project's new GitLab server. Previously, code, issue tracker, and wiki page were all on different locations. Now, they are all in the same place and unified. GitLab also have a lot of extra productivity tools, and I can't wait to use them. Here is the new home for my code: https://gitlab.torproject.org/woswos/CAPTCHA-Monitor

June 12

For the first time, I encountered problems regarding the speed of my code, and I'm glad that it happened. So, I could learn how to make it run faster. I need to perform daily measurements on the Tor exit relays as a part of my project, and there are many of them. I repeat the same exact measurement over and over again and compare the results. With this scale, every extra second in the execution of the individual measurements, cause an extra ~25 minute execution time overall.

When I first started, the total execution time was well over 100 hours, and we have 24 hours in a day. This week I worked on implementing a worker pool to run many operations in parallel. The worker pool system helped me to reduce the total execution time significantly (down to 40 hours), but still, it is not enough. Later, I started looking at other similar projects like exitmap to see how they handle the measurement. This was helpful as well, and I implemented my learnings from these projects, but I still need to reduce the total execution time a lot.

The biggest bottleneck is the web browser itself. Currently, every individual measurement takes ~10 seconds, and ~7 seconds of this belongs to the web browser getting started. I hope to cut down the individual measurement time to ~5. If that is not possible, I will try to find ways to run more workers in parallel more efficiently.

June 5

This week I spent my time getting the first version of the system up and running. I deployed a continuously running instance of my code to my server, and I connected the database to the dashboard. I also worked on adding a few meaningful graphs to the dashboard.

4c282592baa3db2793baec7fb3979c35ca053125_2_1380x876

I communicated with my mentors to make sure that I am on track and to get feedback on the dashboard. Based on the feedback I received, I will update the dashboard and the way I collect data with my code.

Meanwhile, I integrated Tor Stem into the system, and now I can specify a Tor exit node for testing purposes. I also merged the code that I have been restructuring into master, and I updated the README file to reflect the changes. Now, I'm working on integrating the Cloudflare API, and I plan to finish implementing it this weekend.

As you may have realized, I love flowcharts & diagrams, and I made another one to explain the current state of my code :) Actually, the code doesn't ask for these details "step by step" and I enter all of these details all at once at the beginning, instead. That being said, I believe, breaking down the process into smaller steps help us, humans, to understand what is going on better.

d1a019ec874505130ac545655ad1b508a0fa33e1_2_1380x330

May 2020

May 29

This week I restructured the preliminary code I had previously. I did this to make it work as I explained in my project diagram below. The changes I implemented made it possible to easily download the database.

3f63c7c30e9bc429ff103572d98357ad3031350f_2_1034x616

Later, I worked on making the code reliable because it wasn't always working in the "headless" mode. There was an undocumented dependency problem in the tor-browser-selenium library that I was using. I needed to install the Firefox browser to reliably use the library. I don't think it is related to having "Firefox" installed on the system but I think it is related to having a piece of code Firefox installs. It took a long time to figure this out and I will raise this issue in the library's GitHub repository to further investigate with the maintainers.

I also added the functionality to add new headers to the requests and save response HTTP headers. The original selenium library doesn't have this functionality, and I needed to find another way to interact with the headers. I ended up using the selenium-wire, which is an extension of the original selenium library and it allowed me to interact with the HTTP headers.

So, I spent this week trying to get a very basic working version of the project, and I did it! Now, I will extend it and make it complete during the coding phase. I think it has been a great community bonding period, which helped me to get feedback from the community and get prepared for the coding phase.

May 22

This week I created the parent and children trac tickets for the milestones for my project. These trac tickets will help me to let the community know about my progress and get feedback from them while I keep track of what I do each week.

I already received a few feedback about what I did so far on ticket #33010. This week, I also spent my time incorporating this feedback into my project and the milestones.

I additionally worked on getting an SSL certificate for my IRC bouncer setup because it is 2020 and encryption is a must. I discovered that Let's Encrypt issues SSL certificates at no cost and I got one for my IRC bouncer server.

May 15

This week I expanded my knowledge about "receiving IRC messages 24/7", even while my computer is turned off. This was an important issue for me to solve since IRC doesn't buffer the incoming messages. Instead, users need to have special systems to buffer these messages, so that the users can view buffered messages once they are available.

Meanwhile, I started talking to the OONI people about improving my project and benefitting from their past experience in this field.

Finally, I sent a semi-formal introduction about my project to the tor-dev mailing list and asked to get some feedback from the community. I waited to finalize the wiki article to send this email because I thought it would be more meaningful to have documentation of the project attached to the email. I also brainstormed about various approaches for getting feedback and having the community involved in the idea development process.

Once again, I am looking forward to finishing the finals for my university classes and start working on the code!

May 9

This week (or in the last 3 days), I spent my time updating the GSoC page on the Tor Project's trac and talking to my mentors for future planning for my project. They told me to create a "core page" on Tor Project's trac for explaining my project. So, we can direct people to that page if they want to learn more about the project or check the code. I started working on that page, as well.

IRC is not something new to me, but I have never used IRC actively. Once I read an article that resembled IRC to a crowded bar, where everyone talks to everyone loudly. This week, I realized how true that is. I also observed that people tag their messages in different ways (like using numbers or letters) to make it easier for other people to reply using these tags. It is a different type of communication method than I used to, but I am getting used to using it.

I am looking forward to finishing the finals for my university classes and start working on the code!

Clone this wiki locally