Malware scanning #1267

Joshua-Tihabwangye · 2025-05-16T16:18:09Z

Contributor checklist

This pull request is on a separate branch and not the main branch
I have run the tests for the backend and frontend depending on what's needed for my changes as described in the testing section of the contributing guide

Description

I have made the malware scanning functionality where each file is to be scanned before being stored in the database. I used a python library which is pyClamd. I created the malware_scan.py file inside the utils folder, i also had to set the pyclamd and its clamav deamon in settings.py, then other files that were edited are, serializers, docker file of backend, the docker compose file and the requirements file, and the requirements.txt.

Related issue

Add filescan capability to backend #1171

netlify · 2025-05-16T16:18:14Z

✅ Deploy Preview for activist-org canceled.

Name	Link
🔨 Latest commit	`7709042`
🔍 Latest deploy log	https://app.netlify.com/projects/activist-org/deploys/6828debc28cefc0008afba07

github-actions · 2025-05-16T16:18:32Z

Thank you for the pull request! ❤️

The activist team will do our best to address your contribution as soon as we can. If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Development rooms once you're in. Also consider attending our bi-weekly Saturday developer syncs! It'd be great to meet you 😊

github-actions · 2025-05-16T16:18:33Z

Maintainer Checklist

The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

The TypeScript, pytest and formatting workflows within the PR checks do not indicate new errors in the files changed
The Playwright end to end and Zap penetration tests have been ran and are passing (if necessary)
The changelog has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

mattburnett-repo · 2025-05-17T13:20:13Z

Hello @Joshua-Tihabwangye ! Thanks for all of the work getting this together; I'm looking forward to taking a look at the code!

Before we go further, there are a few things we should do, to get the PR in good shape. Could you please:

In the 'Contributor Checklist' at the top of this page, check the item that confirms these commits are on a separate branch.
Just below that, if you have tests for this code, please check the item that confirms you have run the tests.
- If there are no tests yet, no problem! A good place to start with that is a simple test, that checks if the scanner is up and running at the expected port. A second test would check that you can send a file to the scanner, plus check the response from the scanner. Testing can be a little overwhelming sometimes, so it's ok to start with a few basic tests that check if the tested code is running and runs as expected for a simple use case. We can always build more tests out later, but we need to start with these two basic tests.
For the 'Description' field, please add a short description for what the code does. A high-level description that explains what the code is intended for, where it is in the codebase, and how it works should be enough. Simple and clear is good!
For the 'Related Issue' please enter the original issue number that this PR addresses. I think you can just replace 'ISSUE_NUMBER' with the actual issue number.

I noticed that there are some conflicts that need to be resolved. If you can fix these, please go ahead to do so. If not, I'll take a look and see what we can do to fix them.

Overall, you built some code and created a PR! That is fantastic, and we really appreciate your efforts here. Our next step is to fix the items above, then we can get to the fun part of digging through the code!

Joshua-Tihabwangye · 2025-05-17T18:52:41Z

Am sorry for not checking them, i got interrupted by other things and forgot but let me work on them right now

Joshua-Tihabwangye · 2025-05-17T19:15:29Z

I need help, from where i have reached. If i run docker in the logs it will show that the database has been scanned and there is no problem in it, but when i access the admin dashboard and i upload the image it cant show i the filescan container any logs of showing that the image has been scanned or nay file has been scanned. so am stack there. am here that we work together as i learn from you.

mattburnett-repo · 2025-05-18T13:35:23Z

.env.dev

@@ -1,26 +1,28 @@
-FRONTEND_PORT="3000"


The change in the formatting of the values (the "" ) is not necessary.

Usually, the only changes to a .env file should be adding / removing variables. It is best to leave the formatting as-is.

@Joshua-Tihabwangye , could you please revert this file? Thanks!

mattburnett-repo · 2025-05-18T13:37:37Z

backend/Dockerfile

@@ -5,7 +5,10 @@ WORKDIR /app
 COPY requirements.txt .
 RUN pip install -r requirements.txt

-COPY  . /app/backend/.


Is it possible to add the clamav parts to the requirements. txt file? If so, we should do that.

mattburnett-repo · 2025-05-18T13:41:36Z

backend/content/serializers.py

+        uploaded_file = data["file_object"]
+        file_content = uploaded_file.read()
+        logger.info("About to scan uploaded file in memory.")  # Add this log
+        scan_result = scan_file_in_memory(file_content)


This is a good start, but we should use the file scan like a separate service, instead of in-line like we are doing here.

There are some questions I still have about this. @andrewtavis , your thoughts about how to best architect this? I was thinking in terms of treating the filescan like a microservice, separate from the rest of the backend code. Your comments on this are appreciated.

I'd agree that having it as it's own service would make the most sense, @mattburnett-repo 😊 Loading it into serializers sounds good to me :)

mattburnett-repo · 2025-05-18T13:43:08Z

backend/Dockerfile

@@ -5,7 +5,10 @@ WORKDIR /app
 COPY requirements.txt .
 RUN pip install -r requirements.txt

-COPY  . /app/backend/.
+RUN apt-get update && apt-get install -y clamav clamav-daemon


During Docker build, getting this error:

RUN apt-get update && apt-get install -y clamav clamav-daemon

/bin/sh: apt-get: not found

See comment above, about including clamav modules in the requirements.txt file.

mattburnett-repo · 2025-05-18T13:44:37Z

backend/requirements-dev.txt

@@ -190,3 +190,6 @@ wheel==0.45.1
 # The following packages are considered to be unsafe in a requirements file:
 # pip
 # setuptools
+pyclamd


Ah, ok. I see here that the clamav modules are included in requirements text file.

Just wondering why we added 'factory-boy' here?

Looks like these were added directly into this file as well, @Joshua-Tihabwangye :) Totally fine that this is a bit confusing. Was for me when we started doing it 😊 See the directions on how to update dependencies at the top of backend/requirements.in, and please let us know if you have any questions! This uses pip-tools :)

mattburnett-repo · 2025-05-18T13:44:47Z

backend/requirements.in

@@ -16,3 +16,5 @@ gunicorn
 pillow
 psycopg2-binary
 python-dotenv
+pyclamd


Ah, ok. I see here that the clamav modules are included in requirements text file.

mattburnett-repo · 2025-05-18T13:45:32Z

backend/requirements.txt

@@ -94,3 +94,5 @@ urllib3==2.3.0
    # via
    #   requests
    #   types-requests
+pyclamd


Not understanding why this is duplicated here.

Just checking for clarity: Won't we need the file scan for production, so it needs to be in both requirement files?

You are correct. Thanks for pointing that out. #derp

mattburnett-repo · 2025-05-18T13:47:18Z

backend/utils/malware_scan.py

@@ -0,0 +1,52 @@
+import pyclamd


This is a good start! Is it possible to move this code out of the main backend codebase, and instead treat it like a microservice? Waiting to hear from @andrewtavis to get his thoughts about this.

I'm fine with whatever the architecture decision is, but I guess we're not talking a microservice in the true sense of it, but rather another sub-directory or app within the backend like communities, events, etc?

CCing also @to-sta here for this part of the discussion of where the malware scanning should live and how it should be ran 😊

Actually, I was thinking about it as something separate from the backend code, and not a sub-dir as you mentioned.

The idea behind treating it as a microservice is simply to isolate the scan functionality from the rest of the backend codebase. The filescan stuff isn't technically part of the overall core functionality for the backend. Also, in the event that there is a directed attack against the filescanner, isolating the filescanner from the rest of the backend code reduces the attack surface.

These are just initial thoughts. I can see advantages to either solution, although I tend toward treating it as something external to the core backend codebase. However, my opinion on this isn't strong. I'm interested in @to-sta 's thoughts on this.

CCing also @to-sta here for this part of the discussion of where the malware scanning should live and how it should be ran 😊

I decided to integrate it in the serializers that handle the file uploads, in order to capture any file that will be saved to the database

This is a good start! Is it possible to move this code out of the main backend codebase, and instead treat it like a microservice? Waiting to hear from @andrewtavis to get his thoughts about this.

I don't know how we can do it outside the serializers so I think there we need help one another. @andrewtavis some idea about the architecture we should use besides using it in the API endpoint of uploading the files

My first thought would be to use a seperate background worker that checks the file after it's uploaded:

Upload the file to a temporary quarantine folder.

The worker scans the file for malicious content.

Based on the scan results, the file is either released or removed.

mattburnett-repo · 2025-05-18T13:48:46Z

docker-compose.yml

-        --events-per-org 1 \
-        --resources-per-entity 1 \
-        --faq-entries-per-entity 1 &&
+      python manage.py populate_db --users 10 --orgs-per-user 1 --groups-per-org 1 --events-per-org 1 --resources-per-entity 1 --faq-entries-per-entity 1 &&


Not sure why this was changed, to put all of the populate_db options on one line.

Yes this would be nice to revert :)

@Joshua-Tihabwangye , could you please revert these changes? Thanks!

mattburnett-repo · 2025-05-18T13:50:36Z

docker-compose.yml

-
-# volumes:
-#  clamav_db:
+  filescan:


This is a very, very, very good start!

If we are pulling the clamav images here, do we need to also add clamav modules to requirement files?

mattburnett-repo · 2025-05-18T13:56:18Z

Hi @Joshua-Tihabwangye ! You will certainly get help from us! We try our best to be helpful, and to promote a positive environment here!

I've made some initial comments about the code. DON'T STRESS!!! :) You're doing fine! We will go through all of the issues and fix them, one-by-one.

Right now, I have a question for @andrewtavis about how to architect this solution. It's actually a very basic question about where we want the code to go, and how we want to access it.

You've taken on a big task, and have made good progress towards solving it. We need to hear from Andrew, and then we can take next steps. Thanks for all of your work so far!

Joshua-Tihabwangye · 2025-05-18T17:15:29Z

I duplicated it after getting problems with including the clamav modules into the requirements.txt file so, when I added in one file I thought of duplicating it to cover all the requirements.txt files

…

On Sun, May 18, 2025, 4:47 PM Matt Burnett ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In backend/utils/malware_scan.py <#1267 (comment)> : > @@ -0,0 +1,52 @@ +import pyclamd This is a good start! Is it possible to move this code out of the main backend codebase, and instead treat it like a microservice? Waiting to hear from @andrewtavis <https://github.com/andrewtavis> to get his thoughts about this. — Reply to this email directly, view it on GitHub <#1267 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BCHYFTDO6TLQ6XPH34Z6FYL27CFPXAVCNFSM6AAAAAB5JE2SZCVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDQNBYHE4TENZYGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Joshua-Tihabwangye · 2025-05-18T18:20:24Z

We can use the small micro service to implement the malware scanning functionality. I have made some research on both options of integrating the functionality into the API end points especially in serializers as it is right now and then I have also the one of creating a small microservice and I've found out that it is a good idea to make the functionality as a small microservice.
It is good because it handles much data at one time it's is important to separate it ,in case of any problem like crashing of the Clamav it can't affect the backend code.

andrewtavis · 2025-05-18T18:38:58Z

I duplicated it after getting problems with including the clamav modules into the requirements.txt file so, when I added in one file I thought of duplicating it to cover all the requirements.txt files

No stress, @Joshua-Tihabwangye! I can run the scripts to generate the dependencies in the final review here. We're good for now 😊

We can use the small micro service to implement the malware scanning functionality. I have made some research on both options of integrating the functionality into the API end points especially in serializers as it is right now and then I have also the one of creating a small microservice and I've found out that it is a good idea to make the functionality as a small microservice. It is good because it handles much data at one time it's is important to separate it ,in case of any problem like crashing of the Clamav it can't affect the backend code.

Let's do it then! 🚀 I think this will be a very nice first microservice to add into the codebase :)

Joshua-Tihabwangye · 2025-05-23T07:35:56Z

I have made some research on how the microservice would look like inside our main repo or project.

In the microservice we shall have the following directories which will have different files.

.env file = environment variables
Docker file = Dependencies for running the container.
Requirements.txt = will have the pyrhon dependencies of the API and the library responsible for scanning.
App
Tests
Scripts
Readme.md file.

Then we need to have an API to help is integrate it or connect it to the main project.

The API wil have the;
* File Scan Endpoint
* POST as the method

Joshua-Tihabwangye · 2025-05-23T07:39:49Z

@mattburnett-repo , @to-sta @andrewtavis I need your idea and help on what my findings are, is there other things needed to fit in the structure or architecture of the microservice?

mattburnett-repo · 2025-05-23T12:23:48Z

I have made some research on how the microservice would look like inside our main repo or project.

In the microservice we shall have the following directories which will have different files.

.env file = environment variables

Docker file = Dependencies for running the container.

Requirements.txt = will have the pyrhon dependencies of the API and the library responsible for scanning.

App

Tests

Scripts
Readme.md file.

Then we need to have an API to help is integrate it or connect it to the main project.

The API wil have the; * File Scan Endpoint * POST as the method

This is a good overview and a good first step towards building the filescan functionality!

I would suggest that we do the following:

Revert the .env.dev file. The changes made to this file (the " characters surrounding the values) are not necessary in almost all cases. This change should be removed from the PR.
Revert the docker-compose.yml file. The formatting of the code python manage.py populate_db --users 10 --orgs-per-user 1... onto a single line is not necessary. We should keep the original formatting, instead. This change should be removed from the PR.
We can start to build the filescan module with the following steps, focusing first on the infrastructure:
- Create a separate module outside of the backend folder. We can start by creating a new folder called services that is on the same level as the backend and frontend folders.
- In this services folder, create another folder called filescan. In this folder, we can build the actual filescan module. We will build this module later, after we create the infrastructure for it.
- In the filescan folder, create a small, simple http service that
  - Accepts only a POST request. This POST request should include a file to be scanned.
  - Creates three Responses:
    - First response is for the case where the file is ok.
    - Second response is for the case where the file not ok.
    - Third response is for the cases where an error of some type occurred.
  - For now, this simple service doesn't need to actually do the filescan. For now, we just want to get an endpoint for the service created. We can add the filescan functionality after the endpoint is set up and accessible by the backend.
  - To build this simple service, look at Flask as a possible tool to use. Flask allows you to build simple http-based endpoints. If you prefer to use something else, feel free to do so.
- Create a Dockerfile for this simple service.
- Create three tests, one for each of the three Responses mentioned above (ok, malware, error).

We don't need to focus on the actual filescanning right now. What is important at this step is to create the 'home' for the filescan functionality, meaning the folders/http service/Dockerfile/tests. Once we have this 'home' created and working well, we can add the real filescan code.

@andrewtavis @to-sta , as always we're interested in any thoughts or comments you might have!

andrewtavis · 2025-05-23T16:13:54Z

I think the general plan sounds really good, @mattburnett-repo :) Thanks for outlining it in such detail 😊

mattburnett-repo · 2025-05-23T16:21:03Z

@Joshua-Tihabwangye , I've outlined some steps, above, to take in order to move forward with this task. Please take a look; they should be helpful.

Joshua-Tihabwangye · 2025-05-23T16:37:55Z

Thank you @mattburnett-repo for the highlights given to us. It will really help me and the development team in this task

Joshua-Tihabwangye and others added 2 commits May 5, 2025 09:50

Implementing the malware scanning functionality using PyClamd library.

305b703

I have created the malware scan functionality, using pyClamd

a06f55d

andrewtavis requested review from andrewtavis and mattburnett-repo May 17, 2025 11:37

Delete backend/utils/malware_scanner.py

7709042

mattburnett-repo reviewed May 18, 2025

View reviewed changes

Malware scanning #1267

Are you sure you want to change the base?

Malware scanning #1267

Uh oh!

Conversation

Joshua-Tihabwangye commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contributor checklist

Description

Related issue

Uh oh!

netlify bot commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for activist-org canceled.

Uh oh!

github-actions bot commented May 16, 2025

Thank you for the pull request! ❤️

Uh oh!

github-actions bot commented May 16, 2025

Maintainer Checklist

Uh oh!

mattburnett-repo commented May 17, 2025

Uh oh!

Joshua-Tihabwangye commented May 17, 2025

Uh oh!

Joshua-Tihabwangye commented May 17, 2025

Uh oh!

mattburnett-repo May 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattburnett-repo May 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattburnett-repo May 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

to-sta May 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattburnett-repo commented May 18, 2025

Joshua-Tihabwangye commented May 16, 2025 •

edited

Loading

netlify bot commented May 16, 2025 •

edited

Loading

mattburnett-repo May 18, 2025 •

edited

Loading

mattburnett-repo May 18, 2025 •

edited

Loading

mattburnett-repo May 18, 2025 •

edited

Loading

to-sta May 18, 2025 •

edited

Loading