Skip to content

Malware scanning #1267

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Joshua-Tihabwangye
Copy link
Contributor

@Joshua-Tihabwangye Joshua-Tihabwangye commented May 16, 2025

Contributor checklist


Description

I have made the malware scanning functionality where each file is to be scanned before being stored in the database. I used a python library which is pyClamd. I created the malware_scan.py file inside the utils folder, i also had to set the pyclamd and its clamav deamon in settings.py, then other files that were edited are, serializers, docker file of backend, the docker compose file and the requirements file, and the requirements.txt.

Related issue

Copy link

netlify bot commented May 16, 2025

Deploy Preview for activist-org canceled.

Name Link
🔨 Latest commit 7709042
🔍 Latest deploy log https://app.netlify.com/projects/activist-org/deploys/6828debc28cefc0008afba07

Copy link
Contributor

Thank you for the pull request! ❤️

The activist team will do our best to address your contribution as soon as we can. If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Development rooms once you're in. Also consider attending our bi-weekly Saturday developer syncs! It'd be great to meet you 😊

Copy link
Contributor

Maintainer Checklist

The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

  • The TypeScript, pytest and formatting workflows within the PR checks do not indicate new errors in the files changed

  • The Playwright end to end and Zap penetration tests have been ran and are passing (if necessary)

  • The changelog has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

@mattburnett-repo
Copy link
Member

Hello @Joshua-Tihabwangye ! Thanks for all of the work getting this together; I'm looking forward to taking a look at the code!

Before we go further, there are a few things we should do, to get the PR in good shape. Could you please:

  • In the 'Contributor Checklist' at the top of this page, check the item that confirms these commits are on a separate branch.
  • Just below that, if you have tests for this code, please check the item that confirms you have run the tests.
    • If there are no tests yet, no problem! A good place to start with that is a simple test, that checks if the scanner is up and running at the expected port. A second test would check that you can send a file to the scanner, plus check the response from the scanner. Testing can be a little overwhelming sometimes, so it's ok to start with a few basic tests that check if the tested code is running and runs as expected for a simple use case. We can always build more tests out later, but we need to start with these two basic tests.
  • For the 'Description' field, please add a short description for what the code does. A high-level description that explains what the code is intended for, where it is in the codebase, and how it works should be enough. Simple and clear is good!
  • For the 'Related Issue' please enter the original issue number that this PR addresses. I think you can just replace 'ISSUE_NUMBER' with the actual issue number.

I noticed that there are some conflicts that need to be resolved. If you can fix these, please go ahead to do so. If not, I'll take a look and see what we can do to fix them.

Overall, you built some code and created a PR! That is fantastic, and we really appreciate your efforts here. Our next step is to fix the items above, then we can get to the fun part of digging through the code!

@Joshua-Tihabwangye
Copy link
Contributor Author

Am sorry for not checking them, i got interrupted by other things and forgot but let me work on them right now

@Joshua-Tihabwangye
Copy link
Contributor Author

I need help, from where i have reached. If i run docker in the logs it will show that the database has been scanned and there is no problem in it, but when i access the admin dashboard and i upload the image it cant show i the filescan container any logs of showing that the image has been scanned or nay file has been scanned. so am stack there. am here that we work together as i learn from you.

@@ -1,26 +1,28 @@
FRONTEND_PORT="3000"
Copy link
Member

@mattburnett-repo mattburnett-repo May 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change in the formatting of the values (the "" ) is not necessary.

Usually, the only changes to a .env file should be adding / removing variables. It is best to leave the formatting as-is.

@Joshua-Tihabwangye , could you please revert this file? Thanks!

@@ -5,7 +5,10 @@ WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . /app/backend/.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add the clamav parts to the requirements. txt file? If so, we should do that.

uploaded_file = data["file_object"]
file_content = uploaded_file.read()
logger.info("About to scan uploaded file in memory.") # Add this log
scan_result = scan_file_in_memory(file_content)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start, but we should use the file scan like a separate service, instead of in-line like we are doing here.

There are some questions I still have about this. @andrewtavis , your thoughts about how to best architect this? I was thinking in terms of treating the filescan like a microservice, separate from the rest of the backend code. Your comments on this are appreciated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd agree that having it as it's own service would make the most sense, @mattburnett-repo 😊 Loading it into serializers sounds good to me :)

@@ -5,7 +5,10 @@ WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . /app/backend/.
RUN apt-get update && apt-get install -y clamav clamav-daemon
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During Docker build, getting this error:

RUN apt-get update && apt-get install -y clamav clamav-daemon

/bin/sh: apt-get: not found

See comment above, about including clamav modules in the requirements.txt file.

@@ -190,3 +190,6 @@ wheel==0.45.1
# The following packages are considered to be unsafe in a requirements file:
# pip
# setuptools
pyclamd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok. I see here that the clamav modules are included in requirements text file.

Just wondering why we added 'factory-boy' here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like these were added directly into this file as well, @Joshua-Tihabwangye :) Totally fine that this is a bit confusing. Was for me when we started doing it 😊 See the directions on how to update dependencies at the top of backend/requirements.in, and please let us know if you have any questions! This uses pip-tools :)

@@ -16,3 +16,5 @@ gunicorn
pillow
psycopg2-binary
python-dotenv
pyclamd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok. I see here that the clamav modules are included in requirements text file.

@@ -94,3 +94,5 @@ urllib3==2.3.0
# via
# requests
# types-requests
pyclamd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not understanding why this is duplicated here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking for clarity: Won't we need the file scan for production, so it needs to be in both requirement files?

Copy link
Member

@mattburnett-repo mattburnett-repo May 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. Thanks for pointing that out. #derp

@@ -0,0 +1,52 @@
import pyclamd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start! Is it possible to move this code out of the main backend codebase, and instead treat it like a microservice? Waiting to hear from @andrewtavis to get his thoughts about this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with whatever the architecture decision is, but I guess we're not talking a microservice in the true sense of it, but rather another sub-directory or app within the backend like communities, events, etc?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CCing also @to-sta here for this part of the discussion of where the malware scanning should live and how it should be ran 😊

Copy link
Member

@mattburnett-repo mattburnett-repo May 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I was thinking about it as something separate from the backend code, and not a sub-dir as you mentioned.

The idea behind treating it as a microservice is simply to isolate the scan functionality from the rest of the backend codebase. The filescan stuff isn't technically part of the overall core functionality for the backend. Also, in the event that there is a directed attack against the filescanner, isolating the filescanner from the rest of the backend code reduces the attack surface.

These are just initial thoughts. I can see advantages to either solution, although I tend toward treating it as something external to the core backend codebase. However, my opinion on this isn't strong. I'm interested in @to-sta 's thoughts on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CCing also @to-sta here for this part of the discussion of where the malware scanning should live and how it should be ran 😊

I decided to integrate it in the serializers that handle the file uploads, in order to capture any file that will be saved to the database

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start! Is it possible to move this code out of the main backend codebase, and instead treat it like a microservice? Waiting to hear from @andrewtavis to get his thoughts about this.

I don't know how we can do it outside the serializers so I think there we need help one another. @andrewtavis some idea about the architecture we should use besides using it in the API endpoint of uploading the files

Copy link
Collaborator

@to-sta to-sta May 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first thought would be to use a seperate background worker that checks the file after it's uploaded:

  1. Upload the file to a temporary quarantine folder.
  2. The worker scans the file for malicious content.
  3. Based on the scan results, the file is either released or removed. 

--events-per-org 1 \
--resources-per-entity 1 \
--faq-entries-per-entity 1 &&
python manage.py populate_db --users 10 --orgs-per-user 1 --groups-per-org 1 --events-per-org 1 --resources-per-entity 1 --faq-entries-per-entity 1 &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this was changed, to put all of the populate_db options on one line.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this would be nice to revert :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Joshua-Tihabwangye , could you please revert these changes? Thanks!


# volumes:
# clamav_db:
filescan:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very, very, very good start!

If we are pulling the clamav images here, do we need to also add clamav modules to requirement files?

@mattburnett-repo
Copy link
Member

Hi @Joshua-Tihabwangye ! You will certainly get help from us! We try our best to be helpful, and to promote a positive environment here!

I've made some initial comments about the code. DON'T STRESS!!! :) You're doing fine! We will go through all of the issues and fix them, one-by-one.

Right now, I have a question for @andrewtavis about how to architect this solution. It's actually a very basic question about where we want the code to go, and how we want to access it.

You've taken on a big task, and have made good progress towards solving it. We need to hear from Andrew, and then we can take next steps. Thanks for all of your work so far!

@Joshua-Tihabwangye
Copy link
Contributor Author

Joshua-Tihabwangye commented May 18, 2025 via email

@Joshua-Tihabwangye
Copy link
Contributor Author

We can use the small micro service to implement the malware scanning functionality. I have made some research on both options of integrating the functionality into the API end points especially in serializers as it is right now and then I have also the one of creating a small microservice and I've found out that it is a good idea to make the functionality as a small microservice.
It is good because it handles much data at one time it's is important to separate it ,in case of any problem like crashing of the Clamav it can't affect the backend code.

@andrewtavis
Copy link
Member

I duplicated it after getting problems with including the clamav modules into the requirements.txt file so, when I added in one file I thought of duplicating it to cover all the requirements.txt files

No stress, @Joshua-Tihabwangye! I can run the scripts to generate the dependencies in the final review here. We're good for now 😊

We can use the small micro service to implement the malware scanning functionality. I have made some research on both options of integrating the functionality into the API end points especially in serializers as it is right now and then I have also the one of creating a small microservice and I've found out that it is a good idea to make the functionality as a small microservice. It is good because it handles much data at one time it's is important to separate it ,in case of any problem like crashing of the Clamav it can't affect the backend code.

Let's do it then! 🚀 I think this will be a very nice first microservice to add into the codebase :)

@Joshua-Tihabwangye
Copy link
Contributor Author

I have made some research on how the microservice would look like inside our main repo or project.

In the microservice we shall have the following directories which will have different files.

  1. .env file = environment variables
  2. Docker file = Dependencies for running the container.
  3. Requirements.txt = will have the pyrhon dependencies of the API and the library responsible for scanning.
  4. App
  5. Tests
  6. Scripts
    Readme.md file.

Then we need to have an API to help is integrate it or connect it to the main project.

The API wil have the;
* File Scan Endpoint
* POST as the method

@Joshua-Tihabwangye
Copy link
Contributor Author

@mattburnett-repo , @to-sta @andrewtavis I need your idea and help on what my findings are, is there other things needed to fit in the structure or architecture of the microservice?

@mattburnett-repo
Copy link
Member

I have made some research on how the microservice would look like inside our main repo or project.

In the microservice we shall have the following directories which will have different files.

  1. .env file = environment variables
  2. Docker file = Dependencies for running the container.
  3. Requirements.txt = will have the pyrhon dependencies of the API and the library responsible for scanning.
  4. App
  5. Tests
  6. Scripts
    Readme.md file.

Then we need to have an API to help is integrate it or connect it to the main project.

The API wil have the; * File Scan Endpoint * POST as the method

This is a good overview and a good first step towards building the filescan functionality!

I would suggest that we do the following:

  • Revert the .env.dev file. The changes made to this file (the " characters surrounding the values) are not necessary in almost all cases. This change should be removed from the PR.
  • Revert the docker-compose.yml file. The formatting of the code python manage.py populate_db --users 10 --orgs-per-user 1... onto a single line is not necessary. We should keep the original formatting, instead. This change should be removed from the PR.
  • We can start to build the filescan module with the following steps, focusing first on the infrastructure:
    • Create a separate module outside of the backend folder. We can start by creating a new folder called services that is on the same level as the backend and frontend folders.
    • In this services folder, create another folder called filescan. In this folder, we can build the actual filescan module. We will build this module later, after we create the infrastructure for it.
    • In the filescan folder, create a small, simple http service that
      • Accepts only a POST request. This POST request should include a file to be scanned.
      • Creates three Responses:
        • First response is for the case where the file is ok.
        • Second response is for the case where the file not ok.
        • Third response is for the cases where an error of some type occurred.
      • For now, this simple service doesn't need to actually do the filescan. For now, we just want to get an endpoint for the service created. We can add the filescan functionality after the endpoint is set up and accessible by the backend.
      • To build this simple service, look at Flask as a possible tool to use. Flask allows you to build simple http-based endpoints. If you prefer to use something else, feel free to do so.
    • Create a Dockerfile for this simple service.
    • Create three tests, one for each of the three Responses mentioned above (ok, malware, error).

We don't need to focus on the actual filescanning right now. What is important at this step is to create the 'home' for the filescan functionality, meaning the folders/http service/Dockerfile/tests. Once we have this 'home' created and working well, we can add the real filescan code.

@andrewtavis @to-sta , as always we're interested in any thoughts or comments you might have!

@andrewtavis
Copy link
Member

I think the general plan sounds really good, @mattburnett-repo :) Thanks for outlining it in such detail 😊

@mattburnett-repo
Copy link
Member

@Joshua-Tihabwangye , I've outlined some steps, above, to take in order to move forward with this task. Please take a look; they should be helpful.

@Joshua-Tihabwangye
Copy link
Contributor Author

Thank you @mattburnett-repo for the highlights given to us. It will really help me and the development team in this task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants