This folder provides a reference implementation of paper -- Exploring Self-Explainable Street-Level IP Geolocation with Graph Information Bottleneck(ICASSP24).
The code was tested with python 3.8.13, pytorch 1.12.1, cudatoolkit 11.6.0, and cudnn 7.6.5. Install the dependencies via Anaconda:
# create virtual environment
conda create --name ExGeo python=3.8.13
# activate environment
conda activate ExGeo
# install pytorch & cudatoolkit
conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge
# install other requirements
conda install numpy pandas
pip install scikit-learn# Open the "ExGeo" folder
cd ExGeo
# data preprocess (executing IP clustering).
python generateidx.py --dataset "New_York"
python generateidx.py --dataset "Los_Angeles"
python generateidx.py --dataset "Shanghai"
python preprocess.py --dataset "New_York"
python preprocess.py --dataset "Los_Angeles"
python preprocess.py --dataset "Shanghai"
# run the model ExGeo
python main.py --dataset "New_York" --dim_in 30 --lr 2e-3 --saved_epoch 10
python main.py --dataset "Los_Angeles" --dim_in 30 --lr 2e-3 --saved_epoch 10
python main.py --dataset "Shanghai" --dim_in 51 --lr 1e-3 --saved_epoch 10
# load the checkpoint and then test
python test.py --dataset "New_York" --dim_in 30 --lr 2e-3 --load_epoch 100
python test.py --dataset "Los_Angeles" --dim_in 30 --lr 2e-3 --load_epoch 100
python test.py --dataset "Shanghai" --dim_in 51 --lr 1e-3 --load_epoch 70| Hyperparameter | Description |
|---|---|
| seed | the random number seed used for parameter initialization during training |
| model_name | the name of model |
| dataset | the dataset used by main.py |
| lambda_1 | the trade-off coefficient of data perturbation in loss function |
| lambda_2 | the trade-off coefficient of parameter perturbation in loss function |
| lr | learning rate |
| harved_epoch | when how many consecutive epochs the performance does not increase, the learning rate is halved |
| early_stop_epoch | when how many consecutive epochs the performance does not increase, the training stops. |
| saved_epoch | how many epochs to save checkpoint for the testing |
| dim_in | the dimension of input data |
| dim_med | the dimension of middle layers |
| dim_z | the dimension of vector representation |
| eta | magnitude of data disturbance |
| zeta | magnitude of parameter disturbance |
| step | times of gradient ascent in a single parameter disturbance |
| mu | inner learning rate of parameter disturbance |
| c_mlp | when predicting if use collaborative_mlp or not |
| epoch_threshold | when we start adding perturbation both in data and parameter |
└── ExGeo
├── datasets # Contains three large-scale real-world street-level IP geolocation datasets.
│ |── New_York # Street-level IP geolocation dataset collected from New York City including 91,808 IP addresses.
│ |── Los_Angeles # Street-level IP geolocation dataset collected from Los Angeles including 92,804 IP addresses.
│ |── Shanghai # Street-level IP geolocation dataset collected from Shanghai including 126,258 IP addresses.
├── lib # Contains model implementation files
│ |── layers.py # The code of the attention mechanism.
│ |── model.py # The core source code of proposed RIPGeo
│ |── sublayers.py # The support file for layer.py
│ |── utils.py # Auxiliary functions
├── asset # Contains saved checkpoints and logs when running the model
│ |── log # Contains logs when running the model
│ |── model # Contains the saved checkpoints
├── generateidx.py # generate the idx of traget nodes and landmark nodes
├── preprocess.py # Preprocess dataset and execute IP clustering the for model running
├── main.py # Run model for training and test
├── test.py # Load checkpoint and then test
└── README.mdThe "datasets" folder contains three subfolders corresponding to three large-scale real-world street-level IP geolocation datasets collected from New York City, Los Angeles and Shanghai. There are three files in each subfolder:
- data.csv # features (including attribute knowledge and network measurements) and labels (longitude and latitude) for street-level IP geolocation
- ip.csv # IP addresses
- last_traceroute.csv # last four routers and corresponding delays for efficient IP host clustering
The detailed columns and description of data.csv in New York dataset are as follows:
| Column Name | Data Description |
|---|---|
| ip | The IPv4 address |
| as_mult_info | The ID of the autonomous system where IP locates |
| country | The country where the IP locates |
| prov_cn_name | The state/province where the IP locates |
| city | The city where the IP locates |
| isp | The Internet Service Provider of the IP |
| vp900/901/..._ping_delay_time | The ping delay from probing hosts "vp900/901/..." to the IP host |
| vp900/901/..._trace | The traceroute list from probing hosts "vp900/901/..." to the IP host |
| vp900/901/..._tr_steps | #steps of the traceroute from probing hosts "vp900/901/..." to the IP host |
| vp900/901/..._last_router_delay | The delay from the last router to the IP host in the traceroute list from probing hosts "vp900/901/..." |
| vp900/901/..._total_delay | The total delay from probing hosts "vp900/901/..." to the IP host |
| longitude | The longitude of the IP (as label) |
| latitude | The latitude of the IP host (as label) |
PS: The detailed columns and description of data.csv in other two datasets are similar to New York dataset's.