Build Your Own Backdoor Dataset
This is a simple script to build your own backdoor dataset. Specifically, this file include two parts.
How to add a new dataset into our framework.
Further demonstration about how the class prepro_cls_DatasetBD_v2 is working.
How to add a new dataset into our framework
Step 1: Download the dataset and put it into the folder ./data/
Step 2: Add the dataset name into the file ./data/__init__.py
Step 3: Add the dataset class into the file ./data/{dataset_name}.py
class YourDatasetName(Dataset): def __init__(self, ): ... def __len__(self): return ... def __getitem__(self, idx): return ...
Your custom dataset should follow the format of torch.utils.data.Dataset.
The __getitem__ function should return a tuple (img, label).
Step 4: Write how to call you dataset and get num_classes, input_shape, normalization, and other basic information in the backdoorbench/utils/aggregate_block/dataset_and_transform_generate.py. Take CIFAR-10 as an example.
def dataset_and_transform_generate(args): ... elif args.dataset == 'cifar10': from torchvision.datasets import CIFAR10 train_dataset_without_transform = CIFAR10( args.dataset_path, train=True, transform=None, download=True, ) test_dataset_without_transform = CIFAR10( args.dataset_path, train=False, transform=None, download=True, ) ... return train_dataset_without_transform, \ train_img_transform, \ train_label_transform, \ test_dataset_without_transform, \ test_img_transform, \ test_label_transform, \
The train_dataset_without_transform and test_dataset_without_transform should be the dataset without any transform.
The train_img_transform and test_img_transform should be the transform for the input image.
The train_label_transform and test_label_transform should be the transform for the input label.
Further demonstration about how the class prepro_cls_DatasetBD_v2 is working.(File is at backdoorbench/utils/bd_dataset_v2.py)
The basic idea is that this dataset take a clean dataset without transformation. And We use poison_indicator to indicate whether the sample is poisoned or not. If the sample is poisoned, we will use the bd_image_pre_transform and bd_label_pre_transform to transform the sample. Otherwise, we will not change the file.
For space concern, we use poisonedCLSDataContainer to only save the poisoned samples on disk/RAM.
original_index_array is used to record the index of the original dataset. For example, if the original dataset is CIFAR-10, then the original_index_array will be a list of 0 to 49999. Notice that this array also work in subset functionality.
When we call this class
Step 1: We will first load all given information and set them.
self.dataset = full_dataset_without_transform if poison_indicator is None: poison_indicator = np.zeros(len(full_dataset_without_transform)) self.poison_indicator = poison_indicator assert len(full_dataset_without_transform) == len(poison_indicator) self.bd_image_pre_transform = bd_image_pre_transform self.bd_label_pre_transform = bd_label_pre_transform self.save_folder_path = save_folder_path # since when we want to save this dataset, this may cause problem self.original_index_array = np.arange(len(full_dataset_without_transform)) self.bd_data_container = poisonedCLSDataContainer(self.save_folder_path, ".png")
Step 2: We check if any position in poison_indicator is 1, if exists, then we do trigger injection.
if sum(self.poison_indicator) >= 1: self.prepro_backdoor()
def prepro_backdoor(self): for selected_index in tqdm(self.original_index_array, desc="prepro_backdoor"): if self.poison_indicator[selected_index] == 1: img, label = self.dataset[selected_index] img = self.bd_image_pre_transform(img, target=label, image_serial_id=selected_index) bd_label = self.bd_label_pre_transform(label) self.set_one_bd_sample( selected_index, img, bd_label, label ) def set_one_bd_sample(self, selected_index, img, bd_label, label): ''' 1. To pil image 2. set the image to container 3. change the poison_index. logic is that no matter by the prepro_backdoor or not, after we set the bd sample, This method will automatically change the poison index to 1. :param selected_index: The index of bd sample :param img: The converted img that want to put in the bd_container :param bd_label: The label bd_sample has :param label: The original label bd_sample has ''' # we need to save the bd img, so we turn it into PIL if (not isinstance(img, Image.Image)) and self.save_folder_path is not None: if isinstance(img, np.ndarray): img = img.astype(np.uint8) img = ToPILImage()(img) self.bd_data_container.setitem( key=selected_index, value=(img, bd_label, label), relative_loc_to_save_folder_name=f"{label}", ) self.poison_indicator[selected_index] = 1
prepro_backdoor is for more than one bd sample. set_one_bd_sample is for one bd sample.
Step 3: Since we often not only need image and label, but also other information like index and original label are needed. More information is set, and will be returned at __getitem__
self.getitem_all = True self.getitem_all_switch = False
def __getitem__(self, index): original_index = self.original_index_array[index] if self.poison_indicator[original_index] == 0: # clean img, label = self.dataset[original_index] original_target = label poison_or_not = 0 else: # bd img, label, original_target = self.bd_data_container[original_index] poison_or_not = 1 if not isinstance(img, Image.Image): img = ToPILImage()(img) if self.getitem_all: if self.getitem_all_switch: # this is for the case that you want original targets, but you do not want change your testing process return img, \ original_target, \ original_index, \ poison_or_not, \ label else: # here should corresponding to the order in the bd trainer return img, \ label, \ original_index, \ poison_or_not, \ original_target else: return img, label
Step 4 (Optional): We can also save the dataset for further usage. retrieve_state and set_state are useful for save and load all information in dataset.
def retrieve_state(self): return { "bd_data_container" : self.bd_data_container.retrieve_state(), "getitem_all":self.getitem_all, "getitem_all_switch":self.getitem_all_switch, "original_index_array": self.original_index_array, "poison_indicator": self.poison_indicator, "save_folder_path": self.save_folder_path, }
def set_state(self, state_file): self.bd_data_container = poisonedCLSDataContainer() self.bd_data_container.set_state( state_file['bd_data_container'] ) self.getitem_all = state_file['getitem_all'] self.getitem_all_switch = state_file['getitem_all_switch'] self.original_index_array = state_file["original_index_array"] self.poison_indicator = state_file["poison_indicator"] self.save_folder_path = state_file["save_folder_path"]
The save and load are also used in the save_attack_result and load_attack_result, you can take a look at them.