Xử lý File Trùng Lặp trong RAG System¶

Vấn đề thực tế¶

Tình huống: Bạn có file haha.excel đã được import vào RAG system. Sếp gửi file haha.excel mới với nội dung khác. Bạn cần thay thế file cũ mà không ảnh hưởng các file khác.

Thách thức: - Làm sao biết được chunks nào thuộc file haha.excel cũ? - Làm sao xóa đúng chunks cũ mà không xóa nhầm chunks của file khác? - Làm sao đảm bảo quá trình thay thế không gây lỗi?

Giải pháp chi tiết¶

Bước 1: Thiết kế cách "ghi nhớ" file (Metadata Design)¶

Vấn đề: Khi chia file thành chunks, làm sao biết chunk nào thuộc file nào?

Giải pháp: Mỗi chunk phải có "nhãn dán" chứa thông tin file gốc.

Ví dụ thực tế:

// Chunk từ file haha.excel
{
    "text": "Doanh thu Q1 năm 2024 tăng 20%",
    "metadata": {
        "file_name": "haha.excel",           // Tên file
        "file_path": "/sales/haha.excel",    // Đường dẫn đầy đủ
        "chunk_number": 1,                   // Thứ tự chunk
        "imported_date": "2024-01-15"       // Ngày import
    }
}

// Chunk từ file test.pdf  
{
    "text": "Kế hoạch marketing năm 2024",
    "metadata": {
        "file_name": "test.pdf",
        "file_path": "/marketing/test.pdf", 
        "chunk_number": 1,
        "imported_date": "2024-01-10"
    }
}

Lợi ích: Giờ bạn có thể tìm tất cả chunks thuộc file haha.excel bằng cách search file_name = "haha.excel".

Bước 2: Cách nhận diện file (File Identification)¶

Vấn đề: Khi user upload file, làm sao biết file đã tồn tại hay chưa?

Phương pháp 1: Dùng tên file (Đơn giản)¶

def check_file_exists(filename):
    # Tìm trong database
    existing_chunks = search_database(file_name=filename)
    return len(existing_chunks) > 0

# Ví dụ
if check_file_exists("haha.excel"):
    print("File đã tồn tại!")

Ưu điểm: Dễ hiểu, dễ làm
Nhược điểm: Nếu có 2 file haha.excel ở folder khác nhau thì bị nhầm

Phương pháp 2: Dùng đường dẫn đầy đủ (Khuyên dùng)¶

def check_file_exists(full_path):
    # Ví dụ: "/sales/haha.excel" vs "/marketing/haha.excel"
    existing_chunks = search_database(file_path=full_path)
    return len(existing_chunks) > 0

# Ví dụ
if check_file_exists("/sales/haha.excel"):
    print("File này đã tồn tại!")

Ưu điểm: Không bao giờ nhầm lẫn
Nhược điểm: Phức tạp hơn một chút

Phương pháp 3: Dùng "dấu vân tay" của file (Nâng cao)¶

import hashlib

def get_file_fingerprint(file_content):
    # Tạo "dấu vân tay" unique cho file
    return hashlib.md5(file_content).hexdigest()

def check_file_exists(file_content):
    fingerprint = get_file_fingerprint(file_content)
    existing_chunks = search_database(file_fingerprint=fingerprint)
    return len(existing_chunks) > 0

Ưu điểm: Chính xác 100%, detect được file giống hệt nhau
Nhược điểm: Phức tạp, user không hiểu được fingerprint

Bước 3: Thuật toán phát hiện trùng lặp (Duplicate Detection)¶

Mục tiêu: Khi user upload file, tự động biết cần làm gì.

def handle_file_upload(uploaded_file):
    file_path = get_full_path(uploaded_file)

    # Bước 1: Check file đã tồn tại chưa
    existing_file = find_existing_file(file_path)

    if existing_file is None:
        # Case 1: File mới hoàn toàn
        print("File mới, import bình thường")
        return import_new_file(uploaded_file)

    # Bước 2: File đã tồn tại, check nội dung có khác không
    old_fingerprint = existing_file.fingerprint
    new_fingerprint = calculate_fingerprint(uploaded_file)

    if old_fingerprint == new_fingerprint:
        # Case 2: File giống hệt, không cần làm gì
        print("File giống hệt file cũ, bỏ qua")
        return "SKIPPED"

    # Case 3: File cùng tên nhưng nội dung khác
    print("File đã tồn tại nhưng nội dung khác, cần thay thế")
    return replace_existing_file(file_path, uploaded_file)

Quy trình thay thế file (Step by step)¶

Bước 1: Chuẩn bị (Preparation)¶

def prepare_file_replacement(file_path, new_file):
    # 1. Tìm tất cả chunks của file cũ
    old_chunks = find_chunks_by_path(file_path)
    print(f"Tìm thấy {len(old_chunks)} chunks cũ cần xóa")

    # 2. Backup thông tin file cũ (để rollback nếu cần)
    backup_data = {
        "chunks": old_chunks,
        "backup_time": datetime.now(),
        "file_path": file_path
    }
    save_backup(backup_data)

    # 3. Xử lý file mới thành chunks
    new_chunks = process_file_to_chunks(new_file)
    print(f"File mới có {len(new_chunks)} chunks")

    return old_chunks, new_chunks, backup_data

Bước 2: Thay thế an toàn (Safe Replacement)¶

def replace_file_safely(old_chunks, new_chunks, file_path):
    try:
        # Bước 1: Thêm chunks mới TRƯỚC (để tránh mất data)
        print("Đang thêm chunks mới...")
        add_chunks_to_database(new_chunks)

        # Bước 2: Verify chunks mới đã được thêm thành công
        verify_chunks = find_chunks_by_path(file_path)
        if len(verify_chunks) < len(new_chunks):
            raise Exception("Thêm chunks mới thất bại!")

        # Bước 3: Xóa chunks cũ SAU KHI confirm chunks mới OK
        print("Đang xóa chunks cũ...")
        delete_chunks_from_database(old_chunks)

        print("Thay thế file thành công!")
        return True

    except Exception as e:
        print(f"Lỗi: {e}")
        # Rollback: xóa chunks mới nếu có lỗi
        cleanup_failed_chunks(new_chunks)
        return False

Bước 3: Kiểm tra kết quả (Verification)¶

def verify_replacement(file_path):
    # 1. Test query để đảm bảo chỉ có content mới
    test_query = "doanh thu Q1"
    results = search_rag_system(test_query)

    # 2. Check tất cả results đều từ file mới
    for result in results:
        if result.metadata.file_path == file_path:
            assert result.metadata.imported_date == today()

    # 3. Test các file khác vẫn hoạt động bình thường
    other_files = ["test.pdf", "report.docx"]
    for other_file in other_files:
        test_results = search_file_content(other_file)
        assert len(test_results) > 0, f"File {other_file} bị ảnh hưởng!"

    print("Verification thành công!")

Code example hoàn chỉnh¶

class FileUpdateHandler:
    def __init__(self, vector_db):
        self.db = vector_db

    def update_file(self, file_path, new_file_content):
        """
        Main function để update file trong RAG system
        """
        print(f"Bắt đầu update file: {file_path}")

        # Bước 1: Check file có tồn tại không
        old_chunks = self.find_existing_chunks(file_path)

        if not old_chunks:
            print("File mới, import bình thường")
            return self.import_new_file(file_path, new_file_content)

        print(f"File đã tồn tại với {len(old_chunks)} chunks")

        # Bước 2: Process file mới
        new_chunks = self.process_file(new_file_content, file_path)
        print(f"File mới có {len(new_chunks)} chunks")

        # Bước 3: Backup data cũ
        backup_id = self.create_backup(old_chunks)

        # Bước 4: Thực hiện replacement
        try:
            # Add new chunks
            self.db.add_chunks(new_chunks)
            print("✓ Đã thêm chunks mới")

            # Verify new chunks
            if not self.verify_new_chunks(file_path, len(new_chunks)):
                raise Exception("Chunks mới không được thêm đúng")

            # Delete old chunks
            self.db.delete_chunks([c.id for c in old_chunks])
            print("✓ Đã xóa chunks cũ")

            # Final verification
            self.verify_replacement_success(file_path)
            print("✓ Update file thành công!")

            return True

        except Exception as e:
            print(f"✗ Lỗi: {e}")
            print("Đang rollback...")
            self.rollback_from_backup(backup_id)
            return False

    def find_existing_chunks(self, file_path):
        """Tìm tất cả chunks của file theo đường dẫn"""
        return self.db.query(
            where={"file_path": file_path}
        )

    def process_file(self, content, file_path):
        """Xử lý file thành chunks với metadata"""
        chunks = self.chunk_text(content)
        processed_chunks = []

        for i, chunk_text in enumerate(chunks):
            chunk = {
                "text": chunk_text,
                "metadata": {
                    "file_path": file_path,
                    "chunk_index": i,
                    "imported_at": datetime.now().isoformat()
                }
            }
            processed_chunks.append(chunk)

        return processed_chunks

Tóm tắt đơn giản¶

Ghi nhớ: Mỗi chunk phải có "nhãn" cho biết đến từ file nào
Nhận diện: Dùng đường dẫn file để identify unique
Phát hiện: Check file đã tồn tại → quyết định import mới hay replace
Thay thế: Add new → Verify → Delete old → Check kết quả
An toàn: Luôn backup trước, rollback nếu có lỗi

Nguyên tắc vàng: "Add trước, delete sau" để tránh mất data.