
CAPTCHA-Solver
CompletedTwo-Stage Computer Vision System
Overview
an advanced computer vision system designed to automatically solve complex SHEIN CAPTCHA challenges. It utilizes a highly accurate two-stage YOLO11 architecture. The system effectively handles the full interaction loop by first understanding the visual instructions and then executing the precise sequence of clicks required to bypass the CAPTCHA.
Architecture
The system relies on a sequential two-stage YOLO11 pipeline. Stage 1 parses the instruction prompt to identify target icons and determine the required click order. Stage 2 then scans the main CAPTCHA image to locate the coordinates of these specific targets and executes clicks in the exact sequence identified by the first stage. Training is supported by a dynamic data generator creating 10k+ highly augmented samples.
Key Features
Two-Stage Vision Architecture
Stage 1 identifies the target icons and their specific order, while Stage 2 precisely locates and interacts with the targets in the required sequence[cite: 49].
Synthetic Data Generation Engine
Built an automated engine that synthesized 10k+ labeled training samples, incorporating 360° rotation, color variations, and scaling.
Real-World Background Integration
Enhanced model robustness by training on synthetic data that utilizes actual backgrounds from the original SHEIN CAPTCHAs to prevent overfitting.
YOLO11 Detection
Utilizes the state-of-the-art YOLO11 architecture for high-accuracy object detection, achieving near perfect solving capabilities.
Tech Stack
AI/ML
Backend
Data
Challenges & Solutions
Standard object detection models cannot inherently understand the required interaction sequence for complex CAPTCHAs.
Engineered a two-stage vision system where Stage 1 explicitly extracts the target icons and order from the prompt, passing that state to Stage 2 for sequential execution.
Extremely limited availability of labeled CAPTCHA datasets containing the necessary edge cases and background noise.
Built a data generation engine that synthesized 10k+ labeled training samples.
The model struggled to generalize against visual distortions like rotation and variable sizing.
Incorporated 360° rotation, scaling, color variations, and actual CAPTCHA backgrounds into the synthetic data pipeline.