Overview
FKApi scrapes football kit data from footballkitarchive.com using a robust, ethical scraping system built with Python’srequests and BeautifulSoup4 libraries. The system includes retry logic, proxy support, rate limiting, and comprehensive error handling.
All scraping logic is in fkapi/core/scrapers.py (~1400 lines).
Architecture
Core Scraping Functions
scrape_kit()
Scrapes a single kit using its slug and optional kit_id.The kit’s base slug from the URL (e.g., “barcelona-2024-25-home-kit”)
Kit ID for new URL format (e.g., “402174”)
Whether to use a proxy for the request
ID of existing kit to update (instead of creating new)
Kit | None - Created/updated Kit object or None if failed
Features:
- Automatic retry with exponential backoff (max 3 retries)
- Handles URL format changes (old vs new format)
- 403 detection triggers automatic proxy use
- 404 handling with new URL format fallback
- Validates page existence before parsing
scrape_club_details()
Scrapes club information including name, logo, and country.The club’s slug (e.g., “barcelona”)
Whether to use a proxy
Club | None
Extracted Data:
- Club name (from page title)
- Logo URL (both light and dark mode)
- Creates or updates Club record
scrape_whole_club()
Scrapes all kits for a given club across all seasons. Always uses proxy to prevent IP bans.The Club model instance to scrape
Club | None
Process:
- Fetches club’s kit archive page
- Iterates through each season container
- Extracts brand information
- Processes each kit in the season
- Uses atomic transactions for data integrity
scrape_latest()
Scrapes the “Latest Kits” page to find newly added kits.Page number to scrape
Whether to use a proxy
tuple[bool, bool] - (success, all_kits_exist)
Behavior:
- Checks if kits already exist in database
- Only queues new kits for scraping
- Returns early if all kits on page exist
- Dispatches scraping to Celery tasks for parallel processing
scrape_latest_pages()
Scrapes multiple pages from the latest kits section.First page to scrape
Last page to scrape
Whether to use a proxy
Delay in seconds between pages
Optional callback function for progress reporting
Process pages in reverse order (newest to oldest)
tuple[int, int] - (success_count, failure_count)
HTTP Layer
http_get()
Centralized HTTP request function with retry logic and proxy support (defined incore/http.py).
Features:
- Automatic retries on failure
- Rotating proxy support
- Custom headers (User-Agent, Accept, etc.)
- Connection pooling
- Timeout handling
HTML Parsing
BeautifulSoup4
All HTML parsing uses BeautifulSoup4 with thelxml parser:
Common Extraction Patterns
Extract Kit Name
Extract Kit Name
Extract Logo
Extract Logo
Extract Fact Table
Extract Fact Table
Extract Kit Details
Extract Kit Details
Error Handling
Exception Hierarchy
Custom exceptions defined incore/exceptions.py:
ScrapingError
Base exception for scraping errors
KitNotFoundError
Kit page not found (404)
ClubNotFoundError
Club page not found (404)
RateLimitExceededError
Rate limit exceeded (403)
InvalidSeasonError
Invalid season format
Retry Logic
Check Response
- 200: Success, proceed to parse
- 403: Rate limited, retry with proxy
- 404: Page not found, try URL format fallback
- Other errors: Retry with backoff
Retry with Backoff
- Max retries: 3 (from
MAX_RETRIESconstant) - Retry delay: 2 seconds (from
RETRY_DELAYconstant) - Each retry increments counter
Rate Limiting & Proxies
Ethical Scraping Practices
Built-in Protections:-
Delay Between Requests
scrape_latest_pages()enforces configurable delay (default 2 seconds)scrape_user_collection_api()uses 0.5 second delay between pages
-
Automatic Proxy Use
- 403 responses trigger automatic proxy retry
- Bulk operations (
scrape_whole_club) always use proxy - Proxy rotation prevents IP bans
-
Request Throttling
- Celery task queue prevents overwhelming the server
- Tasks process kits sequentially or in controlled parallelism
Proxy Configuration
Proxy settings are configured in environment variables (specifics in deployment config).URL Format Handling
Old vs New Format
FootballKitArchive.com uses two URL formats:- Old Format
- New Format
/barcelona-2024-25-home-kit402174The trailing digits were part of the slug.Automatic Conversion
The_try_new_url_format() function automatically detects and converts:
URL Building
Season Parsing
Season Format Variations
The scraper handles multiple season input formats:| Input Format | Normalized | first_year | second_year |
|---|---|---|---|
2024 | 2024 | 2024 | None |
23-24 | 2023-24 | 2023 | 2024 |
2023-24 | 2023-24 | 2023 | 2024 |
2023-2024 | 2023-2024 | 2023 | 2024 |
99-00 | 1999-00 | 1999 | 2000 |
get_season()
Parses season from kit slugs:Year Validation
Valid Range
Valid Range
- Years must be between 1800-2100
- Prevents obvious input errors
- Historical and future seasons supported
Span Validation
Span Validation
- Modern seasons (1960+): Max 2-year span
- Historical seasons (pre-1960): Max 20-year span
- Prevents incorrect season parsing
Order Validation
Order Validation
- second_year must be ≥ first_year
- Catches reversed or invalid input
API Scraping
In addition to HTML scraping, FKApi can scrape from FootballKitArchive’s internal APIs.scrape_user_collection_api()
Scrapes a user’s kit collection using the collection-feed API:User ID from FootballKitArchive
dict with:
success: boolentries: list of kit entriestotal_entries: intpages_scraped: intuser: user info dict (if available)
- Automatic pagination (fetches all pages)
- Filters out custom entries (custom_team, custom_type, etc.)
- Cleans unwanted fields from response
- 0.5 second delay between pages
- Returns enriched data with metadata
scrape_user_info_api()
Scrapes user profile information:User ID from FootballKitArchive
dict | None with user data:
id: User IDname: Usernameimage: Profile image URL- Additional profile fields
Data Processing Flow
ScrapingService.process_kit_data()
The central processing function for scraped kit data:Transaction Management
All scraping operations use atomic transactions:Constants
Scraping configuration constants (defined incore/constants.py):
| Constant | Value | Purpose |
|---|---|---|
BASE_URL | https://www.footballkitarchive.com | Base website URL |
MAX_RETRIES | 3 | Maximum retry attempts |
RETRY_DELAY | 2 | Seconds between retries |
HTTP_STATUS_FORBIDDEN | 403 | Rate limit status code |
HTML_PARSER | lxml | BeautifulSoup parser |
KIT_CLASS | archive-result | CSS class for kit elements |
KIT_CONTAINER_CLASS | archive-results-grid | CSS class for kit grid |
COLLECTION_CONTAINER_CLASS | collection-container | CSS class for season containers |
SECTION_DETAILS_CLASS | section-details | CSS class for section details |
KIT_SEASON_CLASS | kit-season | CSS class for kit season |
LATEST_PAGE_URL | /latest/ | Latest kits page path |
DEFAULT_LOGO_URL | /static/logos/not_found.png | Fallback logo |
Celery Integration
For parallel processing and scheduled scraping:Task Definitions
Scheduled Scraping
Configured insettings.py:
Logging
Scraping operations use Python’s logging module:DEBUG: Detailed scraping progressINFO: Successful operationsWARNING: Recoverable errors (rate limits, retries)ERROR: Failed operations
Best Practices
Use Proxies for Bulk
Always enable proxies when scraping multiple pages or entire clubs
Respect Rate Limits
Use appropriate delays between requests (2+ seconds recommended)
Handle Errors Gracefully
Use try-except blocks and return None on failure
Use Transactions
Wrap database operations in atomic transactions
Log Everything
Comprehensive logging helps debug scraping issues
Validate Data
Always validate scraped data before saving
Troubleshooting
403 Forbidden Errors
403 Forbidden Errors
Cause: Rate limiting by the source websiteSolution:
- Enable proxy:
use_proxy=True - Increase delay between requests
- Check if IP is banned
- Use Celery for distributed scraping
404 Not Found Errors
404 Not Found Errors
Cause: Page moved or URL format changedSolution:
- Scraper automatically tries new URL format
- Verify slug is correct
- Check if kit exists on website
- Use
kit_idparameter if available
Parsing Errors
Parsing Errors
Cause: HTML structure changed on source websiteSolution:
- Update CSS selectors in
core/parsers.py - Check
KIT_CLASS,SECTION_DETAILS_CLASSconstants - Verify BeautifulSoup selectors
Season Parsing Issues
Season Parsing Issues
Cause: Unusual season formatSolution:
- Check
get_season()logic - Add new format handler in
_parse_*_year()functions - Validate year ranges (1800-2100)
Database Integrity Errors
Database Integrity Errors
Cause: Missing foreign key relationshipsSolution:
- Ensure clubs, brands, seasons exist before creating kits
- Use
scrape_club_details()to create missing clubs - Use
get_season()to auto-create seasons - Wrap operations in
transaction.atomic()