Skip to main content

Overview

FKApi scrapes football kit data from footballkitarchive.com using a robust, ethical scraping system built with Python’s requests and BeautifulSoup4 libraries. The system includes retry logic, proxy support, rate limiting, and comprehensive error handling. All scraping logic is in fkapi/core/scrapers.py (~1400 lines).

Architecture

Core Scraping Functions

scrape_kit()

Scrapes a single kit using its slug and optional kit_id.
slug
str
required
The kit’s base slug from the URL (e.g., “barcelona-2024-25-home-kit”)
kit_id
str
Kit ID for new URL format (e.g., “402174”)
use_proxy
bool
default:"False"
Whether to use a proxy for the request
existing_kit_id
int
ID of existing kit to update (instead of creating new)
Returns: Kit | None - Created/updated Kit object or None if failed Features:
  • Automatic retry with exponential backoff (max 3 retries)
  • Handles URL format changes (old vs new format)
  • 403 detection triggers automatic proxy use
  • 404 handling with new URL format fallback
  • Validates page existence before parsing
from core.scrapers import scrape_kit

# Basic usage
kit = scrape_kit('barcelona-2024-25-home-kit')

# With kit_id (new URL format)
kit = scrape_kit('barcelona-2024-25-home-kit', kit_id='402174')

# With proxy
kit = scrape_kit('kit-slug', use_proxy=True)

# Update existing kit
kit = scrape_kit('kit-slug', existing_kit_id=123)

scrape_club_details()

Scrapes club information including name, logo, and country.
slug
str
required
The club’s slug (e.g., “barcelona”)
use_proxy
bool
default:"False"
Whether to use a proxy
Returns: Club | None Extracted Data:
  • Club name (from page title)
  • Logo URL (both light and dark mode)
  • Creates or updates Club record
from core.scrapers import scrape_club_details

club = scrape_club_details('barcelona')
print(club.name)  # "FC Barcelona"
print(club.logo)  # Logo URL

scrape_whole_club()

Scrapes all kits for a given club across all seasons. Always uses proxy to prevent IP bans.
club
Club
required
The Club model instance to scrape
Returns: Club | None Process:
  1. Fetches club’s kit archive page
  2. Iterates through each season container
  3. Extracts brand information
  4. Processes each kit in the season
  5. Uses atomic transactions for data integrity
This function always uses a proxy since it makes many requests in sequence. Expect longer execution time.
from core.models import Club
from core.scrapers import scrape_whole_club

club = Club.objects.get(slug='barcelona')
result = scrape_whole_club(club)
# Scrapes all Barcelona kits from all seasons

scrape_latest()

Scrapes the “Latest Kits” page to find newly added kits.
page
int
default:"1"
Page number to scrape
use_proxy
bool
default:"False"
Whether to use a proxy
Returns: tuple[bool, bool] - (success, all_kits_exist) Behavior:
  • Checks if kits already exist in database
  • Only queues new kits for scraping
  • Returns early if all kits on page exist
  • Dispatches scraping to Celery tasks for parallel processing
from core.scrapers import scrape_latest

# Scrape page 1
success, all_exist = scrape_latest(page=1)

if all_exist:
    print("All kits already in database")
else:
    print("Found new kits")

scrape_latest_pages()

Scrapes multiple pages from the latest kits section.
page_start
int
default:"1"
First page to scrape
page_end
int
default:"1"
Last page to scrape
use_proxy
bool
default:"False"
Whether to use a proxy
delay
int
default:"2"
Delay in seconds between pages
progress_callback
callable
Optional callback function for progress reporting
reverse_order
bool
default:"False"
Process pages in reverse order (newest to oldest)
Returns: tuple[int, int] - (success_count, failure_count)
from core.scrapers import scrape_latest_pages

# Scrape pages 1-10
success, failed = scrape_latest_pages(
    page_start=1,
    page_end=10,
    use_proxy=True,
    delay=3
)

print(f"Success: {success}, Failed: {failed}")

HTTP Layer

http_get()

Centralized HTTP request function with retry logic and proxy support (defined in core/http.py). Features:
  • Automatic retries on failure
  • Rotating proxy support
  • Custom headers (User-Agent, Accept, etc.)
  • Connection pooling
  • Timeout handling
from core.http import http_get
from core.constants import BASE_URL

url = f"{BASE_URL}/barcelona-2024-25-home-kit"

# Basic request
response = http_get(url)

# With proxy
response = http_get(url, use_proxy=True)

# With custom headers
response = http_get(url, headers={'Accept': 'application/json'})

HTML Parsing

BeautifulSoup4

All HTML parsing uses BeautifulSoup4 with the lxml parser:
from bs4 import BeautifulSoup
from core.constants import HTML_PARSER  # "lxml"

soup = BeautifulSoup(response.text, HTML_PARSER)

Common Extraction Patterns

title_elem = soup.find("span", class_="main-title")
kit_name = title_elem.text.strip()
from core.parsers import extract_fact_table

soup = BeautifulSoup(html, HTML_PARSER)
data = extract_fact_table(soup)

# Returns structured data including:
# - kit_name
# - brand
# - season
# - competition
# - colors
# - design
# - rating
from core.parsers import parse_kit_page

data = parse_kit_page(soup)
# Returns complete kit data dictionary

Error Handling

Exception Hierarchy

Custom exceptions defined in core/exceptions.py:

ScrapingError

Base exception for scraping errors

KitNotFoundError

Kit page not found (404)

ClubNotFoundError

Club page not found (404)

RateLimitExceededError

Rate limit exceeded (403)

InvalidSeasonError

Invalid season format

Retry Logic

1

Initial Request

Make HTTP request to target URL
2

Check Response

  • 200: Success, proceed to parse
  • 403: Rate limited, retry with proxy
  • 404: Page not found, try URL format fallback
  • Other errors: Retry with backoff
3

Retry with Backoff

  • Max retries: 3 (from MAX_RETRIES constant)
  • Retry delay: 2 seconds (from RETRY_DELAY constant)
  • Each retry increments counter
4

Final Attempt

If all retries exhausted, return None or raise exception
max_retries = MAX_RETRIES  # 3
retry_count = 0

while retry_count < max_retries:
    try:
        response = http_get(url, use_proxy=use_proxy)
        
        if response.status_code == 403:
            logger.warning("Rate limited, retrying with proxy")
            use_proxy = True
            retry_count += 1
            time.sleep(RETRY_DELAY)
            continue
            
        response.raise_for_status()
        # Process successful response
        break
        
    except requests.exceptions.RequestException as e:
        logger.error(f"Request failed: {e}")
        retry_count += 1
        if retry_count == max_retries:
            return None
        time.sleep(2)

Rate Limiting & Proxies

Ethical Scraping Practices

Always respect the source website’s rate limits and terms of service.
Built-in Protections:
  1. Delay Between Requests
    • scrape_latest_pages() enforces configurable delay (default 2 seconds)
    • scrape_user_collection_api() uses 0.5 second delay between pages
  2. Automatic Proxy Use
    • 403 responses trigger automatic proxy retry
    • Bulk operations (scrape_whole_club) always use proxy
    • Proxy rotation prevents IP bans
  3. Request Throttling
    • Celery task queue prevents overwhelming the server
    • Tasks process kits sequentially or in controlled parallelism

Proxy Configuration

Proxy settings are configured in environment variables (specifics in deployment config).
# Automatic proxy on rate limit
if response.status_code == 403:
    use_proxy = True
    retry_request()

# Force proxy for bulk operations
def scrape_whole_club(club):
    response = http_get(url, use_proxy=True)

URL Format Handling

Old vs New Format

FootballKitArchive.com uses two URL formats:
/kit-slug-with-trailing-number123
Example: /barcelona-2024-25-home-kit402174The trailing digits were part of the slug.

Automatic Conversion

The _try_new_url_format() function automatically detects and converts:
def _split_slug_trailing_digits(slug: str) -> tuple[str, str] | None:
    """Split slug into (base, trailing_digits) if it ends with digits."""
    if not slug or not slug[-1].isdigit():
        return None
    i = len(slug) - 1
    while i >= 0 and slug[i].isdigit():
        i -= 1
    if i < len(slug) - 1:
        return slug[: i + 1], slug[i + 1 :]
    return None

# Usage
base_slug, kit_id = _split_slug_trailing_digits(
    'barcelona-2024-25-home-kit402174'
)
# Returns: ('barcelona-2024-25-home-kit', '402174')

URL Building

def build_kit_url(slug: str, kit_id: str | None = None) -> str:
    """Builds the complete URL for scraping a kit."""
    if kit_id:
        return f"{BASE_URL}/{slug}/{kit_id}/"
    
    split = _split_slug_trailing_digits(slug)
    if split:
        base_slug, extracted_id = split
        return f"{BASE_URL}/{base_slug}/{extracted_id}/"
    
    return f"{BASE_URL}/{slug}"

Season Parsing

Season Format Variations

The scraper handles multiple season input formats:
Input FormatNormalizedfirst_yearsecond_year
202420242024None
23-242023-2420232024
2023-242023-2420232024
2023-20242023-202420232024
99-001999-0019992000

get_season()

Parses season from kit slugs:
def get_season(season_slug: str) -> Season:
    """
    Retrieves or creates a Season object from a slug.
    
    Handles:
    - Full kit slugs: 'barcelona-2023-24-home-kit'
    - Direct years: '2024', '2023-24', '99-00'
    - Carry-over removal: '2023-24 (carry-over)' -> '2023-24'
    """
    slug = _normalize_season_slug(season_slug)
    
    # Try two-digit year format (99-00)
    parsed = _parse_two_digit_year(slug)
    if parsed:
        year_str, first_year, second_year = parsed
        return _get_or_create_season(year_str, first_year, second_year)
    
    # Try direct year format (2024, 2023-24)
    parsed = _parse_direct_year(slug)
    if parsed is None:
        # Extract from full kit slug
        parsed = _parse_year_from_kit_slug(slug)
    
    year_str, first_year, second_year = parsed
    _validate_season_years(season_slug, first_year, second_year)
    return _get_or_create_season(year_str, first_year, second_year)

Year Validation

  • Years must be between 1800-2100
  • Prevents obvious input errors
  • Historical and future seasons supported
  • Modern seasons (1960+): Max 2-year span
  • Historical seasons (pre-1960): Max 20-year span
  • Prevents incorrect season parsing
  • second_year must be ≥ first_year
  • Catches reversed or invalid input

API Scraping

In addition to HTML scraping, FKApi can scrape from FootballKitArchive’s internal APIs.

scrape_user_collection_api()

Scrapes a user’s kit collection using the collection-feed API:
userid
int
required
User ID from FootballKitArchive
Returns: dict with:
  • success: bool
  • entries: list of kit entries
  • total_entries: int
  • pages_scraped: int
  • user: user info dict (if available)
Features:
  • Automatic pagination (fetches all pages)
  • Filters out custom entries (custom_team, custom_type, etc.)
  • Cleans unwanted fields from response
  • 0.5 second delay between pages
  • Returns enriched data with metadata
from core.scrapers import scrape_user_collection_api

data = scrape_user_collection_api(userid=12345)

print(f"Total entries: {data['total_entries']}")
print(f"Pages scraped: {data['pages_scraped']}")

for entry in data['entries']:
    print(entry['kit']['name'])

scrape_user_info_api()

Scrapes user profile information:
userid
int
required
User ID from FootballKitArchive
Returns: dict | None with user data:
  • id: User ID
  • name: Username
  • image: Profile image URL
  • Additional profile fields

Data Processing Flow

ScrapingService.process_kit_data()

The central processing function for scraped kit data:

Transaction Management

All scraping operations use atomic transactions:
from django.db import transaction

with transaction.atomic():
    club.save()
    kit.save()
    kit.competition.add(*competitions)
    # All-or-nothing: Rolls back on any error

Constants

Scraping configuration constants (defined in core/constants.py):
ConstantValuePurpose
BASE_URLhttps://www.footballkitarchive.comBase website URL
MAX_RETRIES3Maximum retry attempts
RETRY_DELAY2Seconds between retries
HTTP_STATUS_FORBIDDEN403Rate limit status code
HTML_PARSERlxmlBeautifulSoup parser
KIT_CLASSarchive-resultCSS class for kit elements
KIT_CONTAINER_CLASSarchive-results-gridCSS class for kit grid
COLLECTION_CONTAINER_CLASScollection-containerCSS class for season containers
SECTION_DETAILS_CLASSsection-detailsCSS class for section details
KIT_SEASON_CLASSkit-seasonCSS class for kit season
LATEST_PAGE_URL/latest/Latest kits page path
DEFAULT_LOGO_URL/static/logos/not_found.pngFallback logo

Celery Integration

For parallel processing and scheduled scraping:

Task Definitions

from celery import shared_task
from core.scrapers import scrape_kit

@shared_task(bind=True, max_retries=3)
def scrape_kit_task(self, slug, kit_id=None, use_proxy=False):
    """Celery task for scraping a single kit."""
    try:
        kit = scrape_kit(slug, kit_id=kit_id, use_proxy=use_proxy)
        return kit.id if kit else None
    except Exception as exc:
        # Retry with exponential backoff
        raise self.retry(exc=exc, countdown=60 * (2 ** self.request.retries))

Scheduled Scraping

Configured in settings.py:
CELERY_BEAT_SCHEDULE = {
    'scrape_daily': {
        'task': 'core.tasks.scrape_daily',
        'schedule': crontab(hour=0, minute=0),  # Midnight daily
    },
}

Logging

Scraping operations use Python’s logging module:
import logging

logger = logging.getLogger(__name__)

logger.debug(f"Scraping kit: {slug}")
logger.info(f"Successfully created kit: {kit.name}")
logger.warning(f"Rate limited, retrying with proxy")
logger.error(f"Failed to scrape kit: {slug}")
Log Levels:
  • DEBUG: Detailed scraping progress
  • INFO: Successful operations
  • WARNING: Recoverable errors (rate limits, retries)
  • ERROR: Failed operations

Best Practices

Use Proxies for Bulk

Always enable proxies when scraping multiple pages or entire clubs

Respect Rate Limits

Use appropriate delays between requests (2+ seconds recommended)

Handle Errors Gracefully

Use try-except blocks and return None on failure

Use Transactions

Wrap database operations in atomic transactions

Log Everything

Comprehensive logging helps debug scraping issues

Validate Data

Always validate scraped data before saving
Important: Always check robots.txt and respect the source website’s terms of service. FKApi is designed for educational and archival purposes.

Troubleshooting

Cause: Rate limiting by the source websiteSolution:
  • Enable proxy: use_proxy=True
  • Increase delay between requests
  • Check if IP is banned
  • Use Celery for distributed scraping
Cause: Page moved or URL format changedSolution:
  • Scraper automatically tries new URL format
  • Verify slug is correct
  • Check if kit exists on website
  • Use kit_id parameter if available
Cause: HTML structure changed on source websiteSolution:
  • Update CSS selectors in core/parsers.py
  • Check KIT_CLASS, SECTION_DETAILS_CLASS constants
  • Verify BeautifulSoup selectors
Cause: Unusual season formatSolution:
  • Check get_season() logic
  • Add new format handler in _parse_*_year() functions
  • Validate year ranges (1800-2100)
Cause: Missing foreign key relationshipsSolution:
  • Ensure clubs, brands, seasons exist before creating kits
  • Use scrape_club_details() to create missing clubs
  • Use get_season() to auto-create seasons
  • Wrap operations in transaction.atomic()