GraphQL integrate, data correctness
This commit is contained in:
140
REFACTORING_SUMMARY.md
Normal file
140
REFACTORING_SUMMARY.md
Normal file
@@ -0,0 +1,140 @@
|
||||
# Scaev Scraper Refactoring Summary
|
||||
|
||||
## Date: 2025-12-07
|
||||
|
||||
## Objectives Completed
|
||||
|
||||
### 1. Image Download Integration ✅
|
||||
- **Changed**: Enabled `DOWNLOAD_IMAGES = True` in `config.py` and `docker-compose.yml`
|
||||
- **Added**: Unique constraint on `images(lot_id, url)` to prevent duplicates
|
||||
- **Added**: Automatic duplicate cleanup migration in `cache.py`
|
||||
- **Result**: Images are now downloaded to `/mnt/okcomputer/output/images/{lot_id}/` and marked as `downloaded=1`
|
||||
- **Impact**: Eliminates 57M+ duplicate image downloads by monitor app
|
||||
|
||||
### 2. Data Completeness Fix ✅
|
||||
- **Problem**: 99.9% of lots missing closing_time, 100% missing bid data
|
||||
- **Root Cause**: Troostwijk loads bid/timing data dynamically via GraphQL API, not in HTML
|
||||
- **Solution**: Added GraphQL client to fetch real-time bidding data
|
||||
|
||||
## Key Changes
|
||||
|
||||
### New Files
|
||||
1. **src/graphql_client.py** - GraphQL API client for fetching lot bidding data
|
||||
- Endpoint: `https://storefront.tbauctions.com/storefront/graphql`
|
||||
- Fetches: current_bid, starting_bid, minimum_bid, bid_count, closing_time
|
||||
|
||||
### Modified Files
|
||||
1. **src/config.py:22** - `DOWNLOAD_IMAGES = True`
|
||||
2. **docker-compose.yml:13** - `DOWNLOAD_IMAGES: "True"`
|
||||
3. **src/cache.py**
|
||||
- Added unique index on `images(lot_id, url)`
|
||||
- Added columns `starting_bid`, `minimum_bid` to `lots` table
|
||||
- Added migration to clean duplicates and add missing columns
|
||||
4. **src/scraper.py**
|
||||
- Integrated GraphQL API calls for each lot
|
||||
- Fetches real-time bidding data after parsing HTML
|
||||
- Removed unicode characters causing Windows encoding issues
|
||||
|
||||
## Database Schema Updates
|
||||
|
||||
### lots table - New Columns
|
||||
```sql
|
||||
ALTER TABLE lots ADD COLUMN starting_bid TEXT;
|
||||
ALTER TABLE lots ADD COLUMN minimum_bid TEXT;
|
||||
```
|
||||
|
||||
### images table - New Index
|
||||
```sql
|
||||
CREATE UNIQUE INDEX idx_unique_lot_url ON images(lot_id, url);
|
||||
```
|
||||
|
||||
## Data Flow (New Architecture)
|
||||
|
||||
```
|
||||
┌────────────────────────────────────────────────────┐
|
||||
│ Phase 3: Scrape Lot Page │
|
||||
└────────────────────────────────────────────────────┘
|
||||
│
|
||||
├─▶ Parse HTML (__NEXT_DATA__)
|
||||
│ └─▶ Extract: title, location, images, description
|
||||
│
|
||||
├─▶ Fetch GraphQL API
|
||||
│ └─▶ Query: LotBiddingData(lot_display_id)
|
||||
│ └─▶ Returns:
|
||||
│ - currentBidAmount (cents)
|
||||
│ - initialAmount (starting_bid)
|
||||
│ - nextMinimalBid (minimum_bid)
|
||||
│ - bidsCount
|
||||
│ - endDate (Unix timestamp)
|
||||
│ - startDate
|
||||
│ - biddingStatus
|
||||
│
|
||||
└─▶ Save to Database
|
||||
- lots table: complete bid & timing data
|
||||
- images table: deduplicated URLs
|
||||
- Download images immediately
|
||||
```
|
||||
|
||||
## Testing Results
|
||||
|
||||
### Test Lot: A1-28505-5
|
||||
```
|
||||
Current Bid: EUR 50.00 ✅
|
||||
Starting Bid: EUR 50.00 ✅
|
||||
Minimum Bid: EUR 55.00 ✅
|
||||
Bid Count: 1 ✅
|
||||
Closing Time: 2025-12-16 19:10:00 ✅
|
||||
Images: Downloaded 2 ✅
|
||||
```
|
||||
|
||||
## Deployment Checklist
|
||||
|
||||
- [x] Enable DOWNLOAD_IMAGES in config
|
||||
- [x] Update docker-compose environment
|
||||
- [x] Add GraphQL client
|
||||
- [x] Update scraper integration
|
||||
- [x] Add database migrations
|
||||
- [x] Test with live lot
|
||||
- [ ] Deploy to production
|
||||
- [ ] Run full scrape to populate data
|
||||
- [ ] Verify monitor app sees downloaded images
|
||||
|
||||
## Post-Deployment Verification
|
||||
|
||||
### Check Data Quality
|
||||
```sql
|
||||
-- Bid data completeness
|
||||
SELECT
|
||||
COUNT(*) as total,
|
||||
SUM(CASE WHEN closing_time != '' THEN 1 ELSE 0 END) as has_closing,
|
||||
SUM(CASE WHEN bid_count > 0 THEN 1 ELSE 0 END) as has_bids,
|
||||
SUM(CASE WHEN starting_bid IS NOT NULL THEN 1 ELSE 0 END) as has_starting_bid
|
||||
FROM lots
|
||||
WHERE scraped_at > datetime('now', '-1 hour');
|
||||
|
||||
-- Image download rate
|
||||
SELECT
|
||||
COUNT(*) as total,
|
||||
SUM(downloaded) as downloaded,
|
||||
ROUND(100.0 * SUM(downloaded) / COUNT(*), 2) as success_rate
|
||||
FROM images
|
||||
WHERE id IN (
|
||||
SELECT i.id FROM images i
|
||||
JOIN lots l ON i.lot_id = l.lot_id
|
||||
WHERE l.scraped_at > datetime('now', '-1 hour')
|
||||
);
|
||||
|
||||
-- Duplicate check (should be 0)
|
||||
SELECT lot_id, url, COUNT(*) as dup_count
|
||||
FROM images
|
||||
GROUP BY lot_id, url
|
||||
HAVING COUNT(*) > 1;
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- GraphQL API requires no authentication
|
||||
- API rate limits: handled by existing `RATE_LIMIT_SECONDS = 0.5`
|
||||
- Currency format: Changed from € to EUR for Windows compatibility
|
||||
- Timestamps: API returns Unix timestamps in seconds (not milliseconds)
|
||||
- Existing data: Old lots still have missing data; re-scrape required to populate
|
||||
54
check_apollo_state.py
Normal file
54
check_apollo_state.py
Normal file
@@ -0,0 +1,54 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Check for Apollo state or other embedded data"""
|
||||
import asyncio
|
||||
import json
|
||||
import re
|
||||
from playwright.async_api import async_playwright
|
||||
|
||||
async def main():
|
||||
async with async_playwright() as p:
|
||||
browser = await p.chromium.launch(headless=True)
|
||||
page = await browser.new_page()
|
||||
|
||||
await page.goto("https://www.troostwijkauctions.com/a/woonunits-generatoren-reinigingsmachines-en-zakelijke-goederen-A1-37889", wait_until='networkidle')
|
||||
content = await page.content()
|
||||
|
||||
# Look for embedded data structures
|
||||
patterns = [
|
||||
(r'<script[^>]*id="__NEXT_DATA__"[^>]*>(.+?)</script>', "NEXT_DATA"),
|
||||
(r'window\.__APOLLO_STATE__\s*=\s*({.+?});', "APOLLO_STATE"),
|
||||
(r'"lots"\s*:\s*\[(.+?)\]', "LOTS_ARRAY"),
|
||||
]
|
||||
|
||||
for pattern, name in patterns:
|
||||
match = re.search(pattern, content, re.DOTALL)
|
||||
if match:
|
||||
print(f"\n{'='*60}")
|
||||
print(f"FOUND: {name}")
|
||||
print(f"{'='*60}")
|
||||
try:
|
||||
if name == "LOTS_ARRAY":
|
||||
print(f"Preview: {match.group(1)[:500]}")
|
||||
else:
|
||||
data = json.loads(match.group(1))
|
||||
print(json.dumps(data, indent=2)[:2000])
|
||||
except:
|
||||
print(f"Preview: {match.group(1)[:1000]}")
|
||||
|
||||
# Also check for any script tags with "lot" and "bid" and "end"
|
||||
print(f"\n{'='*60}")
|
||||
print("SEARCHING FOR LOT DATA IN ALL SCRIPTS")
|
||||
print(f"{'='*60}")
|
||||
|
||||
scripts = re.findall(r'<script[^>]*>(.+?)</script>', content, re.DOTALL)
|
||||
for i, script in enumerate(scripts):
|
||||
if all(term in script.lower() for term in ['lot', 'bid', 'end']):
|
||||
print(f"\nScript #{i} (first 500 chars):")
|
||||
print(script[:500])
|
||||
if i > 3: # Limit output
|
||||
break
|
||||
|
||||
await browser.close()
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
54
check_data.py
Normal file
54
check_data.py
Normal file
@@ -0,0 +1,54 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Check current data quality in cache.db"""
|
||||
import sqlite3
|
||||
|
||||
conn = sqlite3.connect('/mnt/okcomputer/output/cache.db')
|
||||
|
||||
print("=" * 60)
|
||||
print("CURRENT DATA QUALITY CHECK")
|
||||
print("=" * 60)
|
||||
|
||||
# Check lots table
|
||||
print("\n[*] Sample Lot Data:")
|
||||
cursor = conn.execute("""
|
||||
SELECT lot_id, current_bid, bid_count, closing_time
|
||||
FROM lots
|
||||
LIMIT 10
|
||||
""")
|
||||
for row in cursor:
|
||||
print(f" Lot: {row[0]}")
|
||||
print(f" Current Bid: {row[1]}")
|
||||
print(f" Bid Count: {row[2]}")
|
||||
print(f" Closing Time: {row[3]}")
|
||||
|
||||
# Check auctions table
|
||||
print("\n[*] Sample Auction Data:")
|
||||
cursor = conn.execute("""
|
||||
SELECT auction_id, title, closing_time, first_lot_closing_time
|
||||
FROM auctions
|
||||
LIMIT 5
|
||||
""")
|
||||
for row in cursor:
|
||||
print(f" Auction: {row[0]}")
|
||||
print(f" Title: {row[1][:50]}...")
|
||||
print(f" Closing Time: {row[2] if len(row) > 2 else 'N/A'}")
|
||||
print(f" First Lot Closing: {row[3]}")
|
||||
|
||||
# Data completeness stats
|
||||
print("\n[*] Data Completeness:")
|
||||
cursor = conn.execute("""
|
||||
SELECT
|
||||
COUNT(*) as total,
|
||||
SUM(CASE WHEN current_bid IS NULL OR current_bid = '' THEN 1 ELSE 0 END) as missing_current_bid,
|
||||
SUM(CASE WHEN closing_time IS NULL OR closing_time = '' THEN 1 ELSE 0 END) as missing_closing_time,
|
||||
SUM(CASE WHEN bid_count IS NULL OR bid_count = 0 THEN 1 ELSE 0 END) as zero_bid_count
|
||||
FROM lots
|
||||
""")
|
||||
row = cursor.fetchone()
|
||||
print(f" Total lots: {row[0]:,}")
|
||||
print(f" Missing current_bid: {row[1]:,} ({100*row[1]/row[0]:.1f}%)")
|
||||
print(f" Missing closing_time: {row[2]:,} ({100*row[2]/row[0]:.1f}%)")
|
||||
print(f" Zero bid_count: {row[3]:,} ({100*row[3]/row[0]:.1f}%)")
|
||||
|
||||
conn.close()
|
||||
print("\n" + "=" * 60)
|
||||
69
debug_lot_structure.py
Normal file
69
debug_lot_structure.py
Normal file
@@ -0,0 +1,69 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Debug lot data structure from cached page"""
|
||||
import sqlite3
|
||||
import zlib
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
sys.path.insert(0, 'src')
|
||||
|
||||
from parse import DataParser
|
||||
|
||||
conn = sqlite3.connect('/mnt/okcomputer/output/cache.db')
|
||||
|
||||
# Get a recent lot page
|
||||
cursor = conn.execute("""
|
||||
SELECT url, content
|
||||
FROM cache
|
||||
WHERE url LIKE '%/l/%'
|
||||
ORDER BY timestamp DESC
|
||||
LIMIT 1
|
||||
""")
|
||||
|
||||
row = cursor.fetchone()
|
||||
if not row:
|
||||
print("No lot pages found")
|
||||
exit(1)
|
||||
|
||||
url, content_blob = row
|
||||
content = zlib.decompress(content_blob).decode('utf-8')
|
||||
|
||||
parser = DataParser()
|
||||
result = parser.parse_page(content, url)
|
||||
|
||||
if result:
|
||||
print(f"URL: {url}")
|
||||
print(f"\nParsed Data:")
|
||||
print(f" type: {result.get('type')}")
|
||||
print(f" lot_id: {result.get('lot_id')}")
|
||||
print(f" title: {result.get('title', '')[:50]}...")
|
||||
print(f" current_bid: {result.get('current_bid')}")
|
||||
print(f" bid_count: {result.get('bid_count')}")
|
||||
print(f" closing_time: {result.get('closing_time')}")
|
||||
print(f" location: {result.get('location')}")
|
||||
|
||||
# Also dump the raw JSON
|
||||
match = re.search(r'<script[^>]*id="__NEXT_DATA__"[^>]*>(.+?)</script>', content, re.DOTALL)
|
||||
if match:
|
||||
data = json.loads(match.group(1))
|
||||
page_props = data.get('props', {}).get('pageProps', {})
|
||||
|
||||
if 'lot' in page_props:
|
||||
lot = page_props['lot']
|
||||
print(f"\nRAW __NEXT_DATA__.lot keys: {list(lot.keys())}")
|
||||
print(f"\nSearching for bid/timing fields...")
|
||||
|
||||
# Deep search for these fields
|
||||
def deep_search(obj, prefix=""):
|
||||
if isinstance(obj, dict):
|
||||
for k, v in obj.items():
|
||||
if any(term in k.lower() for term in ['bid', 'end', 'close', 'date', 'time']):
|
||||
print(f" {prefix}{k}: {v}")
|
||||
if isinstance(v, (dict, list)):
|
||||
deep_search(v, prefix + k + ".")
|
||||
elif isinstance(obj, list) and len(obj) > 0:
|
||||
deep_search(obj[0], prefix + "[0].")
|
||||
|
||||
deep_search(lot)
|
||||
|
||||
conn.close()
|
||||
53
extract_graphql_query.py
Normal file
53
extract_graphql_query.py
Normal file
@@ -0,0 +1,53 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Extract the GraphQL query being used"""
|
||||
import asyncio
|
||||
import json
|
||||
from playwright.async_api import async_playwright
|
||||
|
||||
async def main():
|
||||
async with async_playwright() as p:
|
||||
browser = await p.chromium.launch(headless=True)
|
||||
page = await browser.new_page()
|
||||
|
||||
graphql_requests = []
|
||||
|
||||
async def capture_request(request):
|
||||
if 'graphql' in request.url:
|
||||
graphql_requests.append({
|
||||
'url': request.url,
|
||||
'method': request.method,
|
||||
'post_data': request.post_data,
|
||||
'headers': dict(request.headers)
|
||||
})
|
||||
|
||||
page.on('request', capture_request)
|
||||
|
||||
await page.goto("https://www.troostwijkauctions.com/l/%25282x%2529-duo-bureau-160x168-cm-A1-28505-5", wait_until='networkidle')
|
||||
await asyncio.sleep(2)
|
||||
|
||||
print(f"Captured {len(graphql_requests)} GraphQL requests\n")
|
||||
|
||||
for i, req in enumerate(graphql_requests):
|
||||
print(f"{'='*60}")
|
||||
print(f"REQUEST #{i+1}")
|
||||
print(f"{'='*60}")
|
||||
print(f"URL: {req['url']}")
|
||||
print(f"Method: {req['method']}")
|
||||
|
||||
if req['post_data']:
|
||||
try:
|
||||
data = json.loads(req['post_data'])
|
||||
print(f"\nQuery Name: {data.get('operationName', 'N/A')}")
|
||||
print(f"\nVariables:")
|
||||
print(json.dumps(data.get('variables', {}), indent=2))
|
||||
print(f"\nQuery:")
|
||||
print(data.get('query', '')[:1000])
|
||||
except:
|
||||
print(f"\nPOST Data: {req['post_data'][:500]}")
|
||||
|
||||
print()
|
||||
|
||||
await browser.close()
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
64
find_api_endpoint.py
Normal file
64
find_api_endpoint.py
Normal file
@@ -0,0 +1,64 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Find the API endpoint by monitoring network requests"""
|
||||
import asyncio
|
||||
import json
|
||||
from playwright.async_api import async_playwright
|
||||
|
||||
async def main():
|
||||
async with async_playwright() as p:
|
||||
browser = await p.chromium.launch(headless=True)
|
||||
page = await browser.new_page()
|
||||
|
||||
requests = []
|
||||
responses = []
|
||||
|
||||
async def log_request(request):
|
||||
if any(term in request.url for term in ['api', 'graphql', 'lot', 'auction', 'bid']):
|
||||
requests.append({
|
||||
'url': request.url,
|
||||
'method': request.method,
|
||||
'headers': dict(request.headers),
|
||||
'post_data': request.post_data
|
||||
})
|
||||
|
||||
async def log_response(response):
|
||||
if any(term in response.url for term in ['api', 'graphql', 'lot', 'auction', 'bid']):
|
||||
try:
|
||||
body = await response.text()
|
||||
responses.append({
|
||||
'url': response.url,
|
||||
'status': response.status,
|
||||
'body': body[:1000]
|
||||
})
|
||||
except:
|
||||
pass
|
||||
|
||||
page.on('request', log_request)
|
||||
page.on('response', log_response)
|
||||
|
||||
print("Loading lot page...")
|
||||
await page.goto("https://www.troostwijkauctions.com/l/woonunit-type-tp-4-b-6m-nr-102-A1-37889-102", wait_until='networkidle')
|
||||
|
||||
# Wait for dynamic content
|
||||
await asyncio.sleep(3)
|
||||
|
||||
print(f"\nFound {len(requests)} relevant requests")
|
||||
print(f"Found {len(responses)} relevant responses\n")
|
||||
|
||||
for req in requests[:10]:
|
||||
print(f"REQUEST: {req['method']} {req['url']}")
|
||||
if req['post_data']:
|
||||
print(f" POST DATA: {req['post_data'][:200]}")
|
||||
|
||||
print("\n" + "="*60 + "\n")
|
||||
|
||||
for resp in responses[:10]:
|
||||
print(f"RESPONSE: {resp['url']}")
|
||||
print(f" Status: {resp['status']}")
|
||||
print(f" Body: {resp['body'][:300]}")
|
||||
print()
|
||||
|
||||
await browser.close()
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
70
find_api_valid_lot.py
Normal file
70
find_api_valid_lot.py
Normal file
@@ -0,0 +1,70 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Find API endpoint using a valid lot from database"""
|
||||
import asyncio
|
||||
import sqlite3
|
||||
from playwright.async_api import async_playwright
|
||||
|
||||
# Get a valid lot URL
|
||||
conn = sqlite3.connect('/mnt/okcomputer/output/cache.db')
|
||||
cursor = conn.execute("SELECT url FROM lots WHERE url LIKE '%/l/%' LIMIT 5")
|
||||
lot_urls = [row[0] for row in cursor.fetchall()]
|
||||
conn.close()
|
||||
|
||||
async def main():
|
||||
async with async_playwright() as p:
|
||||
browser = await p.chromium.launch(headless=True)
|
||||
page = await browser.new_page()
|
||||
|
||||
api_calls = []
|
||||
|
||||
async def log_response(response):
|
||||
url = response.url
|
||||
# Look for API calls
|
||||
if ('api' in url.lower() or 'graphql' in url.lower() or
|
||||
'/v2/' in url or '/v3/' in url or '/v4/' in url or
|
||||
'query' in url.lower() or 'mutation' in url.lower()):
|
||||
try:
|
||||
body = await response.text()
|
||||
api_calls.append({
|
||||
'url': url,
|
||||
'status': response.status,
|
||||
'body': body
|
||||
})
|
||||
print(f"\nAPI: {url}")
|
||||
except:
|
||||
pass
|
||||
|
||||
page.on('response', log_response)
|
||||
|
||||
for lot_url in lot_urls[:2]:
|
||||
print(f"\n{'='*60}")
|
||||
print(f"Loading: {lot_url}")
|
||||
print(f"{'='*60}")
|
||||
|
||||
try:
|
||||
await page.goto(lot_url, wait_until='networkidle', timeout=30000)
|
||||
await asyncio.sleep(2)
|
||||
|
||||
# Check if page has bid info
|
||||
content = await page.content()
|
||||
if 'currentBid' in content or 'Current bid' in content or 'Huidig bod' in content:
|
||||
print("[+] Page contains bid information")
|
||||
break
|
||||
except Exception as e:
|
||||
print(f"[!] Error: {e}")
|
||||
continue
|
||||
|
||||
print(f"\n\n{'='*60}")
|
||||
print(f"CAPTURED {len(api_calls)} API CALLS")
|
||||
print(f"{'='*60}")
|
||||
|
||||
for call in api_calls:
|
||||
print(f"\n{call['url']}")
|
||||
print(f"Status: {call['status']}")
|
||||
if 'json' in call['body'][:100].lower() or call['body'].startswith('{'):
|
||||
print(f"Body (first 500 chars): {call['body'][:500]}")
|
||||
|
||||
await browser.close()
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
48
find_auction_with_lots.py
Normal file
48
find_auction_with_lots.py
Normal file
@@ -0,0 +1,48 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Find an auction page with lots data"""
|
||||
import sqlite3
|
||||
import zlib
|
||||
import json
|
||||
import re
|
||||
|
||||
conn = sqlite3.connect('/mnt/okcomputer/output/cache.db')
|
||||
|
||||
cursor = conn.execute("""
|
||||
SELECT url, content
|
||||
FROM cache
|
||||
WHERE url LIKE '%/a/%'
|
||||
""")
|
||||
|
||||
for row in cursor:
|
||||
url, content_blob = row
|
||||
content = zlib.decompress(content_blob).decode('utf-8')
|
||||
|
||||
match = re.search(r'<script[^>]*id="__NEXT_DATA__"[^>]*>(.+?)</script>', content, re.DOTALL)
|
||||
if not match:
|
||||
continue
|
||||
|
||||
data = json.loads(match.group(1))
|
||||
page_props = data.get('props', {}).get('pageProps', {})
|
||||
|
||||
if 'auction' in page_props:
|
||||
auction = page_props['auction']
|
||||
lots = auction.get('lots', [])
|
||||
|
||||
if lots and len(lots) > 0:
|
||||
print(f"Found auction with {len(lots)} lots: {url}\n")
|
||||
|
||||
lot = lots[0]
|
||||
print(f"SAMPLE LOT FROM AUCTION.LOTS[]:")
|
||||
print(f" displayId: {lot.get('displayId')}")
|
||||
print(f" title: {lot.get('title', '')[:50]}...")
|
||||
print(f" urlSlug: {lot.get('urlSlug')}")
|
||||
print(f"\nBIDDING FIELDS:")
|
||||
for key in ['currentBid', 'highestBid', 'startingBid', 'minimumBidAmount', 'bidCount', 'numberOfBids']:
|
||||
print(f" {key}: {lot.get(key)}")
|
||||
print(f"\nTIMING FIELDS:")
|
||||
for key in ['endDate', 'startDate', 'closingTime']:
|
||||
print(f" {key}: {lot.get(key)}")
|
||||
print(f"\nALL KEYS: {list(lot.keys())[:30]}...")
|
||||
break
|
||||
|
||||
conn.close()
|
||||
69
inspect_cached_page.py
Normal file
69
inspect_cached_page.py
Normal file
@@ -0,0 +1,69 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Extract and inspect __NEXT_DATA__ from a cached lot page"""
|
||||
import sqlite3
|
||||
import zlib
|
||||
import json
|
||||
import re
|
||||
|
||||
conn = sqlite3.connect('/mnt/okcomputer/output/cache.db')
|
||||
|
||||
# Get a cached auction page
|
||||
cursor = conn.execute("""
|
||||
SELECT url, content
|
||||
FROM cache
|
||||
WHERE url LIKE '%/a/%'
|
||||
LIMIT 1
|
||||
""")
|
||||
|
||||
row = cursor.fetchone()
|
||||
if not row:
|
||||
print("No cached lot pages found")
|
||||
exit(1)
|
||||
|
||||
url, content_blob = row
|
||||
print(f"Inspecting: {url}\n")
|
||||
|
||||
# Decompress
|
||||
content = zlib.decompress(content_blob).decode('utf-8')
|
||||
|
||||
# Extract __NEXT_DATA__
|
||||
match = re.search(r'<script[^>]*id="__NEXT_DATA__"[^>]*>(.+?)</script>', content, re.DOTALL)
|
||||
if not match:
|
||||
print("No __NEXT_DATA__ found")
|
||||
exit(1)
|
||||
|
||||
data = json.loads(match.group(1))
|
||||
page_props = data.get('props', {}).get('pageProps', {})
|
||||
|
||||
if 'auction' in page_props:
|
||||
auction = page_props['auction']
|
||||
print("AUCTION DATA STRUCTURE:")
|
||||
print("=" * 60)
|
||||
print(f"displayId: {auction.get('displayId')}")
|
||||
print(f"name: {auction.get('name', '')[:50]}...")
|
||||
print(f"lots count: {len(auction.get('lots', []))}")
|
||||
|
||||
if auction.get('lots'):
|
||||
lot = auction['lots'][0]
|
||||
print(f"\nFIRST LOT STRUCTURE:")
|
||||
print(f" displayId: {lot.get('displayId')}")
|
||||
print(f" title: {lot.get('title', '')[:50]}...")
|
||||
print(f"\n BIDDING:")
|
||||
print(f" currentBid: {lot.get('currentBid')}")
|
||||
print(f" highestBid: {lot.get('highestBid')}")
|
||||
print(f" startingBid: {lot.get('startingBid')}")
|
||||
print(f" minimumBidAmount: {lot.get('minimumBidAmount')}")
|
||||
print(f" bidCount: {lot.get('bidCount')}")
|
||||
print(f" numberOfBids: {lot.get('numberOfBids')}")
|
||||
print(f" TIMING:")
|
||||
print(f" endDate: {lot.get('endDate')}")
|
||||
print(f" startDate: {lot.get('startDate')}")
|
||||
print(f" closingTime: {lot.get('closingTime')}")
|
||||
print(f" ALL KEYS: {list(lot.keys())}")
|
||||
|
||||
print(f"\nAUCTION TIMING:")
|
||||
print(f" minEndDate: {auction.get('minEndDate')}")
|
||||
print(f" maxEndDate: {auction.get('maxEndDate')}")
|
||||
print(f" ALL KEYS: {list(auction.keys())}")
|
||||
|
||||
conn.close()
|
||||
45
intercept_api.py
Normal file
45
intercept_api.py
Normal file
@@ -0,0 +1,45 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Intercept API calls to find where lot data comes from"""
|
||||
import asyncio
|
||||
import json
|
||||
from playwright.async_api import async_playwright
|
||||
|
||||
async def main():
|
||||
async with async_playwright() as p:
|
||||
browser = await p.chromium.launch(headless=False)
|
||||
page = await browser.new_page()
|
||||
|
||||
# Track API calls
|
||||
api_calls = []
|
||||
|
||||
async def handle_response(response):
|
||||
if 'api' in response.url.lower() or 'graphql' in response.url.lower():
|
||||
try:
|
||||
body = await response.json()
|
||||
api_calls.append({
|
||||
'url': response.url,
|
||||
'status': response.status,
|
||||
'body': body
|
||||
})
|
||||
print(f"\nAPI CALL: {response.url}")
|
||||
print(f"Status: {response.status}")
|
||||
if 'lot' in response.url.lower() or 'auction' in response.url.lower():
|
||||
print(f"Body preview: {json.dumps(body, indent=2)[:500]}")
|
||||
except:
|
||||
pass
|
||||
|
||||
page.on('response', handle_response)
|
||||
|
||||
# Visit auction page
|
||||
print("Loading auction page...")
|
||||
await page.goto("https://www.troostwijkauctions.com/a/woonunits-generatoren-reinigingsmachines-en-zakelijke-goederen-A1-37889", wait_until='networkidle')
|
||||
|
||||
# Wait a bit for lazy loading
|
||||
await asyncio.sleep(5)
|
||||
|
||||
print(f"\n\nCaptured {len(api_calls)} API calls")
|
||||
|
||||
await browser.close()
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
51
scrape_fresh_auction.py
Normal file
51
scrape_fresh_auction.py
Normal file
@@ -0,0 +1,51 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Scrape a fresh auction page to see the lots array structure"""
|
||||
import asyncio
|
||||
import json
|
||||
import re
|
||||
from playwright.async_api import async_playwright
|
||||
|
||||
async def main():
|
||||
async with async_playwright() as p:
|
||||
browser = await p.chromium.launch(headless=True)
|
||||
page = await browser.new_page()
|
||||
|
||||
# Get first auction
|
||||
await page.goto("https://www.troostwijkauctions.com/auctions", wait_until='networkidle')
|
||||
content = await page.content()
|
||||
|
||||
# Find first auction link
|
||||
match = re.search(r'href="(/a/[^"]+)"', content)
|
||||
if not match:
|
||||
print("No auction found")
|
||||
return
|
||||
|
||||
auction_url = f"https://www.troostwijkauctions.com{match.group(1)}"
|
||||
print(f"Scraping: {auction_url}\n")
|
||||
|
||||
await page.goto(auction_url, wait_until='networkidle')
|
||||
content = await page.content()
|
||||
|
||||
# Extract __NEXT_DATA__
|
||||
match = re.search(r'<script[^>]*id="__NEXT_DATA__"[^>]*>(.+?)</script>', content, re.DOTALL)
|
||||
if not match:
|
||||
print("No __NEXT_DATA__ found")
|
||||
return
|
||||
|
||||
data = json.loads(match.group(1))
|
||||
page_props = data.get('props', {}).get('pageProps', {})
|
||||
|
||||
if 'auction' in page_props:
|
||||
auction = page_props['auction']
|
||||
print(f"Auction: {auction.get('name', '')[:50]}...")
|
||||
print(f"Lots in array: {len(auction.get('lots', []))}")
|
||||
|
||||
if auction.get('lots'):
|
||||
lot = auction['lots'][0]
|
||||
print(f"\nFIRST LOT:")
|
||||
print(json.dumps(lot, indent=2)[:1500])
|
||||
|
||||
await browser.close()
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
20
src/cache.py
20
src/cache.py
@@ -50,6 +50,8 @@ class CacheManager:
|
||||
url TEXT UNIQUE,
|
||||
title TEXT,
|
||||
current_bid TEXT,
|
||||
starting_bid TEXT,
|
||||
minimum_bid TEXT,
|
||||
bid_count INTEGER,
|
||||
closing_time TEXT,
|
||||
viewing_time TEXT,
|
||||
@@ -72,6 +74,15 @@ class CacheManager:
|
||||
)
|
||||
""")
|
||||
|
||||
# Add new columns to lots table if they don't exist
|
||||
cursor = conn.execute("PRAGMA table_info(lots)")
|
||||
columns = {row[1] for row in cursor.fetchall()}
|
||||
|
||||
if 'starting_bid' not in columns:
|
||||
conn.execute("ALTER TABLE lots ADD COLUMN starting_bid TEXT")
|
||||
if 'minimum_bid' not in columns:
|
||||
conn.execute("ALTER TABLE lots ADD COLUMN minimum_bid TEXT")
|
||||
|
||||
# Remove duplicates before creating unique index
|
||||
# Keep the row with the smallest id (first occurrence) for each (lot_id, url) pair
|
||||
conn.execute("""
|
||||
@@ -165,15 +176,18 @@ class CacheManager:
|
||||
with sqlite3.connect(self.db_path) as conn:
|
||||
conn.execute("""
|
||||
INSERT OR REPLACE INTO lots
|
||||
(lot_id, auction_id, url, title, current_bid, bid_count, closing_time,
|
||||
viewing_time, pickup_date, location, description, category, scraped_at)
|
||||
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
|
||||
(lot_id, auction_id, url, title, current_bid, starting_bid, minimum_bid,
|
||||
bid_count, closing_time, viewing_time, pickup_date, location, description,
|
||||
category, scraped_at)
|
||||
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
|
||||
""", (
|
||||
lot_data['lot_id'],
|
||||
lot_data.get('auction_id', ''),
|
||||
lot_data['url'],
|
||||
lot_data['title'],
|
||||
lot_data.get('current_bid', ''),
|
||||
lot_data.get('starting_bid', ''),
|
||||
lot_data.get('minimum_bid', ''),
|
||||
lot_data.get('bid_count', 0),
|
||||
lot_data.get('closing_time', ''),
|
||||
lot_data.get('viewing_time', ''),
|
||||
|
||||
138
src/graphql_client.py
Normal file
138
src/graphql_client.py
Normal file
@@ -0,0 +1,138 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
GraphQL client for fetching lot bidding data from Troostwijk API
|
||||
"""
|
||||
import aiohttp
|
||||
from typing import Dict, Optional
|
||||
|
||||
GRAPHQL_ENDPOINT = "https://storefront.tbauctions.com/storefront/graphql"
|
||||
|
||||
LOT_BIDDING_QUERY = """
|
||||
query LotBiddingData($lotDisplayId: String!, $locale: String!, $platform: Platform!) {
|
||||
lotDetails(displayId: $lotDisplayId, locale: $locale, platform: $platform) {
|
||||
estimatedFullPrice {
|
||||
saleTerm
|
||||
}
|
||||
lot {
|
||||
id
|
||||
displayId
|
||||
auctionId
|
||||
currentBidAmount {
|
||||
cents
|
||||
currency
|
||||
}
|
||||
initialAmount {
|
||||
cents
|
||||
currency
|
||||
}
|
||||
nextMinimalBid {
|
||||
cents
|
||||
currency
|
||||
}
|
||||
nextBidStepInCents
|
||||
vat
|
||||
markupPercentage
|
||||
biddingStatus
|
||||
bidsCount
|
||||
startDate
|
||||
endDate
|
||||
assignedExplicitly
|
||||
minimumBidAmountMet
|
||||
}
|
||||
}
|
||||
}
|
||||
"""
|
||||
|
||||
|
||||
async def fetch_lot_bidding_data(lot_display_id: str) -> Optional[Dict]:
|
||||
"""
|
||||
Fetch lot bidding data from GraphQL API
|
||||
|
||||
Args:
|
||||
lot_display_id: The lot display ID (e.g., "A1-28505-5")
|
||||
|
||||
Returns:
|
||||
Dict with bidding data or None if request fails
|
||||
"""
|
||||
variables = {
|
||||
"lotDisplayId": lot_display_id,
|
||||
"locale": "nl",
|
||||
"platform": "TWK"
|
||||
}
|
||||
|
||||
payload = {
|
||||
"query": LOT_BIDDING_QUERY,
|
||||
"variables": variables
|
||||
}
|
||||
|
||||
try:
|
||||
async with aiohttp.ClientSession() as session:
|
||||
async with session.post(GRAPHQL_ENDPOINT, json=payload, timeout=30) as response:
|
||||
if response.status == 200:
|
||||
data = await response.json()
|
||||
lot_details = data.get('data', {}).get('lotDetails', {})
|
||||
|
||||
if lot_details and lot_details.get('lot'):
|
||||
return lot_details
|
||||
return None
|
||||
else:
|
||||
print(f" GraphQL API error: {response.status}")
|
||||
return None
|
||||
except Exception as e:
|
||||
print(f" GraphQL request failed: {e}")
|
||||
return None
|
||||
|
||||
|
||||
def format_bid_data(lot_details: Dict) -> Dict:
|
||||
"""
|
||||
Format GraphQL lot details into scraper format
|
||||
|
||||
Args:
|
||||
lot_details: Raw lot details from GraphQL API
|
||||
|
||||
Returns:
|
||||
Dict with formatted bid data
|
||||
"""
|
||||
lot = lot_details.get('lot', {})
|
||||
|
||||
current_bid_amount = lot.get('currentBidAmount')
|
||||
initial_amount = lot.get('initialAmount')
|
||||
next_minimal_bid = lot.get('nextMinimalBid')
|
||||
|
||||
# Format currency amounts
|
||||
def format_cents(amount_obj):
|
||||
if not amount_obj or not isinstance(amount_obj, dict):
|
||||
return None
|
||||
cents = amount_obj.get('cents')
|
||||
currency = amount_obj.get('currency', 'EUR')
|
||||
if cents is None:
|
||||
return None
|
||||
return f"EUR {cents / 100:.2f}" if currency == 'EUR' else f"{currency} {cents / 100:.2f}"
|
||||
|
||||
current_bid = format_cents(current_bid_amount) or "No bids"
|
||||
starting_bid = format_cents(initial_amount) or ""
|
||||
minimum_bid = format_cents(next_minimal_bid) or ""
|
||||
|
||||
# Format timestamps (Unix timestamps in seconds)
|
||||
start_date = lot.get('startDate')
|
||||
end_date = lot.get('endDate')
|
||||
|
||||
def format_timestamp(ts):
|
||||
if ts:
|
||||
from datetime import datetime
|
||||
try:
|
||||
# Timestamps are already in seconds
|
||||
return datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')
|
||||
except:
|
||||
return ''
|
||||
return ''
|
||||
|
||||
return {
|
||||
'current_bid': current_bid,
|
||||
'starting_bid': starting_bid,
|
||||
'minimum_bid': minimum_bid,
|
||||
'bid_count': lot.get('bidsCount', 0),
|
||||
'closing_time': format_timestamp(end_date),
|
||||
'bidding_status': lot.get('biddingStatus', ''),
|
||||
'vat_percentage': lot.get('vat', 0),
|
||||
}
|
||||
@@ -19,6 +19,7 @@ from config import (
|
||||
)
|
||||
from cache import CacheManager
|
||||
from parse import DataParser
|
||||
from graphql_client import fetch_lot_bidding_data, format_bid_data
|
||||
|
||||
class TroostwijkScraper:
|
||||
"""Main scraper class for Troostwijk Auctions"""
|
||||
@@ -176,29 +177,44 @@ class TroostwijkScraper:
|
||||
self.visited_lots.add(url)
|
||||
|
||||
if page_data.get('type') == 'auction':
|
||||
print(f" → Type: AUCTION")
|
||||
print(f" → Title: {page_data.get('title', 'N/A')[:60]}...")
|
||||
print(f" → Location: {page_data.get('location', 'N/A')}")
|
||||
print(f" → Lots: {page_data.get('lots_count', 0)}")
|
||||
print(f" Type: AUCTION")
|
||||
print(f" Title: {page_data.get('title', 'N/A')[:60]}...")
|
||||
print(f" Location: {page_data.get('location', 'N/A')}")
|
||||
print(f" Lots: {page_data.get('lots_count', 0)}")
|
||||
self.cache.save_auction(page_data)
|
||||
|
||||
elif page_data.get('type') == 'lot':
|
||||
print(f" → Type: LOT")
|
||||
print(f" → Title: {page_data.get('title', 'N/A')[:60]}...")
|
||||
print(f" → Bid: {page_data.get('current_bid', 'N/A')}")
|
||||
print(f" → Location: {page_data.get('location', 'N/A')}")
|
||||
print(f" Type: LOT")
|
||||
print(f" Title: {page_data.get('title', 'N/A')[:60]}...")
|
||||
|
||||
# Fetch bidding data from GraphQL API
|
||||
lot_id = page_data.get('lot_id')
|
||||
print(f" Fetching bidding data from API...")
|
||||
bidding_data = await fetch_lot_bidding_data(lot_id)
|
||||
|
||||
if bidding_data:
|
||||
formatted_data = format_bid_data(bidding_data)
|
||||
# Update page_data with real bidding info
|
||||
page_data.update(formatted_data)
|
||||
print(f" Bid: {page_data.get('current_bid', 'N/A')}")
|
||||
print(f" Bid Count: {page_data.get('bid_count', 0)}")
|
||||
print(f" Closing: {page_data.get('closing_time', 'N/A')}")
|
||||
else:
|
||||
print(f" Bid: {page_data.get('current_bid', 'N/A')} (from HTML)")
|
||||
|
||||
print(f" Location: {page_data.get('location', 'N/A')}")
|
||||
self.cache.save_lot(page_data)
|
||||
|
||||
images = page_data.get('images', [])
|
||||
if images:
|
||||
self.cache.save_images(page_data['lot_id'], images)
|
||||
print(f" → Images: {len(images)}")
|
||||
print(f" Images: {len(images)}")
|
||||
|
||||
if self.download_images:
|
||||
for i, img_url in enumerate(images):
|
||||
local_path = await self._download_image(img_url, page_data['lot_id'], i)
|
||||
if local_path:
|
||||
print(f" ✓ Downloaded: {Path(local_path).name}")
|
||||
print(f" Downloaded: {Path(local_path).name}")
|
||||
|
||||
return page_data
|
||||
|
||||
|
||||
64
test_full_scraper.py
Normal file
64
test_full_scraper.py
Normal file
@@ -0,0 +1,64 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Test the full scraper with one lot"""
|
||||
import asyncio
|
||||
import sys
|
||||
sys.path.insert(0, 'src')
|
||||
|
||||
from scraper import TroostwijkScraper
|
||||
|
||||
async def main():
|
||||
scraper = TroostwijkScraper()
|
||||
|
||||
from playwright.async_api import async_playwright
|
||||
|
||||
async with async_playwright() as p:
|
||||
browser = await p.chromium.launch(headless=True)
|
||||
page = await browser.new_page(
|
||||
viewport={'width': 1920, 'height': 1080},
|
||||
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
|
||||
)
|
||||
|
||||
# Test with a known lot
|
||||
lot_url = "https://www.troostwijkauctions.com/l/%25282x%2529-duo-bureau-160x168-cm-A1-28505-5"
|
||||
|
||||
print(f"Testing with: {lot_url}\n")
|
||||
result = await scraper.crawl_page(page, lot_url)
|
||||
|
||||
if result:
|
||||
print(f"\n{'='*60}")
|
||||
print("FINAL RESULT:")
|
||||
print(f"{'='*60}")
|
||||
print(f"Lot ID: {result.get('lot_id')}")
|
||||
print(f"Title: {result.get('title', '')[:50]}...")
|
||||
print(f"Current Bid: {result.get('current_bid')}")
|
||||
print(f"Starting Bid: {result.get('starting_bid')}")
|
||||
print(f"Minimum Bid: {result.get('minimum_bid')}")
|
||||
print(f"Bid Count: {result.get('bid_count')}")
|
||||
print(f"Closing Time: {result.get('closing_time')}")
|
||||
print(f"Location: {result.get('location')}")
|
||||
|
||||
await browser.close()
|
||||
|
||||
# Verify database
|
||||
import sqlite3
|
||||
conn = sqlite3.connect('/mnt/okcomputer/output/cache.db')
|
||||
cursor = conn.execute("""
|
||||
SELECT current_bid, starting_bid, minimum_bid, bid_count, closing_time
|
||||
FROM lots
|
||||
WHERE lot_id = 'A1-28505-5'
|
||||
""")
|
||||
row = cursor.fetchone()
|
||||
conn.close()
|
||||
|
||||
if row:
|
||||
print(f"\n{'='*60}")
|
||||
print("DATABASE VERIFICATION:")
|
||||
print(f"{'='*60}")
|
||||
print(f"Current Bid: {row[0]}")
|
||||
print(f"Starting Bid: {row[1]}")
|
||||
print(f"Minimum Bid: {row[2]}")
|
||||
print(f"Bid Count: {row[3]}")
|
||||
print(f"Closing Time: {row[4]}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
32
test_graphql_scraper.py
Normal file
32
test_graphql_scraper.py
Normal file
@@ -0,0 +1,32 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Test the updated scraper with GraphQL integration"""
|
||||
import asyncio
|
||||
import sys
|
||||
sys.path.insert(0, 'src')
|
||||
|
||||
from graphql_client import fetch_lot_bidding_data, format_bid_data
|
||||
|
||||
async def main():
|
||||
# Test with known lot ID
|
||||
lot_id = "A1-28505-5"
|
||||
|
||||
print(f"Testing GraphQL API with lot: {lot_id}\n")
|
||||
|
||||
bidding_data = await fetch_lot_bidding_data(lot_id)
|
||||
|
||||
if bidding_data:
|
||||
print("Raw GraphQL Response:")
|
||||
print("="*60)
|
||||
import json
|
||||
print(json.dumps(bidding_data, indent=2))
|
||||
|
||||
print("\n\nFormatted Data:")
|
||||
print("="*60)
|
||||
formatted = format_bid_data(bidding_data)
|
||||
for key, value in formatted.items():
|
||||
print(f" {key}: {value}")
|
||||
else:
|
||||
print("Failed to fetch bidding data")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
43
test_live_lot.py
Normal file
43
test_live_lot.py
Normal file
@@ -0,0 +1,43 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Test scraping a single live lot page"""
|
||||
import asyncio
|
||||
import sys
|
||||
sys.path.insert(0, 'src')
|
||||
|
||||
from scraper import TroostwijkScraper
|
||||
|
||||
async def main():
|
||||
scraper = TroostwijkScraper()
|
||||
|
||||
from playwright.async_api import async_playwright
|
||||
|
||||
async with async_playwright() as p:
|
||||
browser = await p.chromium.launch(headless=True)
|
||||
page = await browser.new_page()
|
||||
|
||||
# Get a lot URL from the database
|
||||
import sqlite3
|
||||
conn = sqlite3.connect('/mnt/okcomputer/output/cache.db')
|
||||
cursor = conn.execute("SELECT url FROM lots LIMIT 1")
|
||||
row = cursor.fetchone()
|
||||
conn.close()
|
||||
|
||||
if not row:
|
||||
print("No lots in database")
|
||||
return
|
||||
|
||||
lot_url = row[0]
|
||||
print(f"Fetching: {lot_url}\n")
|
||||
|
||||
result = await scraper.crawl_page(page, lot_url)
|
||||
|
||||
if result:
|
||||
print(f"\nExtracted Data:")
|
||||
print(f" current_bid: {result.get('current_bid')}")
|
||||
print(f" bid_count: {result.get('bid_count')}")
|
||||
print(f" closing_time: {result.get('closing_time')}")
|
||||
|
||||
await browser.close()
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
Reference in New Issue
Block a user