Conversation
Reviewer's Guide重构 URL 处理和详情提取逻辑,以更好地支持快手移动端页面(chenzhongtech/gifshow),丰富提取的元数据(包括收藏数),优先使用基于移动端 INIT_STATE 的解析方式,并新增用于收藏数和评论数的请求工具,同时调整 API 模型以返回普通 dict 类型的 params。 使用快手移动端 INIT_STATE 解析的 detail 接口时序图sequenceDiagram
actor User
participant APIApp as FastAPI_app
participant Examiner
participant Detail as DetailLink
participant HTTP as Kuaishou_servers
participant Extractor as HTMLExtractor
User->>APIApp: POST /detail (DetailModel)
APIApp->>Examiner: run(text, type_=detail, proxy)
Examiner-->>APIApp: list urls
APIApp->>APIApp: select target_url
Note over APIApp: Prefer chenzhongtech.com
APIApp->>Detail: detail_one(target_url, False, proxy, cookie)
Detail->>Detail: request_url(target_url, proxy, cookie)
alt chenzhongtech.com or gifshow.com
Detail->>Detail: _get_mobile_headers(cookie)
else other domain
Detail->>Detail: use pc_headers
end
Detail->>HTTP: GET target_url with headers
HTTP-->>Detail: HTML with window.INIT_STATE
Detail-->>APIApp: HTML text
APIApp->>Extractor: __convert_object(text, web=False)
Extractor->>Extractor: parse APP_KEYWORD INIT_STATE
Extractor-->>APIApp: photo object with _counts
APIApp->>Extractor: __extract_detail(data, id_, web=False)
Extractor->>Extractor: __extract_detail_app(photo, id_)
Extractor-->>APIApp: detail dict (includes collectionCount)
APIApp-->>User: ResponseModel(message, params as dict, data as dict)
包含 collectionCount 的更新后记录表结构 ER 图erDiagram
RECORD {
INTEGER realLikeCount
INTEGER shareCount
INTEGER commentCount
INTEGER collectionCount
TEXT timestamp
TEXT viewCount
TEXT download
}
RECORD ||--o{ DOWNLOAD_URL : has
更新后快手移动端提取及计数工具的类图classDiagram
direction LR
class Examiner {
+Pattern PC_COMPLETE_URL
+Pattern C_COMPLETE_URL
+Pattern REDIRECT_URL
+Pattern GIFSHOW_URL
+__init__(manager)
+__validate_links(urls) list~str~
+__request_redirect(url, proxy, cookie) str
+_extract_params_detail(url, redirect, user_id, photo_id) tuple
}
class HTMLExtractor {
+str SCRIPT
+str WEB_KEYWORD
+str APP_KEYWORD
+Pattern PHOTO_REGEX
+__init__(manager)
+__convert_object(text, web) dict
+__extract_detail(data, id_, web) dict
+__extract_detail_web(data, id_) dict
+__extract_detail_app(data, id_) dict
+_extract_download_urls(data, index) list~str~
}
class DetailLink {
+__init__(manager)
+run(url, proxy, cookie) str
+_get_mobile_headers(cookie) dict
+request_url(url, proxy, cookie) str
}
class CollectionCount {
+str API_URL_TEMPLATE
+Pattern INIT_STATE_PATTERN
+client
+headers
+console
+int retry
+str photo_id
+str note
+int collection_count
+__init__(manager, photo_id)
+run() int
+run_single() void
+_extract_collection_count(html) void
+_find_collection_count(data) int
+_recursive_find(obj, key) int
}
class CommentCount {
+str API_URL
+str photo_id
+str pcursor
+str note
+int comment_count
+__init__(manager, photo_id, pcursor)
+run() int
+run_single() void
+deal_response(response) void
}
class API {
+client
+headers
+console
+max_retry
+__init__(manager)
+get_data(url, json, method) dict
}
class ResponseModel {
+str message
+dict params
+dict data
+url str
}
class UrlResponse {
+str message
+list~str~ urls
+dict params
+url str
}
class RecordManager {
+list~tuple~ fields
}
Examiner --> DetailLink : uses
Examiner --> HTMLExtractor : uses
DetailLink --> HTMLExtractor : returns HTML for parsing
CollectionCount --> Manager : uses
CommentCount --> API : extends
API <|-- CommentCount
ResponseModel --> DetailModel : previously held
UrlResponse --> ShortUrl : previously held
RecordManager --> "1" Record : manages
class Record {
+int realLikeCount
+int shareCount
+int commentCount
+int collectionCount
+str timestamp
+str viewCount
+str download
}
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your Experience访问你的 dashboard 可以:
Getting HelpOriginal review guide in EnglishReviewer's GuideRefactors URL handling and detail extraction to better support Kuaishou mobile pages (chenzhongtech/gifshow), enriches extracted metadata (including collection count), prioritizes mobile INIT_STATE-based parsing, and adds new request utilities for collection and comment counts while adjusting API models to return plain dict params. Sequence diagram for detail endpoint using mobile Kuaishou INIT_STATE parsingsequenceDiagram
actor User
participant APIApp as FastAPI_app
participant Examiner
participant Detail as DetailLink
participant HTTP as Kuaishou_servers
participant Extractor as HTMLExtractor
User->>APIApp: POST /detail (DetailModel)
APIApp->>Examiner: run(text, type_=detail, proxy)
Examiner-->>APIApp: list urls
APIApp->>APIApp: select target_url
Note over APIApp: Prefer chenzhongtech.com
APIApp->>Detail: detail_one(target_url, False, proxy, cookie)
Detail->>Detail: request_url(target_url, proxy, cookie)
alt chenzhongtech.com or gifshow.com
Detail->>Detail: _get_mobile_headers(cookie)
else other domain
Detail->>Detail: use pc_headers
end
Detail->>HTTP: GET target_url with headers
HTTP-->>Detail: HTML with window.INIT_STATE
Detail-->>APIApp: HTML text
APIApp->>Extractor: __convert_object(text, web=False)
Extractor->>Extractor: parse APP_KEYWORD INIT_STATE
Extractor-->>APIApp: photo object with _counts
APIApp->>Extractor: __extract_detail(data, id_, web=False)
Extractor->>Extractor: __extract_detail_app(photo, id_)
Extractor-->>APIApp: detail dict (includes collectionCount)
APIApp-->>User: ResponseModel(message, params as dict, data as dict)
ER diagram for updated record schema with collectionCounterDiagram
RECORD {
INTEGER realLikeCount
INTEGER shareCount
INTEGER commentCount
INTEGER collectionCount
TEXT timestamp
TEXT viewCount
TEXT download
}
RECORD ||--o{ DOWNLOAD_URL : has
Class diagram for updated Kuaishou mobile extraction and count utilitiesclassDiagram
direction LR
class Examiner {
+Pattern PC_COMPLETE_URL
+Pattern C_COMPLETE_URL
+Pattern REDIRECT_URL
+Pattern GIFSHOW_URL
+__init__(manager)
+__validate_links(urls) list~str~
+__request_redirect(url, proxy, cookie) str
+_extract_params_detail(url, redirect, user_id, photo_id) tuple
}
class HTMLExtractor {
+str SCRIPT
+str WEB_KEYWORD
+str APP_KEYWORD
+Pattern PHOTO_REGEX
+__init__(manager)
+__convert_object(text, web) dict
+__extract_detail(data, id_, web) dict
+__extract_detail_web(data, id_) dict
+__extract_detail_app(data, id_) dict
+_extract_download_urls(data, index) list~str~
}
class DetailLink {
+__init__(manager)
+run(url, proxy, cookie) str
+_get_mobile_headers(cookie) dict
+request_url(url, proxy, cookie) str
}
class CollectionCount {
+str API_URL_TEMPLATE
+Pattern INIT_STATE_PATTERN
+client
+headers
+console
+int retry
+str photo_id
+str note
+int collection_count
+__init__(manager, photo_id)
+run() int
+run_single() void
+_extract_collection_count(html) void
+_find_collection_count(data) int
+_recursive_find(obj, key) int
}
class CommentCount {
+str API_URL
+str photo_id
+str pcursor
+str note
+int comment_count
+__init__(manager, photo_id, pcursor)
+run() int
+run_single() void
+deal_response(response) void
}
class API {
+client
+headers
+console
+max_retry
+__init__(manager)
+get_data(url, json, method) dict
}
class ResponseModel {
+str message
+dict params
+dict data
+url str
}
class UrlResponse {
+str message
+list~str~ urls
+dict params
+url str
}
class RecordManager {
+list~tuple~ fields
}
Examiner --> DetailLink : uses
Examiner --> HTMLExtractor : uses
DetailLink --> HTMLExtractor : returns HTML for parsing
CollectionCount --> Manager : uses
CommentCount --> API : extends
API <|-- CommentCount
ResponseModel --> DetailModel : previously held
UrlResponse --> ShortUrl : previously held
RecordManager --> "1" Record : manages
class Record {
+int realLikeCount
+int shareCount
+int commentCount
+int collectionCount
+str timestamp
+str viewCount
+str download
}
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - 我发现了 4 个问题,并给出了一些高层次的反馈:
- __validate_links 现在先用 set 收集 photo_id 再展开,这会导致返回的 URL 列表顺序不再是确定的;如果有调用方依赖稳定的顺序,建议考虑保留输入顺序(例如使用 list + 成员检查,而不是 set)。
- 在 HTMLExtractor.__convert_object 中,你仍然在 WEB_KEYWORD/APP_KEYWORD 上使用 str.lstrip(),它会移除任意这些字符,而不是固定前缀;为了避免在负载刚好以相似字符开头时破坏 JSON,建议通过 startswith + 切片(或替换精确前缀)的方式来移除已知前缀。
- _extract_params_detail 直接在
in判断中使用 url.hostname;为了和 __validate_links 中的写法保持一致(那里使用了 (url.hostname or "") 保护),建议以同样方式规范化 hostname,以避免在边缘 URL 情况下出现 NoneType 问题。
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- __validate_links now collects photo_ids in a set and then expands them, which makes the returned URL list order non-deterministic; if any callers rely on stable ordering, consider preserving input order (e.g., using a list + membership check instead of a set).
- In HTMLExtractor.__convert_object you are still using str.lstrip() with WEB_KEYWORD/APP_KEYWORD, which removes any of those characters rather than a fixed prefix; it would be safer to strip a known prefix via startswith + slicing (or replace the exact prefix) to avoid corrupting JSON when the payload happens to start with similar characters.
- _extract_params_detail uses url.hostname directly in an 'in' check; for consistency with __validate_links (where you guard with (url.hostname or "")), consider normalizing hostname the same way to avoid potential NoneType issues on edge-case URLs.
## Individual Comments
### Comment 1
<location path="source/link/examiner.py" line_range="184-191" />
<code_context>
url = urlparse(url)
params = parse_qs(url.query)
- if "chenzhongtech" in url.hostname:
+ if "chenzhongtech" in url.hostname or "gifshow" in url.hostname:
+ # chenzhongtech 和 gifshow 域名都是移动版页面
return (
False,
params.get("userId", [""])[0],
- params.get("photoId", [""])[0],
+ url.path.split("/")[-1],
)
- elif "short-video" in url.path or "fw/photo" in url.path:
+ elif "short-video" in url.path:
return (
True,
</code_context>
<issue_to_address>
**issue (bug_risk):** Dropping the `"fw/photo" in url.path` branch could break handling for some Kuaishou URLs.
This used to treat both `"short-video"` and `"fw/photo"` paths as `web=True`. Now `"fw/photo"` URLs (e.g. `https://www.kuaishou.com/fw/photo/...`) will fall through and change behavior. If these URLs are still expected, this is a regression; consider either restoring the `"fw/photo"` condition here or adding a separate branch for it.
</issue_to_address>
### Comment 2
<location path="source/extract/extractor.py" line_range="76-85" />
<code_context>
web: bool,
) -> dict:
if web:
- text = text.lstrip(self.WEB_KEYWORD if web else self.APP_KEYWORD)
+ text = text.lstrip(self.WEB_KEYWORD)
text = text.replace(
";(function(){var s;(s=document.currentScript||document.scripts["
"document.scripts.length-1]).parentNode.removeChild(s);}());",
"",
)
+ return safe_load(text)
else:
- text = text[1] if (text := self.PHOTO_REGEX.search(text)) else ""
- return safe_load(text)
+ # 移动版:解析整个 INIT_STATE 来同时获取 photo 和 counts
+ init_state_text = text.lstrip(self.APP_KEYWORD)
+ try:
+ init_state = json.loads(init_state_text)
</code_context>
<issue_to_address>
**issue (bug_risk):** Using `str.lstrip` with the keyword constants can corrupt the JSON payload.
Because `lstrip` treats its argument as a set of characters, `text.lstrip(self.WEB_KEYWORD)` / `text.lstrip(self.APP_KEYWORD)` will strip any leading combination of those characters until a different one appears, not just the exact keyword. This can remove valid leading JSON characters and break parsing. Use an explicit prefix check and slice instead, e.g.:
```python
if text.startswith(self.WEB_KEYWORD):
text = text[len(self.WEB_KEYWORD):]
```
(and similarly for `APP_KEYWORD`).
</issue_to_address>
### Comment 3
<location path="source/link/examiner.py" line_range="57" />
<code_context>
- self.C_COMPLETE_URL.finditer(urls),
- )
- ]
+ """从各种URL格式中提取photoId,并构造统一的四种格式URL."""
+ photo_ids = set()
+
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting shared helpers for photo ID extraction and URL normalization so that __validate_links orchestrates them instead of inlining all parsing logic.
You can reduce the new complexity in `__validate_links` by extracting the URL → `photo_id` logic into a shared helper and separating extraction from normalization. This also lets you reuse logic already present in `_extract_params_detail`.
For example:
```python
def _extract_photo_id(self, url_str: str) -> str | None:
url = urlparse(url_str)
params = parse_qs(url.query)
host = url.hostname or ""
path = url.path or ""
if "chenzhongtech" in host:
# chenzhongtech: photoId in query
return params.get("photoId", [""])[0] or None
if "gifshow" in host:
# gifshow: photoId in path
return path.split("/")[-1] or None
if "short-video" in path or "fw/photo" in path:
# kuaishou: photoId in path
return path.split("/")[-1] or None
return None
```
Then `__validate_links` can focus on:
```python
def __validate_links(self, urls: str) -> list[str]:
photo_ids: set[str] = set()
for pattern in (
self.REDIRECT_URL,
self.GIFSHOW_URL,
self.PC_COMPLETE_URL,
self.C_COMPLETE_URL,
):
for match in pattern.finditer(urls):
photo_id = self._extract_photo_id(match.group())
if photo_id:
photo_ids.add(photo_id)
return [url for pid in photo_ids for url in self._build_normalized_urls(pid)]
```
And normalization is isolated:
```python
def _build_normalized_urls(self, photo_id: str) -> list[str]:
return [
f"https://m.gifshow.com/fw/photo/{photo_id}",
f"https://v.m.chenzhongtech.com/fw/photo/{photo_id}",
f"https://chenzhongtech.com/fw/photo/{photo_id}",
f"https://1.gifshow.com/fw/photo/{photo_id}",
]
```
You can also reuse `_extract_photo_id` inside `_extract_params_detail` to avoid diverging domain rules:
```python
def _extract_params_detail(self, url: str) -> tuple[bool | None, str, str]:
parsed = urlparse(url)
params = parse_qs(parsed.query)
host = parsed.hostname or ""
path = parsed.path or ""
if "chenzhongtech" in host or "gifshow" in host:
return (
False,
params.get("userId", [""])[0],
self._extract_photo_id(url) or "",
)
if "short-video" in path:
return True, "", self._extract_photo_id(url) or ""
self.console.error(f"Unknown url: {urlunparse(parsed)}")
return None, "", ""
```
This keeps all behavior, but:
- Removes duplicated domain/path parsing.
- Splits responsibilities (collecting URLs, extracting `photo_id`, building normalized URLs).
- Makes `__validate_links` a simple orchestration of these helpers instead of a deeply nested method.
</issue_to_address>
### Comment 4
<location path="source/request/collection.py" line_range="54" />
<code_context>
+ html = response.text
+ self._extract_collection_count(html)
+
+ def _extract_collection_count(self, html: str) -> None:
+ """从 HTML 中提取 window.INIT_STATE 并解析 collectionCount"""
+ if not html:
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting the INIT_STATE parsing and collectionCount search into a single pure helper function so the CollectionCount class focuses only on HTTP/orchestration concerns.
You can reduce the cognitive load by separating the HTTP concern from the INIT_STATE parsing/search logic and collapsing the two search helpers into a single, focused function.
**1. Extract a pure helper for INIT_STATE → collectionCount**
Move the parsing + search into a standalone helper (module-level or shared util), so `CollectionCount` only worries about HTTP, logging, and wiring:
```python
def extract_collection_count_from_init_state(init_state: dict) -> int | None:
# 常见路径 1: 根级别
if "collectionCount" in init_state:
return init_state["collectionCount"]
# 常见路径 2: photo
photo = init_state.get("photo")
if isinstance(photo, dict) and "collectionCount" in photo:
return photo["collectionCount"]
# 常见路径 3: visionVideoDetail
detail = init_state.get("visionVideoDetail")
if isinstance(detail, dict) and "collectionCount" in detail:
return detail["collectionCount"]
# 备选: 通用递归查找
def recursive_find(obj) -> int | None:
if isinstance(obj, dict):
if "collectionCount" in obj:
return obj["collectionCount"]
for v in obj.values():
found = recursive_find(v)
if found is not None:
return found
elif isinstance(obj, list):
for item in obj:
found = recursive_find(item)
if found is not None:
return found
return None
return recursive_find(init_state)
```
Then the class no longer needs `_find_collection_count` and `_recursive_find` methods:
```python
def _extract_collection_count(self, html: str) -> None:
if not html:
self.collection_count = -1
return
match = self.INIT_STATE_PATTERN.search(html)
if not match:
self.console.warning(_("未找到 window.INIT_STATE"))
self.collection_count = -1
return
try:
init_state = json.loads(match.group(1))
except json.JSONDecodeError as e:
self.console.error(_("解析 INIT_STATE JSON 失败: {error}").format(error=e))
self.collection_count = -1
return
collection_count = extract_collection_count_from_init_state(init_state)
self.collection_count = collection_count if collection_count is not None else -1
```
This keeps all existing behavior (regex, JSON parsing, prioritized paths, recursive fallback) while:
- Making `CollectionCount` responsible only for HTTP + orchestration.
- Encapsulating the “complex” tree search into a single, reusable, and testable function.
- Removing the need to understand two methods (`_find_collection_count` + `_recursive_find`) to follow the extraction logic.
</issue_to_address>Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Original comment in English
Hey - I've found 4 issues, and left some high level feedback:
- __validate_links now collects photo_ids in a set and then expands them, which makes the returned URL list order non-deterministic; if any callers rely on stable ordering, consider preserving input order (e.g., using a list + membership check instead of a set).
- In HTMLExtractor.__convert_object you are still using str.lstrip() with WEB_KEYWORD/APP_KEYWORD, which removes any of those characters rather than a fixed prefix; it would be safer to strip a known prefix via startswith + slicing (or replace the exact prefix) to avoid corrupting JSON when the payload happens to start with similar characters.
- _extract_params_detail uses url.hostname directly in an 'in' check; for consistency with __validate_links (where you guard with (url.hostname or "")), consider normalizing hostname the same way to avoid potential NoneType issues on edge-case URLs.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- __validate_links now collects photo_ids in a set and then expands them, which makes the returned URL list order non-deterministic; if any callers rely on stable ordering, consider preserving input order (e.g., using a list + membership check instead of a set).
- In HTMLExtractor.__convert_object you are still using str.lstrip() with WEB_KEYWORD/APP_KEYWORD, which removes any of those characters rather than a fixed prefix; it would be safer to strip a known prefix via startswith + slicing (or replace the exact prefix) to avoid corrupting JSON when the payload happens to start with similar characters.
- _extract_params_detail uses url.hostname directly in an 'in' check; for consistency with __validate_links (where you guard with (url.hostname or "")), consider normalizing hostname the same way to avoid potential NoneType issues on edge-case URLs.
## Individual Comments
### Comment 1
<location path="source/link/examiner.py" line_range="184-191" />
<code_context>
url = urlparse(url)
params = parse_qs(url.query)
- if "chenzhongtech" in url.hostname:
+ if "chenzhongtech" in url.hostname or "gifshow" in url.hostname:
+ # chenzhongtech 和 gifshow 域名都是移动版页面
return (
False,
params.get("userId", [""])[0],
- params.get("photoId", [""])[0],
+ url.path.split("/")[-1],
)
- elif "short-video" in url.path or "fw/photo" in url.path:
+ elif "short-video" in url.path:
return (
True,
</code_context>
<issue_to_address>
**issue (bug_risk):** Dropping the `"fw/photo" in url.path` branch could break handling for some Kuaishou URLs.
This used to treat both `"short-video"` and `"fw/photo"` paths as `web=True`. Now `"fw/photo"` URLs (e.g. `https://www.kuaishou.com/fw/photo/...`) will fall through and change behavior. If these URLs are still expected, this is a regression; consider either restoring the `"fw/photo"` condition here or adding a separate branch for it.
</issue_to_address>
### Comment 2
<location path="source/extract/extractor.py" line_range="76-85" />
<code_context>
web: bool,
) -> dict:
if web:
- text = text.lstrip(self.WEB_KEYWORD if web else self.APP_KEYWORD)
+ text = text.lstrip(self.WEB_KEYWORD)
text = text.replace(
";(function(){var s;(s=document.currentScript||document.scripts["
"document.scripts.length-1]).parentNode.removeChild(s);}());",
"",
)
+ return safe_load(text)
else:
- text = text[1] if (text := self.PHOTO_REGEX.search(text)) else ""
- return safe_load(text)
+ # 移动版:解析整个 INIT_STATE 来同时获取 photo 和 counts
+ init_state_text = text.lstrip(self.APP_KEYWORD)
+ try:
+ init_state = json.loads(init_state_text)
</code_context>
<issue_to_address>
**issue (bug_risk):** Using `str.lstrip` with the keyword constants can corrupt the JSON payload.
Because `lstrip` treats its argument as a set of characters, `text.lstrip(self.WEB_KEYWORD)` / `text.lstrip(self.APP_KEYWORD)` will strip any leading combination of those characters until a different one appears, not just the exact keyword. This can remove valid leading JSON characters and break parsing. Use an explicit prefix check and slice instead, e.g.:
```python
if text.startswith(self.WEB_KEYWORD):
text = text[len(self.WEB_KEYWORD):]
```
(and similarly for `APP_KEYWORD`).
</issue_to_address>
### Comment 3
<location path="source/link/examiner.py" line_range="57" />
<code_context>
- self.C_COMPLETE_URL.finditer(urls),
- )
- ]
+ """从各种URL格式中提取photoId,并构造统一的四种格式URL."""
+ photo_ids = set()
+
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting shared helpers for photo ID extraction and URL normalization so that __validate_links orchestrates them instead of inlining all parsing logic.
You can reduce the new complexity in `__validate_links` by extracting the URL → `photo_id` logic into a shared helper and separating extraction from normalization. This also lets you reuse logic already present in `_extract_params_detail`.
For example:
```python
def _extract_photo_id(self, url_str: str) -> str | None:
url = urlparse(url_str)
params = parse_qs(url.query)
host = url.hostname or ""
path = url.path or ""
if "chenzhongtech" in host:
# chenzhongtech: photoId in query
return params.get("photoId", [""])[0] or None
if "gifshow" in host:
# gifshow: photoId in path
return path.split("/")[-1] or None
if "short-video" in path or "fw/photo" in path:
# kuaishou: photoId in path
return path.split("/")[-1] or None
return None
```
Then `__validate_links` can focus on:
```python
def __validate_links(self, urls: str) -> list[str]:
photo_ids: set[str] = set()
for pattern in (
self.REDIRECT_URL,
self.GIFSHOW_URL,
self.PC_COMPLETE_URL,
self.C_COMPLETE_URL,
):
for match in pattern.finditer(urls):
photo_id = self._extract_photo_id(match.group())
if photo_id:
photo_ids.add(photo_id)
return [url for pid in photo_ids for url in self._build_normalized_urls(pid)]
```
And normalization is isolated:
```python
def _build_normalized_urls(self, photo_id: str) -> list[str]:
return [
f"https://m.gifshow.com/fw/photo/{photo_id}",
f"https://v.m.chenzhongtech.com/fw/photo/{photo_id}",
f"https://chenzhongtech.com/fw/photo/{photo_id}",
f"https://1.gifshow.com/fw/photo/{photo_id}",
]
```
You can also reuse `_extract_photo_id` inside `_extract_params_detail` to avoid diverging domain rules:
```python
def _extract_params_detail(self, url: str) -> tuple[bool | None, str, str]:
parsed = urlparse(url)
params = parse_qs(parsed.query)
host = parsed.hostname or ""
path = parsed.path or ""
if "chenzhongtech" in host or "gifshow" in host:
return (
False,
params.get("userId", [""])[0],
self._extract_photo_id(url) or "",
)
if "short-video" in path:
return True, "", self._extract_photo_id(url) or ""
self.console.error(f"Unknown url: {urlunparse(parsed)}")
return None, "", ""
```
This keeps all behavior, but:
- Removes duplicated domain/path parsing.
- Splits responsibilities (collecting URLs, extracting `photo_id`, building normalized URLs).
- Makes `__validate_links` a simple orchestration of these helpers instead of a deeply nested method.
</issue_to_address>
### Comment 4
<location path="source/request/collection.py" line_range="54" />
<code_context>
+ html = response.text
+ self._extract_collection_count(html)
+
+ def _extract_collection_count(self, html: str) -> None:
+ """从 HTML 中提取 window.INIT_STATE 并解析 collectionCount"""
+ if not html:
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting the INIT_STATE parsing and collectionCount search into a single pure helper function so the CollectionCount class focuses only on HTTP/orchestration concerns.
You can reduce the cognitive load by separating the HTTP concern from the INIT_STATE parsing/search logic and collapsing the two search helpers into a single, focused function.
**1. Extract a pure helper for INIT_STATE → collectionCount**
Move the parsing + search into a standalone helper (module-level or shared util), so `CollectionCount` only worries about HTTP, logging, and wiring:
```python
def extract_collection_count_from_init_state(init_state: dict) -> int | None:
# 常见路径 1: 根级别
if "collectionCount" in init_state:
return init_state["collectionCount"]
# 常见路径 2: photo
photo = init_state.get("photo")
if isinstance(photo, dict) and "collectionCount" in photo:
return photo["collectionCount"]
# 常见路径 3: visionVideoDetail
detail = init_state.get("visionVideoDetail")
if isinstance(detail, dict) and "collectionCount" in detail:
return detail["collectionCount"]
# 备选: 通用递归查找
def recursive_find(obj) -> int | None:
if isinstance(obj, dict):
if "collectionCount" in obj:
return obj["collectionCount"]
for v in obj.values():
found = recursive_find(v)
if found is not None:
return found
elif isinstance(obj, list):
for item in obj:
found = recursive_find(item)
if found is not None:
return found
return None
return recursive_find(init_state)
```
Then the class no longer needs `_find_collection_count` and `_recursive_find` methods:
```python
def _extract_collection_count(self, html: str) -> None:
if not html:
self.collection_count = -1
return
match = self.INIT_STATE_PATTERN.search(html)
if not match:
self.console.warning(_("未找到 window.INIT_STATE"))
self.collection_count = -1
return
try:
init_state = json.loads(match.group(1))
except json.JSONDecodeError as e:
self.console.error(_("解析 INIT_STATE JSON 失败: {error}").format(error=e))
self.collection_count = -1
return
collection_count = extract_collection_count_from_init_state(init_state)
self.collection_count = collection_count if collection_count is not None else -1
```
This keeps all existing behavior (regex, JSON parsing, prioritized paths, recursive fallback) while:
- Making `CollectionCount` responsible only for HTTP + orchestration.
- Encapsulating the “complex” tree search into a single, reusable, and testable function.
- Removing the need to understand two methods (`_find_collection_count` + `_recursive_find`) to follow the extraction logic.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| if "chenzhongtech" in url.hostname or "gifshow" in url.hostname: | ||
| # chenzhongtech 和 gifshow 域名都是移动版页面 | ||
| return ( | ||
| False, | ||
| params.get("userId", [""])[0], | ||
| params.get("photoId", [""])[0], | ||
| url.path.split("/")[-1], | ||
| ) | ||
| elif "short-video" in url.path or "fw/photo" in url.path: | ||
| elif "short-video" in url.path: |
There was a problem hiding this comment.
issue (bug_risk): 删除 "fw/photo" in url.path 这个分支可能会导致部分快手链接处理出错。
之前这里会把 "short-video" 和 "fw/photo" 路径都当作 web=True 处理。现在 "fw/photo" 链接(例如 https://www.kuaishou.com/fw/photo/...)会直接落空,行为发生变化。如果这些链接仍然是预期需要支持的,这就是一次回归;建议要么在这里恢复对 "fw/photo" 的判断,要么为其单独增加一个分支。
Original comment in English
issue (bug_risk): Dropping the "fw/photo" in url.path branch could break handling for some Kuaishou URLs.
This used to treat both "short-video" and "fw/photo" paths as web=True. Now "fw/photo" URLs (e.g. https://www.kuaishou.com/fw/photo/...) will fall through and change behavior. If these URLs are still expected, this is a regression; consider either restoring the "fw/photo" condition here or adding a separate branch for it.
| if web: | ||
| text = text.lstrip(self.WEB_KEYWORD if web else self.APP_KEYWORD) | ||
| text = text.lstrip(self.WEB_KEYWORD) | ||
| text = text.replace( | ||
| ";(function(){var s;(s=document.currentScript||document.scripts[" | ||
| "document.scripts.length-1]).parentNode.removeChild(s);}());", | ||
| "", | ||
| ) | ||
| return safe_load(text) | ||
| else: | ||
| text = text[1] if (text := self.PHOTO_REGEX.search(text)) else "" | ||
| return safe_load(text) | ||
| # 移动版:解析整个 INIT_STATE 来同时获取 photo 和 counts |
There was a problem hiding this comment.
issue (bug_risk): 使用带关键字常量的 str.lstrip 可能会破坏 JSON 负载。
由于 lstrip 会把它的参数当作“字符集合”处理,text.lstrip(self.WEB_KEYWORD) / text.lstrip(self.APP_KEYWORD) 会移除任意由这些字符组成的前缀,而不仅仅是精确的关键字本身。这可能会删掉合法的 JSON 开头字符,从而导致解析失败。可以改用显式的前缀检查并切片,例如:
if text.startswith(self.WEB_KEYWORD):
text = text[len(self.WEB_KEYWORD):](APP_KEYWORD 同理)。
Original comment in English
issue (bug_risk): Using str.lstrip with the keyword constants can corrupt the JSON payload.
Because lstrip treats its argument as a set of characters, text.lstrip(self.WEB_KEYWORD) / text.lstrip(self.APP_KEYWORD) will strip any leading combination of those characters until a different one appears, not just the exact keyword. This can remove valid leading JSON characters and break parsing. Use an explicit prefix check and slice instead, e.g.:
if text.startswith(self.WEB_KEYWORD):
text = text[len(self.WEB_KEYWORD):](and similarly for APP_KEYWORD).
| self.C_COMPLETE_URL.finditer(urls), | ||
| ) | ||
| ] | ||
| """从各种URL格式中提取photoId,并构造统一的四种格式URL.""" |
There was a problem hiding this comment.
issue (complexity): 建议抽取用于 photo ID 提取和 URL 规范化的共享辅助方法,让 __validate_links 只负责调度这些方法,而不是内联所有解析逻辑。
你可以通过把 URL → photo_id 的逻辑抽成一个公共 helper,并将“提取”和“规范化”分离,来降低 __validate_links 新增的复杂度。这也可以让你重用 _extract_params_detail 中已有的逻辑。
例如:
def _extract_photo_id(self, url_str: str) -> str | None:
url = urlparse(url_str)
params = parse_qs(url.query)
host = url.hostname or ""
path = url.path or ""
if "chenzhongtech" in host:
# chenzhongtech: photoId in query
return params.get("photoId", [""])[0] or None
if "gifshow" in host:
# gifshow: photoId in path
return path.split("/")[-1] or None
if "short-video" in path or "fw/photo" in path:
# kuaishou: photoId in path
return path.split("/")[-1] or None
return None然后 __validate_links 可以专注于:
def __validate_links(self, urls: str) -> list[str]:
photo_ids: set[str] = set()
for pattern in (
self.REDIRECT_URL,
self.GIFSHOW_URL,
self.PC_COMPLETE_URL,
self.C_COMPLETE_URL,
):
for match in pattern.finditer(urls):
photo_id = self._extract_photo_id(match.group())
if photo_id:
photo_ids.add(photo_id)
return [url for pid in photo_ids for url in self._build_normalized_urls(pid)]并将规范化逻辑单独隔离:
def _build_normalized_urls(self, photo_id: str) -> list[str]:
return [
f"https://m.gifshow.com/fw/photo/{photo_id}",
f"https://v.m.chenzhongtech.com/fw/photo/{photo_id}",
f"https://chenzhongtech.com/fw/photo/{photo_id}",
f"https://1.gifshow.com/fw/photo/{photo_id}",
]你也可以在 _extract_params_detail 中重用 _extract_photo_id,以避免不同方法之间的域名规则逐渐产生偏差:
def _extract_params_detail(self, url: str) -> tuple[bool | None, str, str]:
parsed = urlparse(url)
params = parse_qs(parsed.query)
host = parsed.hostname or ""
path = parsed.path or ""
if "chenzhongtech" in host or "gifshow" in host:
return (
False,
params.get("userId", [""])[0],
self._extract_photo_id(url) or "",
)
if "short-video" in path:
return True, "", self._extract_photo_id(url) or ""
self.console.error(f"Unknown url: {urlunparse(parsed)}")
return None, "", ""这样既能保持现有行为,又可以:
- 去掉重复的域名/路径解析逻辑;
- 明确职责划分(收集 URL、提取
photo_id、构造规范化 URL); - 让
__validate_links成为这些 helper 的简单调度器,而不是一个高度嵌套的方法。
Original comment in English
issue (complexity): Consider extracting shared helpers for photo ID extraction and URL normalization so that __validate_links orchestrates them instead of inlining all parsing logic.
You can reduce the new complexity in __validate_links by extracting the URL → photo_id logic into a shared helper and separating extraction from normalization. This also lets you reuse logic already present in _extract_params_detail.
For example:
def _extract_photo_id(self, url_str: str) -> str | None:
url = urlparse(url_str)
params = parse_qs(url.query)
host = url.hostname or ""
path = url.path or ""
if "chenzhongtech" in host:
# chenzhongtech: photoId in query
return params.get("photoId", [""])[0] or None
if "gifshow" in host:
# gifshow: photoId in path
return path.split("/")[-1] or None
if "short-video" in path or "fw/photo" in path:
# kuaishou: photoId in path
return path.split("/")[-1] or None
return NoneThen __validate_links can focus on:
def __validate_links(self, urls: str) -> list[str]:
photo_ids: set[str] = set()
for pattern in (
self.REDIRECT_URL,
self.GIFSHOW_URL,
self.PC_COMPLETE_URL,
self.C_COMPLETE_URL,
):
for match in pattern.finditer(urls):
photo_id = self._extract_photo_id(match.group())
if photo_id:
photo_ids.add(photo_id)
return [url for pid in photo_ids for url in self._build_normalized_urls(pid)]And normalization is isolated:
def _build_normalized_urls(self, photo_id: str) -> list[str]:
return [
f"https://m.gifshow.com/fw/photo/{photo_id}",
f"https://v.m.chenzhongtech.com/fw/photo/{photo_id}",
f"https://chenzhongtech.com/fw/photo/{photo_id}",
f"https://1.gifshow.com/fw/photo/{photo_id}",
]You can also reuse _extract_photo_id inside _extract_params_detail to avoid diverging domain rules:
def _extract_params_detail(self, url: str) -> tuple[bool | None, str, str]:
parsed = urlparse(url)
params = parse_qs(parsed.query)
host = parsed.hostname or ""
path = parsed.path or ""
if "chenzhongtech" in host or "gifshow" in host:
return (
False,
params.get("userId", [""])[0],
self._extract_photo_id(url) or "",
)
if "short-video" in path:
return True, "", self._extract_photo_id(url) or ""
self.console.error(f"Unknown url: {urlunparse(parsed)}")
return None, "", ""This keeps all behavior, but:
- Removes duplicated domain/path parsing.
- Splits responsibilities (collecting URLs, extracting
photo_id, building normalized URLs). - Makes
__validate_linksa simple orchestration of these helpers instead of a deeply nested method.
| html = response.text | ||
| self._extract_collection_count(html) | ||
|
|
||
| def _extract_collection_count(self, html: str) -> None: |
There was a problem hiding this comment.
issue (complexity): 建议把 INIT_STATE 的解析和 collectionCount 的查找抽取成一个纯辅助函数,使 CollectionCount 类只专注于 HTTP / 调度相关的逻辑。
你可以通过将 HTTP 相关逻辑与 INIT_STATE 的解析/查找逻辑分离,并把当前的两个查找 helper 合并为一个更聚焦的函数,从而降低理解成本。
1. 为 INIT_STATE → collectionCount 抽取一个纯 helper
将解析 + 查找逻辑移动到一个独立的 helper(模块级或共享工具函数)中,让 CollectionCount 只负责 HTTP、日志和组装:
def extract_collection_count_from_init_state(init_state: dict) -> int | None:
# 常见路径 1: 根级别
if "collectionCount" in init_state:
return init_state["collectionCount"]
# 常见路径 2: photo
photo = init_state.get("photo")
if isinstance(photo, dict) and "collectionCount" in photo:
return photo["collectionCount"]
# 常见路径 3: visionVideoDetail
detail = init_state.get("visionVideoDetail")
if isinstance(detail, dict) and "collectionCount" in detail:
return detail["collectionCount"]
# 备选: 通用递归查找
def recursive_find(obj) -> int | None:
if isinstance(obj, dict):
if "collectionCount" in obj:
return obj["collectionCount"]
for v in obj.values():
found = recursive_find(v)
if found is not None:
return found
elif isinstance(obj, list):
for item in obj:
found = recursive_find(item)
if found is not None:
return found
return None
return recursive_find(init_state)这样类本身就不再需要 _find_collection_count 和 _recursive_find 方法:
def _extract_collection_count(self, html: str) -> None:
if not html:
self.collection_count = -1
return
match = self.INIT_STATE_PATTERN.search(html)
if not match:
self.console.warning(_("未找到 window.INIT_STATE"))
self.collection_count = -1
return
try:
init_state = json.loads(match.group(1))
except json.JSONDecodeError as e:
self.console.error(_("解析 INIT_STATE JSON 失败: {error}").format(error=e))
self.collection_count = -1
return
collection_count = extract_collection_count_from_init_state(init_state)
self.collection_count = collection_count if collection_count is not None else -1这样既保持了所有现有行为(正则、JSON 解析、优先路径、递归兜底),同时:
- 让
CollectionCount只负责 HTTP 和调度相关职责; - 将“复杂”的树形查找逻辑封装到一个可复用、易测试的函数中;
- 不再需要同时理解
_find_collection_count和_recursive_find两个方法才能跟踪整个提取逻辑。
Original comment in English
issue (complexity): Consider extracting the INIT_STATE parsing and collectionCount search into a single pure helper function so the CollectionCount class focuses only on HTTP/orchestration concerns.
You can reduce the cognitive load by separating the HTTP concern from the INIT_STATE parsing/search logic and collapsing the two search helpers into a single, focused function.
1. Extract a pure helper for INIT_STATE → collectionCount
Move the parsing + search into a standalone helper (module-level or shared util), so CollectionCount only worries about HTTP, logging, and wiring:
def extract_collection_count_from_init_state(init_state: dict) -> int | None:
# 常见路径 1: 根级别
if "collectionCount" in init_state:
return init_state["collectionCount"]
# 常见路径 2: photo
photo = init_state.get("photo")
if isinstance(photo, dict) and "collectionCount" in photo:
return photo["collectionCount"]
# 常见路径 3: visionVideoDetail
detail = init_state.get("visionVideoDetail")
if isinstance(detail, dict) and "collectionCount" in detail:
return detail["collectionCount"]
# 备选: 通用递归查找
def recursive_find(obj) -> int | None:
if isinstance(obj, dict):
if "collectionCount" in obj:
return obj["collectionCount"]
for v in obj.values():
found = recursive_find(v)
if found is not None:
return found
elif isinstance(obj, list):
for item in obj:
found = recursive_find(item)
if found is not None:
return found
return None
return recursive_find(init_state)Then the class no longer needs _find_collection_count and _recursive_find methods:
def _extract_collection_count(self, html: str) -> None:
if not html:
self.collection_count = -1
return
match = self.INIT_STATE_PATTERN.search(html)
if not match:
self.console.warning(_("未找到 window.INIT_STATE"))
self.collection_count = -1
return
try:
init_state = json.loads(match.group(1))
except json.JSONDecodeError as e:
self.console.error(_("解析 INIT_STATE JSON 失败: {error}").format(error=e))
self.collection_count = -1
return
collection_count = extract_collection_count_from_init_state(init_state)
self.collection_count = collection_count if collection_count is not None else -1This keeps all existing behavior (regex, JSON parsing, prioritized paths, recursive fallback) while:
- Making
CollectionCountresponsible only for HTTP + orchestration. - Encapsulating the “complex” tree search into a single, reusable, and testable function.
- Removing the need to understand two methods (
_find_collection_count+_recursive_find) to follow the extraction logic.
Summary by Sourcery
为移动端 gifshow/chenzhongtech 链接添加支持,并在标准化响应格式的同时丰富视频元数据和计数信息的提取。
New Features:
Bug Fixes:
Enhancements:
Original summary in English
Summary by Sourcery
Add support for mobile gifshow/chenzhongtech URLs and enrich extraction of video metadata and counters while standardizing response formats.
New Features:
Bug Fixes:
Enhancements: