Skip to content

恢复稳定性,返回更多信息#48

Open
844704781 wants to merge 1 commit intoJoeanAmier:masterfrom
844704781:master
Open

恢复稳定性,返回更多信息#48
844704781 wants to merge 1 commit intoJoeanAmier:masterfrom
844704781:master

Conversation

@844704781
Copy link

@844704781 844704781 commented Mar 16, 2026

Summary by Sourcery

为移动端 gifshow/chenzhongtech 链接添加支持,并在标准化响应格式的同时丰富视频元数据和计数信息的提取。

New Features:

  • 支持从 gifshow 和更多 chenzhongtech 移动端 URL 格式中提取并规范化链接。
  • 添加独立的请求处理器,从移动端 HTML 的 INIT_STATE 中获取收藏数,并通过快手评论 API 获取评论数。

Bug Fixes:

  • 通过正确解析 INIT_STATE 并处理 photo/counts 数据结构,修复移动端页面的详情提取问题。
  • 通过使用非贪婪、DOTALL 模式的正则表达式,并在网页场景下优先使用直接 JSON 解析方式,提高从 HTML 中提取 photo JSON 的健壮性。
  • 通过使用专门的轻量级 headers,请求移动端详情 URL 时避免意外重定向到 PC 页面。

Enhancements:

  • 围绕由 photoId 构建的一组规范化移动端详情端点统一提取出的 URL。
  • 在提取的视频统计信息和记录 schema 中加入 collectionCount,并在详情解析过程中优先使用 chenzhongtech URL。
  • 在 API 响应模型中以普通字典的形式返回 params,以便更好地进行序列化和客户端处理。
  • 在提取可下载资源时优先使用 mainMvUrls,并更好地处理不同的容器类型。
Original summary in English

Summary by Sourcery

Add support for mobile gifshow/chenzhongtech URLs and enrich extraction of video metadata and counters while standardizing response formats.

New Features:

  • Support extracting and normalizing links from gifshow and additional chenzhongtech mobile URL formats.
  • Add separate request handlers to fetch collection counts from mobile HTML INIT_STATE and comment counts from the Kuaishou comment API.

Bug Fixes:

  • Fix detail extraction for mobile pages by properly parsing INIT_STATE and handling photo/counts data structures.
  • Improve robustness of photo JSON extraction from HTML by using a non-greedy, DOTALL regex and preferring direct JSON parsing for web pages.
  • Prevent unintended redirects to PC pages when requesting mobile detail URLs by using dedicated lightweight headers.

Enhancements:

  • Unify extracted URLs around a canonical set of mobile detail endpoints built from photoId values.
  • Include collectionCount in extracted video statistics and record schema, and prefer chenzhongtech URLs during detail resolution.
  • Return params as plain dicts in API response models for better serialization and client handling.
  • Prioritize mainMvUrls when extracting downloadable resources and better handle different container types.

@sourcery-ai
Copy link

sourcery-ai bot commented Mar 16, 2026

Reviewer's Guide

重构 URL 处理和详情提取逻辑,以更好地支持快手移动端页面(chenzhongtech/gifshow),丰富提取的元数据(包括收藏数),优先使用基于移动端 INIT_STATE 的解析方式,并新增用于收藏数和评论数的请求工具,同时调整 API 模型以返回普通 dict 类型的 params。

使用快手移动端 INIT_STATE 解析的 detail 接口时序图

sequenceDiagram
    actor User
    participant APIApp as FastAPI_app
    participant Examiner
    participant Detail as DetailLink
    participant HTTP as Kuaishou_servers
    participant Extractor as HTMLExtractor

    User->>APIApp: POST /detail (DetailModel)
    APIApp->>Examiner: run(text, type_=detail, proxy)
    Examiner-->>APIApp: list urls

    APIApp->>APIApp: select target_url
    Note over APIApp: Prefer chenzhongtech.com

    APIApp->>Detail: detail_one(target_url, False, proxy, cookie)
    Detail->>Detail: request_url(target_url, proxy, cookie)
    alt chenzhongtech.com or gifshow.com
        Detail->>Detail: _get_mobile_headers(cookie)
    else other domain
        Detail->>Detail: use pc_headers
    end

    Detail->>HTTP: GET target_url with headers
    HTTP-->>Detail: HTML with window.INIT_STATE
    Detail-->>APIApp: HTML text

    APIApp->>Extractor: __convert_object(text, web=False)
    Extractor->>Extractor: parse APP_KEYWORD INIT_STATE
    Extractor-->>APIApp: photo object with _counts

    APIApp->>Extractor: __extract_detail(data, id_, web=False)
    Extractor->>Extractor: __extract_detail_app(photo, id_)
    Extractor-->>APIApp: detail dict (includes collectionCount)

    APIApp-->>User: ResponseModel(message, params as dict, data as dict)
Loading

包含 collectionCount 的更新后记录表结构 ER 图

erDiagram
    RECORD {
        INTEGER realLikeCount
        INTEGER shareCount
        INTEGER commentCount
        INTEGER collectionCount
        TEXT timestamp
        TEXT viewCount
        TEXT download
    }

    RECORD ||--o{ DOWNLOAD_URL : has
Loading

更新后快手移动端提取及计数工具的类图

classDiagram
    direction LR

    class Examiner {
        +Pattern PC_COMPLETE_URL
        +Pattern C_COMPLETE_URL
        +Pattern REDIRECT_URL
        +Pattern GIFSHOW_URL
        +__init__(manager)
        +__validate_links(urls) list~str~
        +__request_redirect(url, proxy, cookie) str
        +_extract_params_detail(url, redirect, user_id, photo_id) tuple
    }

    class HTMLExtractor {
        +str SCRIPT
        +str WEB_KEYWORD
        +str APP_KEYWORD
        +Pattern PHOTO_REGEX
        +__init__(manager)
        +__convert_object(text, web) dict
        +__extract_detail(data, id_, web) dict
        +__extract_detail_web(data, id_) dict
        +__extract_detail_app(data, id_) dict
        +_extract_download_urls(data, index) list~str~
    }

    class DetailLink {
        +__init__(manager)
        +run(url, proxy, cookie) str
        +_get_mobile_headers(cookie) dict
        +request_url(url, proxy, cookie) str
    }

    class CollectionCount {
        +str API_URL_TEMPLATE
        +Pattern INIT_STATE_PATTERN
        +client
        +headers
        +console
        +int retry
        +str photo_id
        +str note
        +int collection_count
        +__init__(manager, photo_id)
        +run() int
        +run_single() void
        +_extract_collection_count(html) void
        +_find_collection_count(data) int
        +_recursive_find(obj, key) int
    }

    class CommentCount {
        +str API_URL
        +str photo_id
        +str pcursor
        +str note
        +int comment_count
        +__init__(manager, photo_id, pcursor)
        +run() int
        +run_single() void
        +deal_response(response) void
    }

    class API {
        +client
        +headers
        +console
        +max_retry
        +__init__(manager)
        +get_data(url, json, method) dict
    }

    class ResponseModel {
        +str message
        +dict params
        +dict data
        +url str
    }

    class UrlResponse {
        +str message
        +list~str~ urls
        +dict params
        +url str
    }

    class RecordManager {
        +list~tuple~ fields
    }

    Examiner --> DetailLink : uses
    Examiner --> HTMLExtractor : uses

    DetailLink --> HTMLExtractor : returns HTML for parsing

    CollectionCount --> Manager : uses
    CommentCount --> API : extends
    API <|-- CommentCount

    ResponseModel --> DetailModel : previously held
    UrlResponse --> ShortUrl : previously held

    RecordManager --> "1" Record : manages

    class Record {
        +int realLikeCount
        +int shareCount
        +int commentCount
        +int collectionCount
        +str timestamp
        +str viewCount
        +str download
    }
Loading

File-Level Changes

Change Details Files
将各种快手 URL 形式归一化为 photoId,并生成规范化的移动端 URL,包括新的 gifshow 域名。
  • 新增 GIFSHOW_URL 正则,用于匹配 gifshow.com/cn 的 fw/photo 链接。
  • 重写 __validate_links,从多种域名/路径模式中提取 photoId,并为每个 id 返回四个标准化的移动端 URL。
  • 调整 _extract_params_detail,将 chenzhongtech.com 和 gifshow.com 视为移动端页面,从 path 中提取 photoId,并将 web 分支限制为短视频页面 URL。
source/link/examiner.py
改进用于 web 与移动端详情提取的 HTML/INIT_STATE 解析,并暴露更多元数据,如 collectionCount 和 mainMvUrls。
  • 放宽 PHOTO_REGEX,改为对 photo 区块使用非贪婪、DOTALL 的 JSON 对象匹配。
  • 调整 __convert_object 的 web 分支,只去掉 WEB_KEYWORD 并对剩余 JSON state 执行 safe_load。
  • 替换 __convert_object 的 app/移动端分支,改为解析完整 INIT_STATE JSON,查找包含 photo 的节点,并将计数信息以 _counts 附加到返回对象上。
  • 简化 __extract_detail 的分发逻辑,使移动端分支直接将 photo dict 传给 __extract_detail_app。
  • 扩展 __extract_detail_app,从 _counts.collectionCount 读取 collectionCount。
  • 更新 _extract_download_urls,优先返回 mainMvUrls(兼容 SimpleNamespace 与 dict 条目),若没有则回退到图片图集的提取。
source/extract/extractor.py
调整 FastAPI 处理函数,以使用新的 URL 归一化和移动端详情行为,并在响应中将请求模型序列化为 dict。
  • 修改 share 处理函数,调用 examiner.run 时使用 type_='detail',并将 extract.model_dump() 填入 UrlResponse.params。
  • 在 detail 处理函数中,优先选择 chenzhongtech.com URL(否则使用首个 URL),并使用显式的 False 标志与提供的 proxy/cookie 调用 detail_one。
  • 返回 ResponseModel 时,将 params 设置为 extract.model_dump() 而不是模型实例本身。
source/app/app.py
为移动端详情 URL 专门定制请求头,以避免重定向,并让响应模型统一携带 dict 类型的 params。
  • 新增 _get_mobile_headers,返回包含精简移动端 User-Agent 和 Accept 的请求头。
  • 在 request_url 中,对 chenzhongtech.com 和 gifshow.com URL 使用移动端请求头;对其它域名则拷贝默认请求头,并在有需要时仍然应用 Cookie 和 proxy。
source/link/detail.py
更改响应和分享模型,将 params 存储为普通 dict 而非 pydantic 模型。
  • 将 ResponseModel.params 类型从 BaseModel 修改为 dict。
  • 将 UrlResponse.params 类型从 ShortUrl 修改为 dict。
source/model/response.py
source/model/share.py
在记录中持久化新的收藏数指标。
  • 在 RecordManager 表结构中新增 collectionCount(收藏数量)INTEGER 列。
source/record/manager.py
引入专门的 HTML 抓取器,从 chenzhongtech 移动端页面的 window.INIT_STATE 中解析 collectionCount。
  • 创建 CollectionCount 类,构造 v.m.chenzhongtech.com/fw/photo/{photoId} URL,并使用带重试和错误处理的 GET 请求。
  • 使用正则从 HTML 中提取 window.INIT_STATE,进行 JSON 解码,并在常见路径以及递归路径中搜索 collectionCount,失败时默认为 -1。
  • 将得到的 collection_count 保存在实例上,并让 run() 返回该值。
source/request/collection.py
添加基于 POST 的 API 客户端,通过快手评论列表接口获取指定作品的 commentCountV2。
  • 创建 CommentCount API 子类,向评论列表接口 POST photoId 和 pcursor,并将 note 设置为 评论数。
  • 实现 deal_response,从 JSON 响应中读取 commentCountV2 写入 comment_count,失败时设置为 -1,并让 run() 返回该值。
source/request/comment.py

Tips and commands

Interacting with Sourcery

  • 触发新评审: 在 pull request 中评论 @sourcery-ai review
  • 继续讨论: 直接回复 Sourcery 的评审评论。
  • 从评审评论生成 GitHub Issue: 在某条评审评论下回复,要求 Sourcery 从该评论创建 issue。你也可以直接回复 @sourcery-ai issue 来从该评论生成 issue。
  • 生成 pull request 标题: 在 pull request 标题中任意位置写上 @sourcery-ai 即可随时生成标题。你也可以在 pull request 中评论 @sourcery-ai title 来(重新)生成标题。
  • 生成 pull request 总结: 在 pull request 正文任意位置写上 @sourcery-ai summary,即可在对应位置生成 PR 总结。你也可以在 pull request 中评论 @sourcery-ai summary 来(重新)生成总结。
  • 生成审阅者指南: 在 pull request 中评论 @sourcery-ai guide,即可随时(重新)生成审阅者指南。
  • 一次性解决所有 Sourcery 评论: 在 pull request 中评论 @sourcery-ai resolve,即可将所有 Sourcery 评论标记为已解决。适用于你已经处理完所有评论且不希望再看到它们的情况。
  • 撤销所有 Sourcery 评审: 在 pull request 中评论 @sourcery-ai dismiss,即可撤销所有现有的 Sourcery 评审。尤其适合你希望从一次全新的评审开始的场景——别忘了再评论 @sourcery-ai review 来触发新评审。

Customizing Your Experience

访问你的 dashboard 可以:

  • 启用或停用评审功能,例如 Sourcery 自动生成的 pull request 总结、审阅者指南等。
  • 更改评审语言。
  • 添加、删除或编辑自定义评审指引。
  • 调整其他评审设置。

Getting Help

Original review guide in English

Reviewer's Guide

Refactors URL handling and detail extraction to better support Kuaishou mobile pages (chenzhongtech/gifshow), enriches extracted metadata (including collection count), prioritizes mobile INIT_STATE-based parsing, and adds new request utilities for collection and comment counts while adjusting API models to return plain dict params.

Sequence diagram for detail endpoint using mobile Kuaishou INIT_STATE parsing

sequenceDiagram
    actor User
    participant APIApp as FastAPI_app
    participant Examiner
    participant Detail as DetailLink
    participant HTTP as Kuaishou_servers
    participant Extractor as HTMLExtractor

    User->>APIApp: POST /detail (DetailModel)
    APIApp->>Examiner: run(text, type_=detail, proxy)
    Examiner-->>APIApp: list urls

    APIApp->>APIApp: select target_url
    Note over APIApp: Prefer chenzhongtech.com

    APIApp->>Detail: detail_one(target_url, False, proxy, cookie)
    Detail->>Detail: request_url(target_url, proxy, cookie)
    alt chenzhongtech.com or gifshow.com
        Detail->>Detail: _get_mobile_headers(cookie)
    else other domain
        Detail->>Detail: use pc_headers
    end

    Detail->>HTTP: GET target_url with headers
    HTTP-->>Detail: HTML with window.INIT_STATE
    Detail-->>APIApp: HTML text

    APIApp->>Extractor: __convert_object(text, web=False)
    Extractor->>Extractor: parse APP_KEYWORD INIT_STATE
    Extractor-->>APIApp: photo object with _counts

    APIApp->>Extractor: __extract_detail(data, id_, web=False)
    Extractor->>Extractor: __extract_detail_app(photo, id_)
    Extractor-->>APIApp: detail dict (includes collectionCount)

    APIApp-->>User: ResponseModel(message, params as dict, data as dict)
Loading

ER diagram for updated record schema with collectionCount

erDiagram
    RECORD {
        INTEGER realLikeCount
        INTEGER shareCount
        INTEGER commentCount
        INTEGER collectionCount
        TEXT timestamp
        TEXT viewCount
        TEXT download
    }

    RECORD ||--o{ DOWNLOAD_URL : has
Loading

Class diagram for updated Kuaishou mobile extraction and count utilities

classDiagram
    direction LR

    class Examiner {
        +Pattern PC_COMPLETE_URL
        +Pattern C_COMPLETE_URL
        +Pattern REDIRECT_URL
        +Pattern GIFSHOW_URL
        +__init__(manager)
        +__validate_links(urls) list~str~
        +__request_redirect(url, proxy, cookie) str
        +_extract_params_detail(url, redirect, user_id, photo_id) tuple
    }

    class HTMLExtractor {
        +str SCRIPT
        +str WEB_KEYWORD
        +str APP_KEYWORD
        +Pattern PHOTO_REGEX
        +__init__(manager)
        +__convert_object(text, web) dict
        +__extract_detail(data, id_, web) dict
        +__extract_detail_web(data, id_) dict
        +__extract_detail_app(data, id_) dict
        +_extract_download_urls(data, index) list~str~
    }

    class DetailLink {
        +__init__(manager)
        +run(url, proxy, cookie) str
        +_get_mobile_headers(cookie) dict
        +request_url(url, proxy, cookie) str
    }

    class CollectionCount {
        +str API_URL_TEMPLATE
        +Pattern INIT_STATE_PATTERN
        +client
        +headers
        +console
        +int retry
        +str photo_id
        +str note
        +int collection_count
        +__init__(manager, photo_id)
        +run() int
        +run_single() void
        +_extract_collection_count(html) void
        +_find_collection_count(data) int
        +_recursive_find(obj, key) int
    }

    class CommentCount {
        +str API_URL
        +str photo_id
        +str pcursor
        +str note
        +int comment_count
        +__init__(manager, photo_id, pcursor)
        +run() int
        +run_single() void
        +deal_response(response) void
    }

    class API {
        +client
        +headers
        +console
        +max_retry
        +__init__(manager)
        +get_data(url, json, method) dict
    }

    class ResponseModel {
        +str message
        +dict params
        +dict data
        +url str
    }

    class UrlResponse {
        +str message
        +list~str~ urls
        +dict params
        +url str
    }

    class RecordManager {
        +list~tuple~ fields
    }

    Examiner --> DetailLink : uses
    Examiner --> HTMLExtractor : uses

    DetailLink --> HTMLExtractor : returns HTML for parsing

    CollectionCount --> Manager : uses
    CommentCount --> API : extends
    API <|-- CommentCount

    ResponseModel --> DetailModel : previously held
    UrlResponse --> ShortUrl : previously held

    RecordManager --> "1" Record : manages

    class Record {
        +int realLikeCount
        +int shareCount
        +int commentCount
        +int collectionCount
        +str timestamp
        +str viewCount
        +str download
    }
Loading

File-Level Changes

Change Details Files
Normalize various Kuaishou URL formats to photoIds and generate canonical mobile URLs, including new gifshow domains.
  • Add GIFSHOW_URL regex pattern to match gifshow.com/cn fw/photo links.
  • Rewrite __validate_links to extract photoId from multiple domain/path patterns and return four standardized mobile URLs per id.
  • Adjust _extract_params_detail to treat chenzhongtech.com and gifshow.com as mobile pages, pulling photoId from the path and restricting the web branch to short-video URLs.
source/link/examiner.py
Improve HTML/INIT_STATE parsing for both web and mobile detail extraction and surface additional metadata such as collectionCount and mainMvUrls.
  • Relax PHOTO_REGEX to a non-greedy, DOTALL JSON object match for photo blocks.
  • Change web branch of __convert_object to strip WEB_KEYWORD only and return safe_load of the remaining JSON state.
  • Replace app/mobile branch of __convert_object to parse the full INIT_STATE JSON, find the node containing photo, and attach counts as _counts on the returned object.
  • Simplify __extract_detail dispatcher so mobile branch passes the photo dict directly to __extract_detail_app.
  • Extend __extract_detail_app to read collectionCount from _counts.collectionCount.
  • Update _extract_download_urls to first return mainMvUrls (handling SimpleNamespace and dict entries) and fall back to image atlas extraction.
source/extract/extractor.py
Adjust FastAPI handlers to use the new URL normalization and mobile detail behavior and to serialize request models as dicts in responses.
  • Change share handler to call examiner.run with type_='detail' and to place extract.model_dump() into UrlResponse.params.
  • In detail handler, choose a chenzhongtech.com URL if present (else first URL) and call detail_one with an explicit False flag and provided proxy/cookie.
  • Return ResponseModel with params as extract.model_dump() instead of the model instance.
source/app/app.py
Specialize request headers for mobile detail URLs to avoid redirection and align response models to carry dict params.
  • Introduce _get_mobile_headers that returns a minimal mobile-style User-Agent and Accept header.
  • In request_url, use mobile headers for chenzhongtech.com and gifshow.com URLs; otherwise copy default headers, still applying Cookie and proxy if present.
source/link/detail.py
Change response and share models to store params as plain dicts rather than pydantic models.
  • Update ResponseModel.params type from BaseModel to dict.
  • Update UrlResponse.params type from ShortUrl to dict.
source/model/response.py
source/model/share.py
Persist the new collection count metric in records.
  • Add collectionCount (收藏数量) INTEGER column to the RecordManager schema.
source/record/manager.py
Introduce a dedicated HTML scraper to derive collectionCount from chenzhongtech mobile pages using window.INIT_STATE.
  • Create CollectionCount class that builds v.m.chenzhongtech.com/fw/photo/{photoId} URLs and issues GET requests with retry and error handling utilities.
  • Parse HTML for window.INIT_STATE with a regex, JSON-decode it, and search common and recursive paths for collectionCount, defaulting to -1 on failure.
  • Store the resulting collection_count on the instance and have run() return it.
source/request/collection.py
Add a POST-based API client to retrieve commentCountV2 for a given photo via Kuaishou’s comment list endpoint.
  • Create CommentCount API subclass that posts photoId and pcursor to the comment list endpoint and sets note to 评论数.
  • Implement deal_response to read commentCountV2 from the JSON response into comment_count or set -1 on failure, with run() returning that value.
source/request/comment.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - 我发现了 4 个问题,并给出了一些高层次的反馈:

  • __validate_links 现在先用 set 收集 photo_id 再展开,这会导致返回的 URL 列表顺序不再是确定的;如果有调用方依赖稳定的顺序,建议考虑保留输入顺序(例如使用 list + 成员检查,而不是 set)。
  • 在 HTMLExtractor.__convert_object 中,你仍然在 WEB_KEYWORD/APP_KEYWORD 上使用 str.lstrip(),它会移除任意这些字符,而不是固定前缀;为了避免在负载刚好以相似字符开头时破坏 JSON,建议通过 startswith + 切片(或替换精确前缀)的方式来移除已知前缀。
  • _extract_params_detail 直接在 in 判断中使用 url.hostname;为了和 __validate_links 中的写法保持一致(那里使用了 (url.hostname or "") 保护),建议以同样方式规范化 hostname,以避免在边缘 URL 情况下出现 NoneType 问题。
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- __validate_links now collects photo_ids in a set and then expands them, which makes the returned URL list order non-deterministic; if any callers rely on stable ordering, consider preserving input order (e.g., using a list + membership check instead of a set).
- In HTMLExtractor.__convert_object you are still using str.lstrip() with WEB_KEYWORD/APP_KEYWORD, which removes any of those characters rather than a fixed prefix; it would be safer to strip a known prefix via startswith + slicing (or replace the exact prefix) to avoid corrupting JSON when the payload happens to start with similar characters.
- _extract_params_detail uses url.hostname directly in an 'in' check; for consistency with __validate_links (where you guard with (url.hostname or "")), consider normalizing hostname the same way to avoid potential NoneType issues on edge-case URLs.

## Individual Comments

### Comment 1
<location path="source/link/examiner.py" line_range="184-191" />
<code_context>
         url = urlparse(url)
         params = parse_qs(url.query)
-        if "chenzhongtech" in url.hostname:
+        if "chenzhongtech" in url.hostname or "gifshow" in url.hostname:
+            # chenzhongtech 和 gifshow 域名都是移动版页面
             return (
                 False,
                 params.get("userId", [""])[0],
-                params.get("photoId", [""])[0],
+                url.path.split("/")[-1],
             )
-        elif "short-video" in url.path or "fw/photo" in url.path:
+        elif "short-video" in url.path:
             return (
                 True,
</code_context>
<issue_to_address>
**issue (bug_risk):** Dropping the `"fw/photo" in url.path` branch could break handling for some Kuaishou URLs.

This used to treat both `"short-video"` and `"fw/photo"` paths as `web=True`. Now `"fw/photo"` URLs (e.g. `https://www.kuaishou.com/fw/photo/...`) will fall through and change behavior. If these URLs are still expected, this is a regression; consider either restoring the `"fw/photo"` condition here or adding a separate branch for it.
</issue_to_address>

### Comment 2
<location path="source/extract/extractor.py" line_range="76-85" />
<code_context>
         web: bool,
     ) -> dict:
         if web:
-            text = text.lstrip(self.WEB_KEYWORD if web else self.APP_KEYWORD)
+            text = text.lstrip(self.WEB_KEYWORD)
             text = text.replace(
                 ";(function(){var s;(s=document.currentScript||document.scripts["
                 "document.scripts.length-1]).parentNode.removeChild(s);}());",
                 "",
             )
+            return safe_load(text)
         else:
-            text = text[1] if (text := self.PHOTO_REGEX.search(text)) else ""
-        return safe_load(text)
+            # 移动版:解析整个 INIT_STATE 来同时获取 photo 和 counts
+            init_state_text = text.lstrip(self.APP_KEYWORD)
+            try:
+                init_state = json.loads(init_state_text)
</code_context>
<issue_to_address>
**issue (bug_risk):** Using `str.lstrip` with the keyword constants can corrupt the JSON payload.

Because `lstrip` treats its argument as a set of characters, `text.lstrip(self.WEB_KEYWORD)` / `text.lstrip(self.APP_KEYWORD)` will strip any leading combination of those characters until a different one appears, not just the exact keyword. This can remove valid leading JSON characters and break parsing. Use an explicit prefix check and slice instead, e.g.:

```python
if text.startswith(self.WEB_KEYWORD):
    text = text[len(self.WEB_KEYWORD):]
```

(and similarly for `APP_KEYWORD`).
</issue_to_address>

### Comment 3
<location path="source/link/examiner.py" line_range="57" />
<code_context>
-                self.C_COMPLETE_URL.finditer(urls),
-            )
-        ]
+        """从各种URL格式中提取photoId,并构造统一的四种格式URL."""
+        photo_ids = set()
+
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting shared helpers for photo ID extraction and URL normalization so that __validate_links orchestrates them instead of inlining all parsing logic.

You can reduce the new complexity in `__validate_links` by extracting the URL → `photo_id` logic into a shared helper and separating extraction from normalization. This also lets you reuse logic already present in `_extract_params_detail`.

For example:

```python
def _extract_photo_id(self, url_str: str) -> str | None:
    url = urlparse(url_str)
    params = parse_qs(url.query)

    host = url.hostname or ""
    path = url.path or ""

    if "chenzhongtech" in host:
        # chenzhongtech: photoId in query
        return params.get("photoId", [""])[0] or None
    if "gifshow" in host:
        # gifshow: photoId in path
        return path.split("/")[-1] or None
    if "short-video" in path or "fw/photo" in path:
        # kuaishou: photoId in path
        return path.split("/")[-1] or None
    return None
```

Then `__validate_links` can focus on:

```python
def __validate_links(self, urls: str) -> list[str]:
    photo_ids: set[str] = set()

    for pattern in (
        self.REDIRECT_URL,
        self.GIFSHOW_URL,
        self.PC_COMPLETE_URL,
        self.C_COMPLETE_URL,
    ):
        for match in pattern.finditer(urls):
            photo_id = self._extract_photo_id(match.group())
            if photo_id:
                photo_ids.add(photo_id)

    return [url for pid in photo_ids for url in self._build_normalized_urls(pid)]
```

And normalization is isolated:

```python
def _build_normalized_urls(self, photo_id: str) -> list[str]:
    return [
        f"https://m.gifshow.com/fw/photo/{photo_id}",
        f"https://v.m.chenzhongtech.com/fw/photo/{photo_id}",
        f"https://chenzhongtech.com/fw/photo/{photo_id}",
        f"https://1.gifshow.com/fw/photo/{photo_id}",
    ]
```

You can also reuse `_extract_photo_id` inside `_extract_params_detail` to avoid diverging domain rules:

```python
def _extract_params_detail(self, url: str) -> tuple[bool | None, str, str]:
    parsed = urlparse(url)
    params = parse_qs(parsed.query)
    host = parsed.hostname or ""
    path = parsed.path or ""

    if "chenzhongtech" in host or "gifshow" in host:
        return (
            False,
            params.get("userId", [""])[0],
            self._extract_photo_id(url) or "",
        )
    if "short-video" in path:
        return True, "", self._extract_photo_id(url) or ""

    self.console.error(f"Unknown url: {urlunparse(parsed)}")
    return None, "", ""
```

This keeps all behavior, but:

- Removes duplicated domain/path parsing.
- Splits responsibilities (collecting URLs, extracting `photo_id`, building normalized URLs).
- Makes `__validate_links` a simple orchestration of these helpers instead of a deeply nested method.
</issue_to_address>

### Comment 4
<location path="source/request/collection.py" line_range="54" />
<code_context>
+        html = response.text
+        self._extract_collection_count(html)
+
+    def _extract_collection_count(self, html: str) -> None:
+        """从 HTML 中提取 window.INIT_STATE 并解析 collectionCount"""
+        if not html:
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting the INIT_STATE parsing and collectionCount search into a single pure helper function so the CollectionCount class focuses only on HTTP/orchestration concerns.

You can reduce the cognitive load by separating the HTTP concern from the INIT_STATE parsing/search logic and collapsing the two search helpers into a single, focused function.

**1. Extract a pure helper for INIT_STATE → collectionCount**

Move the parsing + search into a standalone helper (module-level or shared util), so `CollectionCount` only worries about HTTP, logging, and wiring:

```python
def extract_collection_count_from_init_state(init_state: dict) -> int | None:
    # 常见路径 1: 根级别
    if "collectionCount" in init_state:
        return init_state["collectionCount"]

    # 常见路径 2: photo
    photo = init_state.get("photo")
    if isinstance(photo, dict) and "collectionCount" in photo:
        return photo["collectionCount"]

    # 常见路径 3: visionVideoDetail
    detail = init_state.get("visionVideoDetail")
    if isinstance(detail, dict) and "collectionCount" in detail:
        return detail["collectionCount"]

    # 备选: 通用递归查找
    def recursive_find(obj) -> int | None:
        if isinstance(obj, dict):
            if "collectionCount" in obj:
                return obj["collectionCount"]
            for v in obj.values():
                found = recursive_find(v)
                if found is not None:
                    return found
        elif isinstance(obj, list):
            for item in obj:
                found = recursive_find(item)
                if found is not None:
                    return found
        return None

    return recursive_find(init_state)
```

Then the class no longer needs `_find_collection_count` and `_recursive_find` methods:

```python
def _extract_collection_count(self, html: str) -> None:
    if not html:
        self.collection_count = -1
        return

    match = self.INIT_STATE_PATTERN.search(html)
    if not match:
        self.console.warning(_("未找到 window.INIT_STATE"))
        self.collection_count = -1
        return

    try:
        init_state = json.loads(match.group(1))
    except json.JSONDecodeError as e:
        self.console.error(_("解析 INIT_STATE JSON 失败: {error}").format(error=e))
        self.collection_count = -1
        return

    collection_count = extract_collection_count_from_init_state(init_state)
    self.collection_count = collection_count if collection_count is not None else -1
```

This keeps all existing behavior (regex, JSON parsing, prioritized paths, recursive fallback) while:

- Making `CollectionCount` responsible only for HTTP + orchestration.
- Encapsulating the “complex” tree search into a single, reusable, and testable function.
- Removing the need to understand two methods (`_find_collection_count` + `_recursive_find`) to follow the extraction logic.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Original comment in English

Hey - I've found 4 issues, and left some high level feedback:

  • __validate_links now collects photo_ids in a set and then expands them, which makes the returned URL list order non-deterministic; if any callers rely on stable ordering, consider preserving input order (e.g., using a list + membership check instead of a set).
  • In HTMLExtractor.__convert_object you are still using str.lstrip() with WEB_KEYWORD/APP_KEYWORD, which removes any of those characters rather than a fixed prefix; it would be safer to strip a known prefix via startswith + slicing (or replace the exact prefix) to avoid corrupting JSON when the payload happens to start with similar characters.
  • _extract_params_detail uses url.hostname directly in an 'in' check; for consistency with __validate_links (where you guard with (url.hostname or "")), consider normalizing hostname the same way to avoid potential NoneType issues on edge-case URLs.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- __validate_links now collects photo_ids in a set and then expands them, which makes the returned URL list order non-deterministic; if any callers rely on stable ordering, consider preserving input order (e.g., using a list + membership check instead of a set).
- In HTMLExtractor.__convert_object you are still using str.lstrip() with WEB_KEYWORD/APP_KEYWORD, which removes any of those characters rather than a fixed prefix; it would be safer to strip a known prefix via startswith + slicing (or replace the exact prefix) to avoid corrupting JSON when the payload happens to start with similar characters.
- _extract_params_detail uses url.hostname directly in an 'in' check; for consistency with __validate_links (where you guard with (url.hostname or "")), consider normalizing hostname the same way to avoid potential NoneType issues on edge-case URLs.

## Individual Comments

### Comment 1
<location path="source/link/examiner.py" line_range="184-191" />
<code_context>
         url = urlparse(url)
         params = parse_qs(url.query)
-        if "chenzhongtech" in url.hostname:
+        if "chenzhongtech" in url.hostname or "gifshow" in url.hostname:
+            # chenzhongtech 和 gifshow 域名都是移动版页面
             return (
                 False,
                 params.get("userId", [""])[0],
-                params.get("photoId", [""])[0],
+                url.path.split("/")[-1],
             )
-        elif "short-video" in url.path or "fw/photo" in url.path:
+        elif "short-video" in url.path:
             return (
                 True,
</code_context>
<issue_to_address>
**issue (bug_risk):** Dropping the `"fw/photo" in url.path` branch could break handling for some Kuaishou URLs.

This used to treat both `"short-video"` and `"fw/photo"` paths as `web=True`. Now `"fw/photo"` URLs (e.g. `https://www.kuaishou.com/fw/photo/...`) will fall through and change behavior. If these URLs are still expected, this is a regression; consider either restoring the `"fw/photo"` condition here or adding a separate branch for it.
</issue_to_address>

### Comment 2
<location path="source/extract/extractor.py" line_range="76-85" />
<code_context>
         web: bool,
     ) -> dict:
         if web:
-            text = text.lstrip(self.WEB_KEYWORD if web else self.APP_KEYWORD)
+            text = text.lstrip(self.WEB_KEYWORD)
             text = text.replace(
                 ";(function(){var s;(s=document.currentScript||document.scripts["
                 "document.scripts.length-1]).parentNode.removeChild(s);}());",
                 "",
             )
+            return safe_load(text)
         else:
-            text = text[1] if (text := self.PHOTO_REGEX.search(text)) else ""
-        return safe_load(text)
+            # 移动版:解析整个 INIT_STATE 来同时获取 photo 和 counts
+            init_state_text = text.lstrip(self.APP_KEYWORD)
+            try:
+                init_state = json.loads(init_state_text)
</code_context>
<issue_to_address>
**issue (bug_risk):** Using `str.lstrip` with the keyword constants can corrupt the JSON payload.

Because `lstrip` treats its argument as a set of characters, `text.lstrip(self.WEB_KEYWORD)` / `text.lstrip(self.APP_KEYWORD)` will strip any leading combination of those characters until a different one appears, not just the exact keyword. This can remove valid leading JSON characters and break parsing. Use an explicit prefix check and slice instead, e.g.:

```python
if text.startswith(self.WEB_KEYWORD):
    text = text[len(self.WEB_KEYWORD):]
```

(and similarly for `APP_KEYWORD`).
</issue_to_address>

### Comment 3
<location path="source/link/examiner.py" line_range="57" />
<code_context>
-                self.C_COMPLETE_URL.finditer(urls),
-            )
-        ]
+        """从各种URL格式中提取photoId,并构造统一的四种格式URL."""
+        photo_ids = set()
+
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting shared helpers for photo ID extraction and URL normalization so that __validate_links orchestrates them instead of inlining all parsing logic.

You can reduce the new complexity in `__validate_links` by extracting the URL → `photo_id` logic into a shared helper and separating extraction from normalization. This also lets you reuse logic already present in `_extract_params_detail`.

For example:

```python
def _extract_photo_id(self, url_str: str) -> str | None:
    url = urlparse(url_str)
    params = parse_qs(url.query)

    host = url.hostname or ""
    path = url.path or ""

    if "chenzhongtech" in host:
        # chenzhongtech: photoId in query
        return params.get("photoId", [""])[0] or None
    if "gifshow" in host:
        # gifshow: photoId in path
        return path.split("/")[-1] or None
    if "short-video" in path or "fw/photo" in path:
        # kuaishou: photoId in path
        return path.split("/")[-1] or None
    return None
```

Then `__validate_links` can focus on:

```python
def __validate_links(self, urls: str) -> list[str]:
    photo_ids: set[str] = set()

    for pattern in (
        self.REDIRECT_URL,
        self.GIFSHOW_URL,
        self.PC_COMPLETE_URL,
        self.C_COMPLETE_URL,
    ):
        for match in pattern.finditer(urls):
            photo_id = self._extract_photo_id(match.group())
            if photo_id:
                photo_ids.add(photo_id)

    return [url for pid in photo_ids for url in self._build_normalized_urls(pid)]
```

And normalization is isolated:

```python
def _build_normalized_urls(self, photo_id: str) -> list[str]:
    return [
        f"https://m.gifshow.com/fw/photo/{photo_id}",
        f"https://v.m.chenzhongtech.com/fw/photo/{photo_id}",
        f"https://chenzhongtech.com/fw/photo/{photo_id}",
        f"https://1.gifshow.com/fw/photo/{photo_id}",
    ]
```

You can also reuse `_extract_photo_id` inside `_extract_params_detail` to avoid diverging domain rules:

```python
def _extract_params_detail(self, url: str) -> tuple[bool | None, str, str]:
    parsed = urlparse(url)
    params = parse_qs(parsed.query)
    host = parsed.hostname or ""
    path = parsed.path or ""

    if "chenzhongtech" in host or "gifshow" in host:
        return (
            False,
            params.get("userId", [""])[0],
            self._extract_photo_id(url) or "",
        )
    if "short-video" in path:
        return True, "", self._extract_photo_id(url) or ""

    self.console.error(f"Unknown url: {urlunparse(parsed)}")
    return None, "", ""
```

This keeps all behavior, but:

- Removes duplicated domain/path parsing.
- Splits responsibilities (collecting URLs, extracting `photo_id`, building normalized URLs).
- Makes `__validate_links` a simple orchestration of these helpers instead of a deeply nested method.
</issue_to_address>

### Comment 4
<location path="source/request/collection.py" line_range="54" />
<code_context>
+        html = response.text
+        self._extract_collection_count(html)
+
+    def _extract_collection_count(self, html: str) -> None:
+        """从 HTML 中提取 window.INIT_STATE 并解析 collectionCount"""
+        if not html:
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting the INIT_STATE parsing and collectionCount search into a single pure helper function so the CollectionCount class focuses only on HTTP/orchestration concerns.

You can reduce the cognitive load by separating the HTTP concern from the INIT_STATE parsing/search logic and collapsing the two search helpers into a single, focused function.

**1. Extract a pure helper for INIT_STATE → collectionCount**

Move the parsing + search into a standalone helper (module-level or shared util), so `CollectionCount` only worries about HTTP, logging, and wiring:

```python
def extract_collection_count_from_init_state(init_state: dict) -> int | None:
    # 常见路径 1: 根级别
    if "collectionCount" in init_state:
        return init_state["collectionCount"]

    # 常见路径 2: photo
    photo = init_state.get("photo")
    if isinstance(photo, dict) and "collectionCount" in photo:
        return photo["collectionCount"]

    # 常见路径 3: visionVideoDetail
    detail = init_state.get("visionVideoDetail")
    if isinstance(detail, dict) and "collectionCount" in detail:
        return detail["collectionCount"]

    # 备选: 通用递归查找
    def recursive_find(obj) -> int | None:
        if isinstance(obj, dict):
            if "collectionCount" in obj:
                return obj["collectionCount"]
            for v in obj.values():
                found = recursive_find(v)
                if found is not None:
                    return found
        elif isinstance(obj, list):
            for item in obj:
                found = recursive_find(item)
                if found is not None:
                    return found
        return None

    return recursive_find(init_state)
```

Then the class no longer needs `_find_collection_count` and `_recursive_find` methods:

```python
def _extract_collection_count(self, html: str) -> None:
    if not html:
        self.collection_count = -1
        return

    match = self.INIT_STATE_PATTERN.search(html)
    if not match:
        self.console.warning(_("未找到 window.INIT_STATE"))
        self.collection_count = -1
        return

    try:
        init_state = json.loads(match.group(1))
    except json.JSONDecodeError as e:
        self.console.error(_("解析 INIT_STATE JSON 失败: {error}").format(error=e))
        self.collection_count = -1
        return

    collection_count = extract_collection_count_from_init_state(init_state)
    self.collection_count = collection_count if collection_count is not None else -1
```

This keeps all existing behavior (regex, JSON parsing, prioritized paths, recursive fallback) while:

- Making `CollectionCount` responsible only for HTTP + orchestration.
- Encapsulating the “complex” tree search into a single, reusable, and testable function.
- Removing the need to understand two methods (`_find_collection_count` + `_recursive_find`) to follow the extraction logic.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +184 to +191
if "chenzhongtech" in url.hostname or "gifshow" in url.hostname:
# chenzhongtech 和 gifshow 域名都是移动版页面
return (
False,
params.get("userId", [""])[0],
params.get("photoId", [""])[0],
url.path.split("/")[-1],
)
elif "short-video" in url.path or "fw/photo" in url.path:
elif "short-video" in url.path:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): 删除 "fw/photo" in url.path 这个分支可能会导致部分快手链接处理出错。

之前这里会把 "short-video""fw/photo" 路径都当作 web=True 处理。现在 "fw/photo" 链接(例如 https://www.kuaishou.com/fw/photo/...)会直接落空,行为发生变化。如果这些链接仍然是预期需要支持的,这就是一次回归;建议要么在这里恢复对 "fw/photo" 的判断,要么为其单独增加一个分支。

Original comment in English

issue (bug_risk): Dropping the "fw/photo" in url.path branch could break handling for some Kuaishou URLs.

This used to treat both "short-video" and "fw/photo" paths as web=True. Now "fw/photo" URLs (e.g. https://www.kuaishou.com/fw/photo/...) will fall through and change behavior. If these URLs are still expected, this is a regression; consider either restoring the "fw/photo" condition here or adding a separate branch for it.

Comment on lines 76 to +85
if web:
text = text.lstrip(self.WEB_KEYWORD if web else self.APP_KEYWORD)
text = text.lstrip(self.WEB_KEYWORD)
text = text.replace(
";(function(){var s;(s=document.currentScript||document.scripts["
"document.scripts.length-1]).parentNode.removeChild(s);}());",
"",
)
return safe_load(text)
else:
text = text[1] if (text := self.PHOTO_REGEX.search(text)) else ""
return safe_load(text)
# 移动版:解析整个 INIT_STATE 来同时获取 photo 和 counts
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): 使用带关键字常量的 str.lstrip 可能会破坏 JSON 负载。

由于 lstrip 会把它的参数当作“字符集合”处理,text.lstrip(self.WEB_KEYWORD) / text.lstrip(self.APP_KEYWORD) 会移除任意由这些字符组成的前缀,而不仅仅是精确的关键字本身。这可能会删掉合法的 JSON 开头字符,从而导致解析失败。可以改用显式的前缀检查并切片,例如:

if text.startswith(self.WEB_KEYWORD):
    text = text[len(self.WEB_KEYWORD):]

APP_KEYWORD 同理)。

Original comment in English

issue (bug_risk): Using str.lstrip with the keyword constants can corrupt the JSON payload.

Because lstrip treats its argument as a set of characters, text.lstrip(self.WEB_KEYWORD) / text.lstrip(self.APP_KEYWORD) will strip any leading combination of those characters until a different one appears, not just the exact keyword. This can remove valid leading JSON characters and break parsing. Use an explicit prefix check and slice instead, e.g.:

if text.startswith(self.WEB_KEYWORD):
    text = text[len(self.WEB_KEYWORD):]

(and similarly for APP_KEYWORD).

self.C_COMPLETE_URL.finditer(urls),
)
]
"""从各种URL格式中提取photoId,并构造统一的四种格式URL."""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): 建议抽取用于 photo ID 提取和 URL 规范化的共享辅助方法,让 __validate_links 只负责调度这些方法,而不是内联所有解析逻辑。

你可以通过把 URL → photo_id 的逻辑抽成一个公共 helper,并将“提取”和“规范化”分离,来降低 __validate_links 新增的复杂度。这也可以让你重用 _extract_params_detail 中已有的逻辑。

例如:

def _extract_photo_id(self, url_str: str) -> str | None:
    url = urlparse(url_str)
    params = parse_qs(url.query)

    host = url.hostname or ""
    path = url.path or ""

    if "chenzhongtech" in host:
        # chenzhongtech: photoId in query
        return params.get("photoId", [""])[0] or None
    if "gifshow" in host:
        # gifshow: photoId in path
        return path.split("/")[-1] or None
    if "short-video" in path or "fw/photo" in path:
        # kuaishou: photoId in path
        return path.split("/")[-1] or None
    return None

然后 __validate_links 可以专注于:

def __validate_links(self, urls: str) -> list[str]:
    photo_ids: set[str] = set()

    for pattern in (
        self.REDIRECT_URL,
        self.GIFSHOW_URL,
        self.PC_COMPLETE_URL,
        self.C_COMPLETE_URL,
    ):
        for match in pattern.finditer(urls):
            photo_id = self._extract_photo_id(match.group())
            if photo_id:
                photo_ids.add(photo_id)

    return [url for pid in photo_ids for url in self._build_normalized_urls(pid)]

并将规范化逻辑单独隔离:

def _build_normalized_urls(self, photo_id: str) -> list[str]:
    return [
        f"https://m.gifshow.com/fw/photo/{photo_id}",
        f"https://v.m.chenzhongtech.com/fw/photo/{photo_id}",
        f"https://chenzhongtech.com/fw/photo/{photo_id}",
        f"https://1.gifshow.com/fw/photo/{photo_id}",
    ]

你也可以在 _extract_params_detail 中重用 _extract_photo_id,以避免不同方法之间的域名规则逐渐产生偏差:

def _extract_params_detail(self, url: str) -> tuple[bool | None, str, str]:
    parsed = urlparse(url)
    params = parse_qs(parsed.query)
    host = parsed.hostname or ""
    path = parsed.path or ""

    if "chenzhongtech" in host or "gifshow" in host:
        return (
            False,
            params.get("userId", [""])[0],
            self._extract_photo_id(url) or "",
        )
    if "short-video" in path:
        return True, "", self._extract_photo_id(url) or ""

    self.console.error(f"Unknown url: {urlunparse(parsed)}")
    return None, "", ""

这样既能保持现有行为,又可以:

  • 去掉重复的域名/路径解析逻辑;
  • 明确职责划分(收集 URL、提取 photo_id、构造规范化 URL);
  • __validate_links 成为这些 helper 的简单调度器,而不是一个高度嵌套的方法。
Original comment in English

issue (complexity): Consider extracting shared helpers for photo ID extraction and URL normalization so that __validate_links orchestrates them instead of inlining all parsing logic.

You can reduce the new complexity in __validate_links by extracting the URL → photo_id logic into a shared helper and separating extraction from normalization. This also lets you reuse logic already present in _extract_params_detail.

For example:

def _extract_photo_id(self, url_str: str) -> str | None:
    url = urlparse(url_str)
    params = parse_qs(url.query)

    host = url.hostname or ""
    path = url.path or ""

    if "chenzhongtech" in host:
        # chenzhongtech: photoId in query
        return params.get("photoId", [""])[0] or None
    if "gifshow" in host:
        # gifshow: photoId in path
        return path.split("/")[-1] or None
    if "short-video" in path or "fw/photo" in path:
        # kuaishou: photoId in path
        return path.split("/")[-1] or None
    return None

Then __validate_links can focus on:

def __validate_links(self, urls: str) -> list[str]:
    photo_ids: set[str] = set()

    for pattern in (
        self.REDIRECT_URL,
        self.GIFSHOW_URL,
        self.PC_COMPLETE_URL,
        self.C_COMPLETE_URL,
    ):
        for match in pattern.finditer(urls):
            photo_id = self._extract_photo_id(match.group())
            if photo_id:
                photo_ids.add(photo_id)

    return [url for pid in photo_ids for url in self._build_normalized_urls(pid)]

And normalization is isolated:

def _build_normalized_urls(self, photo_id: str) -> list[str]:
    return [
        f"https://m.gifshow.com/fw/photo/{photo_id}",
        f"https://v.m.chenzhongtech.com/fw/photo/{photo_id}",
        f"https://chenzhongtech.com/fw/photo/{photo_id}",
        f"https://1.gifshow.com/fw/photo/{photo_id}",
    ]

You can also reuse _extract_photo_id inside _extract_params_detail to avoid diverging domain rules:

def _extract_params_detail(self, url: str) -> tuple[bool | None, str, str]:
    parsed = urlparse(url)
    params = parse_qs(parsed.query)
    host = parsed.hostname or ""
    path = parsed.path or ""

    if "chenzhongtech" in host or "gifshow" in host:
        return (
            False,
            params.get("userId", [""])[0],
            self._extract_photo_id(url) or "",
        )
    if "short-video" in path:
        return True, "", self._extract_photo_id(url) or ""

    self.console.error(f"Unknown url: {urlunparse(parsed)}")
    return None, "", ""

This keeps all behavior, but:

  • Removes duplicated domain/path parsing.
  • Splits responsibilities (collecting URLs, extracting photo_id, building normalized URLs).
  • Makes __validate_links a simple orchestration of these helpers instead of a deeply nested method.

html = response.text
self._extract_collection_count(html)

def _extract_collection_count(self, html: str) -> None:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): 建议把 INIT_STATE 的解析和 collectionCount 的查找抽取成一个纯辅助函数,使 CollectionCount 类只专注于 HTTP / 调度相关的逻辑。

你可以通过将 HTTP 相关逻辑与 INIT_STATE 的解析/查找逻辑分离,并把当前的两个查找 helper 合并为一个更聚焦的函数,从而降低理解成本。

1. 为 INIT_STATE → collectionCount 抽取一个纯 helper

将解析 + 查找逻辑移动到一个独立的 helper(模块级或共享工具函数)中,让 CollectionCount 只负责 HTTP、日志和组装:

def extract_collection_count_from_init_state(init_state: dict) -> int | None:
    # 常见路径 1: 根级别
    if "collectionCount" in init_state:
        return init_state["collectionCount"]

    # 常见路径 2: photo
    photo = init_state.get("photo")
    if isinstance(photo, dict) and "collectionCount" in photo:
        return photo["collectionCount"]

    # 常见路径 3: visionVideoDetail
    detail = init_state.get("visionVideoDetail")
    if isinstance(detail, dict) and "collectionCount" in detail:
        return detail["collectionCount"]

    # 备选: 通用递归查找
    def recursive_find(obj) -> int | None:
        if isinstance(obj, dict):
            if "collectionCount" in obj:
                return obj["collectionCount"]
            for v in obj.values():
                found = recursive_find(v)
                if found is not None:
                    return found
        elif isinstance(obj, list):
            for item in obj:
                found = recursive_find(item)
                if found is not None:
                    return found
        return None

    return recursive_find(init_state)

这样类本身就不再需要 _find_collection_count_recursive_find 方法:

def _extract_collection_count(self, html: str) -> None:
    if not html:
        self.collection_count = -1
        return

    match = self.INIT_STATE_PATTERN.search(html)
    if not match:
        self.console.warning(_("未找到 window.INIT_STATE"))
        self.collection_count = -1
        return

    try:
        init_state = json.loads(match.group(1))
    except json.JSONDecodeError as e:
        self.console.error(_("解析 INIT_STATE JSON 失败: {error}").format(error=e))
        self.collection_count = -1
        return

    collection_count = extract_collection_count_from_init_state(init_state)
    self.collection_count = collection_count if collection_count is not None else -1

这样既保持了所有现有行为(正则、JSON 解析、优先路径、递归兜底),同时:

  • CollectionCount 只负责 HTTP 和调度相关职责;
  • 将“复杂”的树形查找逻辑封装到一个可复用、易测试的函数中;
  • 不再需要同时理解 _find_collection_count_recursive_find 两个方法才能跟踪整个提取逻辑。
Original comment in English

issue (complexity): Consider extracting the INIT_STATE parsing and collectionCount search into a single pure helper function so the CollectionCount class focuses only on HTTP/orchestration concerns.

You can reduce the cognitive load by separating the HTTP concern from the INIT_STATE parsing/search logic and collapsing the two search helpers into a single, focused function.

1. Extract a pure helper for INIT_STATE → collectionCount

Move the parsing + search into a standalone helper (module-level or shared util), so CollectionCount only worries about HTTP, logging, and wiring:

def extract_collection_count_from_init_state(init_state: dict) -> int | None:
    # 常见路径 1: 根级别
    if "collectionCount" in init_state:
        return init_state["collectionCount"]

    # 常见路径 2: photo
    photo = init_state.get("photo")
    if isinstance(photo, dict) and "collectionCount" in photo:
        return photo["collectionCount"]

    # 常见路径 3: visionVideoDetail
    detail = init_state.get("visionVideoDetail")
    if isinstance(detail, dict) and "collectionCount" in detail:
        return detail["collectionCount"]

    # 备选: 通用递归查找
    def recursive_find(obj) -> int | None:
        if isinstance(obj, dict):
            if "collectionCount" in obj:
                return obj["collectionCount"]
            for v in obj.values():
                found = recursive_find(v)
                if found is not None:
                    return found
        elif isinstance(obj, list):
            for item in obj:
                found = recursive_find(item)
                if found is not None:
                    return found
        return None

    return recursive_find(init_state)

Then the class no longer needs _find_collection_count and _recursive_find methods:

def _extract_collection_count(self, html: str) -> None:
    if not html:
        self.collection_count = -1
        return

    match = self.INIT_STATE_PATTERN.search(html)
    if not match:
        self.console.warning(_("未找到 window.INIT_STATE"))
        self.collection_count = -1
        return

    try:
        init_state = json.loads(match.group(1))
    except json.JSONDecodeError as e:
        self.console.error(_("解析 INIT_STATE JSON 失败: {error}").format(error=e))
        self.collection_count = -1
        return

    collection_count = extract_collection_count_from_init_state(init_state)
    self.collection_count = collection_count if collection_count is not None else -1

This keeps all existing behavior (regex, JSON parsing, prioritized paths, recursive fallback) while:

  • Making CollectionCount responsible only for HTTP + orchestration.
  • Encapsulating the “complex” tree search into a single, reusable, and testable function.
  • Removing the need to understand two methods (_find_collection_count + _recursive_find) to follow the extraction logic.

@JoeanAmier
Copy link
Owner

贡献指南

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants