.. index:: api
This is work in progress and is likely to change in future version
process.Processor([debug=False, preserve_remote_urls=True])Creates a processor instance that you can feed HTML and URLs.The arguments:
debug=FalseCurrently does nothing particular.preserve_remote_urls=TrueIf you run a URL likehttp://www.example.orgthat referenceshttp://cdn.cloudware.com/foo.csswhich containsurl(/background.png)then the CSS will be rewritten to becomeurl(http://cdn.cloudware.com/background.png)phantomjs=NoneIfTruewill default tophantomjs, If a string it's assume it's the path to the executablephantomjspath.phantomjs_options={}Additional options/switches to thephantomjscommand. This has to be a dict. So, for example{'script-encoding': 'latin1'}becomes--script-encoding=latin1.optimize_lookup=TrueIf true, will make a set of all ids and class names in all processed documents and use these to avoid some expensive CSS query searches.
Instances of this allows you to use the following methods:
process(*urls)Downloads the HTML from that URL(s) and expects it to be 200 return code. The content will be transformed to a unicode string in UTF-8.Once all URLs have been processed the CSS is analyzed.
process_url(url)Given a specific URL it will download it and parse the HTML. This method will download the HTML then calledprocess_html().process_html(html, url)If you for some reason already have the HTML you can jump straight to this method. Note, you still need to provide the URL where you got the HTML from so it can use that to download any external CSS.When calling
process_url()orprocess_html(), you have to callprocess()at the end without arguments, in order to post process the pages that were processed individually.
The
Processorinstance will make two attributes availableinstance.inlinesA list ofInlineResultinstances (see below)instance.linksA list ofLinkResultinstances (see below)
InlineResultThis is where the results are stored for inline CSS. It holds the following attributes:
lineWhich line in the original HTML this starts onurlThe URL this was found onbeforeThe inline CSS before it was analyzedafterThe new CSS with the selectors presumably not used removed
LinkResultThis is where the results are stored for all referenced links to CSS files. i.e. from things like
<link rel="stylesheet" href="foo.css">It contains the following attributes:hrefThehrefattribute on the link tag. e.g./static/main.cssbeforeThe CSS before it was analyzedafterThe new CSS with the selectors presumably not used removed