Вы находитесь на странице: 1из 5

Id or class are simply “attributes” of a div tag.

The id attribute should be unique.

This is akin to href being an attribute of an a tag (e.g. <a href=”http://www...)

The class attribute doesn’t need to be unique.

<div class=”you-are-classy”> is a div tag with a class attribute that has an assignment of “you-are-classy”

Xpath = ‘//table’ directs to all table elements within the entire HTML code

Xpath = ‘/html/body/div[2]//table directs you to all table elements within the second child div of the
body

Xpath = ‘//span[@class=”span-class”]’  This will collect all span elements that have a class attribute
equal to “span-class”. We could substitute span with div or whatever.

The asterisk (*) is a wildcard character. For example, xpath = '/html/body/*' will lead to all child
elements within the body, regardless of tag.

xpath = '//p[@class="class-1"]'  This directs to all paragraph elements with a class attribute equal to
class-1

xpath = '//*[@id="uid"]'  The wildcard marker will reduce to any element that has an id attribute
equal to uid

xpath = '//div[@id="uid"]/p[2]'  A step further from above, the second paragraph of any element that
has an id attribute equal to uid

Contains function: contains(@attri-name, “string-expr”)

xpath = '//*[contains(@class,"class-1")]'  This expression chooses all elements where the class
attribute contains a string of “class-1”. This may include “class-1”, “class-1 class-2”, or “class-12” for
example.

xpath = '/html/body/div/p[2]/@class'  By using xpath = '/html/body/div/p[2] we are directed to the


paragraph element itself. By including the @class, we return the attribute itself.
Selecting Selectors

An html string is used with the Selector function to create a list. An example setup is as follows:

Sel = Selector(text=html)

This is where the “html” is a previously defined string. Consider “sel” as having selected the entire html
document.

Xpath selector method creates new selector objects. An example:

Sel.xpath(“//p”) selects all paragraphs from our running example (the html text). The output will be a
selector list of two selector objects such as [<Selector xpath='//p' data='<p>Hello World!</p>'>,
<Selector xpath='//p' data='<p>Enjoy DataCamp!</p>'>]

Ps = sel.xpath(‘//p’) creates the selector list with all paragraph selector objects contained within ps.

Second_ps = ps[1] can specifically choose the second selector object from that list (ps).

second_p.extract() applies the extract function to the single selector. A selector only has one piece of
data, so the output may look like out: '<p>Enjoy DataCamp!</p>'

Xpath Chaining

Using the Selector (assuming the name is “sel” in this case), you can chain an xpath to produce the same
results. For example:

sel.xpath('/html/body/div[2]')

sel.xpath('/html').xpath('./body/div[2]')

sel.xpath('/html').xpath('./body').xpath('./div[2]')

These all produce the same results. You must make certain to “glue” them together with the period that
comes before the front slash of each subsequent chain.

HTML text to Selector

We eventually need to get a webpage’s HTML code. This can be accomplished with the requests.get
method. Import the python library “requests”.

Create a string identifying the url. For example:

url = 'https://www.datacamp.com/courses/all'

Create the html string by then passing it to requests.get. For example:


html = requests.get( url ).content

The above puts the html contents from https://www.datacamp.com/courses/all into a string called
“html”.

sel = Selector( text = html )

The above passes the content of the html source to the selector.

CSS LOCATOR

CSS Locator is like Xpath

/ replaced by > (except first character)

XPath: /html/body/div

CSS Locator: html > body > div

Each of the two examples above moves forward one generation on the html tree

// replaced by a blank space (except first character)

XPath: //div/span//p

CSS Locator: div > span p

From the two examples immediately above, the double front slash in XPath is equivalent to a blank
space in CSS Locator notation. Both perform the task of looking forward to all generations.

[N] replaced by :nth-of-type(N)

XPath: //div/p[2]

CSS Locator: div > p:nth-of-type(2)

The two following methods are equivalent:

Xpath = ‘/html/body//div/p[2]’

Css = ‘html > body div > p:nth-of-type(2)

To find an element by class in CSS, use a period. For example:

p.class-1 selects all paragraph elements belonging to class-1


To find an element by id, use a pound (#) sign. For example:

Div#uid selects the div element with id equal to uid

Usage example:

Css_locator = ‘div#uid > p.class1’

The above line first navigates to the div element whose id is uid and then further to the paragraph
element whose class is class1

An alternative to the above: css_locator = ‘.class’

This directs to all elements in the html document whose class attribute belongs to class1.

This directs to all elements belonging to that class even if they belong to other classes. For example: <p
class=”class-1”> … </p> and <div class=”class-1 class-2”> … </p>

This is different from xpath = '//*[@class="class1"]' which forces an exact match. Also different from
using contains xpath = '//*[contains(@class,"class1")]' which snatches a string.

To find all the children of an element whose id is equal to “uid”:

css_locator = "#uid > *"

Must remember to add the star to follow through to those child elements.

To select displayed text on websites:

XPath: <xpath-to-element>/@attr-name

 xpath = '//div[@id="uid"]/a/@href'

CSS: <css-to-element>::attr(attr-name)

 css_locator = 'div#uid > a::attr(href)'

The double colon selects the desired attribute (that which is in between the quotes); a web address
in this case
Text Extraction

In some instances, you pay want to extract text from an element. For example:

<p id="p-example">

Hello world!

Try <a href="http://www.datacamp.com">DataCamp</a> today!

</p>

We may just want to extract the text. To do this, we need to navigate to the paragraph id that is equal to
“p-example”

sel.xpath('//p[@id="p-example"]/text()').extract()

The above line takes a scrappy selector with an xpath that navigates to what we need. It extracts the
following: ['\n Hello world!\n Try ', ' today!\n']

sel.xpath('//p[@id="p-example"]//text()').extract()

The above results: ['\n Hello world!\n Try ', 'DataCamp', ' today!\n']

Вам также может понравиться