aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/docs/en/users/11_website_scraping.md
blob: a5704b81b1a717705413cfa867060230f40105fb (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# Website scraping

FreshRSS has a built-in [Web scraping](https://en.wikipedia.org/wiki/Web_scraping) engine that generates a feed from websites that have no RSS/Atom feed published.

## How to add

Go to “Subscription Management” where a new feed can be added.
Change the “Type of feed source” to one of:
- “HTML + XPath (Web scraping)”
- JSON Feed (see [`jsonfeed.org`](https://www.jsonfeed.org/))
- JSON (Dotted paths)

An additional list of text boxes to configure the Web scraping will show.

For HTML + XPath, [XPath 1.0](https://www.w3.org/TR/xpath-10/) is used as traversing language.

### Get the XPath path

Firefox: the built-in “inspect” tool may be used to help create a valid XPath expression.
Select the node in the HTML, right click with your mouse and chose “Copy” and “XPath”.
The XPath is stored in your clipboard now.

### Get the JSON dotted path

Suppose the JSON to which you are subscribing to (or scraping) looks like this:

```json
{
	"data": {
		"items": [
			{
				"meta": {"title": "Some news item"},
				"content": "Content of the news",
				"links": ["https://example.net/1", "https://example.org/1"]
			},
			{
				"meta": {"title": "Some other news item"},
				"content": "Yet more content",
				"links": ["https://example.net/2", "https://example.org/2"]
			}
		]
	}
}
```

The *dot notation* and *bracket notation* (only numeric) are supported.

Then the items are under `data.items`, and within each item, the title is `meta.title`,
and the link would be `links[1]`.

It is a similar syntax to the JavaScript way to access JSON: `object.object.array[2].property`.

Support string concatenation with a syntax like: `meta.title & " some text"` using single-quotes or double-quotes.

## Tips & tricks

- [Timezone of date](https://github.com/FreshRSS/FreshRSS/discussions/5483)

## Recommended external manuals

- [XPath Scraping with FreshRSS, by Dan Q](https://danq.me/2022/09/27/freshrss-xpath/) (September 2022)