diff --git a/specs/crawler/common/parameters.yml b/specs/crawler/common/parameters.yml index 940efb097b..4bf171bbdc 100644 --- a/specs/crawler/common/parameters.yml +++ b/specs/crawler/common/parameters.yml @@ -17,7 +17,7 @@ TaskIdParameter: CrawlerVersionParameter: name: version in: path - description: The version of the targeted Crawler revision. + description: This crawler's version nmber. required: true schema: type: integer @@ -88,7 +88,7 @@ UrlsCrawledGroup: description: Number of URLs with this status. readable: type: string - description: Readable representation of the reason for the status message. + description: Reason for this status. example: status: SKIPPED reason: forbidden_by_robotstxt @@ -98,7 +98,10 @@ UrlsCrawledGroup: urlsCrawledGroupStatus: type: string - description: Status of crawling these URLs. + description: | + Crawled URL status. + + For more information, see [Troubleshooting by crawl status](https://www.algolia.com/doc/tools/crawler/troubleshooting/crawl-status/). enum: - DONE - SKIPPED @@ -106,7 +109,10 @@ urlsCrawledGroupStatus: urlsCrawledGroupCategory: type: string - description: Step where the status information was generated. + description: | + Step where the status information was generated. + + For more information, see [Troubleshooting by crawl status](https://www.algolia.com/doc/tools/crawler/troubleshooting/crawl-status/). enum: - fetch - extraction diff --git a/specs/crawler/common/schemas/action.yml b/specs/crawler/common/schemas/action.yml index d1d7449bc1..353bd3fc14 100644 --- a/specs/crawler/common/schemas/action.yml +++ b/specs/crawler/common/schemas/action.yml @@ -1,22 +1,29 @@ Action: type: object - description: Instructions about how to process crawled URLs. + description: | + How to process crawled URLs. + + Each action defines: + + - The targeted subset of URLs it processes. + - What information to extract from the web pages. + - The Algolia indices where the extracted records will be stored. + + If a single web page matches several actions, + one record is generated for each action. properties: autoGenerateObjectIDs: type: boolean - description: | - Whether to generate `objectID` properties for each extracted record. - - If false, you must manually add `objectID` properties to the extracted records. + description: Whether to generate an `objectID` for records that don't have one. default: true cache: $ref: '#/cache' discoveryPatterns: type: array description: | - Patterns for additional pages to visit to find links without extracting records. + Indicates additional pages that the crawler should visit. - The crawler looks for matching pages and crawls them for links, but doesn't extract records from the (intermediate) pages themselves. + For more information, see the [`discoveryPatterns` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/discovery-patterns/). items: $ref: '#/urlPattern' fileTypesToMatch: @@ -24,15 +31,7 @@ Action: description: | File types for crawling non-HTML documents. - Non-HTML documents are first converted to HTML by an [Apache Tika](https://tika.apache.org/) server. - - Crawling non-HTML documents has the following limitations: - - - It's slower than crawling HTML documents. - - PDFs must include the used fonts. - - The produced HTML pages might not be semantic. This makes achieving good relevance more difficult. - - Natural language detection isn't supported. - - Extracted metadata might vary between files produced by different programs and versions. + For more information, see [Extract data from non-HTML documents](https://www.algolia.com/doc/tools/crawler/extracting-data/non-html-documents/). maxItems: 100 items: $ref: '#/fileTypes' @@ -47,8 +46,8 @@ Action: type: string maxLength: 256 description: | - Index name where to store the extracted records from this action. - The name is combined with the prefix you specified in the `indexPrefix` option. + Reference to the index used to store the action's extracted records. + `indexName` is combined with the prefix you specified in `indexPrefix`. example: algolia_website name: type: string @@ -57,7 +56,10 @@ Action: $ref: '#/pathAliases' pathsToMatch: type: array - description: Patterns for URLs to which this action should apply. + description: | + URLs to which this action should apply. + + Uses [micromatch](https://github.com/micromatch/micromatch) for negation, wildcards, and more. minItems: 1 maxItems: 100 items: @@ -72,9 +74,11 @@ Action: source: type: string description: | - JavaScript function (as a string) for extracting information from a crawled page and transforming it into Algolia records for indexing. - The [Crawler dashboard](https://crawler.algolia.com/admin) has an editor with autocomplete and validation, - which makes editing the `recordExtractor` property easier. + A JavaScript function (as a string) that returns one or more Algolia records for each crawled page. + + For details, consult the [`recordExtractor` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/actions/#parameter-param-recordextractor). + + The Crawler has an [editor](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#the-editor) with autocomplete and validation to help you update the `recordExtractor` property. selectorsToMatch: type: array description: | @@ -107,13 +111,8 @@ fileTypes: type: string description: | Supported file type for indexing non-HTML documents. - A single type can match multiple file formats: - - - `doc`: `.doc`, `.docx` - - `ppt`: `.ppt`, `.pptx` - - `xls`: `.xls`, `.xlsx` - - The `email` type supports crawling Microsoft Outlook mail message (`.msg`) documents. + + For more information, see [Extract data from non-HTML documents](https://www.algolia.com/doc/tools/crawler/extracting-data/non-html-documents/). enum: - doc - email @@ -129,7 +128,8 @@ urlPattern: type: string description: | Pattern for matching URLs. - Wildcards and negations are supported via the [micromatch](https://github.com/micromatch/micromatch) library. + + Uses [micromatch](https://github.com/micromatch/micromatch) for negation, wildcards, and more. example: https://www.algolia.com/** hostnameAliases: @@ -137,11 +137,10 @@ hostnameAliases: example: 'dev.example.com': 'example.com' description: | - Key-value pairs to replace matching hostnames found in a sitemap, on a page, in canonical links, or redirects. + Key-value pairs to replace matching hostnames found in a sitemap, + on a page, in canonical links, or redirects. - The crawler continues from the _transformed_ URLs. - The mapping doesn't transform URLs listed in the `startUrls`, `siteMaps`, `pathsToMatch`, and other settings. - The mapping also doesn't replace hostnames found in extracted text. + For more information, see the [`hostnameAliases` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/hostname-aliases/). additionalProperties: type: string description: Hostname that should be added in the records. @@ -154,10 +153,13 @@ pathAliases: '/foo': '/bar' description: | Key-value pairs to replace matching paths with new values. + + It doesn't replace: + + - URLs in the `startUrls`, siteMaps`, `pathsToMatch`, and other settings. + - Paths found in extracted text. The crawl continues from the _transformed_ URLs. - The mapping doesn't transform URLs listed in the `startUrls`, `siteMaps`, `pathsToMatch`, and other settings. - The mapping also doesn't replace paths found in extracted text. additionalProperties: type: object description: Hostname for which matching paths should be replaced. @@ -172,17 +174,7 @@ cache: description: | Whether the crawler should cache crawled pages. - With caching, the crawler only crawls changed pages. - To detect changed pages, the crawler makes [HTTP conditional requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Conditional_requests) to your pages. - The crawler uses the `ETag` and `Last-Modified` response headers returned by your web server during the previous crawl. - The crawler sends this information in the `If-None-Match` and `If-Modified-Since` request headers. - - If your web server responds with `304 Not Modified` to the conditional request, the crawler reuses the records from the previous crawl. - - Caching is ignored in these cases: - - - If your crawler configuration changed between two crawls. - - If `externalData` changed between two crawls. + For more information, see the [`cache` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/cache/). properties: enabled: type: boolean diff --git a/specs/crawler/common/schemas/configuration.yml b/specs/crawler/common/schemas/configuration.yml index fd529e4d1d..a180ed4c2d 100644 --- a/specs/crawler/common/schemas/configuration.yml +++ b/specs/crawler/common/schemas/configuration.yml @@ -8,17 +8,7 @@ Configuration: properties: actions: type: array - description: | - Instructions about how to process crawled URLs. - - Each action defines: - - - The targeted subset of URLs it processes. - - What information to extract from the web pages. - - The Algolia indices where the extracted records will be stored. - - A single web page can match multiple actions. - In this case, the crawler produces one record for each matched action. + description: A list of actions. minItems: 1 maxItems: 30 items: @@ -28,11 +18,7 @@ Configuration: description: | Algolia API key for indexing the records. - The API key must have the following access control list (ACL) permissions: - `search`, `browse`, `listIndexes`, `addObject`, `deleteObject`, `deleteIndex`, `settings`, `editSettings`. - The API key must not be the admin API key of the application. - The API key must have access to create the indices that the crawler will use. - For example, if `indexPrefix` is `crawler_`, the API key must have access to all `crawler_*` indices. + For more information, see the [`apiKey` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/api-key/). appId: $ref: '../parameters.yml#/applicationID' exclusionPatterns: @@ -46,9 +32,9 @@ Configuration: items: type: string description: | - Pattern for matching URLs to exclude from crawling. + URLs to exclude from crawling. - The pattern support globs and wildcard matching with [micromark](https://github.com/micromatch/micromatch). + Uses [micromatch](https://github.com/micromatch/micromatch) for negation, wildcards, and more. externalData: type: array description: | @@ -66,8 +52,7 @@ Configuration: description: | URLs from where to start crawling. - These are the same as `startUrls`. - URLs you [crawl manually](#tag/actions/operation/testUrl) can be added to `extraUrls`. + For more information, see the [`extraUrls` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/extra-urls/). items: type: string ignoreCanonicalTo: @@ -76,19 +61,22 @@ Configuration: type: boolean description: | Whether to ignore the `nofollow` meta tag or link attribute. - If true, links with the `rel="nofollow"` attribute or links on pages with the `nofollow` robots meta tag will be crawled. + + For more information, see the [`ignoreNoFollowTo` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/ignore-no-follow-to/). ignoreNoIndex: type: boolean description: | Whether to ignore the `noindex` robots meta tag. - If true, pages with this meta tag will be crawled. + If `true` pages with this meta tag _will_ be crawled. ignoreQueryParams: type: array description: | Query parameters to ignore while crawling. All URLs with the matching query parameters will be treated as identical. - This prevents indexing duplicated URLs, that just differ by their query parameters. + This prevents indexing URLs that just differ by their query parameters. + + You can use wildcard characters to pattern match. maxItems: 9999 example: - ref @@ -107,28 +95,24 @@ Configuration: initialIndexSettings: type: object description: | - Initial index settings, one settings object per index. + Crawler index settings. - This setting is only applied when the index is first created. - Settings are not re-applied. - This prevents overriding any settings changes after the index was created. + For more information, see the [`initialIndexSettings` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/initial-index-settings/). additionalProperties: $ref: '../../../common/schemas/IndexSettings.yml#/indexSettings' x-additionalPropertiesName: indexName linkExtractor: title: linkExtractor type: object - description: Function for extracting URLs for links found on crawled pages. + description: | + Function for extracting URLs from links on crawled pages. + + For more information, see the [`linkExtractor` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/link-extractor/). properties: __type: $ref: './action.yml#/configurationRecordExtractorType' source: type: string - description: | - JavaScript function (as a string) for extracting URLs for links found on crawled pages. - By default, all URLs that comply with the `pathsToMatch`, `fileTypesToMatch`, and `exclusions` settings are added to the crawl. - The [Crawler dashboard](https://crawler.algolia.com/admin) has an editor with autocomplete and validation, - which makes editing the `linkExtractor` property easier. example: | ({ $, url, defaultExtractor }) => { if (/example.com\/doc\//.test(url.href)) { @@ -225,41 +209,35 @@ ignoreCanonicalTo: renderJavaScript: description: | - Crawl JavaScript-rendered pages by rendering them with a headless browser. + Crawl JavaScript-rendered pages with a headless browser. - Rendering JavaScript-based pages is slower than crawling regular HTML pages. + For more information, see the [`renderJavaScript` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/render-java-script/). oneOf: - type: boolean - description: Whether to render all pages with a headless browser. + description: Whether to render all pages. - type: array - description: URLs or patterns which to render with a headless browser. + description: URLs or URL patterns to render. items: type: string - description: | - URL or pattern for matching URLs which to render with a headless browser. - - The pattern support globs and wildcard matching with [micromark](https://github.com/micromatch/micromatch). + description: URL or URL pattern to render. example: https://www.example.com - title: headlessBrowserConfig type: object - description: Configuration for rendering HTML with a headless browser. + description: Configuration for rendering HTML. properties: enabled: type: boolean - description: Whether to render matching URLs with a headless browser. + description: Whether to render matching URLs. patterns: type: array - description: | - URLs or patterns for matching URLs that should be rendered with a headless browser. - - The pattern support globs and wildcard matching with [micromark](https://github.com/micromatch/micromatch). + description: URLs or URL patterns to render. items: type: string adBlock: type: boolean description: | Whether to turn on the built-in adblocker. - This blocks most ads and tracking scripts but can break some websites. + This blocks most ads and tracking scripts but can break some sites. waitTime: $ref: '#/waitTime' required: @@ -445,42 +423,32 @@ extraParameters: safetyChecks: type: object - description: Checks to ensure the crawl was successful. + description: | + Checks to ensure the crawl was successful. + + For more information, see the [Safety checks](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#safety-checks) documentation. properties: beforeIndexPublishing: $ref: '#/beforeIndexPublishing' beforeIndexPublishing: type: object - description: These checks are triggered after the crawl finishes but before the records are added to the Algolia index. + description: Checks triggered after the crawl finishes but before the records are added to the Algolia index. properties: maxLostRecordsPercentage: type: number - description: | - Maximum difference in percent between the numbers of records between crawls. - - If the current crawl results in fewer than `1 - maxLostPercentage` records compared to the previous crawl, - the current crawling task is stopped with a `SafeReindexingError`. - The crawler will be blocked until you cancel the blocking task. + description: Maximum difference in percent between the numbers of records between crawls. minimum: 1 maximum: 100 default: 10 maxFailedUrls: type: number - description: | - Stops the crawler if a specified number of pages fail to crawl. - If undefined, the crawler won't stop if it encounters such errors. + description: Stops the crawler if a specified number of pages fail to crawl. schedule: type: string description: | - Schedule for running the crawl, expressed in [Later.js](https://bunkat.github.io/later/) syntax. - If omitted, you must start crawls manually. - - - The interval between two scheduled crawls must be at least 24 hours. - - Times are in UTC. - - Minutes must be explicit: `at 3:00 pm` not `at 3 pm`. - - Everyday is `every 1 day`. - - Midnight is `at 12:00 pm`. - - If you omit the time, a crawl might start any time after midnight UTC. + Schedule for running the crawl. + + For more information, see the [`schedule` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/schedule/). example: every weekday at 12:00 pm diff --git a/specs/crawler/common/schemas/getCrawlerResponse.yml b/specs/crawler/common/schemas/getCrawlerResponse.yml index b1591ecbf7..1a160ff6e4 100644 --- a/specs/crawler/common/schemas/getCrawlerResponse.yml +++ b/specs/crawler/common/schemas/getCrawlerResponse.yml @@ -19,13 +19,14 @@ BaseResponse: description: Whether this crawler is active. reindexing: type: boolean - description: Whether this crawler is currently completely reindexing your content. + description: Whether this crawler is completely reindexing your content. blocked: type: boolean description: | Whether this crawler is currently blocked. - If true, you need to unblock this crawler in the [Crawler dashboard](https://crawler.algolia.com/admin/) or by [cancelling the blocking task](#tag/tasks/operation/cancelBlockingAction). + If `true`, you can unblock it from the [Crawler page](https://dashboard.algolia.com/crawler) in the Algolia dashboard + or by [cancelling the blocking task](#tag/tasks/operation/cancelBlockingAction). blockingError: type: string description: Reason why the crawler is blocked. diff --git a/specs/crawler/paths/crawler.yml b/specs/crawler/paths/crawler.yml index 8bb58662e3..b345baba3c 100644 --- a/specs/crawler/paths/crawler.yml +++ b/specs/crawler/paths/crawler.yml @@ -27,17 +27,15 @@ get: $ref: '../common/schemas/responses.yml#/NoRightsOnCrawler' patch: operationId: patchCrawler - summary: Update crawler + summary: Change crawler name description: | - Updates the crawler, either its name or its configuration. + Change the crawler's name. - Use this endpoint to update the crawler's name. - While you can use this endpoint to completely replace the crawler's configuration, - you should [update the crawler configuration](#tag/config/operation/patchConfig) instead. + While you _could_ use this endpoint to replace the crawler configuration, + you should [update it](#tag/config/operation/patchConfig) instead since cnfiguration changes made here aren't [versioned](#tag/config/operation/listConfigVersions). If you replace the configuration, you must provide the full configuration, - including the settings you want to keep. - Configuration changes from this endpoint aren't [versioned](#tag/config/operation/listConfigVersions). + including any settings you want to keep. tags: - crawlers parameters: @@ -62,3 +60,20 @@ patch: $ref: '../common/schemas/responses.yml#/MissingAuthorization' '403': $ref: '../common/schemas/responses.yml#/NoRightsOnCrawler' +delete: + operationId: deleteCrawler + summary: Delete a crawler + description: Delete the specified crawler. + tags: + - crawlers + parameters: + - $ref: '#/components/parameters/CrawlerIdParameter' + responses: + '200': + $ref: '#/components/responses/ActionAcknowledged' + '400': + $ref: '#/components/responses/InvalidRequest' + '401': + $ref: '#/components/responses/MissingAuthorization' + '403': + $ref: '#/components/responses/NoRightsOnCrawler' diff --git a/specs/crawler/paths/crawlerConfigVersions.yml b/specs/crawler/paths/crawlerConfigVersions.yml index 6f2a5df413..197c119570 100644 --- a/specs/crawler/paths/crawlerConfigVersions.yml +++ b/specs/crawler/paths/crawlerConfigVersions.yml @@ -3,7 +3,7 @@ get: summary: List configuration versions description: | Lists previous versions of the specified crawler's configuration, including who authored the change. - Every time you [update the configuration](#tag/config/operation/patchConfig) of a crawler, + Every time you update a crawler's [configuration](#tag/config/operation/patchConfig), a new version is added. tags: - config diff --git a/specs/crawler/paths/crawlerCrawl.yml b/specs/crawler/paths/crawlerCrawl.yml index 80ad0ab816..03aec4715c 100644 --- a/specs/crawler/paths/crawlerCrawl.yml +++ b/specs/crawler/paths/crawlerCrawl.yml @@ -3,7 +3,7 @@ post: summary: Crawl URLs description: | Crawls the specified URLs, extracts records from them, and adds them to the index. - If a crawl is currently running (the crawler's `reindexing` property is true), + If a crawl is currently running (the crawler's `reindexing` property is `true`), the records are added to a temporary index. tags: - actions diff --git a/specs/crawler/paths/crawlerTest.yml b/specs/crawler/paths/crawlerTest.yml index e57913b597..42b2a6eaa8 100644 --- a/specs/crawler/paths/crawlerTest.yml +++ b/specs/crawler/paths/crawlerTest.yml @@ -1,10 +1,10 @@ post: operationId: testUrl - summary: Test crawling a URL + summary: Test crawl a URL description: | Tests a URL with the crawler's configuration and shows the extracted records. - You can override parts of the configuration to test your changes before updating the configuration. + You can test configuration changes by overriding specific parts before updating the full configuration. tags: - actions parameters: @@ -108,6 +108,7 @@ post: type: object description: | External data associated with the tested URL. + External data is refreshed automatically at the beginning of the crawl. example: externalData1: {data1: 'val1', data2: 'val2'} diff --git a/specs/crawler/spec.yml b/specs/crawler/spec.yml index f59e25ec3a..b723ca923a 100644 --- a/specs/crawler/spec.yml +++ b/specs/crawler/spec.yml @@ -25,8 +25,8 @@ info: - ``. The Crawler user ID. - ``. The Crawler API key. - You can find both in the [Crawler dashboard](https://crawler.algolia.com/admin/settings/). - The Crawler dashboard and API key are different from the regular Algolia dashboard and API keys. + You can find both on the [Crawler settings](https://dashboard.algolia.com/crawler/settings) page in the Algolia dashboard. + The Crawler credentials are different from your regular Algolia credentials. ## Request format @@ -45,7 +45,8 @@ info: The Crawler API returns JSON responses. Since JSON doesn't guarantee any specific ordering, don't rely on the order of attributes in the API response. - Successful responses return a `2xx` status. Client errors return a `4xx` status. Server errors are indicated by a `5xx` status. + Successful responses return a `2xx` status. Client errors return a `4xx` status. + Server errors are indicated by a `5xx` status. Error responses have a `message` property with more information. ## Version @@ -66,13 +67,15 @@ tags: - name: actions x-displayName: Actions description: | - Actions change the state of crawlers, such as pausing and unpausing crawl schedules or testing the crawler with specific URLs. + Actions change the state of crawlers, such as pausing and unpausing schedules or testing the crawler with specific URLs. - name: config x-displayName: Configuration description: | In the Crawler configuration, you specify which URLs to crawl, when to crawl, how to extract records from the crawl, and where to index the extracted records. + The configuration is versioned, so you can always restore a previous version. - It's easiest to make configuration changes in the [Crawler dashboard](https://crawler.algolia.com/admin/). + + It's easiest to make configuration changes on the [Crawler page](https://dashboard.algolia.com/crawler) in the Algolia dashboard. The editor has autocomplete and built-in validation so you can try your configuration changes before committing them. - name: crawlers x-displayName: Crawler @@ -84,7 +87,7 @@ tags: description: List registered domains. - name: tasks x-displayName: Tasks - description: Tasks + description: Task operations paths: /1/crawlers: $ref: 'paths/crawlers.yml'