Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

task stuck in status “pending”. How to troubleshoot? #384

Closed
SyllaJay opened this issue Dec 28, 2024 · 5 comments
Closed

task stuck in status “pending”. How to troubleshoot? #384

SyllaJay opened this issue Dec 28, 2024 · 5 comments
Assignees
Labels

Comments

@SyllaJay
Copy link

{
"status": "pending",
"created_at": 1733664419.2874162
}

@unclecode unclecode self-assigned this Dec 30, 2024
@unclecode
Copy link
Owner

@SyllaJay I assume you are using Docker. Could you explain your code samples and everything so I can replicate them? It's difficult for me to provide any ideas right now.

@SyllaJay
Copy link
Author

Yes, I'm using Docker( image unclecode/crawl4ai:all-amd64). And below is my code, which performs a simple HTTP request.

func GetCrawlTaskID(url string) (string, error) {
if len(url) == 0 {
return "", errors.New("url is empty")
}
c, err := client.NewClient(
client.WithClientReadTimeout(3 * 60 * time.Second),
)
if err != nil {
return "", err
}

req := &protocol.Request{}
res := &protocol.Response{}
req.Reset()
req.SetMethod(consts.MethodPost)
req.Header.SetContentTypeBytes([]byte("application/json"))
req.Header.Set("Authorization", AuthorizationToken)
req.SetRequestURI(crawlServer)

reqBodyData := struct {
	Urls                 string `json:"urls,omitempty"`
	Priority             int32  `json:"priority,omitempty"`
	WordCountThreshold   int32  `json:"word_count_threshold,omitempty"`
	ExcludeExternalLinks string `json:"exclude_external_links,omitempty"`
	OnlyText             string `json:"only_text,omitempty"`
	Magic                string `json:"magic,omitempty"`
}{
	Urls:                 url,
	Priority:             10,
	WordCountThreshold:   100,
	ExcludeExternalLinks: "true",
	OnlyText:             "true",
	Magic:                "true",
}
reqBodyJsonByte, _ := json.Marshal(reqBodyData)
req.SetBody(reqBodyJsonByte)
err = c.Do(context.Background(), req, res)
if err != nil {
	return "", err
}

type responseBody struct {
	Detail string `json:"detail,omitempty"`
	TaskID string `json:"task_id,omitempty"`
}
var resBody responseBody
err = json.Unmarshal(res.Body(), &resBody)
if err != nil {
	return "", err
}
if resBody.Detail != "" {
	return "", fmt.Errorf("GetCrawlTaskID Failed! Detail=%s", resBody.Detail)
}
return resBody.TaskID, nil

}

func GetCrawlContentByTaskID(taskID string) (string, error) {
if len(taskID) == 0 {
return "", errors.New("taskID is empty")
}
c, err := client.NewClient(
client.WithClientReadTimeout(3 * 60 * time.Second),
)
if err != nil {
return "", err
}

req := &protocol.Request{}
res := &protocol.Response{}
req.Reset()
req.SetMethod(consts.MethodGet)
//req.Header.SetContentTypeBytes([]byte("application/json"))
req.Header.Set("Authorization", AuthorizationToken)
req.SetRequestURI(crawlServerTask + "/" + taskID)
err = c.Do(context.Background(), req, res)
if err != nil {
	return "", err
}

type responseBody struct {
	Detail    string  `json:"detail,omitempty"` // "detail": "Task not found"
	Status    string  `json:"status,omitempty"`
	CreatedAt float64 `json:"created_at,omitempty"`
	Result    struct {
		URL          string `json:"url,omitempty"`
		Success      bool   `json:"success,omitempty"`
		Html         string `json:"html,omitempty"`
		CleanedHtml  string `json:"cleaned_html,omitempty"`
		Markdown     string `json:"markdown,omitempty"`
		ErrorMessage string `json:"error_message,omitempty"`
		StatusCode   int    `json:"status_code,omitempty"`
	} `json:"result,omitempty"`
}
var resBody responseBody
err = json.Unmarshal(res.Body(), &resBody)
if err != nil {
	return "", err
}
if resBody.Detail != "" {
	return "", fmt.Errorf("GetCrawlContentByTaskID Failed! Detail=%s", resBody.Detail)
}
if resBody.Status != "completed" {
	return "", fmt.Errorf("GetCrawlContentByTaskID Failed! Status=%s", resBody.Status)
}
markdown := resBody.Result.Markdown
if markdown == "" {
	return "", fmt.Errorf("GetCrawlContentByTaskID Failed!  markdown content is empty,resBody =%s", string(res.Body()))
}
return markdown, nil

}

@unclecode
Copy link
Owner

@SyllaJay I don't have much experience with Go. I wish it were Rust, but give me a few days to figure out why that happened. Right now, we don't have that issue when most people use other languages. I will try to replicate what you are doing and determine why it keeps returning pending to you.

@SyllaJay
Copy link
Author

SyllaJay commented Jan 18, 2025

@unclecode Thank you for your attention. It's just simple HTTP request code written in Go. I guess the code is not the key problem; it's probably the Docker server. The issue occurs when performing HTTP requests, especially when making many requests within a short period.

@unclecode
Copy link
Owner

@SyllaJay You're welcome. I suggest waiting a week because I will release a new version. It includes many optimizations specifically for crawling multiple URLs simultaneously and some changes in the documentation. We will release the new version by Monday, and after a week, we will update the documentation. Then you can use that version again. I close the issue but you are welcome to continue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants