Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conflicting results between methods 'canCrawl', 'check' and 'parse' #77

Open
sanderheilbron opened this issue Jan 1, 2017 · 10 comments

Comments

@sanderheilbron
Copy link

The output of both methods canCrawl and check does not match with the result of parsing (parse) the content of a robot.txt file.

Example:

User-Agent: *
Disallow: /page-a

User-Agent: *
Disallow: /page-b

User-Agent: Googlebot
Crawl-Delay: 20

When checking if user-agent Googlebot is allowed to crawl /page-a Robotto provides the following results:

  • isAllowed: false
  • check: false

Results of parse:

"*": {
  "allow": [],
  "disallow": [
    "/page-a",
    "/page-b"
  ]
},
"Googlebot": {
  "allow": [],
  "disallow": []
}

Following to how Googlebot handles robots.txt files, both methods isAllowed and check should result in true.

@sanderheilbron sanderheilbron changed the title Conflicting results between methods ‘canCrawl', ‘check' and ‘parse' Conflicting results between methods 'canCrawl', 'check' and 'parse' Jan 1, 2017
@lucasfcosta
Copy link
Member

Hi @sanderheilbron, thanks for your issue!
This issue is caused by the same reason as #75.

Merging #76 will fix this. I'll merge that and release a new version.
Let me know if the problem persists.

@sanderheilbron
Copy link
Author

Hi @lucasfcosta, thanks for the update and your effort in fixing these issues!

Yesterday I did some local tests with the fix for #76 and noticed this issue. Are you sure it will be fixed by merging #76?

@lucasfcosta
Copy link
Member

@sanderheilbron Yup! This was happening because whenever robotto found an user-agent line it would create a new object to hold those user-agent's rules.
Right now we will check if rules exists whenever we find an user-agent line instead of always creating a new object for it.

I'll be releasing a fix in a few minutes.

@lucasfcosta
Copy link
Member

@sanderheilbron Done!
Robotto has just been released. The version with this fix is 1.0.15.
Let me know if you need anything else.
I'm always happy to be able to help 😄

@sanderheilbron
Copy link
Author

@lucasfcosta Thanks!

@sanderheilbron
Copy link
Author

Hi @lucasfcosta, just did some tests with v1.0.15, and unfortunately got the same results.

@lucasfcosta
Copy link
Member

lucasfcosta commented Jan 3, 2017

Hi @sanderheilbron, thanks for getting in touch.
Are you using the exact same input for your tests? I'll do some further investigation.

EDIT: Actually I think this behavior is correct.
Given what you have just posted I think that * should be applied to every user-agent so Googlebot really should not be allowed to access that page since it has been disabled for every user-agent. Am I right?

@sanderheilbron
Copy link
Author

Hi @lucasfcosta, you can test the behaviour of Googlebot with the robots.txt-tester inside Google Search Console (https://www.google.com/webmasters/tools/robots-testing-tool).

Also you can use some other tools which follow how Googlebot handles robots.txt files:

@lucasfcosta
Copy link
Member

lucasfcosta commented Jan 4, 2017

@sanderheilbron thank you very much!
That's great info! Hopefully I'll have some time to refactor this module entirely until the end of the month. I'll let you know when this issue is fixed.

For now I will reopen it.

Thanks for your help and sorry for not being able to solve it right now. However, I promise I'll work on this whenever I have some spare time.

@lucasfcosta lucasfcosta reopened this Jan 4, 2017
@sanderheilbron
Copy link
Author

Thanks @lucasfcosta, I appreciate your time and effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants