Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify sites that go to "Access Denied" or "Verify you are a human" pages when loaded #51

Closed
katehausladen opened this issue Jun 21, 2023 · 5 comments
Assignees
Labels
enhancement New feature

Comments

@katehausladen
Copy link
Collaborator

I've noticed that there are some sites that go to a page that says some iteration of "Access Denied" or "Verify you are a human." I think this is mostly caused by the VPN (i.e. the VPN IP address is blocked/recognized), but the crawler probably contributes, too. It seems like there is some variability in the sites run to run, particularly in the "Verify you are a human" category.
We want to:

  1. identify these sites, since the crawl data will be inaccurate
  2. hopefully find a way to bypass at least some of these issues

Currently, I am identifying them by the title of the site (i.e. "Access Denied" and "Just a moment..." in the pictures below).

Screenshot 2023-06-21 at 10 51 52 AM Screenshot 2023-06-21 at 10 52 36 AM

The sites I do identify are logged as either a "VerifyHumanError" or a "AccessDeniedError". The crawler won't restart after these errors, so it doesn't affect the overall flow.
In the latest full crawl, the sites that asked to verify you were a human were:
VerifyHumanError: [
'https://www.legacy.com',
'https://www.ticketsatwork.com',
'https://www.fixya.com',
'https://www.cameo.com',
'https://www.cardinalcommerce.com',
'https://www.cinemark.com',
'https://www.securecafe.com',
'https://www.lordandtaylor.com',
'https://www.moneytalksnews.com',
'https://www.allegiantair.com',
'https://www.newspapers.com',
'https://www.rentcafe.com',
'https://www.babylonbee.com',
'https://www.fleetfarm.com',
'https://www.jegs.com',
'https://www.appurse.com',
'https://www.123-movies.com',
'https://www.camelcamelcamel.com',
'https://www.muscleandstrength.com']
And the sites with Access Denied were:
AccessDeniedError: [
'https://www.sprint.com',
'https://www.kroger.com',
'https://www.petsmart.com',
'https://www.tacobell.com',
'https://www.subway.com',
'https://www.zoosk.com',
'https://www.officedepot.com',
'https://www.hotwire.com',
'https://www.meijer.com',
'https://www.jcrew.com',
'https://www.backcountry.com',
'https://www.littlecaesars.com',
'https://www.fisglobal.com',
'https://www.fossil.com',
'https://www.flyertalk.com',
'https://www.citizensbank.com',
'https://www.earthlink.net',
'https://www.apartmentfinder.com',
'https://www.demandforce.com',
'https://www.pizzahut.com',
'https://www.t-mobile.com']

@katehausladen katehausladen self-assigned this Jun 21, 2023
@OliverWang13
Copy link
Collaborator

I've put some thought into this too and we could search for text like "Access Denied" or "Verify you are a human" by using a method similar to how we search for Do Not Sell links. Also, while we have access to @sophieeng's IP address, we could have her run a crawl and see if sites are still blocked.

katehausladen added a commit that referenced this issue Jun 21, 2023
now, errors will be logged in error-logging.json
@katehausladen
Copy link
Collaborator Author

some more examples:

ask petco zoominfo
@katehausladen
Copy link
Collaborator Author

katehausladen commented Jun 22, 2023

I wrote some code to click the "Verify you are a human" button by Cloudflare (as in the cinemark.com example). When the button is clicked, the same captcha page loads again. So since clicking the button alone does not work, Cloudflare is definitely detecting that we are using Selenium. I haven't found any packages that would help with this for Selenium with Nodejs and Firefox, and since it's only impacting ~20 sites, I'm not sure it's worth spending a ton of time trying to bypass. We can discuss this more at the meeting.

katehausladen added a commit that referenced this issue Jun 22, 2023
- Matching access and denied separately (for titles like "Access to this page has been denied.")

- matching "error" and "service unavailable"

- maximize browser screen size

- clicking on Cloudfare captcha is commented in lines 143-152
@SebastianZimmeck SebastianZimmeck added the enhancement New feature label Jun 24, 2023
@SebastianZimmeck
Copy link
Member

SebastianZimmeck commented Jun 24, 2023

The crawler won't restart after these errors, so it doesn't affect the overall flow.

The crawler continues crawling (as a clarification).

impacting ~20 sites

For the total about 1,800 sites, though, the 20 are specifically for the Cloudflare human-checks.

katehausladen added a commit that referenced this issue Jun 26, 2023
- instead of having VerifyHumanError, AccessDeniedError, SiteError, just have one HumanCheckError
- adding "pardon our interruption" to the list to check, as it occasionally occurs on realtor.com
katehausladen added a commit that referenced this issue Jun 26, 2023
- instead of having VerifyHumanError, AccessDeniedError, SiteError, just have one HumanCheckError
- adding "pardon our interruption" to the list to check, as it occasionally occurs on realtor.com
katehausladen added a commit that referenced this issue Jun 30, 2023
describing updates from issue #51 in the readme
@katehausladen
Copy link
Collaborator Author

We decided in the last meeting that we will not try to bypass any human checks. We will add new titles to the regex in the visit_site function in local-crawler.js as we see more sites like this. The readme is also updated to include this in section 4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature
3 participants