OJS OCS OMP OHS

You are viewing the PKP Support Forum | PKP Home Wiki



OJS invisible to various bots

Are you responsible for making OJS work -- installing, upgrading, migrating or troubleshooting? Do you think you've found a bug? Post in this forum.

Moderators: jmacgreg, btbell, michael, bdgregg, barbarah, asmecher

Forum rules
What to do if you have a technical problem with OJS:

1. Search the forum. You can do this from the Advanced Search Page or from our Google Custom Search, which will search the entire PKP site. If you are encountering an error, we especially recommend searching the forum for said error.

2. Check the FAQ to see if your question or error has already been resolved.

3. Post a question, but please, only after trying the above two solutions. If it's a workflow or usability question you should probably post to the OJS Editorial Support and Discussion subforum; if you have a development question, try the OJS Development subforum.

OJS invisible to various bots

Postby szmigieldesign » Wed Nov 28, 2012 10:09 am

Hello,

I'm responsible for OJS instalation at http://forumoswiatowe.pl. First journal issue was published recently. However, the website is still invisible to various web crawlers and bots. I've done some research and was perplexed to see that some services are reporting weird errors like no meta-tags.

For example, scanning the website here http://www.submitexpress.com/analyzer/ results with no basic meta tags (description, title, keywords) which isn't true as they appear in website's code. Also, a validator at http://validator.w3.org/ generates one error
Code: Select all
Line 1, Column 1: end of document in prolog
which suggest that the document accessed is empty. However, I can still open the website in several browsers without any problems.

I believe that website's invisibility in web search engines is caused by something blocking bots and crawlers from indexing it. So far I've checked the robots.txt file and .htaccess files in search of some errors with no success. Robots.txt disallow only /cache/ access and I have a wide list of bots excluded from the site in .htaccess file which I have also temporarily removed in order to check if it's not causing google bot or other services to stop indexing the website - it didn't.

Does anyone have a suggestion of what might be the cause of this problem?
szmigieldesign
 
Posts: 30
Joined: Thu Aug 30, 2012 2:22 am
Location: PL

Re: OJS invisible to various bots

Postby asmecher » Wed Nov 28, 2012 10:27 am

Hi szmigieldesign,

I can't think of anything in OJS that would cause this; it may be a web server configuration/security/filtering issue. I'd suggest getting a tool or web browser plugin that allows impersonation of different user agents; in Firefox, for example, there's a "default user agent" extension that can do this.

Regards,
Alec Smecher
Public Knowledge Project Team
asmecher
 
Posts: 8575
Joined: Wed Aug 10, 2005 12:56 pm

Re: OJS invisible to various bots

Postby szmigieldesign » Wed Nov 28, 2012 10:44 am

I really wasn't suspecting OJS to have such weird behavior. After all, it was made to function well out of the box. I'll do some research with my hosting provider and see if I can find something.
szmigieldesign
 
Posts: 30
Joined: Thu Aug 30, 2012 2:22 am
Location: PL

Re: OJS invisible to various bots

Postby szmigieldesign » Wed Nov 28, 2012 10:59 am

I have just found out something interesting:

When I paste "http://forumoswiatowe.pl/index.php/czasopismo" into W3C validator, it opens the file and displays that markup is valid (no empty file errors), but gives an empty response on "http://forumoswiatowe.pl". It looks like web crawlers and other bots somehow aren't redirected properly. Since there's only one journal on this particular OJS installation, I've enabled it to be persistent site-wide. However, since the PHP engine is doing the redirect, there may be some misconfiguration. Too bad that my knowledge about Apache configuration is too shallow to diagnose it myself. I've also tried and temporarily removed .htaccess (I have a www to non-www redirect there) to see if it isn't conflicting but it didn't fix the problem.

Perhaps one solution would be to have a solid redirect from "http://forumoswiatowe.pl/" to "http://forumoswiatowe.pl/index.php/czasopismo" but I don't know if OJS's internal redirect mechanism won't kick-in before .htaccess.
szmigieldesign
 
Posts: 30
Joined: Thu Aug 30, 2012 2:22 am
Location: PL

Re: OJS invisible to various bots

Postby asmecher » Wed Nov 28, 2012 11:22 am

Hi szmigieldesign,

The redirects are implemented in lib/pkp/classes/core/PKPRequest.inc.php as:
Code: Select all
header("Refresh: 0; url=$url");
As far as I'm aware we've had no reports that crawlers have a problem with this, but stranger things have happened.

Regards,
Alec Smecher
Public Knowledge Project Team
asmecher
 
Posts: 8575
Joined: Wed Aug 10, 2005 12:56 pm

Re: OJS invisible to various bots

Postby szmigieldesign » Wed Nov 28, 2012 11:33 am

This is the part of the code that I have in my PKPRequest.inc:
Code: Select all
function redirectUrl($url) {
      PKPRequest::_checkThis();

      if (HookRegistry::call('Request::redirect', array(&$url))) {
         return;
      }

      header("Refresh: 0; url=$url");
      exit();
   }


Also, I'm using OJS 2.3.8, the latest stable.
szmigieldesign
 
Posts: 30
Joined: Thu Aug 30, 2012 2:22 am
Location: PL

Re: OJS invisible to various bots

Postby szmigieldesign » Wed Nov 28, 2012 1:34 pm

Hello,

I've started a ticket at my hosting company's help desk and the solution suggested by the administrator is to change
Code: Select all
header("Refresh: 0; url=$url");
to
Code: Select all
header("Location: $url");


The reason for this is that "Refresh" is supported by web browsers but may not be with crawlers or software like wget or curl. Administrator also tried to access the website with wget with debugging enabled and here's the response:

Code: Select all
DEBUG output created by Wget 1.13.4 on linux-gnu.

URI encoding = `UTF-8'
--2012-11-28 20:46:48--  http://forumoswiatowe.pl/
Translacja forumoswiatowe.pl (forumoswiatowe.pl)... 85.17.223.149
Caching forumoswiatowe.pl => 85.17.223.149
Łączenie się z forumoswiatowe.pl (forumoswiatowe.pl)|85.17.223.149|:80... połączono.
Created socket 3.
Releasing 0x080b1ec8 (new refcount 1).

---request begin---
GET / HTTP/1.1
User-Agent: Wget/1.13.4 (linux-gnu)
Accept: */*
Host: forumoswiatowe.pl
Connection: Keep-Alive

---request end---
Żądanie HTTP wysłano, oczekiwanie na odpowiedź...
---response begin---
HTTP/1.1 200 OK
Date: Wed, 28 Nov 2012 19:46:47 GMT
Server: Apache
Refresh: 0; url=http://forumoswiatowe.pl/index.php/czasopismo
Set-Cookie: OJSSID=c8647ee118d94e8a392a2619124e0664; path=/
Vary: Accept-Encoding,User-Agent
Content-Length: 0
Keep-Alive: timeout=2, max=10000
Connection: Keep-Alive
Content-Type: text/html

---response end---
200 OK

Stored cookie forumoswiatowe.pl -1 (ANY) / <session> <insecure> [expiry none] OJSSID c8647ee118d94e8a392a2619124e0664
Registered socket 3 for persistent reuse.
Długość: 0 [text/html]
Zapis do: `index.html'

    [ <=>                                                                                               ] 0           --.-K/s   w  0s     

2012-11-28 20:46:48 (0,00 B/s) - zapisano `index.html' [0/0]


Everything works well after I changed Refresh to Location. However, I wonder why other OJS installations are free of this problem? I'd like to hear from developers if it's safe to leave this line of code changed or not if it may compromise journal's security. Perhaps, there's a better way of dealing with this problem? I'm also worried about future updates to OJS. Will I have to update this file each time manually?
szmigieldesign
 
Posts: 30
Joined: Thu Aug 30, 2012 2:22 am
Location: PL

Re: OJS invisible to various bots

Postby asmecher » Wed Nov 28, 2012 1:56 pm

Hi szmigieldesign,

We used to use Location: headers but changed to the current Refresh: per http://pkp.sfu.ca/bugzilla/show_bug.cgi?id=3520. That change was quite a while ago so I'm surprised nobody has reported crawling problems so far. I'd be interested in potentially returning to Refresh: as it's more standards-friendly, but only if it doesn't cause major problems with IE. I'll see if we can do some testing with the team on this. Meanwhile, changing back to Location: won't have any other side-effects.

Regards,
Alec Smecher
Public Knowledge Project Team
asmecher
 
Posts: 8575
Joined: Wed Aug 10, 2005 12:56 pm

Re: OJS invisible to various bots

Postby szmigieldesign » Wed Nov 28, 2012 2:24 pm

Hello,

I've done some quick research and found two example journals based on OJS 2.3.4 and OJS 2.3.6 (http://journal.rsw.edu.pl and http://www.qualitative-research.net) and they suffer from the same problem (check: http://www.submitexpress.com/analyzer/)

Generally speaking, it seems that from certain version, OJS installations that use only one journal with site-wide redirect are invisible to Google and other search engines.
szmigieldesign
 
Posts: 30
Joined: Thu Aug 30, 2012 2:22 am
Location: PL

Re: OJS invisible to various bots

Postby asmecher » Wed Nov 28, 2012 2:33 pm

Hi szmigieldesign,

Thanks -- we'll follow up, probably in http://pkp.sfu.ca/bugzilla/show_bug.cgi?id=6670.

Regards,
Alec Smecher
Public Knowledge Project Team
asmecher
 
Posts: 8575
Joined: Wed Aug 10, 2005 12:56 pm

Re: OJS invisible to various bots

Postby szmigieldesign » Wed Nov 28, 2012 2:47 pm

You're welcome.

I'm glad I could help.
szmigieldesign
 
Posts: 30
Joined: Thu Aug 30, 2012 2:22 am
Location: PL

Re: OJS invisible to various bots

Postby szmigieldesign » Wed Dec 12, 2012 1:25 pm

After two weeks since the problem was fixed on forumoswiatowe.pl Google still doesn't want to index this website, regardless of having Google Webmaster Tools set up. I've tried fetching particular sub-sites with Google and then sending them to be re-indexed and even generating a sitemap.xml. Still, the website doesn't seem to be indexed at all. In fact - when I check indexing status (advanced view) I can see that there are 32 pages that were not chosen to be indexed (green) which is strange since a journal has been already published on forumoswiatowe.pl with several articles available.

Does anyone have any suggestion of what might be the case? Could a redirect from root to index.php/czasopismo be the culprit? I have only one other redirect set up and it's www to non-www in .htaccess. However, website configuration in Google Webmaster Tools is set up on forumoswiatowe.pl, not http://www.forumoswiatowe.pl which means that indexing shouldn't even trigger the www to non-www translation.
szmigieldesign
 
Posts: 30
Joined: Thu Aug 30, 2012 2:22 am
Location: PL

Re: OJS invisible to various bots

Postby szmigieldesign » Thu Dec 13, 2012 4:39 am

I've checked Google Webmaster Tools and it looks like googlebot doesn't have any problems with reaching the site. Also, robots.txt aren't blocking anything else than /cache/ directory (and cache is off anyway). What worries me is that in advanced view I can see that at most 32 pages are being ignored because they are either very similar to each other or have too many redirects (or redirect to pages with similar content).

I'm worried that a basic redirect from domain root to journal might be the culprit.
szmigieldesign
 
Posts: 30
Joined: Thu Aug 30, 2012 2:22 am
Location: PL

Re: OJS invisible to various bots

Postby asmecher » Thu Dec 13, 2012 9:59 am

Hi szmigieldesign,

Can you confirm that you're seeing this behavior even after the redirect type has been changed?

I doubt that a root redirect would cause this, as it's not unusual behavior -- but I'm not aware of Google's internal decision-making. One way you could probably test this is to disable the redirect in Site Administration; that way the homepage will include a link to your journal rather than a redirect directly to it.

Regards,
Alec Smecher
Public Knowledge Project Team
asmecher
 
Posts: 8575
Joined: Wed Aug 10, 2005 12:56 pm

Re: OJS invisible to various bots

Postby szmigieldesign » Fri Dec 14, 2012 11:08 am

Hello,

I'm happy to inform that the problem with this particular OJS installation was resolved after Google admit that domain was blocked before for being a link farm. Obviously, I've made a mistake by not doing throughout investigation if forumoswiatowe.pl wasn't misused before it was ordered with intention to create an e-journal. I'm already seeing some positive feedback from Google Webmaster Tools and there are two subpages already indexed by Google, so I believe that right now it's just a matter of time for all pages to become indexed and visible in organic search.

After all, I'm happy that my investigations led to discovering a redirect problem in OJS. It's great to know that OJS is evolving.

I would also like to thank everyone for their support. I haven't seen such quick and responsive community for quite a long time.
szmigieldesign
 
Posts: 30
Joined: Thu Aug 30, 2012 2:22 am
Location: PL

Next

Return to OJS Technical Support

Who is online

Users browsing this forum: No registered users and 5 guests