OJS OCS OMP OHS

You are viewing the PKP Support Forum | PKP Home Wiki



[SOLVED] HOWTO import old articles (pdfs&regional charset)

Are you responsible for making OJS work -- installing, upgrading, migrating or troubleshooting? Do you think you've found a bug? Post in this forum.

Moderators: jmacgreg, btbell, michael, bdgregg, barbarah, asmecher

Forum rules
What to do if you have a technical problem with OJS:

1. Search the forum. You can do this from the Advanced Search Page or from our Google Custom Search, which will search the entire PKP site. If you are encountering an error, we especially recommend searching the forum for said error.

2. Check the FAQ to see if your question or error has already been resolved.

3. Post a question, but please, only after trying the above two solutions. If it's a workflow or usability question you should probably post to the OJS Editorial Support and Discussion subforum; if you have a development question, try the OJS Development subforum.

[SOLVED] HOWTO import old articles (pdfs&regional charset)

Postby mbria » Thu Feb 23, 2006 6:15 am

Hi,

I will try to summarize here some "bottlenecks" I found importing articles from a custom made eJournal to OJS and I will also post new problems found (that sure will be due my inexperience). It will help me to clarify and archive... but I hope it will also help new users.

As is shown in the title, my goal is to import in a clean OJS2.1 of my localized environment (articles include special regional characters) my old system's articles (including it's pdf galley).

Note for desesperate readers: It won't be a steep by steep howto and will assume you know basic gnu/linux administration and enough OJS to find the right links in your application menus.

Base System:
OJS: 2.1.0.1
Operating System: Linux
PHP Version: 4.3.10
Apache Version: Apache/2.0.53 mod_perl/1.99_14 Perl/v5.8.4 PHP/4.3.10-10
Database: mysql
Database version: 4.0.23_Debian
Languages: [es],en,po
i18n: locale es_ES | client_charset utf-8 | database_charset Off
files: umask 18(000 010 010 [rwxr-xr-x] -> www-data:www-data)

1) Review your base system at: index.php/index/admin/systemInfo
Two sections are specially important for us: i18n and files.
They can give you the clue about why your importation don't work as you expect.

2) Where is all the stuff? OJS team included native importation php scripts at /plugins/importexport/native of your installed application.
You will find there the dtd and an xml example called sample.xml.
Extra documentation could be find at /docs/IMPORTEXPORT.

3) An example?: As is reportd in some posts (viewtopic.php?t=398), the example won't work for OJS21. DTD shows that email is mandatory and first date is unsorted, so my suggestion is trying this first the easiest example:

Code: Select all
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE issues SYSTEM "native.dtd">
<issues>
 <issue>
  <title>Test Issue</title>
  <access_date>2001-01-01</access_date>
  <section>
   <title>Articles</title>
   <article>
    <title>Jane Doe's Article</title>
    <date_published>2004-10-05</date_published>
    <author>
     <firstname>Jane</firstname>
     <lastname>Doe</lastname>
     <email>none@mail.net</email>
    </author>
   </article>
  </section>
 </issue>
</issues>


4) Encodding correctly: Here is where I got stucked during most of the time. :-)
I tested some different combinations of encoding until I found the one that worked for my system:
a) Play with xml encode (UTF-8, ISO-8859-1)
b) Changed file encoded (ASCII, UNICODE and UTF-8)
c) File saved as DOS and Unix formats.
I didn't look into the code but I suspect UTF-8 header is hardcoded (and also OJS suggestion at the example) so I belive it's the only choice. BTW, be carefully if you are working from a M$windows system because file need to be saved in unix format.

At the end my working combination is:
xml encode UTF-8 and
UTF-8 file (with BOM)
saved as unix (LF only).

This is the example extends former one including special chars:

Code: Select all
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE issue PUBLIC "-//PKP//OJS Articles and Issues XML//EN" "http://pkp.sfu.ca/ojs/dtds/native.dtd">

<issues>
 <issue published="true" current="false">
  <title>Publicación de prueba 01</title>
  <description>[Description]Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.</description>
  <volume>4</volume>
  <number>2</number>
  <year>2006</year>
  <date_published>2006-02-23</date_published>
  <access_date>2006-02-23</access_date>
  <section>
   <title locale="es_ES">Artículos</title>
   <abbrev locale="es_ES">ART</abbrev> 
   <article>
    <title>Título del artículo 01</title>
   <abstract>[Resumen] Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat. </abstract>
    <date_published>1975-10-05</date_published>
    <author>
     <firstname>Nombre autor 01</firstname>
     <lastname>Apellidos autor</lastname>
     <email>user@mail.net</email>
    </author>
   </article>
  </section>
 </issue>
</issues>


As is commented in the forum, "xmllint is your friend". :-)

If it didn't work correctly, you will notice soon at issue and article's title.

5) Adding PDF galley (underconstruction)
here will be the moment to comment nuances about pdf importing: relative paths, emmbed vs file, web interface vs command line...

My conclussions are:
Using web interface: It's mandatory to use qualified urls (for instance http://www.server.net/files/myfile.pdf)
Command line accept everything (including absolute file paths as /var/www/files/myfile.pdf or relative ones ../../../files/myfile.pdf)

Am I right? What is your experience?



AND THE QUESTION IS... ?

It's about this last 5th point. At practice, when importing from the web, with pdf and url paths I get the following error:

A specified URL "http://www.myserver.net/myJournal/oldNumbers/myfile.pdf" could not be copied to a local file.


how is it possible? what local directory??

From any browser I can download the pdf with the "specified URL" and all my OJS files are now with 777 and www-data:www-data as owners. So the final code is like the former one but adding after </author>:

Code: Select all
<galley>
  <label>PDF</label>
    <file>
        <href src="http://www.myserver.net/myJournal/oldNumbers/myfile.pdf" mime_type="application/pdf"/>
    </file>
</galley>


Any idea about what I'm doing wrong?

Thanks for your help or feedback

Cheers,

m.
Last edited by mbria on Wed Mar 21, 2012 2:23 am, edited 3 times in total.
mbria
 
Posts: 306
Joined: Wed Dec 14, 2005 4:15 am

Postby asmecher » Thu Feb 23, 2006 12:22 pm

Hello mbria,

Your description is correct on all points -- I'm sure this will be useful to people assessing the import process.

In regards to the error message you're receiving, there are a few failures that could cause this message:
  • PHP's copy(...) function returns failure. This can happen if allow_url_fopen is disabled in php.ini, if the source file couldn't be read, or if the target file couldn't be written.
  • chmod(...) returns failure. This is only called if a umask is specified in the [files] section of config.inc.php; it will fail if the user doesn't have sufficient permissions to perform the chmod.

Public files are saved as [files path]/journals/[journal ID]/articles/[article ID]/public/[article ID]-[galley ID]-[revision #]-PB.[file extension] -- the www-data user has sufficient permissions to write to this location.

I think the most likely problem you're running into is allow_url_fopen being disabled. If not, I'd recommend putting some debug outputs into the relevant functions to check where the failure is occurring. If you'd like help with this process, feel free to contact me at pkp-support@sfu.ca.

If allow_url_fopen is the problem and you'd like to maintain this setting for security purposes, I'd recommend using the ini_set function to enable it before the call to handleUpload in ArticleFileManager::uploadPublicFile (classes/file/ArticleFileManager.inc.php) and disable it again afterwards.

Regards,
Alec Smecher
Open Journal Systems Team
asmecher
 
Posts: 8869
Joined: Wed Aug 10, 2005 12:56 pm

Postby mbria » Fri Feb 24, 2006 3:41 am

Thanks a lot Alec,

It's very relaxing working with OJS knowing you are always there. :-)

I tested both suggestions:

a) Change allow_url_fopen at php.ini: This is my php.ini variable:

Code: Select all
; Whether to allow the treatment of URLs (like http:// or ftp://) as files.
allow_url_fopen = On



b) Change umask at config.inc.php:

Code: Select all
; Permissions mask for created files and directories
umask = 0022


As pointed below, owner user and group are: www-data:www-data but I changed (temporaly) to 0000 to be sure it won't be the problem.

c) Permissions: I checked both, web and storage files, setting them to 777 and www-data:www-data. I reviewed my "...../journals/1/articles/" directory and new articles id are created, but those articles-id directories are empty.

At the end... no changes, same error. :-(

I will try debugging with the ancient "print" method, but is there any OJS log file? Which classes/files do you recomend to start with? classes/file/ArticleFileManager.inc.php is ok?

Cheers and thanks a lot for your help,

m.
mbria
 
Posts: 306
Joined: Wed Dec 14, 2005 4:15 am

it works!!!

Postby dudu » Tue Feb 28, 2006 5:54 am

hi all,
the code written by mbria has worked very well in my case. I have only changed the place of .pdf file. Instead of the place in original code which was;

<href src="http://www.myserver.net/myJournal/oldNumbers/myfile.pdf" mime_type="application/pdf"/>

I have changed it to :

<href src="http://www.server.net/myjournal directory/ojs-2.1.0-1/files/journals/1/articles/1.pdf" mime_type="application/pdf"/>

thanks mbria!!! It is good to know that guys like you are also there :) :D
dudu
 
Posts: 14
Joined: Mon Feb 27, 2006 7:54 am

thanks

Postby mbria » Thu Mar 02, 2006 5:59 am

Thanks dudu for your gentle words and for your suggestion.
Let me thanks OJS developers, testers and writers for they unbelievable work with this platform. This microhowto is less than a water drop compared with the see of effort those guys made with OJS.

Ok... let's work :-P
Some feedback about your suggested solution.

It's nice if it works... but I think it will only work if you "web-map" your OJS's files directory and I belive it isn't a recommended practice, because it can be accessed without authoritation.

If you understand the "open knowledge" principle from a radical perspective, won't be a problem for you to offer all your documentation to people "very interested" in it, but probably others think that original articles, reviewed ones or very modified galleys... need to be secured and hidden form externals.

Summarizing: Your suggstion will work fine with the OJS importation web interface, if you map your OJS "file directory" in your web site what IMHO isn't a recommended practice.

Following your suggestion, I will try to import my pdf pointing to the absolute system path of my OJS files... and it means that I need to run the importation from the command line, because web interface only works with urls.

If I'm talking rubish or somebody has a better idea... talk now or be quiet for ever. :-P

Cheers,

m.
mbria
 
Posts: 306
Joined: Wed Dec 14, 2005 4:15 am

its solved and "xml writer xls file"

Postby dudu » Thu Mar 02, 2006 8:11 am

hi all,
first of all the code works fine in the way mbria stated but in somehow different way. As far as I can understand, the "import function of OJS" renames and copies the files under /files/journals/1/articles etc... under OJS's main directory. When I first read your message I thought that OJS just recodes the URL entered in XML file. However, to figure out if it really does so, I deleted the original files from web-map (or from the place that is not under OJS's main directory) and see that readers can still have access to the pdfs. Thus, once the import is completed, you can delete the pdfs from the location stated in XML transfer files. Thus our -at least my- problem about securily exporting the old issues is solved :)

By the way, I have wrote an excel file that writes mbria's XML code required in transfers once you enter the necessary information such as titles, abstracts etc... Of course at least an intermediate excel knowledge is necessary to use my "xml writer xls file". I think it will be very useful for those who needs to import lots of issues. Instead of struggling with XML code, they can enter the info easily.
I consider to make it public but, firstly since mbria wrote the code originally, I thought I require mbria's permission and second if mbria gives the permission I will work on it to make it more "user friendly". I am waiting for mbria's permission.
dudu
 
Posts: 14
Joined: Mon Feb 27, 2006 7:54 am

command line importation...

Postby mbria » Thu Mar 02, 2006 10:05 am

Hi dudu,

Wonderful if you find a way to make importation more "userfriendly".
I will also love to enjoy it, as far as I'm still finding trouble in my importation (probably I'm still missing something and your Excel will help me to clarify). Thanks in advance.

The XML files I wrote here are just variations (simplifications with didactical aims) of the original XML samples you can find in the OJS distributions, so I'm not the real "owner". The original files (as mines) are distributed under a GPL v2 license so basically it means that if you credit the original author you are free to do whatever you like. See docs/COPYING if you need more details.

In this colaborative "research" I though was the moment to start playing with the command-line importation. I post here my notes, just in case they will be helpfull for those that are more newbies than me (if is there somebody with this profile :-D).

Dudu, your first post made me think importation in a different way I though at the begining. Your last corrections (importation as a COPY of files, not just linking) fits better with my inital approach and looking into the code it's how it really works.

I will update the main post with the following text as soon as we find it useful for our final importation goals... probably dudu Exel will be enough to do the job and this won't be useful for nobody.

This minihowto continuation is still incomplete. I will finish it tomorrow.

--------

Looks like our "trip" arround the OJS importation plugin, drive us to the command-line importation (see former posts)

I went to the plugin directory that redirect us to the IMPORTEXPORT documentation where we can find the following information useful for our proposes:

Each import/export plugin has a unique name, defined in its PHP source code, by
which it can be invoked using the command line tool. To get a list of all
plugins via the command line, execute the following:
php tools/importExport.php list


So I tried to be sure everything was working well and I get:

Available plugins:
EruditExportPlugin
NativeImportExportPlugin
UserImportExportPlugin


Great. We have NativeImportExportPlugin installed.
Let's follow reading... and at the plugin develper's standards we find:

- Journals should be addressed by path;
- Local hrefs, such as <href src="localFile"/>, should be supported
only by the command line tool and should be discarded by the web-
based tool for security reasons;


Good. It make sense. So...
Impotant mental note: only system paths as /srv/storage/myfiles/file.pdf will be accepted.

And we arrive to our point: "Articles & Issues XML Plugin" where it's said:

To get usage information for the command line tool, execute the following:

php tools/importExport.php NativeImportExportPlugin usage


And we test... without getting any usage information from the plugin.
Ok, nevermind if the programmer was a little lazy at the momment of giving user information. I won't be the right computer guy to critique because I always do the same. So... let's deduce the import usage from the export example we have:

php tools/importExport.php NativeImportExportPlugin export out.xml demo issue 3


So I suspect the importation usage is:

php tools/importExport.php NativeImportExportPlugin import input.xml


Let's see what happens...

(going to dinner. I will follow after)
mbria
 
Posts: 306
Joined: Wed Dec 14, 2005 4:15 am

Postby asmecher » Thu Mar 02, 2006 10:14 am

Hi Mbria,

The command
Code: Select all
php tools/importExport.php NativeImportExportPlugin usage
should indeed present you with usage information... If you're getting dumped back onto the command line without any feedback, check to make sure that PHP isn't encountering any startup errors. On some systems, the command-line version of PHP has a different configuration file than the CGI (or module-based) PHP; if, for example, the command-line version isn't configured to include the MySQL module, it'll drop to the command line without any messages.

Double-check that the MySQL module is in place for the CLI PHP executable, make sure display_startup_errors is enabled in php.ini and try again. This will vary from system to system, but on Debian, the CLI PHP4 executable uses /etc/php4/cli.php.ini (as opposed to the Apache one, which is in /etc/php4/apache/php.ini).

Regards,
Alec Smecher
Open Journal Systems Team
asmecher
 
Posts: 8869
Joined: Wed Aug 10, 2005 12:56 pm

you are right... again.

Postby mbria » Thu Mar 02, 2006 10:53 am

asmecher, aren't you tired of being always right :lol:

I checked my client php.ini configuration and it's exactly what you pointed: my apache ini configuration file includes mysql module, but not the php client one.

Newbie question: How do you notice it?

Solution: Ask your php about your configuration executing the following command:
Code: Select all
$ php -i | grep mysql


and the output will be something like this:
Configure Command => '../configure' '--prefix=/usr' '--disable-cgi' '--with-config-file-path=/etc/php4/cli' '--enable-memory-limit' '--disable-debug' '--with-regex=php' '--disable-rpath' '--disable-static' '--with-pic' '--with-layout=GNU' '--with-pear=/usr/share/php' '--enable-calendar' '--enable-sysvsem' '--enable-sysvshm' '--enable-sysvmsg' '--enable-track-vars' '--enable-trans-sid' '--enable-bcmath' '--with-bz2' '--enable-ctype' '--with-db4' '--with-iconv' '--enable-exif' '--enable-filepro' '--enable-ftp' '--with-gettext' '--enable-mbstring' '--with-pcre-regex=/usr' '--enable-shmop' '--enable-sockets' '--enable-wddx' '--disable-xml' '--with-expat-dir=/usr' '--with-xmlrpc' '--enable-yp' '--with-zlib' '--without-pgsql' '--with-kerberos=/usr' '--with-openssl=/usr' '--enable-dbx' '--with-mime-magic=/usr/share/misc/file/magic.mime' '--with-exec-dir=/usr/lib/php4/libexec' '--without-mm' '--without-mysql' '--without-sybase-ct' '--enable-pcntl' '--with-ncurses=/usr'


"--without-mysql" points that asmecher's was right.

Newbie: Ok, in plain words... How do I fix it?

Solution: As asmecher said, edit your php client configuration file with your favorite editor:

Code: Select all
$ vi /etc/php4/cli/php.ini


And arround line 530 (it depends on configurations) you will find an input like:

;extension=mysql.so


You only need to delete ";" character to uncomment this line to made mysql module active.

Let's execute the plugin again with usage parameter:

Code: Select all
$ php tools/importExport.php NativeImportExportPlugin usage


An now you will get plugin's usage output (sorry, my OS locales are spanish):

Uso: tools/importExport.php NativeImportExportPlugin [command] ...
Comandos:
import [xmlFileName] [journal_path] [user_name] ...
export [xmlFileName] [journal_path] articles [articleId1] [articleId2] ...
export [xmlFileName] [journal_path] article [articleId]
export [xmlFileName] [journal_path] issues [issueId1] [issueId2] ...
export [xmlFileName] [journal_path] issue [issueId]

Se necesitan parámetros adicionales para importar datos de la siguiente manera, dependiendo del
nodo raiz del documento XML.


Si el nodo raí­z es <article> o <articles>, se necesitan parámetros adicionales.
Se aceptan los siguientes formatos:

tools/importExport.php NativeImportExportPlugin import [xmlFileName] [journal_path] [user_name]
issue_id [issueId] section_id [sectionId]

tools/importExport.php NativeImportExportPlugin import [xmlFileName] [journal_path] [user_name]
issue_id [issueId] section_name [name]

tools/importExport.php NativeImportExportPlugin import [xmlFileName] [journal_path]
issue_id [issueId] section_abbrev [abbrev]


Conclusion: As was expectable, the NativeImportExportPlugin's developer wasn't a lazy programer. :P

Thanks asmecher for your help.

------------

Talking about a different (but thread related) problem... as you will notice reading this post I'm stucked trying to make the plugin import my old pdf files with the web interface.

In your former post you suggested me to debug the code to see where it fails so I asked where do you think I need to start by. Any new suggestion now? Any idea?

You encoraged me to annoy you by mail, but (if is also ok for you) I think will be better for the OJS community of users to post the whole process here.

Although I'm not an expert php OOP programmer I can read the code quite well so with your suggestions I'm sure I can follow the trace until I will find the problem. BTW, any OJS log file anywere? Any debug methods?

Once again, thanks a lot for your help and the great work done in this platform,

marc
mbria
 
Posts: 306
Joined: Wed Dec 14, 2005 4:15 am

Postby asmecher » Thu Mar 02, 2006 11:49 am

Hi Mbria,

Just a lucky guess, I'm afraid.

FYI, The conclusion you came to above is correct -- when you import articles using an "href" to refer to an external file, OJS only needs to access that file during the import. It's then copied into the appropriate place in the OJS files directory and you don't need to leave it at the URL specified. Likewise, when importing files from a local directory, they're copied into OJS's files path and you don't need to leave the originals in place afterwards.

For debugging the import problem you're encountering, I haven't been able to find a more effective way than using well-placed "print" statements; PHP doesn't lend itself to tool-based debugging, unfortunately. I'd start in classes/file/ArticleFileManager.inc.php, specifically in the handleCopy function, and in classes/file/FileManager.inc.php, in the copyFile and setMode functions. One of these functions is returning false, triggering the error message you were encountering, and you'll need to investigate why.

Regards,
Alec Smecher
Open Journal Systems Team
asmecher
 
Posts: 8869
Joined: Wed Aug 10, 2005 12:56 pm

oki doki

Postby mbria » Thu Mar 02, 2006 12:11 pm

Thanks Alec. I will follow your debuging plan.

Leaving work right now... but I promise I will annoye a little bit more tomorrow. :P

Cheers,

m.
mbria
 
Posts: 306
Joined: Wed Dec 14, 2005 4:15 am

xmlwriterxls version 1.0

Postby dudu » Fri Mar 17, 2006 11:31 am

hi all,
The excel file that generates the code writtern by mbria is available at

http://www.metu.edu.tr/~dudu/

click the link aboce and go to "Download" page by selecting "Download" link at the left side of the site. Follow the instructions for download.
After the download is complete first make a virus check. (I don't think that file is infected but it is always good to make a virus check before you open any file)
Then, you first need to extract the xls file from zip file (by using 7-zip or winzip or winrar or smt similar). Then open the excel file. (When you open the file, excel may reject to run the file since the file consists of a macro that generates the code automatically. If you are familiar with excel and do not want to do everything automatically just disable the macro. (You need to go to security settings page of excel and adjust the security settings in a way to ensure that excel asks you what to do about the macros. If you are not familiar with excel and want to do the process you need to enable the macros after changing the security settings of excel as described.)
After openning the files just follow the instructions at the issue data page. Don't forget to delete the example entries, before entering your actual data.
I hope it works...
dudu
 
Posts: 14
Joined: Mon Feb 27, 2006 7:54 am

Postby dudu » Wed May 03, 2006 1:59 am

version 1.0 of xmlwirterxls transfers wrong page numbers. The problem is fixed in version 1.1 which is available at
http://www.metu.edu.tr/~dudu/
dudu
 
Posts: 14
Joined: Mon Feb 27, 2006 7:54 am

Thanks !

Postby mbria » Wed May 03, 2006 3:26 am

Thanks dudu to share.

I was overwhelmed to answer your former mail, but I like to thank you your effort and your sharing spirit.

I'm working in a php version similar to your excel. Not much benefits in relation with your development, but I had a rare database to import and I feel more confortable with php than with excel.

I will publish my version here, as soon as I finish it. Probably during this week or next.

Cheers,

m.
mbria
 
Posts: 306
Joined: Wed Dec 14, 2005 4:15 am

anything need?

Postby dudu » Mon Jul 24, 2006 12:17 pm

hi,
i am sure that a php file will be a "major" contribution and appreciated very much (especially by me) since even I am having trouble while importing with the excel file. whatever, if there is anything that I can help, just say it.
dudu
 
Posts: 14
Joined: Mon Feb 27, 2006 7:54 am

Next

Return to OJS Technical Support

Who is online

Users browsing this forum: Google [Bot] and 4 guests