OJS OCS OMP OHS

You are viewing the PKP Support Forum | PKP Home Wiki



Building a thematic harvester with Harvester2

Open Harvester Systems support questions and answers, bug reports, and development issues.

Moderators: jmacgreg, michael, John

Forum rules
Developer Resources:

Git: You can access our public Git Repository here. Comprehensive Git usage instructions are available on the wiki.

Bugzilla: You can access our Bugzilla report tracker here.

Search: You can use our Google Custom Search to search across our main website, the support forum, and Bugzilla.

Questions and discussion are welcome.

Building a thematic harvester with Harvester2

Postby obi » Mon Jun 23, 2008 5:55 am

We are planning using PKP Harvester2 to build a thematic harvester. We have three categories of data we would like to harvest:

1.Those we would like to harvest their entire repository or set. The function is already in harvester2.

2.Those repositories we would preprocess their records and each record automatically loaded into Harvester2 if the record contains “SURE” specific scientific names. As I understand, this can be achieve by editing “RegexPreprocessorPlugin.inc.php “.

3.Those repositories we would preprocess their records and if a record contains “UNSURE” scientific names, the records will be deposited into a temp database for manual checking and if OK, then the record will be later loaded into Harvester2

Has anyone done point 2 and 3 this before? Any suggestions or examples are highly welcomed.

Thanks in advance.
Obi
University of Tromsø
obi
 
Posts: 52
Joined: Wed Jun 09, 2004 5:56 am

Re: Building a thematic harvester with Harvester2

Postby asmecher » Mon Jun 23, 2008 5:00 pm

Hi Obi,

1) and 2) should be pretty straight-forward, and your suggested approach is fine; the 3rd point will be the one that takes some further thought. What kind of review of the record contents will be necessary? Will it be corrected at the source end and then re-harvested?

Regards,
Alec Smecher
Public Knowledge Project Team
asmecher
 
Posts: 8860
Joined: Wed Aug 10, 2005 12:56 pm

Re: Building a thematic harvester with Harvester2

Postby obi » Tue Jun 24, 2008 1:57 am

For 3rd point (UNSURE process)
We were thinking of having a list of scientific terms (words) in an SQL table; call it “unsureWordmatching” table. If a record and its title, abstract, subject, description or url metadata values matches any of the scientific terms in “unsureWordmatching”, we deposit the record in a temp table. Manual check is done against the temp table. If we are satisfied that the record in the temp table should go to harvester2, we either:

1. Insert the record and all its info into Harvester or
2. Insert its unique id (e.g. url) in a SURE sql table and use the id info to re-harvest next time.

It is the SURE table that will be used by “RegexPreprocessorPlugin.inc.php“ to determine those records that are automatically loaded into Harvester2.

What do you think about the two alternatives?

I am still studying the implementation of harvester2. How can I extract from the variable “$value” in /plugins/preprocessors RegexPreprocessorPlugin.inc.php and assign the value of title, abstract, subject, description, url. metadata to new variables like $tem_title, $new_abstract, …..?

Thanks
Obi
obi
 
Posts: 52
Joined: Wed Jun 09, 2004 5:56 am

Re: Building a thematic harvester with Harvester2

Postby asmecher » Fri Jun 27, 2008 2:47 am

Hi Obi,

I would suggest merely tracking the record's identifier, along with a text dump of all the data you need to review in whatever format is most convenient, in the temp table. Then you'll need to add some code to the harvester's archive management form to allow the user to enter a single ID for harvesting -- this shouldn't be too much work, once you've got your head wrapped around the code, and it would probably be useful for the community if you're able to contribute the implementation back.

The regexp preprocessor plugin's preprocessEntry function is called once per field, i.e. once for a title, once for an abstract, once for an author name, etc., rather than once for an entire record. Use the $field object to determine what particular field is being processed at a given moment. (Judicious use of the print_r function to dump variables will help you to figure out what's happening here.)

Regards,
Alec Smecher
Public Knowledge Project Team
asmecher
 
Posts: 8860
Joined: Wed Aug 10, 2005 12:56 pm

Re: Building a thematic harvester with Harvester2

Postby obi » Fri Jun 27, 2008 5:53 am

Thanks very much. I will report back to the forum.

Obi
obi
 
Posts: 52
Joined: Wed Jun 09, 2004 5:56 am

Re: Building a thematic harvester with Harvester2

Postby obi » Tue Sep 09, 2008 6:22 am

Hi Alec,
I have been trying using the regexp preprocessor plugin's preprocessEntry function to filter out those entries I do want to be stored in havester2. How do I stop the script from storing empty records? That is, those records I have deleted all their metadata from being stored in the database. An empty record is displayed when I browse the id, example

/index.php/record/view/33914 will show empty field and value.

Thanks in advance.
Obi
obi
 
Posts: 52
Joined: Wed Jun 09, 2004 5:56 am

Re: Building a thematic harvester with Harvester2

Postby asmecher » Tue Sep 09, 2008 6:42 pm

Hi Obi,

The preprocessor plugin handles individual pieces of metadata, rather than entire records, and was intended to alter metadata before indexing, rather than stopping the record from being recorded entirely. That said, you may be able to use the RecordDAO to delete a record during the preprocessor plugin's callback.

Regards,
Alec Smecher
Public Knowledge Project Team
asmecher
 
Posts: 8860
Joined: Wed Aug 10, 2005 12:56 pm

Re: Building a thematic harvester with Harvester2

Postby obi » Fri Sep 19, 2008 6:22 am

Hi Alec,
I am not quite sure that I understand what you mean here. Are you saying that I can delete a record after it has been inserted into the database? Is a given record already inserted into the database before a preprocessing phase starts/is called? Can you tell specifically where/which files I can alter to add this function? I am not very good in OO-programming.

Thanks.
Obi
obi
 
Posts: 52
Joined: Wed Jun 09, 2004 5:56 am

Re: Building a thematic harvester with Harvester2

Postby asmecher » Fri Sep 19, 2008 8:43 am

Hi Obi,

The record is inserted when the harvester parses the beginning of a record entry, before the preprocessor plugin is invoked. If you write a preprocessor plugin, and if you determine in the process of handling a record that you'd like to prevent it from getting indexed, you may be able to delete the record by getting the record object from the harvester:
Code: Select all
$record =& $harvester->getRecord();
...then deleting the record using the RecordDAO:
Code: Select all
$recordDao =& DAORegistry::getDAO('RecordDAO');
$recordDao->deleteRecord($record);
However, this isn't the intended use of a preprocessor plugin, so your mileage may vary.

Regards,
Alec Smecher
Public Knowledge Project Team
asmecher
 
Posts: 8860
Joined: Wed Aug 10, 2005 12:56 pm

Re: Building a thematic harvester with Harvester2

Postby dmoses » Thu Dec 18, 2008 12:29 pm

I'm just trying to clarify how you might preprocess your results to only include records with a particular word/term using RegexPreprocessorPlugin.inc.php.
If I was harvesting records and I only wanted to include those records that had a dc.subject=potatoes, how would I modify RegexPreprocessorPlugin.inc.php? The example in the script shows how you might modify the content of the result, but how would you include or exclude full records from harvesting.

Code: Select all
   
function preprocessEntry(&$archive, &$record, &$field, &$value, &$attributes) {
                    /*
                     * Add your regular expressions, and any conditional logic, here. You can use
                     * methods like $archive->getArchiveId() and $field->getName() in your code.
                     * This example removes periods from the ends of subject elements:
                     *
                     * if ($field->getName() == 'subject') {
                     *    $value = preg_replace('/\.$/', '', $value);
                     * }
                     */
                    return false;
            }


How would the function be modified? Is it possible ... looking at other posts on the list it doesn't look like it is?
Thanks,
Don
dmoses
 
Posts: 1
Joined: Thu Dec 18, 2008 11:46 am


Return to Open Harvester Systems Support and Development

Who is online

Users browsing this forum: No registered users and 2 guests