Sunday, 29 December 2013

Indexing docx file with Apache Solr

When you need to search inside the file contents for which yo can use  Solr Data Import Handler. The response should show the content line where the search word is appearing. So for processing line by line yuou can use Line Entity Processor. Following is the data-config file is:


<dataConfig>
<dataSource type="FileDataSource" name = "fds"/>
<document>
<entity name="filelist" processor="FileListEntityProcessor" fileName="sample.docx"
          rootEntity="false"   baseDir="C:\SampleDocuments" >
        <entity name="fileline" processor="LineEntityProcessor"
                url="${filelist.fileAbsolutePath}" format="text">                   
                <field column="linecontent" name="rawLine"/>
        </entity>
</entity>
</document>

The schema.xml is having entry or rawLine.
<field name="rawLine"  type="text" indexed="true" stored="true"/>

No comments:

Post a Comment