Massive Technical Interviews Tips: Tika Miscs

Tuesday, September 15, 2015

Tika Miscs

https://www.tutorialspoint.com/tika/tika_language_detection.htm

To detect the language of a document, a language profile is constructed and compared with the profile of the known languages. The text set of these known languages is known as a corpus.

A corpus is a collection of texts of a written language that explains how the language is used in real situations.

The corpus is developed from books, transcripts, and other data resources like the Internet. The accuracy of the corpus depends upon the profiling algorithm we use to frame the corpus.

The common way of detecting languages is by using dictionaries. The words used in a given piece of text will be matched with those that are in the dictionaries.

A list of common words used in a language will be the most simple and effective corpus for detecting a particular language, for example, articles a, an, the in English.

Among all the 184 standard languages standardized by ISO 639-1, Tika can detect 18 languages.

https://issues.apache.org/jira/browse/TIKA-1723

The language-detector project at https://github.com/optimaize/language-detector is faster, has more languages (70 vs 13) and better accuracy than the built-in language detector.

This is a stab at integrating it, with some initial findings. There are a number of issues this raises, especially if Chris A. Mattmann moves forward with turning language detection into a pluggable extension point.

https://github.com/apache/tika/blob/master/tika-example/src/main/java/org/apache/tika/example/Language.java
LanguageDetector detector = new OptimaizeLangDetector().loadModels();
LanguageResult result = detector.detect("Alla människor är födda fria och lika i värde och rättigheter.");
https://www.tutorialspoint.com/tika/tika_language_detection.htm
http://stackoverflow.com/questions/7025915/what-is-the-best-language-detect-library-or-web-api-available-even-paid
@Deprecated use

org.apache.tika.language.detect.LanguageDetector

public String detectLangTika(String text) throws SystemException {
    LanguageIdentifier li = new LanguageIdentifier(text);
    if (li.isReasonablyCertain())
        return li.getLanguage();
    else
        throw new Exception("Tika lang detection not reasonably certain");
}

Tika IOException: mark/reset not supported

Therefore, the stream you provide to this method must support the optional mark/resetfunctionality. Decorate your resource stream with a BufferedInputStream.

//read audio data from whatever source (file/classloader/etc.)
InputStream audioSrc = getClass().getResourceAsStream("mySound.au");
//add buffer for mark/reset support
InputStream bufferedIn = new BufferedInputStream(audioSrc);
AudioInputStream audioStream = AudioSystem.getAudioInputStream(bufferedIn);

stream = new BufferedInputStream(stream);

tika.detect(stream)

tika.parse(fileInputStream, metadata);

metadata.get(Metadata.IMAGE_WIDTH)

Constants
metadata.get(Metadata.IMAGE_WIDTH)

How to check that file content really image

You'll see that MediaType has a getType() and a getSubtype() method. What you are looking for is the type (ie. "image/*"). The sub-type in this case would be "jpeg".

So your test should be:

if (mediaType.getType().equals("image")) {
   // Deal with image
}

http://www.tutorialspoint.com/tika/tika_extracting_image_file.htm
http://www.jeroenreijn.com/2010/04/metadata-extraction-with-apache-tika.html
final Metadata metadata = new Metadata();
final ParseContext context = new ParseContext();
new JpegParser().parse(fileInputStream, new DefaultHandler(), metadata, context);

final String imageWidth =
MoreObjects.firstNonNull(metadata.get("tiff:ImageWidth"), metadata.get("Image Width"));
final String imageHeight =
MoreObjects.firstNonNull(metadata.get("tiff:ImageLength"), metadata.get("Image Height"));

https://imagemagick.org/discourse-server/viewtopic.php?t=27037

The formal designations of the fields are:

PixelXDimension (often reported as Exif Image Width)
PixelYDimension (often reported as Exif Image Height)

http://superuser.com/questions/994666/what-fields-in-exif-files-provide-image-height-width-information

down vote

What fields in exif files provide image Height/Width information?

Here are the relevant Exif tags as defined in the Exif 2.3 standard (PDF Link):

+------------+------------+--------+-------------------------+-------+------------------------------------------------------------------------------------------------------------------------------------------------+
| Tag (hex)  | Tag (dec)  |  IFD   |          Key            | Type  |                                                                Tag description                                                                 |
+------------+------------+--------+-------------------------+-------+------------------------------------------------------------------------------------------------------------------------------------------------+
| 0x0100     |       256  | Image  | Exif.Image.ImageWidth   | Long  | The number of columns of image data, equal to the number of pixels per row. In JPEG compressed data a JPEG marker is used instead of this tag. |
| 0x0101     |       257  | Image  | Exif.Image.ImageLength  | Long  | The number of rows of image data. In JPEG compressed data a JPEG marker is used instead of this tag.                                           |
+------------+------------+--------+-------------------------+-------+------------------------------------------------------------------------------------------------------------------------------------------------+

Source Standard Exif Tags

http://lucene.472066.n3.nabble.com/Thread-Safety-td646195.html
Is AutoDetectParser thread-safe and/or all the other parsers?
Yes. Once initialized, a parser instance should be immutable and thus fully thread-safe.
Correct, they are just simple instances. Almost all the work is done in the parse() method.

Tuesday, September 15, 2015

Tika Miscs

What fields in exif files provide image Height/Width information?

Labels

Popular Posts