Tuesday, September 15, 2015

Tika Miscs



https://www.tutorialspoint.com/tika/tika_language_detection.htm
To detect the language of a document, a language profile is constructed and compared with the profile of the known languages. The text set of these known languages is known as a corpus.
A corpus is a collection of texts of a written language that explains how the language is used in real situations.
The corpus is developed from books, transcripts, and other data resources like the Internet. The accuracy of the corpus depends upon the profiling algorithm we use to frame the corpus.
The common way of detecting languages is by using dictionaries. The words used in a given piece of text will be matched with those that are in the dictionaries.
A list of common words used in a language will be the most simple and effective corpus for detecting a particular language, for example, articles aanthe in English.
Among all the 184 standard languages standardized by ISO 639-1, Tika can detect 18 languages.
https://issues.apache.org/jira/browse/TIKA-1723
The language-detector project at https://github.com/optimaize/language-detector is faster, has more languages (70 vs 13) and better accuracy than the built-in language detector.
This is a stab at integrating it, with some initial findings. There are a number of issues this raises, especially if Chris A. Mattmann moves forward with turning language detection into a pluggable extension point.
https://github.com/apache/tika/blob/master/tika-example/src/main/java/org/apache/tika/example/Language.java
LanguageDetector detector = new OptimaizeLangDetector().loadModels();
LanguageResult result = detector.detect("Alla människor är födda fria och lika i värde och rättigheter.");
https://www.tutorialspoint.com/tika/tika_language_detection.htm
http://stackoverflow.com/questions/7025915/what-is-the-best-language-detect-library-or-web-api-available-even-paid
@Deprecated use

org.apache.tika.language.detect.LanguageDetector
public String detectLangTika(String text) throws SystemException {
    LanguageIdentifier li = new LanguageIdentifier(text);
    if (li.isReasonablyCertain())
        return li.getLanguage();
    else
        throw new Exception("Tika lang detection not reasonably certain");
}

Tika IOException: mark/reset not supported
Therefore, the stream you provide to this method must support the optional mark/resetfunctionality. Decorate your resource stream with a BufferedInputStream.
//read audio data from whatever source (file/classloader/etc.)
InputStream audioSrc = getClass().getResourceAsStream("mySound.au");
//add buffer for mark/reset support
InputStream bufferedIn = new BufferedInputStream(audioSrc);
AudioInputStream audioStream = AudioSystem.getAudioInputStream(bufferedIn);
stream = new BufferedInputStream(stream);
tika.detect(stream)


tika.parse(fileInputStream, metadata);

metadata.get(Metadata.IMAGE_WIDTH)
Constants
metadata.get(Metadata.IMAGE_WIDTH)

How to check that file content really image
You'll see that MediaType has a getType() and a getSubtype() method. What you are looking for is the type (ie. "image/*"). The sub-type in this case would be "jpeg".
So your test should be:
if (mediaType.getType().equals("image")) {
   // Deal with image
}
http://www.tutorialspoint.com/tika/tika_extracting_image_file.htm
http://www.jeroenreijn.com/2010/04/metadata-extraction-with-apache-tika.html
final Metadata metadata = new Metadata();
final ParseContext context = new ParseContext();
new JpegParser().parse(fileInputStream, new DefaultHandler(), metadata, context);

final String imageWidth =
        MoreObjects.firstNonNull(metadata.get("tiff:ImageWidth"), metadata.get("Image Width"));
final String imageHeight =
        MoreObjects.firstNonNull(metadata.get("tiff:ImageLength"), metadata.get("Image Height"));

https://imagemagick.org/discourse-server/viewtopic.php?t=27037

The formal designations of the fields are:
PixelXDimension (often reported as Exif Image Width)
PixelYDimension (often reported as Exif Image Height)
http://superuser.com/questions/994666/what-fields-in-exif-files-provide-image-height-width-information

What fields in exif files provide image Height/Width information?

Here are the relevant Exif tags as defined in the Exif 2.3 standard (PDF Link):
+------------+------------+--------+-------------------------+-------+------------------------------------------------------------------------------------------------------------------------------------------------+
| Tag (hex)  | Tag (dec)  |  IFD   |          Key            | Type  |                                                                Tag description                                                                 |
+------------+------------+--------+-------------------------+-------+------------------------------------------------------------------------------------------------------------------------------------------------+
| 0x0100     |       256  | Image  | Exif.Image.ImageWidth   | Long  | The number of columns of image data, equal to the number of pixels per row. In JPEG compressed data a JPEG marker is used instead of this tag. |
| 0x0101     |       257  | Image  | Exif.Image.ImageLength  | Long  | The number of rows of image data. In JPEG compressed data a JPEG marker is used instead of this tag.                                           |
+------------+------------+--------+-------------------------+-------+------------------------------------------------------------------------------------------------------------------------------------------------+
http://lucene.472066.n3.nabble.com/Thread-Safety-td646195.html
Is AutoDetectParser thread-safe and/or all the other parsers? 
Yes. Once initialized, a parser instance should be immutable and thus fully thread-safe. 
Correct, they are just simple instances. Almost all the work is done in the parse() method. 

Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts