Thursday, July 3, 2014

Linux Interview Questions: AWK



http://man7.org/linux/man-pages/man1/gawk.1.html
http://www.tecmint.com/use-linux-awk-command-to-filter-text-string-in-files/
# awk '/localhost/{print}' /etc/hosts

http://stackoverflow.com/questions/12739515/setting-multiple-field-to-awk-variables-at-once
SAVEIFS=$IFS;
IFS=',';
while read line; do
    set -- $line;
    var1=$1;
    var2=$2;
    var3=$3;
    ...
done < file.txt

IFS=$SAVEIFS;
while IFS=, read var1 var2 var3; do
  ...
done < file.txt
http://stackoverflow.com/questions/16317961/how-to-process-each-line-received-as-a-result-of-grep-command
One of the easy ways is not to store the output in a variable, but directly iterate over it with a while/read loop.
Something like:
grep xyz abc.txt | while read -r line ; do
    echo "Processing $line"
    # your code goes here
done

If you need to change variables inside the loop (and have that change be visible outside of it), you can use process substitution as stated in fedorqui's answer:
while read -r line ; do
    echo "Processing $line"
    # your code goes here
done < <(grep xyz abc.txt)
1. Use egrep -o:
echo 'employee_id=1234' | egrep -o '[0-9]+'
1234
2. using grep -oP (PCRE):
echo 'employee_id=1234' | grep -oP 'employee_id=\K([0-9]+)'
1234
3. Using sed:
echo 'employee_id=1234' | sed 's/^.*employee_id=\([0-9][0-9]*\).*$/\1/'
1234

http://unix.stackexchange.com/questions/13466/can-grep-output-only-specified-groupings-that-match
GNU grep has the -P option for perl-style regexes, and the -o option to print only what matches the pattern. These can be combined using look-around assertions (described under Extended Patterns in the perlre manpage) to remove part of the grep pattern from what is determined to have matched for the purposes of -o.
$ grep -oP 'foobar \K\w+' test.txt
bash
happy
$
The \K is the short-form (and more efficient form) of (?<=pattern) which you use as a zero-width look-behind assertion before the text you want to output. (?=pattern) can be used as a zero-width look-ahead assertion after the text you want to output.
For instance, if you wanted to match the word between foo and bar, you could use:
$ grep -oP 'foo \K\w+(?= bar)' test.txt
or (for symmetry)
$ grep -oP '(?<=foo )\w+(?= bar)' test.txt


  • ^ is the beginning of the string anchor
  • $ is the end of the string anchor
  • \s is the character class for whitespace
  • \S is the negation of \s (note the upper and lower case difference)
  • * is "zero-or-more" repetition
  • + is "one-or-more" repetition

awk '/li/ { print $2 }' mail-list
awk '{ if ($1 ~ /J/) print }' inventory-shipped
exp ~ /regexp/
exp !~ /regexp/


With gawk, you can use the match function to capture parenthesized groups.
gawk 'match($0, pattern, ary) {print ary[1]}' 
example:
echo "abcdef" | gawk 'match($0, /b(.*)e/, a) {print a[1]}' 
Yes, in gawk use the match() function and give it the optional array parameter (a in my example). When you do this, the 0-th element will be the part that matched the regex
$ echo "blah foo123bar blah" | awk '{match($2,"[a-z]+[0-9]+",a)}END{print a[0]}'
foo123

http://www.thegeekstuff.com/2011/06/awk-nawk-gawk/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed:+TheGeekStuff+(The+Geek+Stuff)
http://unix.stackexchange.com/questions/29576/difference-between-gawk-vs-awk
AWK is a programming language. There are several implementations of AWK (mostly in the form of interpreters). AWK has been codified in POSIX. The main implementations in use today are:
  • nawk (“new awk”, an evolution of oawk, the original UNIX implementation), used on *BSD and widely available on Linux;
  • mawk, a fast implementation that mostly sticks to standard features;
  • gawk, the GNU implementation, with many extensions;
  • the Busybox (small, intended for embedded systems, not many features).
If you only care about standard features, call awk, which may be Gawk or nawk or mawk or some other implementation. If you want the features in GNU awk, use gawk or Perl or Python.
http://www.cyberciti.biz/faq/unix-linux-bsd-appleosx-skip-fields-command/
echo 'This is a test' | awk '{print substr($0, index($0,$3))}'
http://stackoverflow.com/questions/4198138/printing-everything-except-the-first-field-with-awk
I have a file that looks like this:
AE  United Arab Emirates
AG  Antigua & Barbuda
AN  Netherlands Antilles
AS  American Samoa
BA  Bosnia and Herzegovina
BF  Burkina Faso
BN  Brunei Darussalam
And I 'd like to invert the order, printing first everything except $1 and then $1:
United Arab Emirates AE
Assigning $1 works but it will leave a leading space: awk '{first = $1; $1 = ""; print $0, first; }'
You can also find the number of columns in NF and use that in a loop.
$1="" leaves a space as Ben Jackson mentioned, so use a for loop
awk '{for (i=2; i<=NF; i++) print $i}' filename
so if your string was "one two three" the output will be:
two
three
if you want the result in one row, you could do as follows:
awk '{for (i=2; i<NF; i++) printf $i " "; print $NF}' filename
http://www.cnblogs.com/ggjucheng/archive/2013/01/13/2858470.html
AWK 的确拥有自己的语言: AWK 程序设计语言 , 三位创建者已将它正式定义为“样式扫描和处理语言”
awk就是把文件逐行的读入,以空格为默认分隔符将每行切片,切开的部分再进行各种分析处理。

awk '{pattern + action}' {filenames}
awk [-F  field-separator]  'commands'  input-file(s)
awk -f awk-script-file input-file(s)
将所有的awk命令插入一个文件,并使awk程序可执行,然后awk命令解释器作为脚本的首行,一遍通过键入脚本名称来调用。

cat /etc/passwd |awk  -F ':'  'BEGIN {print "name,shell"}  {print $1","$7} END {print "blue,/bin/nosh"}'
-F指定域分隔符为':'。
#cat /etc/passwd |awk  -F ':'  '{print $1"\t"$7}'

awk工作流程是这样的:先执行BEGING,然后读取文件,读入有/n换行符分割的一条记录,然后将记录按指定的域分隔符划分域,填充域,$0则表示所有域,$1表示第一个域,$n表示第n个域,随后开始执行模式所对应的动作action。接着开始读入第二条记录······直到所有的记录都读完,最后执行END操作。

搜索/etc/passwd有root关键字的所有行
awk -F: '/root/' /etc/passwd
这种是pattern的使用示例,匹配了pattern(这里是root)的行才会执行action(没有指定action,默认输出每行的内容)。
搜索支持正则,例如找root开头的: awk -F: '/^root/' /etc/passwd
awk -F: '/root/{print $7}' /etc/passwd
awk内置变量
ARGC               命令行参数个数
ARGV               命令行参数排列
ENVIRON            支持队列中系统环境变量的使用
FILENAME           awk浏览的文件名
FNR                浏览文件的记录数
FS                 设置输入域分隔符,等价于命令行 -F选项
NF                 浏览记录的域的个数
NR                 已读的记录数
OFS                输出域分隔符
ORS                输出记录分隔符
RS                 控制记录分隔符

$0变量是指整条记录。$1表示当前行的第一个域,$2表示当前行的第二个域,

awk  -F ':'  '{print "filename:" FILENAME ",linenumber:" NR ",columns:" NF ",linecontent:"$0}' /etc/passwd
使用printf替代print,可以让代码更加简洁,易读
awk  -F ':'  '{printf("filename:%10s,linenumber:%s,columns:%s,linecontent:%s\n",FILENAME,NR,NF,$0)}' /etc/passwd

awk还可以自定义变量。
awk '{count++;print $0;} END{print "user count is ", count}' /etc/passwd
awk 'BEGIN {count=0;print "[start]user count is ", count} {count=count+1;print $0;} END{print "[end]user count is ", count}' /etc/passwd
而action{}可以有多个语句,以;号隔开。

统计某个文件夹下的文件占用的字节数
ls -l |awk 'BEGIN {size=0;} {size=size+$5;} END{print "[end]size is ", size}'

条件语句
统计某个文件夹下的文件占用的字节数,过滤4096大小的文件(一般都是文件夹):
ls -l |awk 'BEGIN {size=0;print "[start]size is ", size} {if($5!=4096){size=size+$5;}} END{print "[end]size is ", size/1024/1024,"M"}'

数组
因为awk中数组的下标可以是数字和字母,数组的下标通常被称为关键字(key)。值和关键字都存储在内部的一张针对key/value应用hash的表格里。由于hash不是顺序存储,因此在显示数组内容时会发现,它们并不是按照你预料的顺序显示出来的。数组和变量一样,都是在使用时自动创建的,awk也同样会自动判断其存储的是数字还是字符串。
awk -F ':' 'BEGIN {count=0;} {name[count] = $1;count++;}; END{for (i = 0; i < NR; i++) print i, name[i]}' /etc/passwd
How to write awk commands and scripts
The basic format:
awk 'BEGIN { action; }
/search/ { action; }
END { action; }' input_file

awk -F, '{ print $3 }' table1.txt > output1.txt
awk '$7=="\$7.30" { print $3 }' table1.txt
awk '/30/ { print $3 }' table1.txt

awk '{ sum=0; for (col=1; col&lt=NF; col++) sum += $col; print sum; }'

How To Use the AWK language to Manipulate Text in Linux | DigitalOcean
awk '/search_pattern/ { action_to_take_on_matches; another_action; }' file_to_parse
awk '/^UUID/ {print $1;}' /etc/fstab

Awk Internal Variables and Expanded Format
FILENAME: References the current input file.
FNR: References the number of the current record relative to the current input file. For instance, if you have two input files, this would tell you the record number of each file instead of as a total.
FS: The current field separator used to denote each field in a record. By default, this is set to whitespace.
NF: The number of fields in the current record.
NR: The number of the current record.
OFS: The field separator for the outputted data. By default, this is set to whitespace.
ORS: The record separator for the outputted data. By default, this is a newline character.
RS: The record separator used to distinguish separate records in the input file. By default, this is a newline character.

we can change some of the internal variables in the BEGIN section.
sudo awk 'BEGIN { FS=":"; }
{ print $1; }' /etc/passwd

awk '$2 ~ /^sa/' favorite_food.txt
awk '$2 !~ /^sa/' favorite_food.txt
awk '$2 !~ /^sa/ && $1 < 5' favorite_food.txt
http://calvin1978.blogcn.com/articles/awk_accesslog.html

可以快速的用单引号’ ’,把所有语句写成一行。
也可以用-f 指定文件,文件里可以任意换行,增加可读性和重用性。
所有执行语句用{}括起来,{}的外面是一些高级的东西比如过滤条件

列引用
$0代表整行所有数据,$1代表第一列(终于不是程序员数数从0开始了)。
NF是个代表总列数的系统变量,所以$NF代表最后一列,还支持$(NF-1)来表示倒数第二列。
还支持列之间的运算,如$NF-$(NF-1)是最后两列的值相减。
只写一个print 是 print $0的简写,打印整行所有数据。

4. 输入的列分隔符
默认以空格做分割符,也可以重新指定,下例指定了':'
awk -F ':' '{print $1,$2}’ access.log
也可以正则表达式定义多个分割符,下例指定了 '-' 和 ':'
awk -F '[-:]' '{print $1,$2}’ access.log
输出的列间隔
print $1,$2 中间的','逗号,代表打印时第1与第2列之间使用默认分隔符号也就是空格,也可以用” ”来定义其他任意的字符:
 awk '{print $1 "\t" $2 " - " $3$4xxxxx$5}’ access.log
上例,在第1第2列之间用 tab 分隔,第2第3列之间用" - "分隔,
也可以什么都不写代表中间没分隔,比如第3第4列之间,或者乱写一些字符没用" "括起来,也等于没写,比如第4第5列之间。

数字类型,字符串类型
虽然上例最后两列的值是字符串类型的,带着ms字样,看起来不能做算术运算。
但其实两个列相减时,AWK就会神奇地把它们转换为纯数字。同样,做累计的时候,sum=sum+$NF,也能自动转换为数字。
其实可以简写成下面的样子,性能还比使用sed略快:
 awk ' $NF*1>100 {print}’ access.log
或 awk ' int($NF)>100 {print}’ access.log

BEGIN与END后的语句定义在处理全部文本内容之前与之后的语句。

1.计算累计值和平均值
 awk '{sum+=$NF} END {print sum, sum/NR}'
上例对每行输入内容进行最后一列的值的累计,而END后的语句,打印累计结果 和平均值,NR是系统变量代表总行数。

2.打印表头
还可以定义BEGIN语句打印表头,定义变量什么的。
 awk 'BEGIN{print "Date\t\tTime\t\tCost”} {print $1 "\t"$2 "\t" $NF}’ access.log
过滤行
1. 简单字符匹配
先用grep过滤也是可以的,也可以用awk简单在执行语句之外的/ /之间定义正则表达式

 awk '/192.168.0.4[1-5]/ {print $1}’ access.log
等价于

 grep "192.168.0.4[1-5]” access.log| awk ‘{print $1}
2. 针对某一列的字符匹配
针对第4列的地址段匹配,~ 是字符匹配,!~则是不匹配的意思。

 awk '$4 ~ /192.168.0.4[1-5]/ {print}'
3. 针对数值的过滤
支持==, !=, <, >, <=, >=

 awk '$(NF-1)*1==100 {print}'
 awk '$NF-$(NF-1)>100 {print}'
见前,对于非纯数字的字段,可以用算术运算让它转回数字。

4. 多条件同时存在
 awk '($12 >150 || $(13)>250) {print}'
5. 使用if语句
如果逻辑更复杂,可以考虑使用if,else等语句

 awk '{ if ($(NF-1)*1>100) print}'


其他
1.外部传入参数
比如从外面传入超时的阀值,注意threshold在命令行中的位置。

 awk '{if($(NF)*1>threshold) print}' threshold=20 access.log
2.常用函数
最有用是gsub和sub,match,index等。其中gsub将一个字符串替换为目标字符串,可选定整行替换或只替换某一列。
 awk '{gsub("ms]","",$NF); if( $NF>100 ) print}' access.log

一些例子
1.截取日期段内段数据
方式有很多,都是随着日志格式不同自由发挥。
比如下段截取17:30:30 秒到 17.31:00的数据,先抽取出时分秒三列,再拼成一个数字进行比较

 awk -F "[ :.]" '$2$3$4>=173030 && $2$3$4<173100 {print}'
也可以匹配某个整点时间, 下例取11点的日志:

 awk '/[2015-08-20 11:/ {print $1}’ access.log
取11点01分到05分的数据:

 awk '/[2015-08-20 11:0[1-5]:/ {print $1}’ access.log
2. 找出超时的数据集中发生的时间
第一段找出超时记录,第二段过滤掉时间戳里的微秒,然后按秒来合并,并统计该秒超时的次数。
 awk '$(NF)*1>100 {print}’ access.log | awk -F"." '{print $1}' | sort | uniq -c





http://calvin1978.blogcn.com/articles/awk.html
需求是这样的,性能测试之后,要在访问日志里滤出所有大于100毫秒的慢调用记录,日志长下面的样子,调用总耗时在最后一列:

 [2015-08-20 00:00:55.600] - [192.168.0.73/192.168.0.75:1080 32379 com.vip.xxx.MyService_2.0 0 106046 100346 90ms 110ms]
 sed "s|ms]||g" access.*.log | awk ' $NF>100 {print}'

 awk ' $NF*1>100 {print}’ access.log

 awk ' int($NF)>100 {print}’ access.log

比如,一不小心就成了一个Linux老梗的主角,傻傻的用cat去打开文件,多花了10秒。

 cat access.*.log | sed "s|ms]||g" | awk ' $NF>100 {print}'
又比如,尝试单纯用awk,对最后一列进行文本替换来过滤"ms]",减少一次由sed产生的管道,但结果反而慢了三十多秒。
 awk '{gsub("ms]","",$NF); if( $NF>100 ) print}' access.*.log

最后还有另一个慢到没朋友的纯AWK失败品,用-F 配正则表达式同时拿空格和"ms"做分隔符,则分割后最后两列分别是 调用总耗时 和 ']',看起来要跑一个小时的样子。
 awk -F' |ms' '$NF>100 {print}' access.*.log

https://linux.cn/article-1699-1.html
  如果你有一个4.2GB的CSV文件,里面有超过1200万条记录,每条记录都有50列。现在你要做的是把其中某一列的值全加起来。
  cat data.csv | awk -F "|" '{ sum += $4 } END { printf "%.2f\n", sum }'
  如果你认为自己是一个Linux命令行高手,那恭喜你赢得今天的“最没用的Cat用法”大奖。你应该这样写这个命令:
  awk -F "|" '{ sum += $4 } END { printf "%.2f\n", sum }' < data.csv
  如果你认为自己是一个Linux命令行高手,那恭喜你赢得今天的“最没用的重定向用法”奖。你应该这样写这个命令:
  awk -F "|" '{ sum += $4 } END { printf "%.2f\n", sum }' data.csv
大文件重定向和管道的效率对比
cat huge_dump.sql|./b.out
我们从systemtap的日志可以看出: bash fork了2个进程,然后execve分别运行cat 和 b.out进程, 这二个进程用pipe通信,数据从由cat从 huge_dump.sql读出,写到pipe,然后b.out从pipe读出处理。
那么再看下情况二重定向的情况:
$ ./b.out < huge_dump.sql
bash fork了一个进程,打开数据文件,然后把文件句柄搞到0句柄上,这个进程execve运行b.out,然后b.out直接读取数据。
现在就非常清楚为什么二种场景速度有3倍的差别:
情况1. 读二次,写一次,外加一个进程上下文切换。
情况二:只读一次。

http://www.gregreda.com/2013/07/15/unix-commands-for-data-science/
References:
How To Use the AWK language to Manipulate Text in Linux | DigitalOcean
How to write awk commands and scripts

Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts