博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Attention to encodings
阅读量:5235 次
发布时间:2019-06-14

本文共 4759 字,大约阅读时间需要 15 分钟。

工作中发现一个有意思的bug。Jenkins上测试UTF-8 解码文件的unit test莫名其妙的fail了,之前一直跑的好好的我们也没有改动任何代码。相关代码如下:

public Context process(Context context) throws Exception {        Config config = context.getConfig();        Charset charset = Charset.forName(config.getEncoding());        StreamFactory factory = context.getFactory();        InputStreamReader reader = new InputStreamReader(FRFUtils.resolveFile(config.getReadFile()), charset);        BeanReaderIterable parsingReader = new BeanReaderIterable(factory.createReader(config.getStreamName(), Utils.removeBlankLines(reader)));        context.setBMSResults(fileParser.parse(parsingReader, config, factory, context.getTimestamp()));        return context;    }public static InputStreamReader removeBlankLines(InputStreamReader in) {    BufferedReader reader = new BufferedReader(in);    StringBuilder sb = new StringBuilder();    reader.lines()        .filter(StringUtils::isNotBlank)        .forEach(line -> {            sb.append(line);            sb.append('\n');        });    try {        reader.close();    } catch (IOException e) {        LOG.warn("removeBlankLines: unable to close the reader:{}", e.getMessage());    }    return new InputStreamReader(new ByteArrayInputStream(sb.toString().getBytes()));}
@Test public void shouldEncodeWell() throws Exception {
String input = "data/utf8_encoded_file.txt"; String conf = "file-utf8.xml"; String[] args = Util.constructArgs(input, null, conf, false, false); //when List
results = service.parse(args); assertThat(results.get()).isEqualTo("\u00c1\u00c9\u00d3-123"); }

经过排查发现removeBlankLines(InputStreamReader in)方法内部对InputStreamReader 做了改动,丢失了原有的Charset,所以用的解码方法一直是server默认的方法。之前test通过因为Jenkins Linux server上默认的encoding是UTF-8,但是美国那边的运维默默的更新了服务器,default encoding变成了us-ascii. 将config中Charset加入removeBlankLines中sb.toString().getBytes和new InputStreamReader修复了bug。注意bytes流和String字符串之间的转换过程就是解码编码的过程,需要指定解码编码方式。

new String(bytes, "UTF-8"); //decoding"AString".getBytes("UTF-8"); //encoding

 修正后代码如下

public Context process(Context context) throws Exception {        Config config = context.getConfig();        Charset charset = Charset.forName(config.getEncoding());        StreamFactory factory = context.getFactory();        InputStreamReader reader = new InputStreamReader(FRFUtils.resolveFile(config.getReadFile()), charset);        BeanReaderIterable parsingReader = new BeanReaderIterable(factory.createReader(config.getStreamName(), Utils.removeBlankLines(reader, charset)));        context.setBMSResults(fileParser.parse(parsingReader, config, factory, context.getTimestamp()));        return context;    }public static InputStreamReader removeBlankLines(InputStreamReader in, Charset charset) {    BufferedReader reader = new BufferedReader(in);    StringBuilder sb = new StringBuilder();    reader.lines()        .filter(StringUtils::isNotBlank)        .forEach(line -> {            sb.append(line);            sb.append('\n');        });    try {        reader.close();    } catch (IOException e) {        LOG.warn("removeBlankLines: unable to close the reader:{}", e.getMessage());    }    return new InputStreamReader(new ByteArrayInputStream(sb.toString().getBytes(charset)), charset);}@Test public void shouldEncodeWell() throws Exception {    String input = "data/utf8_encoded_file.txt";    String conf = "file-utf8.xml";    String[] args = Util.constructArgs(input, null, conf, false, false);    //when    List
results = service.parse(args); assertThat(results.get()).isEqualTo("\u00c1\u00c9\u00d3-123");}

InputStream vs InputStreamReader

The InputStream is the ancestor class of all possible streams of bytes, it is not useful by itself but all the subclasses (like the FileInputStream that you are using) are great to deal with binary data.

On the counterpart the InputStreamReader (and its father Reader) are used specifically to deal with characters (so strings) so they handle charset encodings (utf8, iso-8859-1, and so on) gracefully.

The simple answer is: if you need binary data you can use an InputStream (also a specific one like a DataInputStream), if you need to work with text use an InputStreamReader.

Stream vs Reader/Writer

Streams work at the byte level, they can read (InputStream) and write (OutputStream) bytes or list of bytes to a stream.

Reader/Writers add the concept of character on top of a stream. Since a character can only be translated to bytes by using an Encoding, readers and writers have an encoding component (that may be set automatically since Java has a default encoding property). The characters read (Reader) or written (Writer) are automatically converted to bytes by the encoding and sent to the stream.

转载于:https://www.cnblogs.com/codingforum/p/8552128.html

你可能感兴趣的文章
前台freemark获取后台的值
查看>>
log4j.properties的作用
查看>>
游戏偶感
查看>>
Leetcode: Unique Binary Search Trees II
查看>>
C++ FFLIB 之FFDB: 使用 Mysql&Sqlite 实现CRUD
查看>>
Spring-hibernate整合
查看>>
c++ map
查看>>
exit和return的区别
查看>>
discuz 常用脚本格式化数据
查看>>
洛谷P2777
查看>>
PHPStorm2017设置字体与设置浏览器访问
查看>>
SQL查询总结 - wanglei
查看>>
安装cocoa pods时出现Operation not permitted - /usr/bin/xcodeproj的问题
查看>>
GIT笔记:将项目发布到码云
查看>>
JavaScript:学习笔记(7)——VAR、LET、CONST三种变量声明的区别
查看>>
JavaScript 鸭子模型
查看>>
SQL Server 如何查询表定义的列和索引信息
查看>>
GCD 之线程死锁
查看>>
NoSQL数据库常见分类
查看>>
一题多解 之 Bat
查看>>