`
sungang_1120
  • 浏览: 309772 次
  • 性别: Icon_minigender_1
  • 来自: 成都
社区版块
存档分类

total number of created files now is 100385, which exceeds 100000. Killing the j

阅读更多

 

今天将临时表里面的数据按照天分区插入到线上的表中去,出现了Hive创建的文件数大于100000个的情况,我的SQL如下:

 

hive> insert overwrite table test partition(dt)
    > select * from table_tmp;

 table_tmp表里面一共有570多G的数据,一共可以分成76个分区,SQL运行的时候创建了2163个Mapper,0个Reducers。程序运行到一般左右的时候出现了以下的异常:

 

 

[Fatal Error] total number of created files now is 100385, which exceeds 100000. Killing the job.

 

 

并最终导致了SQL的运行失败。这个错误的原因是因为Hive对创建文件的总数有限制(hive.exec.max.created.files),默认是100000个,而这个SQL在运行的时候每个Map都会创建76个文件,对应了每个分区,所以这个SQL总共会创建2163 * 76 = 164388个文件,运行中肯定会出现上述的异常。为了能够成功地运行上述的SQL,最简单的方法就是加大hive.exec.max.created.files参数的设置。但是这有个问题,这会导致在hadoop中产生大量的小文件,因为table_tmp表的数据就570多G,那么平均每个文件的大小=570多G / 164388 = 3.550624133148405MB,可想而知,十万多个这么小的小文件对Hadoop来说是多么不好。那么有没有好的办法呢?有!

  我们可以将dt相同的数据放到同一个Reduce处理,这样最多也就产生76个文件,将dt相同的数据放到同一个Reduce可以使用DISTRIBUTE BY dt实现,所以修改之后的SQL如下:

 

hive> insert overwrite table test partition(dt)
    > select * from table_tmp
    > DISTRIBUTE BY dt;

 修改完之后的SQL运行良好,并没有出现上面的异常信息,但是这里也有个问题,因为这76个分区的数据分布很不均匀,有些Reduce的数据有30多G,而有些Reduce只有几K,直接导致了这个SQL运行的速度很慢!

 

  能不能将570G的数据均匀的分配给Reduce呢?可以!我们可以使用DISTRIBUTE BY rand()将数据随机分配给Reduce,这样可以使得每个Reduce处理的数据大体一致。我设定每个Reduce处理5G的数据,对于570G的数据总共会起110左右的Reduces,修改的SQL如下:

hive> set hive.exec.reducers.bytes.per.reducer=5120000000;
hive> insert overwrite table test partition(dt)
    > select * from iteblog_tmp
    > DISTRIBUTE BY rand();

 这个SQL运行的时间很不错,而且生产的文件数量为Reduce的个数*分区的个数,不到1W个文件。

分享到:
评论

相关推荐

    微软内部资料-SQL性能优化2

    The boot.ini option /3GB was created for those cases where systems actually support greater than 2 GB of physical memory and an application can make use of it This capability allows memory intensive ...

    kgb档案压缩console版+源码

    which is created. The archive must not already exist. File names may specify a path, which is stored. If there are no file names on the command line, then PAQ6 prompts for them, reading until the ...

    《5 Practical React Projects》- 2017 英文原版

    This book is a collection of in-depth tutorials, that will guide you through some fun and practical projects. Along the way, you’ll pick up lots of useful development tips. It contains: How to ...

    微软内部资料-SQL性能优化3

    An isolation level determines the degree to which data is isolated for use by one process and guarded against interference from other processes. Prior to SQL Server 7.0, REPEATABLE READ and ...

    OCI Programmer's Guide

    The INTEGER data type converts numbers.... If the number to be returned exceeds the capacity of a signed integer for the system, Oracle Database returns an "overflow on conversion" error.

    AD9833.rar

    The digital section is internally operated at +2.5 V, irrespective of the value of VDD, by an on board regulator which steps down VDD to +2.5 V, when VDD exceeds +2.5 V. The AD9833 has a power-down ...

    发电机常用英文.pdf

    The armature of an AC generator is the assembly of windings and metal core laminations in which the output voltage is induced. It is the stationary part (stator) in a revolving-field generator.

    Bayesian Network(贝叶斯网络) Python Program

    The purpose of the toolkit is to facilitate creating experimental Bayes nets that analyze sequences of events. The toolkit provides code to help with the following: (a) creating Bayes nets. There are ...

    gone fishing

    separated by commas, for the plan achieving the maximum number of fish expected to be caught (you should print the entire plan on one line even if it exceeds 80 characters). This is followed by a ...

    The joint density of the maximum and its location

    It is of interest' to calculate the distribution of the value of the difference between the maximum and present values of a stock or other security along with the time of occurrence of the maximum. At...

    this exceeds GitHub's file size limit of 100.00 MB

    如果你的文件超过了GitHub的文件大小限制,你可以尝试以下几种方法...

    Multi-digit Number Recognition from Street View Imagery using DCNN

    Recognizing arbitrary multi-character text in unconstrained natural ...operating thresholds, the performance of the proposed system is comparable to, and in some cases exceeds, that of human operators.

    软件设计师考试辅导

     (3)Software quality assurance is now an of software engineering.  (4)Assessment of software quality still relies on .  (5)We are not yet capable of quantifying .  (6)At each stage ...

    Google C++ Style Guide(Google C++编程规范)高清PDF

    You can significantly minimize the number of header files you need to include in your own header files by using forward declarations. For example, if your header file uses the File class in ways that ...

    php.ini-development

    The syntax of the file is extremely simple. Whitespace and lines ; beginning with a semicolon are silently ignored (as you probably guessed). ; Section headers (e.g. [Foo]) are also silently ignored,...

    Cube Attacks on Tweakable Black Box Polynomials

    for the public variables, and his goal is to solve the resultant system of polynomial equations in terms of their common secret variables. In this paper we develop a new technique (called a cube ...

    AD630锁相放大资料

    3. The 100 dB dynamic range of the AD630 exceeds that of any hybrid or IC balanced modulator/demodulator and is comparable to that of costly signal processing instruments. 4. The op amp format of the ...

    embedded system design

    already exceeds the number of processors in PCs, and this trend is expected to continue. According to forecasts, the size of embedded software will also increase at a large rate. Another kind of Moore...

    Deep Neural Network-Based Digital Predistorter for Doherty Power Amplifiers

    based digital predistorter (DPD) outperforms rectified linear unit (ReLU) activation by up to 2 dB even when the number of layers of the network is increased. When the number of coefficients exceeds ...

    题目1002:Grading

    Grading hundreds of thousands of Graduate Entrance Exams is a hard work. It is even harder to design a process to make the results as fair as possible. One way is to assign each exam problem to 3 ...

Global site tag (gtag.js) - Google Analytics