线程也不是越多越好，多少是好？-技术圈

前面我们评估了不同大小基因组构建索引所需的计算资源和时间资源和不同大小数据集比对所需的计算资源和时间资源。

下面我们进一步看下不同线程数的影响。

测试电脑配置

这是一个10核、20线程的处理器，主频2.8 G HZ，可超频到5.2 GhZ。
在Windows系统上安装了Ubuntu子系统进行的测试。

测试指定不同线程数对速度提升的影响

因为测试电脑是最多20线程，这里指定1-25共25个测试线程，程序如下：

# 10核心 20 线程

for thread in `seq 1 25`; do
 i=SRR1039517
    mkdir -p ${i}
    /usr/bin/time -v -o star.${i}.thread${thread}.log STAR --runMode alignReads \
          --runThreadN ${thread} \
        --readFilesIn ${i}_1.fastq.gz ${i}_2.fastq.gz \
        --readFilesCommand zcat --genomeDir star_GRCh38 \
        --outFileNamePrefix ${i}/${i}. --outFilterType BySJout --outSAMattributes NH HI AS NM MD \
       --outFilterMultimapNmax 20 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 \
       --alignIntronMin 20 --alignIntronMax 1000000 \
       --alignMatesGapMax 1000000 \
       --outFilterMatchNminOverLread 0.66 --outFilterScoreMinOverLread 0.66 \
       --winAnchorMultimapNmax 70 --seedSearchStartLmax 45 \
       --outSAMattrIHstart 0 --outSAMstrandField intronMotif \
       --genomeLoad LoadAndKeep \
       --outTmpDir /tmp/${i}/ \
       --outSAMtype BAM Unsorted --quantMode GeneCounts
  du -s ${i} | awk 'BEGIN{OFS="\t"}{print "Output_size: "$1/10^6}' >>star.${i}.thread${thread}.log
done

运行完成后，整理所需的计算资源和时间资源数据。

/bin/rm -f GRCh38_39517_star_reads_map_thread.summary
i=SRR1039517
for thread in `seq 1 25`; do
 echo ${thread} | \
 awk 'BEGIN{OFS="\t"}{print "nThreads"; print $1}' | \
  awk -v outputHeader=${thread} -f ./timeIntegrate2.awk - star.${i}.thread${thread}.log \
  >>GRCh38_39517_star_reads_map_thread.summary
done

汇总后的数据如下：

Time_cost    Memory_cost    nCPU    Output_size    nThreads
25.962    28.9048    0.98    5.58423    1
13.98    29.311    1.97    5.58424    2
9.95217    29.5176    2.93    5.58425    3
7.77033    29.7221    3.85    5.58426    4
6.356    29.9266    4.78    5.58428    5
5.1585    30.1311    5.61    5.58422    6
4.69233    30.3356    6.37    5.58426    7
4.51    30.5401    6.69    5.58429    8
4.39683    30.7445    6.94    5.58423    9
4.38017    30.949    6.99    5.58426    10
4.41233    31.1535    6.99    5.58424    11
4.45333    31.358    6.94    5.58424    12
4.41033    31.5624    6.95    5.58429    13
4.44267    31.7669    6.88    5.58428    14
4.4595    31.9714    6.87    5.58426    15
4.50567    32.0859    6.85    5.58424    16
4.458    32.2639    6.92    5.58429    17
4.46417    32.4802    6.86    5.58428    18
4.497    32.6487    6.91    5.58425    19
4.4425    32.8489    6.95    5.58426    20
4.46817    32.9927    6.92    5.5843    21
4.4555    33.1738    6.97    5.58426    22
4.45483    33.3675    6.94    5.58426    23
4.46133    33.5499    6.99    5.58428    24
4.42733    33.7143    6.99    5.58426    25

STAR比对的时间随指定的线程数的变化

在给定的线程数少于10个时，随着线程数增加时间逐渐减少，尤其是在线程数从1-6的过程中，下降幅度更明显。
线程也不是越多越好，给定多于10个进程对速度提升基本没有贡献。
（因为测试电脑只有10个核心，不知道这里的节点10是否是受此影响；
还需要后续在服务器更多测试来判断；
如果是这样，对我们的指导是设定的线程数不应该超过CPU的核心数。
）

library(ImageGP) 、sp_scatterplot(“GRCh38_39517_star_reads_map_thread.summary”, melted = T, xvariable = “nThreads”,               yvariable = “Time_cost”, smooth_method = “auto”,               x_label =”Number of specified threads”, y_label = “Running time (minutes)”) +  scale_x_continuous(breaks=seq(1,25, by=1)) +  scale_y_continuous(breaks=seq(1,25, by=1))

STAR比对所需内存随指定的线程数的变化

线程数越多，内存需求越大；
但整体相差不大。

# 这时绘图要注意，是否加limits=c(0,34)图给人的第一印象不同。
sp_scatterplot("GRCh38_39517_star_reads_map_thread.summary", melted = T, xvariable = "nThreads",
               yvariable = "Memory_cost", smooth_method = "auto",
               x_label ="Number of specified threads", y_label = "Maximum physical memory required (Gb)") +
  scale_x_continuous(breaks=seq(1,25, by=1)) +
  scale_y_continuous(breaks=seq(1,34, by=1),limits=c(0,34))

不加limits=c(0,34)的效果。是不是感觉内存变化很大？？？

STAR比对过程中CPU利用率随指定的线程数的变化

多线程的效率一般很难达到100%。如下图，在指定线程数小于10时，给定的线程越多，利用起来的线程也越多，但整体利用率是越来越低的。

不同线程是不影响程序输出的

这个统计没什么意义。

sp_scatterplot("GRCh38_39517_star_reads_map_thread.summary", melted = T, xvariable = "nThreads",
               yvariable = "Output_size", smooth_method = "auto",
               x_label ="Number of specified threads", y_label = "Disk space usages (Gb)") +
  scale_x_continuous(breaks=seq(1,25, by=1)) +
  scale_y_continuous(breaks=seq(0,6, by=1),limits=c(0,6))

往期精品(点击图片直达文字对应教程)

机器学习

后台回复“生信宝典福利第一波”或点击阅读原文获取教程合集