SNA_Reg2 - Hamaoka@fbc.keio.ac.jp's home page

回帰分析の注意点

load(file="0MGPdat.rda") 
names(MGPdat) #これは人のデータ  下記の変数
attach(MGPdat)
#  id         id
# n_down   その人が開発したソフト/uploadしたファイルがダウンロードされた回数
# n_file　　  その人が開発したソフト/uploadしたファイル数
# first_dev　その人がはじめてソフト/ファイルを公開した日　
#n_post　　　メッセージ投稿数
#n_res0　　　メッセージ受信数
# first_post　　その人がはじめてメッセージを投稿した日　
# handle2　　ハンドル名
# id2　　　　上のidとハンドル名をつけたもの
summary(MGPdat) #含まれる変数の記述統計

#ダウンロード数を　投稿数で説明
summary(res<-lm(n_down~ n_post,data=MGPdat))

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  41.9594    37.3371   1.124    0.261    
n_post        3.2671     0.3272   9.984   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
Multiple R-squared: 0.08559,	Adjusted R-squared: 0.08473

#ダウンロード数を　受信数で説明
summary(res<-lm(n_down~ n_res0,data=MGPdat))

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  22.2607    36.7912   0.605    0.545     
n_res0       10.9080     0.9229  11.819   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
Multiple R-squared: 0.1159,	Adjusted R-squared: 0.1151

#ダウンロード数を　投稿数､受信数で説明
summary(res0<-lm(n_down~ n_post+n_res0,data=MGPdat))

この結果をどう解釈するか?

Coefficients:
           Estimate Std. Error t value Pr(>|t|)    
(Intercept)   23.207     36.402   0.638    0.524    
n_post        -5.995      1.225  -4.894 1.14e-06 ***
n_res0        27.513      3.514   7.830 1.17e-14 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
Multiple R-squared: 0.1354,	Adjusted R-squared: 0.1338

plot(n_down,res0$fitted)  #ダウンロード数と回帰分析で推定されたパラメータを用いて推定したダウンロード数をプロット
#参考　R2は　モデルによって推定した　yの分散　/もともとのyの分散
  var(res0$fitted)/var(n_down) #[1] 0.1047923

#説明変数感の関係をチェック
plot(n_post,n_res0)
cor(data.frame(n_post,n_res0)) #相関係数
#           n_post    n_res0
#n_post 1.0000000 0.9656394
#n_res0 0.9656394 1.0000000
#これら二つで回帰分析
summary(lm(n_post~ n_res0,data=MGPdat))

Coefficients:
           Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.15781    0.91061   0.173    0.862    
n_res0       2.76993    0.02284 121.257   <2e-16 *** 
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
Residual standard error: 28.62 on 1065 degrees of freedom
Multiple R-squared: 0.9325,	Adjusted R-squared: 0.9324

説明変数間の相関が高いと､それらを同時に説明変数として投入したとき､上記のような不可解なことが生じる｡
- 対策　　どちらか重要な方のみを用いる｡

　　　　　(意味があるならば)　合成して一つの変数にする｡

　　　　　変数変換をすると相関が低下することもある｡
ln_down<-log(1+n_down)
ln_post<-log(1+n_post)
ln_res0<-log(1+n_res0)

par(mfrow=c(2,4))   #8個のグラフを並べる
hist(n_down)
hist(n_post)
hist(n_res0)
plot(n_post,n_res0)

hist(ln_down)
hist(ln_post)
hist(ln_res0)
plot(ln_post,ln_res0)
cor(ln_post,ln_res0)
#[1] 0.8937551

対数をとったもので説明

 summary(res2<-lm(n_down~ ln_post+ln_res0))

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  -185.44      69.18  -2.680  0.00747 **
ln_post        63.50      60.92   1.042  0.29751   
ln_res0       167.88      65.23   2.574  0.01019 * 
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
Residual standard error: 1196 on 1064 degrees of freedom
Multiple R-squared: 0.0552,	Adjusted R-squared: 0.05342

plot(n_down,res2$fitted)  #ダウンロード数と回帰分析で推定されたパラメータを用いて推定したダウンロード数をプロット
#参考　R2は　モデルによって推定した　yの分散　/もともとのyの分散
  var(res2$fitted)/var(n_down) #[1] 0.1047923

ダウンロード数も対数を

 summary(res<-lm(ln_down~ ln_post+ln_res0))

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.16756    0.09155  -1.830 0.067479 .  
ln_post      0.12310    0.08062   1.527 0.127086    
ln_res0      0.29100    0.08631   3.371 0.000775 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
Residual standard error: 1.582 on 1064 degrees of freedom
Multiple R-squared: 0.09663,	Adjusted R-squared: 0.09493

参考)従属変数は回数(0もしくは正の整数)　なので回帰分析ではなくポアソン回帰

 summary(res.p<-glm(n_down~ ln_post+ln_res0,family="poisson"))
plot(n_down,res.p$fitted)
#R2のようなもの
 var(res.p$fitted)/var(n_down) #[1] 0.1047923

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.956653   0.006322  467.70   <2e-16 ***
ln_post     -0.341131   0.005295  -64.42   <2e-16 ***
ln_res0      1.288934   0.006111  210.92   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
(Dispersion parameter for poisson family taken to be 1)
    Null deviance: 1128172  on 1066  degrees of freedom
Residual deviance:  735679  on 1064  degrees of freedom
AIC: 736262